Finishing touches are in place for my convert2bed
tool (GitHub site).
This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:
- It runs about 3-10x as fast as bedtools
*ToBed
equivalents - It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format
As an example, here we use convert2bed
on a 14M-read, indexed BAM file to a sorted BED file (data are piped to /dev/null
) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:
$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam 14090028
Conversion is performed with default options (sorted BED as output, using BEDOPS sort-bed
):
$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null [bam_header_read] EOF marker is absent. The input is probably truncated. real 3m5.508s user 0m25.702s sys 0m8.602s
Here is the same conversion, performed with bedtools v2.22 bamToBed
and sortBed
:
$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null real 28m22.057s user 2m58.579s sys 0m41.605s
The use of convert2bed
for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.
Further time reductions are conferred with use of bam2bedcluster
and bam2starchcluster
scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.
When testing is complete, code will be wrapped into the upcoming BEDOPS v2.4.3 release. Source is now available via GitHub.