Finishing touches are in place for my
convert2bed tool (GitHub site).
This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:
- It runs about 3-10x as fast as bedtools
- It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format
As an example, here we use
convert2bed on a 14M-read, indexed BAM file to a sorted BED file (data are piped to
/dev/null) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:
$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam 14090028
Conversion is performed with default options (sorted BED as output, using BEDOPS
$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null [bam_header_read] EOF marker is absent. The input is probably truncated. real 3m5.508s user 0m25.702s sys 0m8.602s
Here is the same conversion, performed with bedtools v2.22
$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null real 28m22.057s user 2m58.579s sys 0m41.605s
The use of
convert2bed for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.
Further time reductions are conferred with use of
bam2starchcluster scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.