Finishing touches are in place for my
convert2bed tool (GitHub site).
This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:
- It runs about 3-10x as fast as bedtools
- It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format
As an example, here we use
convert2bed on a 14M-read, indexed BAM file to a sorted BED file (data are piped to
/dev/null) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:
$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam 14090028
Conversion is performed with default options (sorted BED as output, using BEDOPS
$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null [bam_header_read] EOF marker is absent. The input is probably truncated. real 3m5.508s user 0m25.702s sys 0m8.602s
Here is the same conversion, performed with bedtools v2.22
$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null real 28m22.057s user 2m58.579s sys 0m41.605s
The use of
convert2bed for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.
Further time reductions are conferred with use of
bam2starchcluster scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.
For scientific work, I have used
matrix2png to make a nice PNG image from a text-formatted matrix of data values. PNG looks great on the web, but it doesn’t translate well to making publication-quality figures.
Here are some useful resources for open source C and C++ -based OCR libraries that could run under iOS (need to check licensing):
- Seven Segment Optical Character Recognition (ssocr)
- Advice for 7-Segment Display OCR with Tesseract
- Tesseract OCR iOS library
The end goal is to be able to use an iPhone to read LED displays, as commonly found on meters, etc. and then do something useful with that data (upload it somewhere, tagged with geodata). An aggregate of hundreds or thousands of users could conceivably collect data useful for themselves and also for the group as a whole.
$ unstarch --sha1-signature .foo
So far, so good.
But now I want to validate that the metadata are being digested correctly through some independent means, preferably via the command-line, so that I can perform regression testing. I can use the
base64 tools together to test that I get the same answer:
$ unstarch --list-json-no-trailing-newline .foo \
| openssl sha1 \
| xxd -r -p \
As a note to myself: I end up stripping the trailing newline from the JSON output of unstarch because this is what the PolarSSL library ends up digesting. This very nearly had me doubting whether PolarSSL was working correctly, or whether my command-line test was correct!