Looking at the first few bytes of the archive will help us because I plan to move the metadata to the back of the archive file, and it would be expensive to seek to the end of the file just to determine the file type. I figure a 64-byte header is enough to hold some extra data that identifies the file type and points the unstarch and starchcat tools to the right byte in the file where the metadata is kept:
This image describes what I’m planning for Starch v2. Basically, we reserve a 64-byte header at the front of the archive that contains the following four items:
- Magic number – The magic number ca5cade5 identifies this as a Starch v2-formatted file.
- Offset – A zero-padded 16-digit unsigned integer that marks the byte into the file at which the archive’s metadata starts (including the 64-byte header itself).
- Hash – Here I make a SHA-1 hash of a concatenated string made from the offset value and metadata string. This value helps validate the integrity of the archive metadata and the offset values. (At a later time, we will add hashes of the chromosome streams to the metadata, to allow full archive validation).
- Reserved – We keep 20 bytes of space free, in case we need it.
After these 64 bytes, the compressed, per-chromosome streams start, and we then wrap up with the metadata at the end of the file.
Initially, the header will be the magic number and all zeros. At the conclusion of creating the archive, once all the streams are prepared and the metadata is ready to hash, parts of the header are written over with calculated values (offset and hash).
To assist with picking the right magic number, Ned Batchelder put together a Python script to generate all the hex words possible from a small subset of the English dictionary — words like deadbeef, dec0ded and 0ddba11 which can be used as magic numbers to identify a file type. Thanks, Ned! So far, I’m liking the magic word ca5cade5 as it has a nice biological flavor to it.
When I want to make BEDOPS documentation available for offline browsing, here is the command I use:
wget –no-parent –recursive –page-requisites –html-extension –convert-links -E -l 2 http://code.google.com/p/bedops
This handy wget statement fixes up the img and other URL references so that links to images and other resources load up from the local copy. So what the end user sees is almost exactly like what she would see if the documentation was retrieved through a network connection. Very nice.
WordPress replaces quotation marks with curly equivalents, which is really frustrating when trying to copy and paste code. Add remove_filter('the_content', 'wptexturize'); to the bottom of the theme’s functions.php file.
Interestingly, this doesn’t seem to affect post titles, but that’s okay with me.
I had recently updated my copy of R to 2.15.1 and ended up needing to reinstall some libraries, including rgl.
If you use this R library, it can be tricky to install with the built-in build of Mesa/OpenGL in Lion. In fact, a straightforward install.packages("rgl") just won’t work at all.
install.packages(“rgl”, repos=”http://R-Forge.R-project.org”, configure.args=”–disable-cocoa –with-gl-includes=/opt/local/include –with-gl-libs=/opt/local/lib –with-x”)
Once installed, rgl can make cool figures, like this: