How to get a list of HGNC symbols and names (descriptions)

Here’s a quick method to get HGNC symbols and names that draws upon data from UCSC and the open source MyGene.info project:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | cut -f13 | sort | uniq | get_hgnc_names_for_symbols.py > hgnc_symbols_with_names.txt

There’s a Python script in there that I call get_hgnc_names_for_symbols.py:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

hgnc_symbols = []
for line in sys.stdin:
    hgnc_symbols.append('%s' % (line.strip()))

mg = MyGeneInfo()
results = mg.querymany(hgnc_symbols, scopes='symbol', species='human', verbose=False)

for result in results:
    sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))

The pipeline above writes a two-column text file called hgnc_symbols_with_names.txt that contains the HGNC symbol (e.g., AAR2) and its name (e.g., AAR2 splicing factor homolog), which could be put into a lookup table or, given that it is sorted, could be searched very quickly with a binary search via the Python bisect library.