--- author: einar comments: true date: 2007-04-25 14:26:07+00:00 layout: page slug: databases title: Databases wordpress_id: 237 categories: - Linux - Science header: image_fullwidth: banner_other.jpg --- As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain. Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for [SQLite](http://sqlite.org), a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py: [code lang="python"] #!/usr/bin/env python import gzip import sys import csv """Filters NCBI annotation files by human taxon (9606). Works directly from the source gzipped file and outputs a tab-delimited file.""" class ncbi: delimiter = '\t' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\n' quoting = csv.QUOTE_NONE def __init__(self): # Dialect registration csv.register_dialect("ncbi", self) ncbi() if len(sys.argv) < 3: print "Not enough command line arguments" sys.exit(0) try: compressed_file = gzip.open(sys.argv[1]) except IOError: print "Could not open file!" sys.exit(-1) delim = csv.reader(compressed_file,dialect="ncbi") try: destination_file = open(sys.argv[2],"wb") except IOError: print "Can't open destination file!" sys.exit(-1) write_delim = csv.writer(destination_file,dialect="ncbi") write_delim.writerow(delim.next()) for row in delim: if row[0] == "9606": write_delim.writerow(row) print "Complete!" compressed_file.close() destination_file.close() sys.exit(0) [/code] (apologies, tabs are messed up) That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields: [code lang="bash"] sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt [/code] SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data: [code lang="bash"] .separator "\t" .import datafiles/File_NCBI/entrez_gene.txt entrez_gene [/code] Done! This is the first step, then I'll work on creating tables for my own data.