dennogumi/content/post/2007-04-25-databases.markdown
Luca Beltrame 64b24842b8
All checks were successful
continuous-integration/drone/push Build is passing
Update all posts to not show the header text
2021-01-13 00:05:30 +01:00

2.9 KiB

author categories comments date header slug title omit_header_text disable_share wordpress_id
einar
Linux
Science
true 2007-04-25T14:26:07Z
image_fullwidth
banner_other.jpg
databases Databases true true 237

As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain. Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for SQLite, a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:

[code lang="python"] #!/usr/bin/env python

import gzip import sys import csv

"""Filters NCBI annotation files by human taxon (9606). Works directly from the source gzipped file and outputs a tab-delimited file."""

class ncbi: delimiter = '\t' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\n' quoting = csv.QUOTE_NONE

def init(self):

Dialect registration

csv.register_dialect("ncbi", self)

ncbi()

if len(sys.argv) < 3: print "Not enough command line arguments" sys.exit(0)

try: compressed_file = gzip.open(sys.argv[1]) except IOError: print "Could not open file!" sys.exit(-1)

delim = csv.reader(compressed_file,dialect="ncbi")

try: destination_file = open(sys.argv[2],"wb") except IOError: print "Can't open destination file!" sys.exit(-1)

write_delim = csv.writer(destination_file,dialect="ncbi")

write_delim.writerow(delim.next())

for row in delim: if row[0] == "9606": write_delim.writerow(row)

print "Complete!" compressed_file.close() destination_file.close() sys.exit(0)

[/code]

(apologies, tabs are messed up)

That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:

[code lang="bash"] sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt [/code]

SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:

[code lang="bash"] .separator "\t" .import datafiles/File_NCBI/entrez_gene.txt entrez_gene [/code]

Done! This is the first step, then I'll work on creating tables for my own data.