websites/dennogumi

Fork 0

Luca Beltrame 64b24842b8

continuous-integration/drone/push Build is passing

Details

Update all posts to not show the header text

2021-01-13 00:05:30 +01:00

2.9 KiB

Raw Blame History

author

categories

comments

date

header

slug

title

omit_header_text

disable_share

wordpress_id

einar

Linux

Science

true

2007-04-25T14:26:07Z

image_fullwidth
banner_other.jpg

databases

Databases

true

237

As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain. Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for SQLite, a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:

[code lang="python"] #!/usr/bin/env python

import gzip import sys import csv

"""Filters NCBI annotation files by human taxon (9606). Works directly from the source gzipped file and outputs a tab-delimited file."""

class ncbi: delimiter = '\t' quotechar = '"' escapechar = None doublequote = True skipinitialspace = False lineterminator = '\n' quoting = csv.QUOTE_NONE

def init(self):

Dialect registration

csv.register_dialect("ncbi", self)

ncbi()

if len(sys.argv) < 3: print "Not enough command line arguments" sys.exit(0)

try: compressed_file = gzip.open(sys.argv[1]) except IOError: print "Could not open file!" sys.exit(-1)

delim = csv.reader(compressed_file,dialect="ncbi")

try: destination_file = open(sys.argv[2],"wb") except IOError: print "Can't open destination file!" sys.exit(-1)

write_delim = csv.writer(destination_file,dialect="ncbi")

write_delim.writerow(delim.next())

for row in delim: if row[0] == "9606": write_delim.writerow(row)

print "Complete!" compressed_file.close() destination_file.close() sys.exit(0)

[/code]

(apologies, tabs are messed up)

That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:

[code lang="bash"] sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt [/code]

SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:

[code lang="bash"] .separator "\t" .import datafiles/File_NCBI/entrez_gene.txt entrez_gene [/code]

Done! This is the first step, then I'll work on creating tables for my own data.

2.9 KiB Raw Blame History

Dialect registration

2.9 KiB

Raw Blame History