Add the whole blog
This commit is contained in:
parent
0d2f58ce7a
commit
c4f23c1529
418 changed files with 15708 additions and 0 deletions
92
content/post/2007-04-25-databases.markdown
Normal file
92
content/post/2007-04-25-databases.markdown
Normal file
|
@ -0,0 +1,92 @@
|
|||
---
|
||||
author: einar
|
||||
categories:
|
||||
- Linux
|
||||
- Science
|
||||
comments: true
|
||||
date: "2007-04-25T14:26:07Z"
|
||||
header:
|
||||
image_fullwidth: banner_other.jpg
|
||||
slug: databases
|
||||
title: Databases
|
||||
disable_share: true
|
||||
wordpress_id: 237
|
||||
---
|
||||
|
||||
As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain.<!--more-->
|
||||
Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for [SQLite](http://sqlite.org), a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:
|
||||
|
||||
[code lang="python"]
|
||||
#!/usr/bin/env python
|
||||
|
||||
import gzip
|
||||
import sys
|
||||
import csv
|
||||
|
||||
"""Filters NCBI annotation files by human taxon (9606). Works directly from the source
|
||||
gzipped file and outputs a tab-delimited file."""
|
||||
|
||||
class ncbi:
|
||||
delimiter = '\t'
|
||||
quotechar = '"'
|
||||
escapechar = None
|
||||
doublequote = True
|
||||
skipinitialspace = False
|
||||
lineterminator = '\n'
|
||||
quoting = csv.QUOTE_NONE
|
||||
|
||||
def __init__(self):
|
||||
# Dialect registration
|
||||
csv.register_dialect("ncbi", self)
|
||||
|
||||
ncbi()
|
||||
|
||||
if len(sys.argv) < 3:
|
||||
print "Not enough command line arguments"
|
||||
sys.exit(0)
|
||||
|
||||
try:
|
||||
compressed_file = gzip.open(sys.argv[1])
|
||||
except IOError:
|
||||
print "Could not open file!"
|
||||
sys.exit(-1)
|
||||
|
||||
delim = csv.reader(compressed_file,dialect="ncbi")
|
||||
|
||||
try:
|
||||
destination_file = open(sys.argv[2],"wb")
|
||||
except IOError:
|
||||
print "Can't open destination file!"
|
||||
sys.exit(-1)
|
||||
|
||||
write_delim = csv.writer(destination_file,dialect="ncbi")
|
||||
|
||||
write_delim.writerow(delim.next())
|
||||
|
||||
for row in delim:
|
||||
if row[0] == "9606":
|
||||
write_delim.writerow(row)
|
||||
|
||||
print "Complete!"
|
||||
compressed_file.close()
|
||||
destination_file.close()
|
||||
sys.exit(0)
|
||||
|
||||
[/code]
|
||||
|
||||
(apologies, tabs are messed up)
|
||||
|
||||
That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:
|
||||
|
||||
[code lang="bash"]
|
||||
sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt
|
||||
[/code]
|
||||
|
||||
SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:
|
||||
|
||||
[code lang="bash"]
|
||||
.separator "\t"
|
||||
.import datafiles/File_NCBI/entrez_gene.txt entrez_gene
|
||||
[/code]
|
||||
|
||||
Done! This is the first step, then I'll work on creating tables for my own data.
|
Loading…
Add table
Add a link
Reference in a new issue