Add the whole blog

2020-12-28 18:06:15 +01:00 · 2020-12-28 18:06:15 +01:00 · c4f23c1529
commit c4f23c1529
parent 0d2f58ce7a
418 changed files with 15708 additions and 0 deletions
--- a/content/post/2007-04-25-databases.markdown
+++ b/content/post/2007-04-25-databases.markdown
@ -0,0 +1,92 @@
+---
+author: einar
+categories:
+- Linux
+- Science
+comments: true
+date: "2007-04-25T14:26:07Z"
+header:
+  image_fullwidth: banner_other.jpg
+slug: databases
+title: Databases
+disable_share: true
+wordpress_id: 237
+---
+
+As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain.<!--more-->
+Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for [SQLite](http://sqlite.org), a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:
+
+[code lang="python"]
+#!/usr/bin/env python
+
+import gzip
+import sys
+import csv
+
+"""Filters NCBI annotation files by human taxon (9606). Works directly from the source
+gzipped file and outputs a tab-delimited file."""
+
+class ncbi:
+delimiter = '\t'
+quotechar = '"'
+escapechar = None
+doublequote = True
+skipinitialspace = False
+lineterminator = '\n'
+quoting = csv.QUOTE_NONE
+
+def __init__(self):
+# Dialect registration
+csv.register_dialect("ncbi", self)
+
+ncbi()
+
+if len(sys.argv) < 3:
+print "Not enough command line arguments"
+sys.exit(0)
+
+try:
+compressed_file = gzip.open(sys.argv[1])
+except IOError:
+print "Could not open file!"
+sys.exit(-1)
+
+delim = csv.reader(compressed_file,dialect="ncbi")
+
+try:
+destination_file = open(sys.argv[2],"wb")
+except IOError:
+print "Can't open destination file!"
+sys.exit(-1)
+
+write_delim = csv.writer(destination_file,dialect="ncbi")
+
+write_delim.writerow(delim.next())
+
+for row in delim:
+if row[0] == "9606":
+write_delim.writerow(row)
+
+print "Complete!"
+compressed_file.close()
+destination_file.close()
+sys.exit(0)
+
+[/code]
+
+(apologies, tabs are messed up)
+
+That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:
+
+[code lang="bash"]
+sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt
+[/code]
+
+SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:
+
+[code lang="bash"]
+.separator "\t"
+.import datafiles/File_NCBI/entrez_gene.txt entrez_gene
+[/code]
+
+Done! This is the first step, then I'll work on creating tables for my own data.