90 lines
		
	
	
	
		
			2.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			90 lines
		
	
	
	
		
			2.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| author: einar
 | |
| comments: true
 | |
| date: 2007-04-25 14:26:07+00:00
 | |
| layout: page
 | |
| slug: databases
 | |
| title: Databases
 | |
| wordpress_id: 237
 | |
| categories:
 | |
| - Linux
 | |
| - Science
 | |
| ---
 | |
| 
 | |
| As I've been working to get some results done for my Ph.D. thesis, I've stumbled across the problem of having different data obtained through different software. Even if it's just a matter of text files, the fields are all different and even if dealing with the same data, trying to infer relationships is a pain.<!-- more -->
 | |
| Therefore I decided to create a small database to host the data of my work and query it accordingly. I didn't want to run a database server, so I settled for [SQLite](http://sqlite.org), a lightweight file-driven database, I don't handle enormous amount of data so it should be ok. Up to now I've inserted parts of the Entrez Gene database. First of all I downloaded the gene_info.gz from NCBI's FTP, which contains data such as gene name, gene symbol, and so on. Then it was a matter of filtering out non-human entries, and to do so I wrote a small script called taxon_filter.py:
 | |
| 
 | |
| [code lang="python"]
 | |
| #!/usr/bin/env python
 | |
| 
 | |
| import gzip
 | |
| import sys
 | |
| import csv
 | |
| 
 | |
| """Filters NCBI annotation files by human taxon (9606). Works directly from the source
 | |
| gzipped file and outputs a tab-delimited file."""
 | |
| 
 | |
| class ncbi:
 | |
| delimiter = '\t'
 | |
| quotechar = '"'
 | |
| escapechar = None
 | |
| doublequote = True
 | |
| skipinitialspace = False
 | |
| lineterminator = '\n'
 | |
| quoting = csv.QUOTE_NONE
 | |
| 
 | |
| def __init__(self):
 | |
| # Dialect registration
 | |
| csv.register_dialect("ncbi", self)
 | |
| 
 | |
| ncbi()
 | |
| 
 | |
| if len(sys.argv) < 3:
 | |
| print "Not enough command line arguments"
 | |
| sys.exit(0)
 | |
| 
 | |
| try:
 | |
| compressed_file = gzip.open(sys.argv[1])
 | |
| except IOError:
 | |
| print "Could not open file!"
 | |
| sys.exit(-1)
 | |
| 
 | |
| delim = csv.reader(compressed_file,dialect="ncbi")
 | |
| 
 | |
| try:
 | |
| destination_file = open(sys.argv[2],"wb")
 | |
| except IOError:
 | |
| print "Can't open destination file!"
 | |
| sys.exit(-1)
 | |
| 
 | |
| write_delim = csv.writer(destination_file,dialect="ncbi")
 | |
| 
 | |
| write_delim.writerow(delim.next())
 | |
| 
 | |
| for row in delim:
 | |
| if row[0] == "9606":
 | |
| write_delim.writerow(row)
 | |
| 
 | |
| print "Complete!"
 | |
| compressed_file.close()
 | |
| destination_file.close()
 | |
| sys.exit(0)
 | |
| 
 | |
| [/code]
 | |
| 
 | |
| (apologies, tabs are messed up)
 | |
| 
 | |
| That filtered the file for entries belonging to taxon 9606 (_Homo sapiens) . _Then I had to keep only the interesting bits in the file, so I cut the leading comment and selected only the correct fields:
 | |
| 
 | |
| [code lang="bash"]
 | |
| sed '1d' gene_info_human | cut -f2,9,3,7-8 > entrez_gene.txt
 | |
| [/code]
 | |
| 
 | |
| SQLite has Python bindings (officially part of Python since 2.5) but those don't allow the direct import of text files, so I fired up the command line sqlite3 command and created the relevant table, called entrez_gene, and imported the data:
 | |
| 
 | |
| [code lang="bash"]
 | |
| .separator "\t"
 | |
| .import datafiles/File_NCBI/entrez_gene.txt entrez_gene
 | |
| [/code]
 | |
| 
 | |
| Done! This is the first step, then I'll work on creating tables for my own data.
 |