All checks were successful
		
		
	
	continuous-integration/drone/push Build is passing
				
			
		
			
				
	
	
		
			28 lines
		
	
	
	
		
			1.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			28 lines
		
	
	
	
		
			1.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
author: einar
 | 
						||
categories:
 | 
						||
- Science
 | 
						||
comments: true
 | 
						||
date: "2007-10-09T20:00:23Z"
 | 
						||
header:
 | 
						||
  image_fullwidth: banner_other.jpg
 | 
						||
slug: soft-file-woes
 | 
						||
tags:
 | 
						||
- bioinformatics
 | 
						||
- python
 | 
						||
- R
 | 
						||
- Science
 | 
						||
- software
 | 
						||
title: SOFT file woes
 | 
						||
omit_header_text: true
 | 
						||
disable_share: true
 | 
						||
wordpress_id: 298
 | 
						||
---
 | 
						||
 | 
						||
Today I started working on a data set published on [GEO](http://www.ncbi.nlm.nih.gov/geo/). As the sample data were somehow inconsistent (they mentioned 23 controls when I found 28), I decided to parse the [SOFT](http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTformat) file from GEO in order to get the exact sample information.
 | 
						||
 | 
						||
I did a grave mistake. First of all, [Biopython](http://www.biopython.org)'s SOFT parser is horribly broken (doesn't work at all) and quite undocumented: I could work around the lack of documentation (API docs) but not with the fact that it wouldn't work. So I turned to [R](http://www.r-project.org), which offers a GEO query module through [Bioconductor](http://www.bioconductor.org).
 | 
						||
 | 
						||
Again that proved to be a terrible mistake. For a file containing 183 samples, the analysis is going on since **four hours** and with no sign of completing anytime soon (not to mention a  possible memory leak). After this, I gave up. I'm going to get the reduced data sheet and write a small parser in Python myself.
 | 
						||
 | 
						||
What is frustrating is the lack of quality: I could concentrate on my own work rather than reinventing the wheel for the nth time if the existing implementations worked. What's the point in releasing non-working software? I could understand bugs, but this is one step further.
 |