53 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			53 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| author: einar
 | |
| categories:
 | |
| - Linux
 | |
| - Science
 | |
| comments: true
 | |
| date: "2007-11-07T18:15:29Z"
 | |
| header:
 | |
|   image_fullwidth: banner.jpg
 | |
| slug: data-clustering-with-python
 | |
| tags:
 | |
| - bioinformatics
 | |
| - cluster
 | |
| - python
 | |
| - R
 | |
| title: Data clustering with Python
 | |
| disable_share: true
 | |
| wordpress_id: 330
 | |
| ---
 | |
| 
 | |
| **Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!]({{ site.url }}/2011/05/multiscale-bootstrap-clustering-with-python-and-r)
 | |
| 
 | |
| Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org).  The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.
 | |
| 
 | |
| <!--more-->
 | |
| Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.
 | |
| 
 | |
| Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.
 | |
| 
 | |
| {{< highlight python >}}
 | |
| 
 | |
| from Bio.Cluster import *
 | |
| 
 | |
| #   Load data, in Cluster format
 | |
| data = DataFile("somefile.txt")
 | |
| 
 | |
| #   Clustering using Pearson's correlation and average linkage
 | |
| gene_clustering=data.treecluster(method="a",dist="c",transpose=0)
 | |
| 
 | |
| #   Same as above, but clustering samples
 | |
| exp_clustering = data.treecluster(method="a",dist="c", transpose=1)
 | |
| 
 | |
| #   We then save the results to a series of files to view in Java TreeView
 | |
| data.save("name",gene_clustering,exp_clustering)
 | |
| {{< / highlight >}}
 | |
| 
 | |
| [Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.
 | |
| 
 | |
| 
 | |
| []({{ site.url }}/images/2007/11/treeview.png)
 | |
| 
 | |
| 
 | |
| It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.
 |