51 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			51 lines
		
	
	
	
		
			3.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
author: einar
 | 
						|
comments: true
 | 
						|
date: 2007-11-07 18:15:29+00:00
 | 
						|
layout: page
 | 
						|
slug: data-clustering-with-python
 | 
						|
title: Data clustering with Python
 | 
						|
wordpress_id: 330
 | 
						|
categories:
 | 
						|
- Linux
 | 
						|
- Science
 | 
						|
tags:
 | 
						|
- bioinformatics
 | 
						|
- cluster
 | 
						|
- python
 | 
						|
- R
 | 
						|
---
 | 
						|
 | 
						|
**Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!]({{ site.url }}/2011/05/multiscale-bootstrap-clustering-with-python-and-r)
 | 
						|
 | 
						|
Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org).  The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.
 | 
						|
 | 
						|
<!-- more -->
 | 
						|
Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.
 | 
						|
 | 
						|
Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.
 | 
						|
 | 
						|
{% highlight python %}
 | 
						|
 | 
						|
from Bio.Cluster import *
 | 
						|
 | 
						|
#   Load data, in Cluster format
 | 
						|
data = DataFile("somefile.txt")
 | 
						|
 | 
						|
#   Clustering using Pearson's correlation and average linkage
 | 
						|
gene_clustering=data.treecluster(method="a",dist="c",transpose=0)
 | 
						|
 | 
						|
#   Same as above, but clustering samples
 | 
						|
exp_clustering = data.treecluster(method="a",dist="c", transpose=1)
 | 
						|
 | 
						|
#   We then save the results to a series of files to view in Java TreeView
 | 
						|
data.save("name",gene_clustering,exp_clustering)
 | 
						|
{% endhighlight %}
 | 
						|
 | 
						|
[Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.
 | 
						|
 | 
						|
 | 
						|
[]({{ site.url }}/images/2007/11/treeview.png)
 | 
						|
 | 
						|
 | 
						|
It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.
 |