dennogumi/content/post/2007-11-07-data-clustering-with-python.markdown

---
author: einar
categories:
- Linux
- Science
comments: true
date: "2007-11-07T18:15:29Z"
header:
  image_fullwidth: banner.jpg
slug: data-clustering-with-python
tags:
- bioinformatics
- cluster
- python
- R
title: Data clustering with Python
disable_share: true
wordpress_id: 330
---

**Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!]({{ site.url }}/2011/05/multiscale-bootstrap-clustering-with-python-and-r)

Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org).  The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.

<!--more-->
Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.

Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.

{{< highlight python >}}

from Bio.Cluster import *

#   Load data, in Cluster format
data = DataFile("somefile.txt")

#   Clustering using Pearson's correlation and average linkage
gene_clustering=data.treecluster(method="a",dist="c",transpose=0)

#   Same as above, but clustering samples
exp_clustering = data.treecluster(method="a",dist="c", transpose=1)

#   We then save the results to a series of files to view in Java TreeView
data.save("name",gene_clustering,exp_clustering)
{{< / highlight >}}

[Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.


[![Java TreeView]({{ site.url }}/images/2007/11/treeview.thumbnail.png)]({{ site.url }}/images/2007/11/treeview.png)


It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.