53 lines
3.3 KiB
Markdown
53 lines
3.3 KiB
Markdown
---
|
|
author: einar
|
|
comments: true
|
|
date: 2007-11-07 18:15:29+00:00
|
|
layout: page
|
|
slug: data-clustering-with-python
|
|
title: Data clustering with Python
|
|
wordpress_id: 330
|
|
categories:
|
|
- Linux
|
|
- Science
|
|
header:
|
|
image_fullwidth: "banner.jpg"
|
|
tags:
|
|
- bioinformatics
|
|
- cluster
|
|
- python
|
|
- R
|
|
---
|
|
|
|
**Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!]({{ site.url }}/2011/05/multiscale-bootstrap-clustering-with-python-and-r)
|
|
|
|
Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org). The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.
|
|
|
|
<!-- more -->
|
|
Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.
|
|
|
|
Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.
|
|
|
|
{% highlight python %}
|
|
|
|
from Bio.Cluster import *
|
|
|
|
# Load data, in Cluster format
|
|
data = DataFile("somefile.txt")
|
|
|
|
# Clustering using Pearson's correlation and average linkage
|
|
gene_clustering=data.treecluster(method="a",dist="c",transpose=0)
|
|
|
|
# Same as above, but clustering samples
|
|
exp_clustering = data.treecluster(method="a",dist="c", transpose=1)
|
|
|
|
# We then save the results to a series of files to view in Java TreeView
|
|
data.save("name",gene_clustering,exp_clustering)
|
|
{% endhighlight %}
|
|
|
|
[Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.
|
|
|
|
|
|
[]({{ site.url }}/images/2007/11/treeview.png)
|
|
|
|
|
|
It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.
|