Initial import of new posts
This commit is contained in:
parent
e4bafbb361
commit
0e12688f04
391 changed files with 14594 additions and 0 deletions
51
_posts/2007-11-07-data-clustering-with-python.markdown
Normal file
51
_posts/2007-11-07-data-clustering-with-python.markdown
Normal file
|
@ -0,0 +1,51 @@
|
|||
---
|
||||
author: einar
|
||||
comments: true
|
||||
date: 2007-11-07 18:15:29+00:00
|
||||
layout: post
|
||||
slug: data-clustering-with-python
|
||||
title: Data clustering with Python
|
||||
wordpress_id: 330
|
||||
categories:
|
||||
- Linux
|
||||
- Science
|
||||
tags:
|
||||
- bioinformatics
|
||||
- cluster
|
||||
- python
|
||||
- R
|
||||
---
|
||||
|
||||
**Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!](http://www.dennogumi.org/2011/05/multiscale-bootstrap-clustering-with-python-and-r)
|
||||
|
||||
Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org). The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.
|
||||
|
||||
<!-- more -->
|
||||
Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.
|
||||
|
||||
Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.
|
||||
|
||||
{% highlight python %}
|
||||
|
||||
from Bio.Cluster import *
|
||||
|
||||
# Load data, in Cluster format
|
||||
data = DataFile("somefile.txt")
|
||||
|
||||
# Clustering using Pearson's correlation and average linkage
|
||||
gene_clustering=data.treecluster(method="a",dist="c",transpose=0)
|
||||
|
||||
# Same as above, but clustering samples
|
||||
exp_clustering = data.treecluster(method="a",dist="c", transpose=1)
|
||||
|
||||
# We then save the results to a series of files to view in Java TreeView
|
||||
data.save("name",gene_clustering,exp_clustering)
|
||||
{% endhighlight %}
|
||||
|
||||
[Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.
|
||||
|
||||
|
||||
[](http://www.dennogumi.org/wp-content/uploads/2007/11/treeview.png)
|
||||
|
||||
|
||||
It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.
|
Reference in a new issue