Initial import of new posts

2015-04-27 20:54:45 +02:00 · 2015-04-27 20:54:45 +02:00 · 0e12688f04
commit 0e12688f04
parent e4bafbb361
391 changed files with 14594 additions and 0 deletions
--- a/_posts/2007-11-07-data-clustering-with-python.markdown
+++ b/_posts/2007-11-07-data-clustering-with-python.markdown
@ -0,0 +1,51 @@
+---
+author: einar
+comments: true
+date: 2007-11-07 18:15:29+00:00
+layout: post
+slug: data-clustering-with-python
+title: Data clustering with Python
+wordpress_id: 330
+categories:
+- Linux
+- Science
+tags:
+- bioinformatics
+- cluster
+- python
+- R
+---
+
+**Notice:**Just now I realized this has been linked to [to a Stack Overflow question](http://stackoverflow.com/questions/5002783/best-python-clustering-library-to-use-for-product-data-analysis). I recently wrote a new post that uses a different technique and a combination of R and Python. [Check it out!](http://www.dennogumi.org/2011/05/multiscale-bootstrap-clustering-with-python-and-r)
+
+Following up my recent post, I've been looking for alternatives to TMeV. So far I've found the R package pvclust and the [Pycluster library](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#pycluster), part of [BioPython](http://biopython.org).  The first one also performs bootstrapping (I'm not sure if it's similar to what support trees do, but it's still better than no resampling at all). I've found [another Python project](http://python-cluster.sourceforge.net/) but it is still too basic to perform what I need.
+
+<!-- more -->
+Pvclust would be my first interest, but it only plots dendrograms and not heatmaps, and the clustering must be done twice by transposing the data (it only clusters columns). [The package's web page](http://www.is.titech.ac.jp/~shimo/prog/pvclust/) shows the various options and what to do with it.
+
+Pycluster, on the other hand, can be used to generate files which can be read by the Java TreeView program, where you can view a heat map of the results and their annotations. Although there's documentation available, it is not part of the Biopython documentation (as usual, I'd say: lack of documentation is a plague for Biopython). In any case, doing a cluster analysis is rather simple, but we need to remember that we need to do two cluster runs (one for genes, the other for experiments). Here I show an example with hierarchical clustering, but [the documentation](http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf) (Python part on chapter 8) has examples also with other methods such as SOMs or k-means.
+
+{% highlight python %}
+
+from Bio.Cluster import *
+
+#   Load data, in Cluster format
+data = DataFile("somefile.txt")
+
+#   Clustering using Pearson's correlation and average linkage
+gene_clustering=data.treecluster(method="a",dist="c",transpose=0)
+
+#   Same as above, but clustering samples
+exp_clustering = data.treecluster(method="a",dist="c", transpose=1)
+
+#   We then save the results to a series of files to view in Java TreeView
+data.save("name",gene_clustering,exp_clustering)
+{% endhighlight %}
+
+[Java TreeView](http://jtreeview.sourceforge.net/) is a program to view trees and heat maps. Unlike its counterpart TreeView, it's truly cross-platform (Java) and GPLed, a nice added bonus. You can load the files directly and display the results like in this picture, taken with the sample data available on the project page.
+
+
+[![Java TreeView](http://www.dennogumi.org/wp-content/uploads/2007/11/treeview.thumbnail.png)](http://www.dennogumi.org/wp-content/uploads/2007/11/treeview.png)
+
+
+It's still not perfect (no data shown on the main map page, only with the detailed view) but a good start, nevertheless. I'll investigate whether I can complement TMeV usage with these tools.