Thursday, February 19, 2009

tips for bioinformatics

This was from a talk by Joel Dudley, originally posted by Shirley Wu at

Quite useful if you want to start programming something...

1. Learn UNIX. It’s quick, it’s powerful, it’s easy to learn. What often takes several lines to code in a scripting language can usually be reduced to a single line on the command line.

2. Be jack of all trades, but master of ONE. That is, be familiar with most programming languages, but be really good at one of them. In the hierarchy of languages, VB and C are more “primitive” while Ruby and Python are most “advanced” - he recommends starting with one of the more advanced languages if you are new to programming. Out of Ruby and Python, Python will probably give you more bang for your buck, due to the smorgasbord of libraries available and broad acceptance (e.g. academic labs, Google). In addition, there are lots of bridges between languages, such as Jython (Java and Python) and JRuby (Java and Ruby) so expert knowledge of one is usually sufficient for you to make a lot of things work practically everywhere.

3. Don’t reinvent the wheel. “Frameworks are your friends.” Take advantage of large existing projects like BioPython/Perl/Ruby/Java, Django, Rails, etc which contain lots of ready to go code for practically everything. Use the internet to find existing code solutions - e.g. Koders is like a Google search for open source code on the web.

4. Learn one text editor really well. Take your pick of Emacs, vi, or a GUI-based editor like TextMate for Macs. The advantage of emacs and vi is that they will be installed on pretty much any system you come across.

5. “Don’t trust yourself”, i.e. use code versioning. Examples are Subversion, CVS, and git. You can even outsource your code hosting with github. Combine this with project management in GForge.

6. Don’t be afraid to use more than 3 letters to define a variable. Having short variable names won’t make the code run faster. It will, however, make the code more difficult for others (and you, 3 months from now) to understand!
Photo by archeon on Flickr

7. Balance architecture and accomplishment. You may be tempted to create something that is complete, elegant, and perfectly structured. This will likely be a waste of time. It’s ok to sacrifice a little bit of structure to get something that actually works.

8. Automate documentation. Documentation is necessary, but it’s a pain to write. So come up with a convention for your headers and make it automatic. Use available tools like Doxygen, JavaDoc, and RDoc, many of which are free.

The above are generic for academic-level software engineering. Some tips that more specifically address high-throughput biomedical computing:

9. Kill the flat file (sort of). This is the most common file format used in bioinformatics, but it hardly lends itself efficient computation. A common task we want to do with the file is read in the data and store it keyed so that we can look up specific pieces of the data later. Hate databases? Cringe at SQL? If you can represent your data as key/value pairs, consider using an embeddable database like the open source BerkeleyDB (now licensed by Oracle), which require no administration. If you don’t mind SQL, but hate the administration, SQLite allows you to create embedded, serverless databases. Other options that go beyond the relational database concept are CouchDB (”a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”) and Hypertable (”a high performance distributed data storage system”).

10. New ways to do parallel computing. Determine whether your tasks are loosely coupled (independent) or tightly coupled. Although personal computers and laptops are coming out with more cores, most programs only use one at a time. Find ways to utilize idle cores - e.g. there is a way to do this in R. Think in terms of MapReduce. Take advantage of cloud computing, like Amazon’s EC2. Use platforms like Hadoop and Disco to make parallel computing applications. A cool example of this is Cloudburst-Bio, a massively parallel project for genome assembly from next-generation sequencing that uses MapReduce.

11. Embrace hardware. New (and old) hardware is available that can give you significant speedups in biomedical computation, notably graphical processing units (GPUs) which have been used to accelerate molecular dynamics. Hardware vendors like Nvidia are starting to respond; you can now get GPU workstations like NVidia’s Tesla personal supercomputer offering many 100sX speedup over traditional workstations. So if you don’t want to utilize the cloud, you can get an affordable and powerful cluster that fits on top of your desk. Aside from GPUs, there are field programmable gate arrays - chips you can program after manufacturing.

12. Playing nice with others. Think a bit about data exchange formats - but definitely use them! Suggestions are JSON, YAML, and, of course, XML. When working in teams, use an “agile software development” strategy - mainly many fast iterations of the specification-development-feedback cycle. Use tools to automate the development process, such as unit testing and the granddaddy, “make“. Tools like BaseCamp (and perhaps Science 2.0 versions like Laboratree) can help with the more general project management aspects.


In summary:

Focus on the goal (biology or medicine).
Don’t be clever (you’ll trick yourself).
Value your time.
Outsource everything but genius.
Use tools available to you.
And have fun. ;)

Slides for Joel’s presentation are up on Slideshare

Monday, August 18, 2008

more on dynamics

As one of the previous posts describes, heterochromatin is not always cold and inert. It goes through dynamic transcription during cell cycle progression. The fact that all of the earlier studies used mixed population of cells is the reason of the controversial heterochromatin transcription.

To expand this a little bit, in fact most of modern molecular and cellular biology researches (except cell cycle studies) are done using cells staged at different cell cycle phase. If you think about it, the cell cycle approach could be applied to most of these researches. It is not unexpected if a lot of our current knowledge is refined by taking this approach, for example cancer formation and development. A new science direction is approaching.

Sunday, August 17, 2008

an important example of heterochromatin, showing the major heterochromatin regions in mouse and its amazing global structure in a cell. The detail properties will be described later on.

(copyright JCB)

Thursday, July 31, 2008


These are two important hallmarks of heterochromatin:

Cytological level - electron microscopy picture

molecular level: HP1-H3K9Me and the HMTase

Wednesday, July 30, 2008

dynamic heterochromatin

This is relatively new. But I think it is necessary to put it here before all the historical information. You have to keep this dynamic thing in mind whenever you think about heterochromatin. --I am sure this will be huge in the future.

Proliferation-dependent and cell cycle–regulated transcription of mouse pericentric heterochromatin

Cell cycle regulated transcription of heterochromatin in mammals vs. fission yeast: Functional conservation or coincidence?

Monday, July 28, 2008

wikipedia on heterochromatin

This is a description of heterochromatin on wikipedia. I am amazed that it is basically correct and even covers most of the concepts. It is a good introduction of this interesting kind of chromatin.

However, I have to correct some of the obvious mistakes.

1. first paragraph. "Its major characteristic is that transcription is limited." This is definitely going to change with the current research showing transcription from heterochromatin in many species. At least it is not the "major" characteristic anymore. The concept of heterochromatin is being reshaped dramatically recently.

2. still first paragraph. "As such, it is a means to control gene expression, through regulation of the transcription initiation." this is correct but certainly not the most important function of heterochromatin. Also it is not clear if the regulation is only through transcription initiation.

3. structure: about genes in heterochromatin regions, there are some, especially in fruit fly. About its replication timing, in most species heterochromatin replicates late in S phase. But in fission yeast, it replicates very early. --Joel Huberman nicely demonstrated this years ago.

4. On the potential involvement of RNAi in heterochromatin formation in higher eukaryotes like mammalian systems, the evidence is very controversy currently. Almost all supporting evidence came from one single lab. My opinion is that it is not as general. There might be some similarity, but models of fission yeast cannot be simply adopted to mammalian systems.

I will expand this hot area later on.

5. One important thing missing is the molecular hallmark of heterochromatin, i.e. the histone H3 methylation at lysine 9 (H3K9)- HP1 (heterochromatin protein 1) interaction and the histone H3K9 methytransferases.

Ok. Enough for wikipedia.

more coming...

Sunday, July 27, 2008


It all started from Emil Heitz (German Botanist). His 1928 paper on "the Heterochromatin of Moss" first coined the term heterochromatin - the tightly packed form of DNA throughout of the cell cycle.