Saturday, August 23, 2008

Coding with Class Sequence

Ok… time to get down to business! This time, we’re going to see how to code with class Sequence. To use Class Sequence in Python, download the source code file here and remember to import it into your python source file or interactive session. The following examples assume that you're working at the Python interactive session command prompt. Input (what you type in...!) and output (what Python throws back at you) have been coloured blue and green respectively to help distinguish them from the posts text. Python keywords are coloured orange.

First, let’s create a Sequence object. To do this, type the following at the Python command prompt (‘>>>’):

>>>my_sequence = Sequence(name = 'Sequence 1', seq = 'This is a sequence')

Notice that the name of the sequence and the actual sequence are written inside quotes (you could use either single or double quotes). Anything written inside quotes is taken by the Python interpreter to be a string.
Now, type

>>> print my_sequence

The Python interpreter should print the following:

>Sequence 1

Ok… so it works! Now lets try to add two Sequence objects…

>>> another_sequence = Sequence(name = 'Sequence 2',seq = ' of english characters')
>>> joined_sequences = my_sequence + another_sequence
>>> print joined_sequence
>Sequence 1+Sequence 2

Note that the variable ‘joined_sequence’ is also a Sequence object. Addition of two Sequence objects leads to the creation of a new Sequence object whose name reflects the fact that it is an addition of two Sequences. You may then change the name of a Sequence object if you wish:

>>> joined_sequences.setname('New Name')

One can also obtain the name or the sequence contained within a Sequence object if one wished:

>>> name = joined_sequences.getname()
>>> seq = joined_sequences.getseq()
>>> type(name)

Notice that the name and sequence are themselves just Python strings. If you want to get just the 6th letter in the sequence, type:

>>> sixth_char = joined_sequences[5]
>>> print sixth_char

Note that to access the sixth letter we used ‘joined_sequences[5]’. That’s because the first character in a Python string is actually numbered zero! We can actually search for the first ‘I’ in joined_sequences:

>>> position_first_I = joined_sequences.find(motif = 'I')
>>> print position_first_I.start()
>>> print position_first_I.end()
>>> print position_first_I.span()
(2, 3)

The ‘find’ function can find not only characters but entire sub-strings!

>>> pos_subs = joined_sequences.find(motif = 'SEQUENCE')
>>> print pos_subs.span()
(10, 18)

Finally, the ‘fragment’ function:

>>> fragment = joined_sequences.fragment(my_start = 10, my_stop = 18)
>>> print fragment
>New Name(10,18)

Next time we’ll see how the more biologically relevant classes DNA, preMRNA, mRNA and Protein work.

Tuesday, July 15, 2008

Class Sequence

Class ‘Sequence’ is the result of an effort to create a generic class of object that encapsulates most of the common properties that biological information holding sequences contain. Hmmm… now what could those be? Obviously, one needs to store the actual sequence. It might also prove useful to define the alphabet (e.g. A, T, G and C for DNA) that makes up the sequence as well as their properties, such as, molecular weight of the nucleotide or amino acid, etc. Another important property is the number of alphabets/letters that are read together (i.e. the word size) to derive meaning from the sequence. As you’ve probably already guessed, the word size in an ORF would be 3 (i.e. the size of a codon). We’ve also included a few artificial but useful properties like name, locus and any interesting references that you might want to associate with the sequence.

The class also contains methods to manipulate the sequence and other properties. Methods analogous to Python string operators are addition and array-like indexing of the sequence. The binary ‘+’ operator behaves exactly as one would expect! The addition of two Sequence objects results in the formation of a new Sequence object that contains the end-to-end ligation of two sequences that the operator acts upon. The unary ‘[n]’ (index) operator returns the nth letter in the sequence. There are also a set of methods whose names speak for themselves! ‘getname’, ‘getseq’, ‘setname’ and ‘setseq’ return the name/sequence or alter the name/sequence respectively. ‘locus’ and ‘ref’ are methods that alter the locus or references.

Two important methods are ‘find’ and ‘fragment’. ‘find’ searches for sub-sequence or ‘motif’ in the sequence and returns the position of the first instance (if it finds one). ‘fragment’ returns a fragment of the Sequence according to the start and end positions supplied by the programmer. These two methods are at the heart of the GenePython ideology. For example, the RNAPolymerase Virtual Enzyme uses the ‘find’ method to locate important signal sequences and then returns the ‘fragment’ that lies between the TATA box and poly-adenylation signal.

Next time, we’ll look at some actual coding examples that will illustrate how class Sequence works.

Wednesday, July 2, 2008

Getting started with GenePython...

Before using GenePython, you should install the latest, stable version of Python ( which currently happens to be version 2.5.2 (do not download any other version) on your computer. Just run the installer and... thats it! You are now ready to use Python. Check out the help and tutorial files for a detailed introduction to the language... or just read on!

Python is an object-oriented scripting language and as soon as you enter a command the Python interpreter responds. The term ‘object’ refers to an entity that not only contains data, but also contains instructions (called methods or functions) on how to manipulate that data. The word ‘class’ essentially refers to the ‘type’ of the object or, for those with a more philosophical bent of mind, the abstract qualities that define an object!. Let me give a real world analogy to make things clearer… your shiny new car, assuming you are very lucky to own one, is an object of class ‘Ferrari’, or maybe class ‘BMW’ or maybe even class ‘Fiat’.

All car objects present you, the driver, with interfaces to manipulate them in a specific way. The steering-wheel, brake pedal and door-handle are some such interfaces that offer a simple way of doing pretty complicated things. For instance, if you turn the steering wheel to the right, the car turns (oddly enough) to the right! Unbeknownst to you, turning the steering wheel right actually ends up turning a shaft with a geared end, which, in turn, displaces a ratchet which in turn moves another thing, and another…., that finally turns the wheels which in turn transmit the frictional forces between them and the road to the rest of the car that ends up turning (like I said, oddly enough!). The great part about being a driver is that you never needed to know all this! You just turn the wheel and the car turns… simple! As in the real world, well-designed objects in the programming world offer a simple and easy to use interface to ‘drivers’ so that they can get on with their driving.

You may have noticed that regardless of the classes of car, there are a set of common interfaces with brake-pedal, steering-wheel etc. One can imagine that each particular class of car has ‘inherited’ some of its common properties from a parent class… class ‘Car’! So, in object-oriented programming, if one intends to create a lot of classes which have common properties, all one need do is creating one ‘parent’ class which encapsulates all those common properties. Then one just creates different classes, each inheriting these common properties from the parent class! This saves you a whole lot of typing and helps you organize your thoughts and you’re coding. Of course, one can tweak the properties of each different daughter class as required. For instance, all cars have steering-wheels, but some have power steering while others don’t.

Python offers a set of standard classes so that programmers might while away their time productively. These classes are used to store and manipulate commonly required data types such as numbers, strings of characters, ordered lists of stuff etc. So for instance, in Python, the number ‘5’ is an object of type ‘int’ (integer), the word ‘five’ is an object of type ‘str’ (string of characters) and one could, if one wished, create a ‘list’ object in which the first element could be ‘5’, second element ‘five’ and so on. Most useful code consists of creating ‘variables’ of these classes that change their values according to your rules.

GenePython offers an additional set of classes that drive biologists, bioinformaticians or anyone interested to make sense out of biological information such as sequences or strings. In this context, the concept of inheritance is used extensively in GenePython. For instance, classes representing information-holding sequences like DNA, RNA and protein, inherit from the parent class ‘Sequence’ that holds all the common properties that one would like any information-holding sequence to have, something we’ll deal with the next time.

Labels: , , ,

Monday, June 16, 2008

GenePython - What’s the Big Idea?

GenePython is exactly what its name suggests! It is the use of Python (, a popular and simple, open-source programming language to simulate gene expression in silico. Python has a way to depict real objects in what Pythonspeak is called ‘classes’. Classes can also be formed from other classes thereby inheriting their properties and behavior.

GenePython uses Python classes to depict molecular objects and define their behavior. This allows the flexibility of not just exploring the behavior of a mixture of different objects in-silico but also redefining the response of individual objects to different stimuli.

So in GenePython we have a class defining a sequence, for instance a genome comprising of either DNA or RNA. Similarly, sequences involved in the central dogma of molecular biology as well as the adjunct can all be represented using the class ‘Sequence’ or classes that inherit from class Sequence.

This architecture allows GenePython to facilitate the rapid construction of a virtual cell capable of simulating processes within and similar to the central dogma. This means that it can be used to rapidly create and explore in silico gene expression, understand bio-molecules that can be derived, classify them, create metabolic networks and systems and even design a-life.

In our next blog, we will look at how class ‘Sequence’ works. We’ll get you to use this class and explore the different behavioral patterns possible with the class. Those of you who are new to Python need not fear… we’ll run you through the code in Englishspeak and guide you through installing Python on your computer.

Labels: , , , , , ,

Wednesday, June 11, 2008


MicroRNAs – Small RNAs with Large Potential and Several Complexities
Manoj Hariharan

In the genome, non-coding sequences are widely interspersed within and between genes and complicate gene annotation. Although their importance has been widely appreciated for past few decades, the functional relevance of non-coding sequences is still not well understood. Moreover, a growing number of non-coding sequences are found to be transcribed into small RNAs with unknown functions.A novel class of such small molecules, viz. MicroRNAs (miRNAs) and other non-coding RNAs that are not translated to proteins have been found to operate at several levels of genomic architecture regulating chromatin formation, RNA editing, RNA stability and efficiency of mRNA-translation. While identifying and annotating miRNAs is tricky, these possess a structural peculiarity such as a imperfect hairpin structures which makes the task easier. Such a feature includes a palindromic nucleotide sequence interrupted by 5-10 non self-complementary bases giving rise to a hairpin with a stem of length ranging from 20-40bases. This type of molecule is an ideal candidate for a precursor miRNA. There now exist several computational programs that allow identifying such sequences in the genome and most miRNAs identified so far do exhibit these features.

Here, I shall briefly describe the current understanding of miRNA mediated regulation and the changing trends in this knowledge. Nearly 100-base-long transcripts or precursor miRNA (pre-miRNA) are processed to form ~17-25 nucleotides long mature miRNAs. These are encoded in the DNA either as clusters of primary miRNA (pri-miRNA) which could be as long as 1 kb or shorter stretches in the “intergenic” as well as “genic” regions. The earliest attempts at uncovering the regulatory potential of such small RNAs revealed finer difference in the mode of action of miRNA and small interfering RNA (siRNA). The siRNA bind to the entire target transcript while the miRNA bind to mostly the 3’ UTR of the target transcripts through imperfect complementarity creating stretches of continuous matches towards the 5’ end of miRNA, termed seed matches, and bulges of 1-4 nucleotides with a favorable Minimal Free Energy (MFE). This binding dramatically reduces the translational efficiency of the target transcript. Furthermore, a particular miRNA may interact with several transcripts (multiplicity) or it can be targeted by more than one miRNA (co-operativity). Given the number of miRNAs expressed in each metazoan genome and some viruses this generates a large regulatory potential. Humans encode over 650 miRNAs. Several approaches identifying the targets of these miRNAs, in silico, in vitro and in vivo have allowed understanding the regulatory potential of miRNAs during development, stress, cancer, apoptosis and host-pathogen interaction.

At the Institute of Genomics & Integrative Biology, CSIR, India, we have established a massive program to uncover the regulatory potential of miRNAs in various biological processes and in multiple experimental models backed by a strong in silico approach. We have developed computational tools to identify miRNAs, second generation target-prediction tools, miRNA expression profiling as well as miRNA-target databases that are available at

07 June, 2008.

The Flow of Genetic information

The Flow of Genetic information
Sohan Modak

The beauty of Genetic Information Storage and Retrieval systems lies in how both process components are entrenched in a fail-safe mode. Genes are arranged linearly along the double stranded genome but not necessarily on the same strand. In principle, a single gene, i.e. piece of genetic information, is stored on one strand at a time. However, there may be arrays of genes that follow each other on the same strand or alternate between two strands, but rarely overlapping both strands.
Thus, given a stretch of double stranded DNA containing a gene, the strand containing the coding sequence is called as the positive strand while its complement is the negative strand. The principle of base pairing between the two complementary strands ensures that, given the sequence of nucleotides on one strand, one can correctly conceive, visualize and construct the sequence on the complementary strand. The strand complementary to the coding sequences is named as the negative strand. Thus, the same DNA strand may act as positive or negative strand at different locations.
What one means by the coding information is the sequence of nucleotides arranged as a linear array of triplets or codons that can be translated into a string of amino acids, or a polypeptide. As the information on the coding strand must be transferred verbatim to messenger RNA that will act as the transfer intermediate, one must copy the negative strand, and not the positive strand, to generate the exact copy of the positive strand coding sequence in the mRNA.
In the genetic language or nucleotide sequence, different type of signaling elements/motifs or words are required to act as punctuation marks to locate and access a given gene segment with appropriate coding sequence and to specifically copy/transcribe it into RNA before converting it into a translatable form for protein synthesis.
Finally, inside a cell DNA is not a naked molecule but protected as chromatin or a deoxyribonucleoprotein complex. While we do know of a number of signals such as transcriptor binding and initiation sites, ribosome binding site, translation initiation and termination sites, exon-intron junctions and quite a few transcription factor binding sites, there is a great deal of confusion about the strands on which are positioned identifier signals of a desired coding sequence and process signals before and during transcription and translation. On which strand are located the signals or motifs that allow localization, identification and specificity of a gene sentence ?

Here we go…

1. On which DNA strand are located signals or motifs that ensure binding of RNA polymerase to the template/negative strand?
2. Which signals or processes generate binding space on DNA in the chromatin to ensure correct positioning of RNA polymerase?
3. What is the nature of signals or processes that allow translocation of RNA polymerase during the transcription of the negative strand of DNA
4. On which DNA strand are located signals or motifs that allow processing/packaging/trimming of the nascent transcript into mRNA ?
5. On which DNA strand are located signals or motifs that allow binding of the ribosome prior to the initiation of translation ?
6. And there are many other questions related to the rules of the genetic grammar and specific words and clauses in the genetic language that still escape our knowledge.

Surely, out there, you would have much to add…
June 11, 2008

Labels: , , , , , , , ,

Sunday, June 1, 2008

Why is Gene annotation lagging way behind the rapid accumulation of Genome nucleotide sequences?

Sohan P. Modak
Since the technological tour de force of sequencing Human Genome, over 4400 genomes have sequenced (see, Table, below) and the list is increasing daily.
Completed Genome Sequences
Viruses 1895
Viroids 38
Plasmids 38
Archaea 65
Bacteria 858
Eukaryota 1508
Plants 112
Mammals 51
Birds 28
Reptiles 1
Amphibians 3
Fishes 22
Insects 52
Flatwoms 8
Roundworms 28
Others 42

The Genome database provides sequences of a variety of genomes, full chromosomes, sequence maps with contigs, and integrated genetic and physical maps. The database, organized in six major taxa, includes complete chromosomes, organelles, plasmids & draft genome assemblies. With the advent of high throughput technologies, in an year or so, one would complete a sequence a day. DNA reannealing data and gene mapping tells us that only a small fraction of the genome contains information corresponding to amino acid sequences of polypeptides or molecular phenotypes or Phenes. Unfortunately, the methodology for finding specific Genes, their positions and sequences on the genome and validating these is not only circuitous and cumbersome but requires a combination of bioinformatics, and validation through wet lab analyses and extensive statistical analyses due to the bottoms-up approach. Gene finder (,

Gene Ontologys nnotation @ EBI (
EMBOSS (The European Molecular Biology Open Software Suite) platform accepts data in different formats and allows transparent retrieval of sequence data from the web and allows scientists to develop and release software as open source and integrates a range of currently available packages and tools for sequence analysis. Annoted sequences become Gene Bank entries. A tutorial, Gene bender, (, highlights the issues faced during gene annotation, stating, “There are no real paradigms or standards for annotation -- each person does it differently. It is very easy to miss or misinterpret genomic features. GenBank entries themselves are annotated very unevenly, depending on the knowledge and interest level of the sequencing lab (and no one is allowed to fix a bad annotation!). GenBank is not curated: entries are only provide suggestions for genomic features such as promoters, alternative splicing of mRNAs, retrotransposons, pseudogenes, tandem duplications, synteny, and homology”. Finally, use of multiple sequence alignment protocols based on widely varying logic further weakens the scenario.
The present bottoms-up approach involves by aligning reverse translated polypeptide sequence on the genome and the genetic code along with known signaling elements to identify bona fide Genes and invariably ends in a fishing expedition by in throwing up a large number of homologues, paralogues and partials. The difficulty here concerns our meager knowledge of the genetic grammar which, at present, is restricted to the triplet code, punctuation marks and motifs positioned within and immediately upstream and downstream of the coding sequence. But, do we know the signals potentially dictating the functional linkages within and across gene batteries and epigenetically regulated temporal & positional gene networks.
The annotation of the Human Genome, a massive effort, was undertaken by a consortium consisting of [1] The Havana group (chr. 1, 6, 9, 10, 13, 20, 21, 22, X, Y) and Collins et al. (chr. 22) at the Wellcome Trust Sanger Insitute,.[2] Hillier et al. (chr. 7) at the Washington University Genome Center, [3] Genoscope (chr. 14) at CNRS, [4] The DOE Joint Genome Institute (chr. 16, 19), [5] The Broad Institute (chr. 8, 15, 17, 18), [6] Baylor College of Medicine (chr. 3, 12) and[7] Genome Analysis Group (chr. X) at the Institute of Mol. Biotechnology covering 34055 genes. And yet, most gene entries are yet to be curated and assigned a function. As fate would have it, it was initially thought that human genome would contain over 100,000 genes, then the number decreased first to 50,000 then to 23000 and back again it is on increase. All this is due to uncertainty in what one is looking for. I feel that in actual practice, if one were to include the Genes in making or Potential Genes, the number would probably go beyond 50,000. Well, well…time will tell and that is not too far off, either!

Really, can we not apply a radically different logic to annotate Genes ?