Sunday, June 1, 2008

Why is Gene annotation lagging way behind the rapid accumulation of Genome nucleotide sequences?

Sohan P. Modak
Since the technological tour de force of sequencing Human Genome, over 4400 genomes have sequenced (see, Table, below) and the list is increasing daily.
Completed Genome Sequences
Viruses 1895
Viroids 38
Plasmids 38
Archaea 65
Bacteria 858
Eukaryota 1508
Plants 112
Mammals 51
Birds 28
Reptiles 1
Amphibians 3
Fishes 22
Insects 52
Flatwoms 8
Roundworms 28
Others 42

The Genome database provides sequences of a variety of genomes, full chromosomes, sequence maps with contigs, and integrated genetic and physical maps. The database, organized in six major taxa, includes complete chromosomes, organelles, plasmids & draft genome assemblies. With the advent of high throughput technologies, in an year or so, one would complete a sequence a day. DNA reannealing data and gene mapping tells us that only a small fraction of the genome contains information corresponding to amino acid sequences of polypeptides or molecular phenotypes or Phenes. Unfortunately, the methodology for finding specific Genes, their positions and sequences on the genome and validating these is not only circuitous and cumbersome but requires a combination of bioinformatics, and validation through wet lab analyses and extensive statistical analyses due to the bottoms-up approach. Gene finder (,

Gene Ontologys nnotation @ EBI (
EMBOSS (The European Molecular Biology Open Software Suite) platform accepts data in different formats and allows transparent retrieval of sequence data from the web and allows scientists to develop and release software as open source and integrates a range of currently available packages and tools for sequence analysis. Annoted sequences become Gene Bank entries. A tutorial, Gene bender, (, highlights the issues faced during gene annotation, stating, “There are no real paradigms or standards for annotation -- each person does it differently. It is very easy to miss or misinterpret genomic features. GenBank entries themselves are annotated very unevenly, depending on the knowledge and interest level of the sequencing lab (and no one is allowed to fix a bad annotation!). GenBank is not curated: entries are only provide suggestions for genomic features such as promoters, alternative splicing of mRNAs, retrotransposons, pseudogenes, tandem duplications, synteny, and homology”. Finally, use of multiple sequence alignment protocols based on widely varying logic further weakens the scenario.
The present bottoms-up approach involves by aligning reverse translated polypeptide sequence on the genome and the genetic code along with known signaling elements to identify bona fide Genes and invariably ends in a fishing expedition by in throwing up a large number of homologues, paralogues and partials. The difficulty here concerns our meager knowledge of the genetic grammar which, at present, is restricted to the triplet code, punctuation marks and motifs positioned within and immediately upstream and downstream of the coding sequence. But, do we know the signals potentially dictating the functional linkages within and across gene batteries and epigenetically regulated temporal & positional gene networks.
The annotation of the Human Genome, a massive effort, was undertaken by a consortium consisting of [1] The Havana group (chr. 1, 6, 9, 10, 13, 20, 21, 22, X, Y) and Collins et al. (chr. 22) at the Wellcome Trust Sanger Insitute,.[2] Hillier et al. (chr. 7) at the Washington University Genome Center, [3] Genoscope (chr. 14) at CNRS, [4] The DOE Joint Genome Institute (chr. 16, 19), [5] The Broad Institute (chr. 8, 15, 17, 18), [6] Baylor College of Medicine (chr. 3, 12) and[7] Genome Analysis Group (chr. X) at the Institute of Mol. Biotechnology covering 34055 genes. And yet, most gene entries are yet to be curated and assigned a function. As fate would have it, it was initially thought that human genome would contain over 100,000 genes, then the number decreased first to 50,000 then to 23000 and back again it is on increase. All this is due to uncertainty in what one is looking for. I feel that in actual practice, if one were to include the Genes in making or Potential Genes, the number would probably go beyond 50,000. Well, well…time will tell and that is not too far off, either!

Really, can we not apply a radically different logic to annotate Genes ?


Post a Comment

<< Home