Tuesday, July 15, 2008

Class Sequence

Class ‘Sequence’ is the result of an effort to create a generic class of object that encapsulates most of the common properties that biological information holding sequences contain. Hmmm… now what could those be? Obviously, one needs to store the actual sequence. It might also prove useful to define the alphabet (e.g. A, T, G and C for DNA) that makes up the sequence as well as their properties, such as, molecular weight of the nucleotide or amino acid, etc. Another important property is the number of alphabets/letters that are read together (i.e. the word size) to derive meaning from the sequence. As you’ve probably already guessed, the word size in an ORF would be 3 (i.e. the size of a codon). We’ve also included a few artificial but useful properties like name, locus and any interesting references that you might want to associate with the sequence.

The class also contains methods to manipulate the sequence and other properties. Methods analogous to Python string operators are addition and array-like indexing of the sequence. The binary ‘+’ operator behaves exactly as one would expect! The addition of two Sequence objects results in the formation of a new Sequence object that contains the end-to-end ligation of two sequences that the operator acts upon. The unary ‘[n]’ (index) operator returns the nth letter in the sequence. There are also a set of methods whose names speak for themselves! ‘getname’, ‘getseq’, ‘setname’ and ‘setseq’ return the name/sequence or alter the name/sequence respectively. ‘locus’ and ‘ref’ are methods that alter the locus or references.

Two important methods are ‘find’ and ‘fragment’. ‘find’ searches for sub-sequence or ‘motif’ in the sequence and returns the position of the first instance (if it finds one). ‘fragment’ returns a fragment of the Sequence according to the start and end positions supplied by the programmer. These two methods are at the heart of the GenePython ideology. For example, the RNAPolymerase Virtual Enzyme uses the ‘find’ method to locate important signal sequences and then returns the ‘fragment’ that lies between the TATA box and poly-adenylation signal.

Next time, we’ll look at some actual coding examples that will illustrate how class Sequence works.

3 Comments:

Blogger Ramray Bhat said...

Hello, once again. I am following the posts made in this blog with a lot of interest. At this stage I had a question. Will the instructions as to how to construct a class or how to impart a specific set of properties to a "class" be shared?
Also, when you start with the class "sequence", the objects will be sequences of nucleotides, right? But the individual nucleotides themselves can be considered objects belonging to the class "nucleotide" with a specific set of properties. We can go further into the subclasses of "nucleotide" ("purine" and pyrimidine") and define properties for classes and subclasses at every stage. But for working exclusively at the level of the class "sequence", do we necessarily need to define nucleotides as objects or "class"ify them?

July 28, 2008 6:56 PM  
Blogger Pranav Yajnik said...

Hey Ramray. Sorry for the delay. The answers to your questions are... yes and no!
The instructions to construct a class and impart specific properties to it WILL be shared viz. the code is going to be open and uploaded onto the blog soon. So you can download the code and read it use it as you wish. However, there is an important caveat... although we are going to explain how to use the classes that we have made (in the posts that will follow) actually reading and understanding the base code that we have written is dependent on how much Python/any programming you know. Although I personally believe that the code is quite simple, it may not be too easy for someone who hasn't programmed before to understand it. Although good programming practice dictates voluminous documentation, our code has none. That is an issue that we should hopefully address soon.
Coming to your second question... no, nucleotides are not 'objects' in GenePython. So currently, the actual sequence itself is taken to be a simple string of characters (with some rudimentary checking about whether the correct characters are present in the input string... i.e. you can't have a "U" in a DNA object). We did not feel the need to 'class'ify the monomers that make up the various sequences (DNA, RNA, proteins). GenePython is currently being designed to work as mere parser of biological sequences. So individual units are currently not important enough to have a class to themselves. Of course, we may change that as time progresses and if we have good reason to do so.
Ok... finally, something unrelated. It would be very nice if I could get an invite to view your blog. I am a student of biology who is interested in things like evolution and self organization. My apologies to you and everyone else for making this request through this channel... unfortunately I did not know any other way to contact you!

August 7, 2008 6:06 AM  
Blogger Ramray Bhat said...

Ok. This is very embarrassing but I have to say this. Pranav, the only reason my blog is private is because there isn't anything there! I would be the happiest if I could put my thoughts into the public arena and get them critiqued, but right now, I am in a bit of a maelstrom in terms of thoughts and work. This is the wrong forum to discuss evolution and self-organization, which form the core of my interests. I can offer you my email address [ramray_bhat@nymc.edu] for the present. I am also on Nature Networks where there are appropriate fora for our mutual interests. Very soon, I will open up my blog for public viewing. My apologies to all, in reciprocation, to Pranav's, for again responding via this channel.

August 9, 2008 5:13 PM  

Post a Comment

<< Home