CodeCereal: dna

Showing posts with label dna. Show all posts

Jun 5, 2011

CpG Islands (3) - Model Evaluation

Following the post we've built a Hidden Markov Model (HMM) for the CpG islands problem, using the training set. Now we should evaluate if our model is adequate to predict things about CpG islands. For evaluate we may use a tagged sequence and see if the HMM we built can predict the islands and what is it's accuracy.

Using the viterbi path and a tagged sequence (out ouf the training set), enable us to compare if the estimative given by the model is coherent with the real data. Using the HMM and the training and test sets of the last post, we can compare the results using a confusion matrix. For example, at the following snippet of the viterbi path paired with the tagged sequence:



[ ... 0,0,0,0,0,1,1,1,1,1, ... ]

[ ... a t t g t c C G A C  ... ]

We may calculate the following confusion matrix:



  +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |           4         |      0         |

U +------------+------------+------------+

A | Non Island |           1     |           5         |

L +------------+------------+------------+

Note that the bigger are the main diagonal values, the more accurate our model is. Making this matrix with the experiment made (using HMM with maximum likelihood) I got the following matrix:



  +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |   328815       |    61411       |

U +------------+------------+------------+

A | Non Island |   1381515  |   1889819  |

L +------------+------------+------------+

We may see that a good amount of islands was correctly predicted (84,26% of the islands), but 57,76% of the non islands regions was classified as islands, which is a bad indicator. A experiment with the HMM after running the Baum Welch was also evaluated. The results was better from the obtained by the maximum likelihood when predicting islands (96,56% of accuracy). But also obtained a higher miss rate (70,70%). Here is the confusion matrix for this HMM:

   +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |   376839       |    13387       |

U +------------+------------+------------+

A | Non Island |   2313138  |   958196       |

L +------------+------------+------------+

Matching of Real Sequence, Viterbi and Posterior results

The result obtained was not very accurate. It may happened because we had few data to train and to evaluate our model. We could also build a HMM with other indicators that identify the CpG islands for model it better. Remember we simplified the CpG problem to the frequencies of occurring Cs and Gs, but a better model could also use the CG pairs. Using a more complicate HMM we could have more than 2 states and then associate a set of states to the CpG and non CpG islands sites. This would allow to use better the posterior result to classify the sequence. But I keep this as future work.

Jun 2, 2011

CpG Islands (2) - Building a Hidden Markov Model

By the definition of the CpG islands in the previous post and the Hidden Markov Models (HMMs) short introduction, we now can model a HMM for finding CpG islands. We can create a model very similar to the "Fair Bet Casino" problem.

When we are in a nucleotide of given DNA sequence there are two possibilities, that nucleotide belongs to CpG island (lets denote state S1) or do not (S0). If you analyse a sibling nucleotide it can stay in the same state or to change with complementary probabilities. We can view it as this Markov chain:

CpG states Markov chain

As the emissions of the states S0 and S1, we may have the associated probabilities of the emissions of ACGT symbols. However we can't know for now how much are these probabilities. However if we have a dataset of sequences with the attached information of where the CpG islands occurs, we can estimate those parameters. I used these DNA sequences tagged by biologists (used biopython library to parse the .fasta files) as the training and test sets to build the model and evaluate the technique. This dataset consists of nucleotides in upper and lower cases. When we have a given nucleotide in upper case, it denotes that it is a CpG site and the lower case nucleotides means they are not. I used a statistical technique called maximum likelihood to build the HMM.

Using the maximum likelihood to calculate the HMM emissions probabilities, I count each char frequency in the dataset and divide by the total amount of nucleotides in the same state:

Emissions probability diagram

P(A|S0) = a / (a+c+g+t)

P(C|S0) = c / (a+c+g+t)

P(G|S0) = g / (a+c+g+t)

P(T|S0) = t / (a+c+g+t)

P(A|S1) = A / (A+C+G+T)

P(C|S1) = C / (A+C+G+T)

P(G|S1) = G / (A+C+G+T)

P(T|S1) = T / (A+C+G+T)

And to calculate the transition probabilities of each state, count the number of transitions between a CpG to a non CpG site (out_of) and the reversal transitions (into). Then divide each of them by the state frequency which they're coming from. Note that the permanence probability of each state is the complement of the probability of transiting between states since there is no other state to go:



P(S0|S1) = out_of / (A+C+G+T)

P(S1|S0) = into / (a+c+g+t)

P(S0|S0) = 1 - P(S1|S0)

P(S1|S1) = 1 - P(S0|S1)

Our model is almost done, it just lacks of the initial transitions which can be supposed to be the frequency of each state over all nucleotides:



P(S0) = a+c+g+t/(a+c+g+t+A+C+G+T)

P(S1) = A+C+G+T/(a+c+g+t+A+C+G+T)

I used the ghmm to create my HMM with the given dataset and it generated this model:



DiscreteEmissionHMM(N=2, M=4)

  state 0 (initial=0.97)

    Emissions: 0.30, 0.21, 0.20, 0.29

    Transitions: ->0 (0.00), ->1 (1.00)

  state 1 (initial=0.03)

    Emissions: 0.22, 0.29, 0.29, 0.20

    Transitions: ->0 (1.00), ->1 (0.00)

It's important to notice that when we print the HMM it says that has the probability of 1 to stay in the same state. This occurs due to float formatted print. The reality is that in that value is very close to 1 but still has a chance to hop between states. It's also interesting to see the probability of emission of C and G when we contrast the states. In S1 the probability of getting a C or a G is significantly higher than in S0.

Another manner of creating the HMM, is by initializing a model with the same structure we modelled the problem and then apply some unsupervised learning algorithms, such as the Baum Welch algorithm. The Baum Welch uses randomized initial weights in the HMM and once you feed untagged sequences, it tries to infer a model. However, usually is far better initialize your HMM with biased data about the problem (for example using the model we created) and let the Baum Welch calibrate the probabilities. With ghmm do the following:



sigma = ghmm.IntegerRange(0,4) # our emission alphabet

emission_seq = ghmm.EmissionSequence(sigma, [0, 4, 5, ... ]) # create a emission sequence

hmm.baumWelch(sigma, emission_seq) # use a template hmm to calibrate the probablities

In the next post I will show how is the evaluation process of the HMM in the CpG island.

May 29, 2011

CpG Islands (1) - Problem Motivation & Definitions

This semester I'm attending the course Processing of Sequences and Methods and Algorithm in Computational Biology (basically DNA and proteins). One of the focus of it is the use of the Hidden Markov Models to solve many of it's problems. One well studied problems is how to find codifying sequences (snippets that can be translated into proteins) in a given sequence of DNA, which has both codifying and non codifying regions. One of the most used techniques is called CpG islands.

In a random DNA sequence we have a lower frequency of two ((G)uanine and (C)ytosine) than the other two nucleotide types ((A)denine and (T)hymine) due to a phenomenon called DNA methylation. Due to this methylation, it's more probable to occur a deletion of a C and G nucleotides than with A or T


DNA double helix structure

In the species evolution, occurs that genes which are essential for it's survival cannot mutate much, otherwise the organism can't live and reproduce to transmit it's gene to their descendants. Because of this, is very unlikely that even with lots of generations, those genes doesn't suffer much alteration. It's important to notice that non codifying regions can mutate without affecting the organism because it doesn't affect a vital part of the genetic code.

With these two facts in hand we can see that mutation of the DNA can occur more frequently on non codifying regions. As the methylation of C is a very common mutation cause, we can observe that non codifying regions will have less C and G due to the methylation which it doesn't interfere (much) in the organism. The opposite occurs with codifying regions because it has less frequent mutations.


CpG island around an important human DNA regulation site

With this information in our hands we can use it to discover regions which are likely to be a codifying region. The sites which contains greater concentrations of C-phosphate-G (called CpG islands) are a strong indicator that it hasn't suffered much mutations through the generations. So it's very probable to be an area of interest for biologists. I'll describe in the next posts how can we identify these CpG using HMM and the ghmm python library.

CodeCereal