CodeCereal: 05/2011

May 30, 2011

Hidden Markov Models

Nowadays, many applications use Hidden Markov Models (HMMs) to solve crucial issues such as bioinformatics, speech recognition, musical analysis, digital signal processing, data mining, financial applications, time series analysis and many others. HMMs are probabilistic models which are very useful to model sequence behaviours or discrete time series events. Formally it models Markov processes with hidden states, like an extension for Markov Chains. For computer scientists, is a state machine with probabilistic transitions where each state can emit a value with a given probability.

For better understanding HMMs, I will illustrate how it works with "The Fair Bet Casino" problem. Imagine you are in a casino where you can bet on coins tosses, tossed by a dealer. A coin toss can have two outcomes: head (H) or tail (T). Now suppose that the coin dealer has two coins, a fair (F) which outputs both H and T with 1/2 probabilities and a biased coin (B) which outputs H with probability 3/4 and T with 1/4. Using probability language we say:

P(H_i+1|F_i) = 1/2
P(T_i+1|F_i) = 1/2
P(H_i+1|B_i) = 3/4
P(T_i+1|B_i) = 1/4

Now imagine that the dealer changes the coin in a way you can't see, but you know that he does it with a 1/10 probability. So thinking the coin tosses as a sequence of events we can say:

P(F_i+1|F_i) = 9/10
P(B_i+1|F_i) = 1/10
P(B_i+1|B_i) = 9/10
P(F_i+1|B_i) = 1/10

We can model it using a graph to illustrate the process:

HMM for "The Fair Bet Casino" problem

That's a HMM! It isn't any rocket science. Is just important to add a few remarks. We call the set of all possible emissions of the Markov process as the alphabet Σ ({H, T} in our problem). For many of computational method involving HMMs you will also need a initial state distribution π. For our problem we may assume that the we have equal probability for each coin.

Now comes in our mind what we can do with the model in our hands. There are lot's of stuff to do with it, such as: given a sequence of results, when the dealer used the biased coin or even generate a random sequence with a coherent behaviour when compared to the model.

There is a nice library called ghmm (available for C and Python) which handles HMMs and already gives us the most famous and important HMM algorithms. Unfortunately the python wrapper is not pythonic. Let's model our problem in python to have some fun:





import ghmm



# setting 0 for Heads and 1 for Tails as our Alphabet 

sigma = ghmm.IntegerRange(0, 2)



# transition matrix: rows and columns means origin and destiny states

transitions_probabilities = [

    [0.9, 0.1], # 0: fair state

    [0.1, 0.9], # 1: biased state

]



# emission matrix: rows and columns means states and symbols respectively

emissions_probabilities = [

    [0.5, 0.5], # 0: fair state emissions probabilities

    [0.75, 0.25], # 1: biased state emissions probabilities

]



# probability of initial states

pi = [0.5, 0.5] # equal probabilities for 0 and 1



hmm = ghmm.HMMFromMatrices(

sigma,

    # you can model HMMs with others emission probability distributions

    ghmm.DiscreteDistribution(sigma),

    transitions_probabilities,

    emissions_probabilities,

    pi

)

>>> print hmm
DiscreteEmissionHMM(N=2, M=2)



  state 0 (initial=0.50)

    Emissions: 0.50, 0.50

    Transitions: ->0 (0.90), ->1 (0.10)



  state 1 (initial=0.50)

    Emissions: 0.75, 0.25

    Transitions: ->0 (0.10), ->1 (0.90)

Now that we have our HMM object on the hand we can play with it. Suppose you have the given sequence of coin tosses and you would like to distinguish which coin was being used at a given state:



tosses = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,  1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1]

The viterbi algorithm can be used to trace the most probable states at each coin toss according to the HMM distribution:



# not as pythonic is could be :-/

sequence = ghmm.EmissionSequence(sigma, tosses)



viterbi_path, _ = hmm.viterbi(sequence)



>>> print viterbi_path

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Nice! But sometimes is interesting to have the probability of each state on the point instead of only the most probable one. To have that, you must use the posterior or forward algorithms to have more detailed information.



states_probabilities = hmm.posterior(sequence)



>>> print

states_probabilities
[[0.8407944139086141, 0.1592055860913865], [0.860787703168127, 0.13921229683187356], ... ]

The posterior method result, returns the list of probabilities at each state, for example, in the first index we have [0.8407944139086141, 0.1592055860913865]. That means that we have ~0.84 probability of chance that the dealer is using the fair coin and ~0.16 for the biased coin. We also can plot a graph to show the behaviour of the curve of the probability of the dealer being using the fair coin (I used matplotlib for the graphs).

Probability of being a fair coin over time

This is only a superficial example of what can HMMs do. It's worthy give a look at it if you want do some sequence or time series analysis in any domain. I hope this post presented and cleared what are HMM and how they can be used to analyse data.

May 29, 2011

CpG Islands (1) - Problem Motivation & Definitions

This semester I'm attending the course Processing of Sequences and Methods and Algorithm in Computational Biology (basically DNA and proteins). One of the focus of it is the use of the Hidden Markov Models to solve many of it's problems. One well studied problems is how to find codifying sequences (snippets that can be translated into proteins) in a given sequence of DNA, which has both codifying and non codifying regions. One of the most used techniques is called CpG islands.

In a random DNA sequence we have a lower frequency of two ((G)uanine and (C)ytosine) than the other two nucleotide types ((A)denine and (T)hymine) due to a phenomenon called DNA methylation. Due to this methylation, it's more probable to occur a deletion of a C and G nucleotides than with A or T


DNA double helix structure

In the species evolution, occurs that genes which are essential for it's survival cannot mutate much, otherwise the organism can't live and reproduce to transmit it's gene to their descendants. Because of this, is very unlikely that even with lots of generations, those genes doesn't suffer much alteration. It's important to notice that non codifying regions can mutate without affecting the organism because it doesn't affect a vital part of the genetic code.

With these two facts in hand we can see that mutation of the DNA can occur more frequently on non codifying regions. As the methylation of C is a very common mutation cause, we can observe that non codifying regions will have less C and G due to the methylation which it doesn't interfere (much) in the organism. The opposite occurs with codifying regions because it has less frequent mutations.


CpG island around an important human DNA regulation site

With this information in our hands we can use it to discover regions which are likely to be a codifying region. The sites which contains greater concentrations of C-phosphate-G (called CpG islands) are a strong indicator that it hasn't suffered much mutations through the generations. So it's very probable to be an area of interest for biologists. I'll describe in the next posts how can we identify these CpG using HMM and the ghmm python library.

May 27, 2011

QtQuick - WSL II

QtQuick uses QML, a powerful tool for creating fluid interfaces in a clean, easy and fast way. It uses a declarative language (Javascript based) and it has a really good performance, which makes it suitable to make apps for any kind of platform. It's also really easy to extend QML with Qt/C++, enbling developers to speedup the application logic or something else needed. Is also possible to use OpenGL to render QtQuick without any change in the code, sending the painting job to the video card. IMHO, it's greatest advantage, is because suitable for designer who already know HTML and CSS to build awesome interfaces, being easier than Flash and ActionScript. This is really welcome because the QtQuick prototypes are more straightforward, because can be directly integrated in the final application code.

These are some videos showing what can be done with QtQuick:

I've been developing with QtQuick (and some extensions such as the MeeGo and the Plasma Components) for quite a long time (for almost a year from now). So I'm sharing my QtQuick presentation on the II Workshop of Free Software at CIn-UFPE at 28/03/2011. The workshop was organized by the University Linux User Group (CIn-LUG) with the intent to show new free technologies and show how to start using them.

QtQuick - WSL II

View more presentations from Daker Fernandes.

The exercises resources and examples can be found here.

Happy hacking!

May 19, 2011

Plasma Components

Hi,

I'm here to talk about my GSoC project (\m/), the plasma components. As you may know, QML is a declarative language to build rich interfaces introduced in Qt 4.7 by providing simple primitives. As it is a powerful way to develop interfaces and it's the future of UI development for Qt was necessary to make the plasma support it.

Create interfaces in QML is really easy and fast but sometimes we need common widgets and may be boring to reimplement and replicate them in every application we create (e.g. Button, Slider, ScrollBar). For avoid this code replication, the Qt Components project was created to unify an API for a set of components.

The current (often updated) defined components and it's API can be found at QTCOMPONENTS-200. The Qt Components intends to be a cross platform, but sometimes we need to have a closer integration of the component and the platform. In plasma desktop we want to show tooltips or use the theme svg images for example. The common API defines a set of properties a component must have, but doesn't disallow to have extra properties and functionalities (suggestions ?). Creating the plasma components with the theme integration and adding plasma behaviour is what my GSoC project is about :-).

I already started working on the plasma components. They can be found in the kde-runtime repository at the plasma/declarative branch. And more speciffically in the plasma/declarativeimports/plasmacomponents directory. The components which already there are not yet done, they lack of tooltips, keyboard events handling, focus policy, and some (CheckBox and RadioButton) hasn't images yet. However they're fully functional and they cover all the properties and behaviours defined in the common API. Here is the list of the components in this state mentioned above (until this post publication):

BusyIndicator
Button
CheckBox
RadioButton
Slider

These are components not present in the common API but they're highly wanted in plasma:

ScrollBar
ListItem
ListItemView
ListHighlight

To test these components, its also kept a components gallery (plasma/declarativeimports/test/Gallery.qml) that you may view it with qmlviewer (after installing the components).

Gallery screenshot

Feedback is highly appreciated. ;-)

Cheers!

CodeCereal