Oct 6, 2011

Manifesto CodeCereal



Venho recebendo diversas críticas, de várias pessoas, que deixam de ler meus posts porquê são em inglês, além de tratar de "assuntos complicados". O fato é que eu estou deixando de divulgar muita coisa que faço e está em português e não tenho tempo para fazer a tradução de tudo, devido meu recurso finito de tempo. Também gosto de escrever em inglês como forma de exercício, além de dar uma maior dispersão dos posts. Portanto decidí tomara a decisão de tornar este blog híbrido onde quando possível farei posts em inglês, mas aproveitarei o meu material escrito em português, que caso contrário ficaria arquivado. Também continuarei a compartilhar o que for relativo a inteligência artificial no AIMotion e começarei a divulgar os posts relativos ao Qt aqui e no Qt Labs Brasil.

Espero que dê certo.

Jul 18, 2011

Back to Plasma Components

This last month I've been working slowly on my GSoC project due to the university activities (due to Brazilian academic calendar). And my thesis, which I shall talk in other post. But at least now I'm undergraduated!

As I explained in my last post about Plasma Components, my GSoC project. I'm building graphics components for developers to build plasmoids in QML using non trivial common components such as Sliders, ScrollBars and RadioButtons. After the break I've done these components following the Qt Components common API:
  • Switch
  • ButtonRow
  • ButtonColumn
  • ScrollBar
  • ScrollDecorator
Keyboard event handling and focus policy for the new and old components were added in this sprint. I also spent a lot of time refactoring some components. I think their code is much better now.

I'm also adding more complex use cases at the components gallery (at kde-runtime/plasma/declarativeimports/test/gallery). By the way, this is screenshot of the new gallery:

Plasma Components Gallery 

Part of the work was just straightforward, but there are some doubts I would like to ask which you think is the best, because some of the components behaviour are not defined.
  1. Should ScrollDecorator appear only when it's flickableItem is flicked, like in the mobile use case?
  2. ScrollDecorator must should look like ScrollBar or have it's own appearence?
  3. There are no SVG graphics for CheckBox, RadioButton and Switch. Currently there are placeholders. What can I do?
  4. Currently, when you click a component, it gains the focus. This logic must be in the components as it is? Or should left it external to the button.
  5. The Qt Components doesn't define any enabled property, for any components. I think it's important to have such a property in all interactive components. What do you think about it?
Any other tip/suggestion is highly appreciated.

I expect to give other update as soon as I have something more to report.

Jul 15, 2011

PySide on CITi

Here is the PySide slides from the python course in CITi. Have fun!

Jul 7, 2011

Protein Profile with HMM

Profiling is the action of summarizing a set of data in a smaller mathematical model. One of the practical usages of profiling techniques is the classification of sequences. With a data set profile, you may calculate the distance of an instance to the model, and classify the instance trough this value.

Profiling proteins is a more specific case of profiling sequences. As we know from the previous ṕost about Hidden Markov Models (HMMs) is a very robust mathematical model to represent probabilistically sequences. This post will detail how to profile sets of proteins with a HMM model.

The HMM model for this problem must be a HMM which when we calculate a probability of generation of a protein by the model, must give higher probabilities for proteins similar for the ones used to create the profile against the others.

To build the model, I've started with a multiple sequence alignment to see how is the structure of the sequences. That means that our HMM will be a probabilistic representation of a multiple alignment. With the alignment we see that some columns are complete and some are almost, and the others have few data. These most common matches can be used as a common matches for our model and the deletions and insertions can be modelled as other states. Here is an example of a few sequences aligned and the core columns of the alignment (marked with an *):

A C - - - - A T G
T C A A C T A T C
A C A C - - A G C
A G A - - - A T C
A C C G - - A T C
* * *       * * *

With this sample of alignment we have examples of insertions and deletions between the core alignments, which may have a state on the HMM to represent them. Note that insertions may occur on arbitrary times between the matching states. And the deletion states always replaces some matching. One possible HMM template for building the model is presented in the following picture:

HMM protein profile model
Each matching state (Mj) is related to the matching on the jth core alignment. The same applies for deletion state (Dj). The insertion is slightly different, because it represents the insertions between the core alignments, that's why it has one extra state, and this enable to represent states before the first and after the last alignment.

Our model will be like the one in the picture with the same length as the core alignments of a multiple alignment for a given set of sequences. However we should use maximum likelihood to estimate the transitions and emission for each state. The easiest way to count the total emissions/transitions, is to thread each sequence to be profiled in the model. For each symbol in the aligned sequence you must check if the symbol is in the core alignment. If it is, then increment the count of that state to the next match state, otherwise, you go to the insertion state. If it is a deletion, go to the deletion state and increment the transition. Finally, to calculate the probability for the transitions, just divide the count of each transition and divide it by all the states leaving the same state. It's important to notice that we have a stopping state, and this is a special state that has no transitions.

Note that is important to initialize the model with pseudo counts at each possible transitions. Adding this pseudo count, we let our model less rigid and avoid overfitting to the train set. Otherwise we could have some zero probabilities for some sequences.

To create the emissions probabilities, at each match state, you also have to count which symbol was emitted and then increment them. To calculate the probability, just divide it by the total symbols matched in the threading process. 

The similar process could be done with the insertion. However, the insertion states are characterized by having a low occurrence rate and this may lead us directly to a overfitting problem because the number of emissions must be too small. To avoid this we should use the background distribution as the emissions probabilities of each insertion state. The background distribution is the probability of the occurrence of a given amino acid in the entire protein set. To calculate this, count each amino acid type for all the train set sequence and then divided by the total count.

For the deletion states, it's important to notice that it is a silent state. It doesn't emit any symbol at all. To signalize it in ghmm, just let all the emissions probabilities with 0. Note that the end state is also a silent state once we have no emission associated to it.

ALERT: the loglikelihood is the only function in the ghmm library which handles silent states correctly.

At this point, the model is ready for use. However we have a problem of how to classify a new protein sequence. A threshold could be used to divide the two classes. However, the most common alternative is to use a null model to compare with it. The null model is a model which aims represents any protein with similar probability as any other. With this two models we could take a sequence and compare if it is more similar to a general definition of a protein or to a specific family. This model should model a sequence with average size equal to the aligned sequences being handled, and should emit any kind of symbol at each position. A common alternative for creating the null model, is using a single insertion state, which goes to a stopping state with probability of 1 divided by the average length of  sequences in the train set. For the emissions probability, we should use the background distribution, because this is related to the general amino acid distribution. At the end, the model should be like this:

Null Model

For testing the proposed model, I used a set of 100 globin proteins from the NCBI protein repository as a train set to build a profile model, and used ghmm to build and test the model.

To test if our model corresponds to our expectations, 100 globins different from the ones in the trains set were used with 1800 other random proteins with similar length. The loglikelihood function from the ghmm library to calculate the similarity index. The classification of the globins versus non globins, was a comparison between the likelihood of the protein with the globins profile hmm, and the null model. This test gave us 100% of accuracy! To display this result graph, each sequence was pointed out in a graph where the horizontal axis displays the length of the sequence and the vertical the log of globins / null models likelihood (or globins - null model loglikelihood). The globins are plotted in red an the others in blue. Proteins over the zero line are classified as globins, and below means they aren't globins. The graphs show us a clear distinction between the classes, and that our model is very precise for this problem.
Length x log(Globin's Profile Model Likelihood/Null Model Likelihood)

Jun 22, 2011

Encrypted Tweets

As a cryptography course project, me and @laisandrade developed a twitter extension which allows us to tweet in a way that only a group of followers can decrypt and understand my tweet. The goal is to secure broadcasting tweets to a group of followers and avoid anyone else to understand it's content. This definition of security is safe against a eavesdropping attack, which is considered secure enough for the common sense for cryptography, what is not true. Another concerns emerges when we consider other attacks or ways of extracting informations from the cyphertext, as an example the Known-Plaintext Attack.
Configuration of the Encrypted Tweets problem - Alice securely broadcasting messages to a specific group
In this project we should also consider other security issues such as forgery attacks and secure information storage. The only real world problem outside the project's scope was the private key distribution. We assume that the user who wants to share it's keys to his followers could use other mechanisms such as a Diffie-Hellman key exchange or using a trusted key broker. The full project's specification can be found here.

For doing the task we used the Grease Monkey Firefox's plugin. This plugin allow us to run custom javascript scripts in specifics web pages, so we can add functionalities to known web pages. In fact, we used a script stub for doing it. This script adds a encrypt tweet functionality as well as a key manager at password settings page. It also gives a function to store (unsafely) the user information. Was also our job to make it secure. We assumed that the script couldn't be cracked, the focus where we were concerned was about the information safety outside the script execution memory.

Encrypt button and Group selection close to Tweet button

============================== Encryption Specification ==============================

Encripted Message:
+--------+-----+--------------------------------------+
| "aes:" | Ctr | Extended-AES(Content, Ctr, GroupKey) |
+--------+-----+--------------------------------------+
* The Ctr is the base counter for the CTR block chaining cypher. Every message picks a random Ctr.
* The group key is the key to the group whom members are the only who should understand the message.
* The Extended-AES function will be detailed after.

* the "aes:" indicator is not a security threat due to Kerckhoffs's Principle.

Content:
+---------------+---------+
| sha1(Message) | Message |
+---------------+---------+
* The sha1 cryptographic hash is used to check the integrity of the stored keys and avoid modifications on the massage bypass as a real message.
* The sha1 produces a 160 bits length message, which is convertes into a base64 string to send it over twitter.

* This Message Authentication Code (MAC) is necessary for Chosen Ciphertext Security (CCA Security).

Message: the original message
======================================================================================


For our Extended-AES implementation, we used a Cypher Block Chaining to extend the AES block cypher, that we use as a block cypher. To do this we used a Counter (CTR) chaining to extend the cypher. Note that the CTR mode uses a base counter (Ctr) as an Initialization Vector (IV) which also must be transmitted to enable to be decrypted. This IV is chosen at random by our pseudorandom generator. If the IV is not truly random, an attacker may predict the IV sequence and get more information from the message. A real problem faced by pseudorandom generators are when the seed selection are not truly random. To handle this, the script stub creates a seed using the mouse positions at the first seconds of running (an entropy test is also used to check the randomness of the seed).

When someone receives a "aes:" tagged message, the script checks the author of the message and get the key provided by him. To decrypt, we also use the Extended-AES since it is symmetric (due to XOR reversibility property). At last the SHA-1 hash is confronted with the message to check if the key is correct or the message suffered a forgery attempt.

To store the groups keys we used the grease monkey's key value storage (GM_setValue and GM_getValue functions). We store and recover the values from the key "twit-key-" + login, which is stored as plaintext in the hard drive. To avoid the stealth of this information from other users, virus or worm we added an encryption layer over it. The stored data representation developed was this one:

Keys management screen

==============================   Safe Storaging Specification   ==============================

Stored Data:
AES(Data, MasterKey)

MasterKey:
User secret key to access all the other keys. This key requires user input every time it logs in.

Data:
+---------------------+-----------------+
|   sha1(GroupKeys)   |   Groups Keys   |
+---------------------+-----------------+
* This cryptographic hash is used to check the integrity of the stored keys.
* A malicious agent may override the values on the keys and may set the keys he can decrypt.

Groups Keys:
+----------+----------+----------+----------+---------+----------+----------+
|   Item   |   "$$"   |   Item   |   "$$"   |   ...   |   "$$"   |   Item   |
+----------+----------+----------+----------+---------+----------+----------+

Item:
+----------+---------+-----------+---------+---------+
|   User   |   "$"   |   Group   |   "$"   |   Key   |
+----------+---------+-----------+---------+---------+

User: base64 string
Group: base64 string
Key: base64 string

==============================================================================================


It's important to notice that we implemented the SHA-1 hashing algorithm to ensure the data wasn't modified while stored or transmitted. We trust it wasn't modified due to the difficulty to crack the AES cypher to forge a new MAC. And if the attacker try to change bits from the stored data, we trust that SHA-1 hash collision difficulty will be enough to ensure us that someone will be able to produce valid message, hash pairs without looking at the plaintext. If the pair is corrupted, then the user is alerted that someone modified his keys. By doing this, we may disable an attacker to modify bits from our keys and have better chances to crack the scheme. The SHA-1 also has some theoretical weakness, however, in practice it still is a very strong cryptographic hash function.

Jun 15, 2011

The ePassport Standard and it's Cryptographic Scheme

ePassport is the name given for the new passport specification standardized by the International Civil Aviation Organization (ICAO). It was created to speed up and standardize the immigration procedures. This standard specifies RFID as the communication channel between the passport and the reader. The main feature present on it is the insertion of biometric data into the passport for biometric checks, including facial recognition, fingerprint and iris information. To keep this important data protected, an interesting cryptographic layer was added in addition to the usual document falsification mechanisms such as watermarks.
Brazilian ePassport. One of few which has full compatibility with the ICAO specification.
The name, nationality, birth date and other personal information as well as the biometric informations are held on a contactless chip inside the passport. This set of information are securely handled and is called the Machine Readable Zone (MRZ) and the chip asserts their security. These informations are transmitted to a ePassport reader at immigration process. As the embedded chip has low processing power, the biometric data is transmitted to a reader so it can be checked against the porter of the passport.
Anatomy of an ePassport

As the physical communication layer of the ePassport broadcasts the messages, even with the low range (less than a meter) RFID communication. But as we know an eavesdropper can be very determined to listen the communication ;-). I also has an electromagnetic shield on the passport cover, however it doesn't assure to block the communication between the chip and the outside world when it's closed, however it's an extra obstacle. Another issue to prevent is the passport and digital data forgery. Otherwise a falsary could get someone else passport and insert it's biometrics information and easily bypass the immigration. As we may see, the ePassport is a complete meal for cryptography specialists.

Starring at this cryptographic layer we have: 

A unique id used as a private key. This key is hardcoded in the chip manufacturing process and it's non traceable. Doing reverse engineering on this kind of chip it's really difficult because the equipment to do it is extremely expensive and hard to obtain, it is only available to large chip manufacturers.

An access control to prevent an attacker to force the ePassport to send information in the MRZ to readers unless the passport owner authorizes it. Otherwise an attacker could use a reader and simply request the biometric information to the passport. It uses a protocol called Basic Access Control (BAC) to give access for the reader. It consists on a OCR readable code inside the passport. With this code, the reader can say to the passport it is an authorized reader, and then transmit data to it. The BAC code is also used as key for the encrypted transmission of the data, making an eavesdropping attack more difficult. It's important to note that the BAC security entirely relies on the physical access to the passport. The BAC is inside the ICAO standard but doesn't obligates it's support.

BAC happy path's sequence diagram

The ICAO also standardizes a authentication for readers called Extended Access Control (EAC). It consists of sets of signatures of terminals stored in the passport's chip. These signatures are changed periodically by the countries (which signs them) on the terminals to avoid keys stolen from equipments being used for much time. Some MRZ sensitive data, such as iris and fingerprint, are just transmitted to readers if it is EAC authenticated. Other informations are optional.

EAC happy path's sequence diagram
    The Passive Authentication (PA) is a cryptographic mechanism  to ensure the data in the passport is valid by the authority emitting the passports. The PA is a kind of digital signature where the country who emits passports ensures the veracity of the data with a asymmetric Country Verification Certificate Authority (CVCA), a kind of message authentication code (MAC). This process is known as asymmetric cryptographic hashing or digital signing. The PA MAC is created by the country when emits a passport, which must keep a signing key safe, which is used to create the MAC code. Then it publish the public key (all readers must know the countries public keys) which can be used to check if the MAC is valid and it was really generated by the organization which possess the signing key. The PA scheme relies on mathematical functions which are easy to encrypt a MAC with the private key, but computationally hard without them. The public key can easily check if the MAC was generated by the ones who posses the private key, and detect forged data. This process makes the forgery of digital data a really hard computational problem, and ensures with negligible chance of failure that the data is valid. All the MRZ data is signed using such process. Using PA is mandatory by ICAO standards.
    PA happy path's sequence diagram

    The Active Authentication (AA) is a kind of challenge response authentication mechanism to prevent the cloning of passports. It relies on the untraceable chip key. Cloning the MRZ data is a simple task to be done, however you can't get a passport private key due to the difficulty of doing reverse engineering to obtain it. The AA works as a challenge to the passport, so you can verify if it really is. The reader encrypts a random message that only him knows it's content with the passport's public key. The message is then sent to the passport and it must decrypt and response to the reader what was the original message. This process works because of the duality principle of asymmetric cryptography. This concept of challenge is the same concept behind CAPTCHAs and the Turing Test.
    AA hapy path's sequence diagram

    Jun 11, 2011

    Optimizing Functions with Python Caching Decorators

    On these last months I've been solving some problems (such as some HMMs algorithms) which the best solutions involves some kind of dynamic programming. Some of them are quite simple to implement, but their recursive formulation are far more intuitive. The problem is that even in functional languages, the recursive functions aren't well handled unless you some mechanism like tail call, which aren't intuitive as we would like to. The simplest example that comes in my mind is the fibonacci function which is usually defined as:

    fib(0) = 1
    fib(1) = 1
    fib(n) = fib(n-1) + fib(n-2)



    As we know, almost all the languages compilers and interpreters use the call stack to call the recursives cases on functions being executed. We can analyze the following C fibonacci version:

    int fib(n)
    {
        if (n == 0 || n == 1)
            return 1;
        else
            return fib(n-1) + fib(n-2);
    }


    It is really simple to understand when contrasted with the definition. But, if we make a trace of the the program (even with a small input value) we'll have something like the following evaluation order:

    fib(6) = fib(5) + fib(4)
    fib(5) = fib(4) + fib(3)
    fib(4) = fib(3) + fib(2)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(4) = fib(3) + fib(2)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)

    Call stack for fib(6)
    As we can see, there is a repetition of the calculation of fib 4 to 1, many times and is something we can avoid. In fact, the complexity of this solution has a exponencial computational complexity because for each n from input we branch it in 2 until it reachs 0 or 1 approximately n times, leading us into a O(2n) complexity. A simple way to avoid it, is converting into a interactive form:

    int fib(int n)
    {
        int current = 1;
        int previous = 1;
        int i;
        for (i = 1; i < n; i++) {
            int temp = current; // XXX: nasty
            current += previous;
            previous = temp;
        }

        return current;
    }


    The same result is achieved by using tail call for functional languages.

    As you can see, it obfuscates the intuitive definition given in as the recursive formulation. But we still have a problem whenever we calculate fib(n), we have to recalculate it's previous results even if they was previously calculated. If this function is used many times in our program it will take a lot of processing re-computing many of the values. We can avoid this by using the dynamic programming, which keeps the record of previously calculated results. The drawback of this technique is the memory usage, which for large entries can become a bottleneck. However, processing usually is a more rare computer resource. A C implementation (not the most elegant) for it is:

    // XXX: kids, don't do this at home
    int fib_results[10000];
    int last_fib;

    int fib(int n)
    {
        if (n <= last_fib
            return fib_results[n];

        int current = fib_results[last_fib-1];
        int previous = fib_results[last_fib-2];
        for (; last_fib < n; last_fib++) {
            int temp = current;
            current += previous;
            fib_results[last_fib] = current;
            previous = temp;
        }

        return current;
    }

    int main()
    {
        fib_results[0] = 1;
        fib_results[1] = 1;
        last_fib = 1;

        // ... other stuff ...

        return 0;



    As we can see, dynamic programming isn't too hard to implement. On the other hand, reading a code it's a though task to do unless you are already familiar with the algorithm.
    If we extract what is dynamic programming fundamental concept, which is "store pre-computed results", we find a regularity in every recursive function which we can be transformed into a dynamic programming one. One of the reasons I love python because it's easy to use meta-programming concepts, and that's what I will use to transform recursive functions into it's dynamic form in a ridiculous easy way using function decorators.
    Function decorators (or annotations in Java) are a form of meta-programming for functions. It extends functions with some functionalities, such as debugging, tracing, adding meta-data to the function, synchronization or memoization (not memorization) of values, which is a great way of optimizing recursive functions by caching their results (if you have enough memory available). One possible implementation of memoitized decorator in python is the following:

    def cached(function):
        cache = {}
        def wrapper(*args):
            if args in cache:
                return cache[args]
            else:
                result = function(*args)
                cache[args] = result 
                return result

        return wrapper

    Note that I'm not using kwargs because they're not hashable, such as the tuple args, and will add a few complexity in the example. See also that we a have a function that returns another function, which uses a given one to calculate results and store them in a cache. To cache our fib function we may use the following code:

    @cached
    def fib(n):
        if n == 0 or n == 1:
            return 1
        else:
            return fib(n-1) + fib(n-2)

    # or in a not so clean version:
    def normal_fib(n):
        if n == 0 or n == 1:
            return 1
        else:
            return fib(n-1) + fib(n-2)

    fib = cached(normal_fib)


    This technique is really useful to improve your code performance in a really easy. On the other hand, it isn't the best solution for almost all the cases. Many times code a dynamic programming method (if performance is crucial) will be necessary. Is also important to notice that I didn't used any cache memory management policy, which is important to economize memory. Most appropriate cache data structures (such as numpy arrays for integer arguments) also are welcome. The python 3.2 version added the lru_cache decorator into the functools module to make this cache mechanism. If you are already using this version, it's smarter to use it instead of implementing your one. Here is how it must be used:

    # Python > 3.2
    import functools
    @functools.lru_cached(max_size=500) # uses a fixed size cache to avoid memory usage explosion
    def fib(n)
        ...


    This technique is very useful not only for economize the CPU resources but also network (such as caching SQL query results), other IO operations (such as disk reading) and even user interaction input in a straightforward way.

    Jun 5, 2011

    CpG Islands (3) - Model Evaluation

    Following the post we've built a Hidden Markov Model (HMM) for the CpG islands problem, using the training set. Now we should evaluate if our model is adequate to predict things about CpG islands. For evaluate we may use a tagged sequence and see if the HMM we built can predict the islands and what is it's accuracy.

    Using the viterbi path and a tagged sequence (out ouf the training set), enable us to compare if the estimative given by the model is coherent with the real data. Using the HMM and the training and test sets of the last post, we can compare the results using a confusion matrix. For example, at the following snippet of the viterbi path paired with the tagged sequence:

    [ ... 0,0,0,0,0,1,1,1,1,1, ... ]
    [ ... a t t g t c C G A C ... ]


    We may calculate the following confusion matrix:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |      4     |      0     |
    U +------------+------------+------------+
    A | Non Island |      1     |      5     |
    L +------------+------------+------------+


    Note that the bigger are the main diagonal values, the more accurate our model is. Making this matrix with the experiment made (using HMM with maximum likelihood) I got the following matrix:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |   328815   |    61411   |
    U +------------+------------+------------+
    A | Non Island |   1381515  |   1889819  |
    L +------------+------------+------------+



    We may see that a good amount of islands was correctly predicted (84,26% of the islands), but 57,76% of the non islands regions was classified as islands, which is a bad indicator. A experiment with the HMM after running the Baum Welch was also evaluated. The results was better from the obtained by the maximum likelihood when predicting islands (96,56% of accuracy). But also obtained a higher miss rate (70,70%). Here is the confusion matrix for this HMM:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |   376839   |    13387   |
    U +------------+------------+------------+
    A | Non Island |   2313138  |   958196   |
    L +------------+------------+------------+

    Matching of Real Sequence, Viterbi and Posterior results

    The result obtained was not very accurate. It may happened because we had few data to train and to evaluate our model. We could also build a HMM with other indicators that identify the CpG islands for model it better. Remember we simplified the CpG problem to the frequencies of occurring Cs and Gs, but a better model could also use the CG pairs. Using a more complicate HMM we could have more than 2 states and then associate a set of states to the CpG and non CpG islands sites. This would allow to use better the posterior result to classify the sequence. But I keep this as future work.

    Jun 2, 2011

    CpG Islands (2) - Building a Hidden Markov Model

    By the definition of the CpG islands in the previous post and the Hidden Markov Models (HMMs) short introduction, we now can model a HMM for finding CpG islands. We can create a model very similar to the "Fair Bet Casino" problem.

    When we are in a nucleotide of given DNA sequence there are two possibilities, that nucleotide belongs to CpG island (lets denote state S1) or do not (S0). If you analyse a sibling nucleotide it can stay in the same state or to change with complementary probabilities. We can view it as this Markov chain:

    CpG states Markov chain

    As the emissions of the states S0 and S1, we may have the associated probabilities of the emissions of ACGT symbols. However we can't know for now how much are these probabilities. However if we have a dataset of sequences with the attached information of where the CpG islands occurs, we can estimate those parameters. I used these DNA sequences tagged by biologists (used biopython library to parse the .fasta files) as the training and test sets to build the model and evaluate the technique. This dataset consists of nucleotides in upper and lower cases. When we have a given nucleotide in upper case, it denotes that it is a CpG site and the lower case nucleotides means they are not. I used a statistical technique called maximum likelihood to build the HMM.

    Using the maximum likelihood to calculate the HMM emissions probabilities, I count each char frequency in the dataset and divide by the total amount of nucleotides in the same state:

    Emissions probability diagram
    P(A|S0) = a / (a+c+g+t)
    P(C|S0) = c / (a+c+g+t)
    P(G|S0) = g / (a+c+g+t)
    P(T|S0) = t / (a+c+g+t)
    P(A|S1) = A / (A+C+G+T)
    P(C|S1) = C / (A+C+G+T)
    P(G|S1) = G / (A+C+G+T)
    P(T|S1) = T / (A+C+G+T)

    And to calculate the transition probabilities of each state, count the number of transitions between a CpG to a non CpG site (out_of) and the reversal transitions (into). Then divide each of them by the state frequency which they're coming from. Note that the permanence probability of each state is the complement of the probability of transiting between states since there is no other state to go:

    P(S0|S1) = out_of / (A+C+G+T)
    P(S1|S0) = into / (a+c+g+t)
    P(S0|S0) = 1 - P(S1|S0)
    P(S1|S1) = 1 - P(S0|S1)

    Our model is almost done, it just lacks of the initial transitions which can be supposed to be the frequency of each state over all nucleotides:

    P(S0) = a+c+g+t/(a+c+g+t+A+C+G+T)
    P(S1) = A+C+G+T/(a+c+g+t+A+C+G+T)

    I used the ghmm to create my HMM with the given dataset and it generated this model:

    DiscreteEmissionHMM(N=2, M=4)
      state 0 (initial=0.97)
        Emissions: 0.30, 0.21, 0.20, 0.29
        Transitions: ->0 (0.00), ->1 (1.00)
      state 1 (initial=0.03)
        Emissions: 0.22, 0.29, 0.29, 0.20
        Transitions: ->0 (1.00), ->1 (0.00)


    It's important to notice that when we print the HMM it says that has the probability of 1 to stay in the same state. This occurs due to float formatted print. The reality is that in that value is very close to 1 but still has a chance to hop between states. It's also interesting to see the probability of emission of C and G when we contrast the states. In S1 the probability of getting a C or a G is significantly higher than in S0.

    Another manner of creating the HMM, is by initializing a model with the same structure we modelled the problem and then apply some unsupervised learning algorithms, such as the Baum Welch algorithm. The Baum Welch uses randomized initial weights in the HMM and once you feed untagged sequences, it tries to infer a model. However, usually is far better initialize your HMM with biased data about the problem (for example using the model we created) and let the Baum Welch calibrate the probabilities. With ghmm do the following:

    sigma = ghmm.IntegerRange(0,4) # our emission alphabet
    emission_seq = ghmm.EmissionSequence(sigma, [0, 4, 5, ... ]) # create a emission sequence
    hmm.baumWelch(sigma, emission_seq) # use a template hmm to calibrate the probablities


    In the next post I will show how is the evaluation process of the HMM in the CpG island.

    May 30, 2011

    Hidden Markov Models

    Nowadays, many applications use Hidden Markov Models (HMMs) to solve crucial issues such as bioinformatics, speech recognition, musical analysis, digital signal processing, data mining, financial applications, time series analysis and many others. HMMs are probabilistic models which are very useful to model sequence behaviours or discrete time series events. Formally it models Markov processes with hidden states, like an extension for Markov Chains. For computer scientists, is a state machine with probabilistic transitions where each state can emit a value with a given probability.

    For better understanding HMMs, I will illustrate how it works with "The Fair Bet Casino" problem. Imagine you are in a casino where you can bet on coins tosses, tossed by a dealer. A coin toss can have two outcomes: head (H) or tail (T). Now suppose that the coin dealer has two coins, a fair (F) which outputs both H and T with 1/2 probabilities and a biased coin (B) which outputs H with probability 3/4 and T with 1/4. Using probability language we say:
    • P(Hi+1|Fi) = 1/2
    • P(Ti+1|Fi) = 1/2
    • P(Hi+1|Bi) = 3/4
    • P(Ti+1|Bi) = 1/4
    Now imagine that the dealer changes the coin in a way you can't see, but you know that he does it with a 1/10 probability. So thinking the coin tosses as a sequence of events we can say:
    • P(Fi+1|Fi) = 9/10
    • P(Bi+1|Fi) = 1/10
    • P(Bi+1|Bi) = 9/10
    • P(Fi+1|Bi) = 1/10 
    We can model it using a graph to illustrate the process:

    HMM for "The Fair Bet Casino" problem

    That's a HMM! It isn't any rocket science. Is just important to add a few remarks. We call the set of all possible emissions of the Markov process as the alphabet Σ ({H, T} in our problem). For many of computational method involving HMMs you will also need a initial state distribution π. For our problem we may assume that the we have equal probability for each coin.

    Now comes in our mind what we can do with the model in our hands. There are lot's of stuff to do with it, such as: given a sequence of results, when the dealer used the biased coin or even generate a random sequence with a coherent behaviour when compared to the model.

    There is a nice library called ghmm (available for C and Python) which handles HMMs and already gives us the most famous and important HMM algorithms. Unfortunately the python wrapper is not pythonic. Let's model our problem in python to have some fun:

    import ghmm

    # setting 0 for Heads and 1 for Tails as our Alphabet
    sigma = ghmm.IntegerRange(0, 2)

    # transition matrix: rows and columns means origin and destiny states
    transitions_probabilities = [
        [0.9, 0.1], # 0: fair state
        [0.1, 0.9], # 1: biased state
    ]

    # emission matrix: rows and columns means states and symbols respectively
    emissions_probabilities = [
        [0.5, 0.5], # 0: fair state emissions probabilities
        [0.75, 0.25], # 1: biased state emissions probabilities
    ]

    # probability of initial states
    pi = [0.5, 0.5] # equal probabilities for 0 and 1

    hmm = ghmm.HMMFromMatrices(

        sigma,
        # you can model HMMs with others emission probability distributions
        ghmm.DiscreteDistribution(sigma),    
        transitions_probabilities,
        emissions_probabilities,
        pi
    )


    >>> print hmm
    DiscreteEmissionHMM(N=2, M=2)

      state 0 (initial=0.50)
        Emissions: 0.50, 0.50
        Transitions: ->0 (0.90), ->1 (0.10)

      state 1 (initial=0.50)
        Emissions: 0.75, 0.25
        Transitions: ->0 (0.10), ->1 (0.90)


    Now that we have our HMM object on the hand we can play with it. Suppose you have the given sequence of coin tosses and you would like to distinguish which coin was being used at a given state:

    tosses = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1]


    The viterbi algorithm can be used to trace the most probable states at each coin toss according to the HMM distribution:

    # not as pythonic is could be :-/
    sequence = ghmm.EmissionSequence(sigma, tosses)

    viterbi_path, _ = hmm.viterbi(sequence)

    >>> print viterbi_path
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


    Nice! But sometimes is interesting to have the probability of each state on the point instead of only the most probable one. To have that, you must use the posterior or forward algorithms to have more detailed information.

    states_probabilities = hmm.posterior(sequence)

    >>> print
    states_probabilities
    [[0.8407944139086141, 0.1592055860913865], [0.860787703168127, 0.13921229683187356], ... ]

    The posterior method result, returns the list of probabilities at each state, for example, in the first index we have [0.8407944139086141, 0.1592055860913865]. That means that we have ~0.84 probability of chance that the dealer is using the fair coin and ~0.16 for the biased coin. We also can plot a graph to show the behaviour of the curve of the probability of the dealer being using the fair coin (I used matplotlib for the graphs).

    Probability of being a fair coin over time
    This is only a superficial example of what can HMMs do. It's worthy give a look at it if you want do some sequence or time series analysis in any domain. I hope this post presented and cleared what are HMM and how they can be used to analyse data.

    May 29, 2011

    CpG Islands (1) - Problem Motivation & Definitions

    This semester I'm attending the course Processing of Sequences and Methods and Algorithm in Computational Biology (basically DNA and proteins). One of the focus of it is the use of the Hidden Markov Models to solve many of it's problems. One well studied problems is how to find codifying sequences (snippets that can be translated into proteins) in a given sequence of DNA, which has both codifying and non codifying regions. One of the most used techniques is called CpG islands.

    In a random DNA sequence we have a lower frequency of two ((G)uanine and (C)ytosine) than the other two nucleotide types ((A)denine and (T)hymine) due to a phenomenon called DNA methylation. Due to this methylation, it's more probable to occur a deletion of a C and G nucleotides than with A or T

    DNA double helix structure


    In the species evolution, occurs that genes which are essential for it's survival cannot mutate much, otherwise the organism can't live and reproduce to transmit it's gene to their descendants. Because of this, is very unlikely that even with lots of generations, those genes doesn't suffer much alteration. It's important to notice that non codifying regions can mutate without affecting the organism because it doesn't affect a vital part of the genetic code.

    With these two facts in hand we can see that mutation of the DNA can occur more frequently on non codifying regions. As the methylation of C is a very common mutation cause, we can observe that non codifying regions will have less C and G due to the methylation which it doesn't interfere (much) in the organism. The opposite occurs with codifying regions because it has less frequent mutations.
    CpG island around an important human DNA regulation site

    With this information in our hands we can use it to discover regions which are likely to be a codifying region. The sites which contains greater concentrations of C-phosphate-G (called CpG islands) are a strong indicator that it hasn't suffered much mutations through the generations. So it's very probable to be an area of interest for biologists. I'll describe in the next posts how can we identify these CpG using HMM and the ghmm python library.

    May 27, 2011

    QtQuick - WSL II

    QtQuick uses QML, a powerful tool for creating fluid interfaces in a clean, easy and fast way. It uses a declarative language (Javascript based) and it has a really good performance, which makes it suitable to make apps for any kind of platform. It's also really easy to extend QML with Qt/C++, enbling developers to speedup the application logic or something else needed. Is also possible to use OpenGL to render QtQuick without any change in the code, sending the painting job to the video card. IMHO, it's greatest advantage, is because suitable for designer who already know HTML and CSS to build awesome interfaces, being easier than Flash and ActionScript. This is really welcome because the QtQuick prototypes are more straightforward, because can be directly integrated in the final application code.

    These are some videos showing what can be done with QtQuick:









    I've been developing with QtQuick (and some extensions such as the MeeGo and the Plasma Components) for quite a long time (for almost a year from now). So I'm sharing my QtQuick presentation on the II Workshop of Free Software at CIn-UFPE at 28/03/2011. The workshop was organized by the University Linux User Group (CIn-LUG) with the intent to show new free technologies and show how to start using them.
    The exercises resources and examples can be found here.

    Happy hacking!

    May 19, 2011

    Plasma Components

    Hi,

    I'm here to talk about my GSoC project (\m/), the plasma components. As you may know, QML is a declarative language to build rich interfaces introduced in Qt 4.7 by providing simple primitives. As it is a powerful way to develop interfaces and it's the future of UI development for Qt was necessary to make the plasma support it.

    Create interfaces in QML is really easy and fast but sometimes we need common widgets and may be boring to reimplement and replicate them in every application we create (e.g. Button, Slider, ScrollBar). For avoid this code replication, the Qt Components project was created to unify an API for a set of components.

    The current (often updated) defined components and it's API can be found at QTCOMPONENTS-200. The Qt Components intends to be a cross platform, but sometimes we need to have a closer integration of the component and the platform. In plasma desktop we want to show tooltips or use the theme svg images for example. The common API defines a set of properties a component must have, but doesn't disallow to have extra properties and functionalities (suggestions ?). Creating the plasma components with the theme integration and adding plasma behaviour is what my GSoC project is about :-).

    I already started working on the plasma components. They can be found in the kde-runtime repository at the plasma/declarative branch. And more speciffically in the plasma/declarativeimports/plasmacomponents directory. The components which already there are not yet done, they lack of tooltips, keyboard events handling, focus policy, and some (CheckBox and RadioButton) hasn't images yet. However they're fully functional and they cover all the properties and behaviours defined in the common API. Here is the list of the components in this state mentioned above (until this post publication):
    • BusyIndicator
    • Button
    • CheckBox
    • RadioButton
    • Slider
    These are components not present in the common API but they're highly wanted in plasma:
    • ScrollBar
    • ListItem
    • ListItemView
    • ListHighlight
    To test these components, its also kept a components gallery (plasma/declarativeimports/test/Gallery.qml) that you may view it with qmlviewer (after installing the components).

    Gallery screenshot
    Feedback is highly appreciated. ;-)

    Cheers!