Jun 22, 2011

Encrypted Tweets

As a cryptography course project, me and @laisandrade developed a twitter extension which allows us to tweet in a way that only a group of followers can decrypt and understand my tweet. The goal is to secure broadcasting tweets to a group of followers and avoid anyone else to understand it's content. This definition of security is safe against a eavesdropping attack, which is considered secure enough for the common sense for cryptography, what is not true. Another concerns emerges when we consider other attacks or ways of extracting informations from the cyphertext, as an example the Known-Plaintext Attack.
Configuration of the Encrypted Tweets problem - Alice securely broadcasting messages to a specific group
In this project we should also consider other security issues such as forgery attacks and secure information storage. The only real world problem outside the project's scope was the private key distribution. We assume that the user who wants to share it's keys to his followers could use other mechanisms such as a Diffie-Hellman key exchange or using a trusted key broker. The full project's specification can be found here.

For doing the task we used the Grease Monkey Firefox's plugin. This plugin allow us to run custom javascript scripts in specifics web pages, so we can add functionalities to known web pages. In fact, we used a script stub for doing it. This script adds a encrypt tweet functionality as well as a key manager at password settings page. It also gives a function to store (unsafely) the user information. Was also our job to make it secure. We assumed that the script couldn't be cracked, the focus where we were concerned was about the information safety outside the script execution memory.

Encrypt button and Group selection close to Tweet button

============================== Encryption Specification ==============================

Encripted Message:
| "aes:" | Ctr | Extended-AES(Content, Ctr, GroupKey) |
* The Ctr is the base counter for the CTR block chaining cypher. Every message picks a random Ctr.
* The group key is the key to the group whom members are the only who should understand the message.
* The Extended-AES function will be detailed after.

* the "aes:" indicator is not a security threat due to Kerckhoffs's Principle.

| sha1(Message) | Message |
* The sha1 cryptographic hash is used to check the integrity of the stored keys and avoid modifications on the massage bypass as a real message.
* The sha1 produces a 160 bits length message, which is convertes into a base64 string to send it over twitter.

* This Message Authentication Code (MAC) is necessary for Chosen Ciphertext Security (CCA Security).

Message: the original message

For our Extended-AES implementation, we used a Cypher Block Chaining to extend the AES block cypher, that we use as a block cypher. To do this we used a Counter (CTR) chaining to extend the cypher. Note that the CTR mode uses a base counter (Ctr) as an Initialization Vector (IV) which also must be transmitted to enable to be decrypted. This IV is chosen at random by our pseudorandom generator. If the IV is not truly random, an attacker may predict the IV sequence and get more information from the message. A real problem faced by pseudorandom generators are when the seed selection are not truly random. To handle this, the script stub creates a seed using the mouse positions at the first seconds of running (an entropy test is also used to check the randomness of the seed).

When someone receives a "aes:" tagged message, the script checks the author of the message and get the key provided by him. To decrypt, we also use the Extended-AES since it is symmetric (due to XOR reversibility property). At last the SHA-1 hash is confronted with the message to check if the key is correct or the message suffered a forgery attempt.

To store the groups keys we used the grease monkey's key value storage (GM_setValue and GM_getValue functions). We store and recover the values from the key "twit-key-" + login, which is stored as plaintext in the hard drive. To avoid the stealth of this information from other users, virus or worm we added an encryption layer over it. The stored data representation developed was this one:

Keys management screen

==============================   Safe Storaging Specification   ==============================

Stored Data:
AES(Data, MasterKey)

User secret key to access all the other keys. This key requires user input every time it logs in.

|   sha1(GroupKeys)   |   Groups Keys   |
* This cryptographic hash is used to check the integrity of the stored keys.
* A malicious agent may override the values on the keys and may set the keys he can decrypt.

Groups Keys:
|   Item   |   "$$"   |   Item   |   "$$"   |   ...   |   "$$"   |   Item   |

|   User   |   "$"   |   Group   |   "$"   |   Key   |

User: base64 string
Group: base64 string
Key: base64 string


It's important to notice that we implemented the SHA-1 hashing algorithm to ensure the data wasn't modified while stored or transmitted. We trust it wasn't modified due to the difficulty to crack the AES cypher to forge a new MAC. And if the attacker try to change bits from the stored data, we trust that SHA-1 hash collision difficulty will be enough to ensure us that someone will be able to produce valid message, hash pairs without looking at the plaintext. If the pair is corrupted, then the user is alerted that someone modified his keys. By doing this, we may disable an attacker to modify bits from our keys and have better chances to crack the scheme. The SHA-1 also has some theoretical weakness, however, in practice it still is a very strong cryptographic hash function.

Jun 15, 2011

The ePassport Standard and it's Cryptographic Scheme

ePassport is the name given for the new passport specification standardized by the International Civil Aviation Organization (ICAO). It was created to speed up and standardize the immigration procedures. This standard specifies RFID as the communication channel between the passport and the reader. The main feature present on it is the insertion of biometric data into the passport for biometric checks, including facial recognition, fingerprint and iris information. To keep this important data protected, an interesting cryptographic layer was added in addition to the usual document falsification mechanisms such as watermarks.
Brazilian ePassport. One of few which has full compatibility with the ICAO specification.
The name, nationality, birth date and other personal information as well as the biometric informations are held on a contactless chip inside the passport. This set of information are securely handled and is called the Machine Readable Zone (MRZ) and the chip asserts their security. These informations are transmitted to a ePassport reader at immigration process. As the embedded chip has low processing power, the biometric data is transmitted to a reader so it can be checked against the porter of the passport.
Anatomy of an ePassport

As the physical communication layer of the ePassport broadcasts the messages, even with the low range (less than a meter) RFID communication. But as we know an eavesdropper can be very determined to listen the communication ;-). I also has an electromagnetic shield on the passport cover, however it doesn't assure to block the communication between the chip and the outside world when it's closed, however it's an extra obstacle. Another issue to prevent is the passport and digital data forgery. Otherwise a falsary could get someone else passport and insert it's biometrics information and easily bypass the immigration. As we may see, the ePassport is a complete meal for cryptography specialists.

Starring at this cryptographic layer we have: 

A unique id used as a private key. This key is hardcoded in the chip manufacturing process and it's non traceable. Doing reverse engineering on this kind of chip it's really difficult because the equipment to do it is extremely expensive and hard to obtain, it is only available to large chip manufacturers.

An access control to prevent an attacker to force the ePassport to send information in the MRZ to readers unless the passport owner authorizes it. Otherwise an attacker could use a reader and simply request the biometric information to the passport. It uses a protocol called Basic Access Control (BAC) to give access for the reader. It consists on a OCR readable code inside the passport. With this code, the reader can say to the passport it is an authorized reader, and then transmit data to it. The BAC code is also used as key for the encrypted transmission of the data, making an eavesdropping attack more difficult. It's important to note that the BAC security entirely relies on the physical access to the passport. The BAC is inside the ICAO standard but doesn't obligates it's support.

BAC happy path's sequence diagram

The ICAO also standardizes a authentication for readers called Extended Access Control (EAC). It consists of sets of signatures of terminals stored in the passport's chip. These signatures are changed periodically by the countries (which signs them) on the terminals to avoid keys stolen from equipments being used for much time. Some MRZ sensitive data, such as iris and fingerprint, are just transmitted to readers if it is EAC authenticated. Other informations are optional.

EAC happy path's sequence diagram
    The Passive Authentication (PA) is a cryptographic mechanism  to ensure the data in the passport is valid by the authority emitting the passports. The PA is a kind of digital signature where the country who emits passports ensures the veracity of the data with a asymmetric Country Verification Certificate Authority (CVCA), a kind of message authentication code (MAC). This process is known as asymmetric cryptographic hashing or digital signing. The PA MAC is created by the country when emits a passport, which must keep a signing key safe, which is used to create the MAC code. Then it publish the public key (all readers must know the countries public keys) which can be used to check if the MAC is valid and it was really generated by the organization which possess the signing key. The PA scheme relies on mathematical functions which are easy to encrypt a MAC with the private key, but computationally hard without them. The public key can easily check if the MAC was generated by the ones who posses the private key, and detect forged data. This process makes the forgery of digital data a really hard computational problem, and ensures with negligible chance of failure that the data is valid. All the MRZ data is signed using such process. Using PA is mandatory by ICAO standards.
    PA happy path's sequence diagram

    The Active Authentication (AA) is a kind of challenge response authentication mechanism to prevent the cloning of passports. It relies on the untraceable chip key. Cloning the MRZ data is a simple task to be done, however you can't get a passport private key due to the difficulty of doing reverse engineering to obtain it. The AA works as a challenge to the passport, so you can verify if it really is. The reader encrypts a random message that only him knows it's content with the passport's public key. The message is then sent to the passport and it must decrypt and response to the reader what was the original message. This process works because of the duality principle of asymmetric cryptography. This concept of challenge is the same concept behind CAPTCHAs and the Turing Test.
    AA hapy path's sequence diagram

    Jun 11, 2011

    Optimizing Functions with Python Caching Decorators

    On these last months I've been solving some problems (such as some HMMs algorithms) which the best solutions involves some kind of dynamic programming. Some of them are quite simple to implement, but their recursive formulation are far more intuitive. The problem is that even in functional languages, the recursive functions aren't well handled unless you some mechanism like tail call, which aren't intuitive as we would like to. The simplest example that comes in my mind is the fibonacci function which is usually defined as:

    fib(0) = 1
    fib(1) = 1
    fib(n) = fib(n-1) + fib(n-2)

    As we know, almost all the languages compilers and interpreters use the call stack to call the recursives cases on functions being executed. We can analyze the following C fibonacci version:

    int fib(n)
        if (n == 0 || n == 1)
            return 1;
            return fib(n-1) + fib(n-2);

    It is really simple to understand when contrasted with the definition. But, if we make a trace of the the program (even with a small input value) we'll have something like the following evaluation order:

    fib(6) = fib(5) + fib(4)
    fib(5) = fib(4) + fib(3)
    fib(4) = fib(3) + fib(2)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(4) = fib(3) + fib(2)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)
    fib(3) = fib(2) + fib(1)
    fib(2) = fib(1) + fib(0)

    Call stack for fib(6)
    As we can see, there is a repetition of the calculation of fib 4 to 1, many times and is something we can avoid. In fact, the complexity of this solution has a exponencial computational complexity because for each n from input we branch it in 2 until it reachs 0 or 1 approximately n times, leading us into a O(2n) complexity. A simple way to avoid it, is converting into a interactive form:

    int fib(int n)
        int current = 1;
        int previous = 1;
        int i;
        for (i = 1; i < n; i++) {
            int temp = current; // XXX: nasty
            current += previous;
            previous = temp;

        return current;

    The same result is achieved by using tail call for functional languages.

    As you can see, it obfuscates the intuitive definition given in as the recursive formulation. But we still have a problem whenever we calculate fib(n), we have to recalculate it's previous results even if they was previously calculated. If this function is used many times in our program it will take a lot of processing re-computing many of the values. We can avoid this by using the dynamic programming, which keeps the record of previously calculated results. The drawback of this technique is the memory usage, which for large entries can become a bottleneck. However, processing usually is a more rare computer resource. A C implementation (not the most elegant) for it is:

    // XXX: kids, don't do this at home
    int fib_results[10000];
    int last_fib;

    int fib(int n)
        if (n <= last_fib
            return fib_results[n];

        int current = fib_results[last_fib-1];
        int previous = fib_results[last_fib-2];
        for (; last_fib < n; last_fib++) {
            int temp = current;
            current += previous;
            fib_results[last_fib] = current;
            previous = temp;

        return current;

    int main()
        fib_results[0] = 1;
        fib_results[1] = 1;
        last_fib = 1;

        // ... other stuff ...

        return 0;

    As we can see, dynamic programming isn't too hard to implement. On the other hand, reading a code it's a though task to do unless you are already familiar with the algorithm.
    If we extract what is dynamic programming fundamental concept, which is "store pre-computed results", we find a regularity in every recursive function which we can be transformed into a dynamic programming one. One of the reasons I love python because it's easy to use meta-programming concepts, and that's what I will use to transform recursive functions into it's dynamic form in a ridiculous easy way using function decorators.
    Function decorators (or annotations in Java) are a form of meta-programming for functions. It extends functions with some functionalities, such as debugging, tracing, adding meta-data to the function, synchronization or memoization (not memorization) of values, which is a great way of optimizing recursive functions by caching their results (if you have enough memory available). One possible implementation of memoitized decorator in python is the following:

    def cached(function):
        cache = {}
        def wrapper(*args):
            if args in cache:
                return cache[args]
                result = function(*args)
                cache[args] = result 
                return result

        return wrapper

    Note that I'm not using kwargs because they're not hashable, such as the tuple args, and will add a few complexity in the example. See also that we a have a function that returns another function, which uses a given one to calculate results and store them in a cache. To cache our fib function we may use the following code:

    def fib(n):
        if n == 0 or n == 1:
            return 1
            return fib(n-1) + fib(n-2)

    # or in a not so clean version:
    def normal_fib(n):
        if n == 0 or n == 1:
            return 1
            return fib(n-1) + fib(n-2)

    fib = cached(normal_fib)

    This technique is really useful to improve your code performance in a really easy. On the other hand, it isn't the best solution for almost all the cases. Many times code a dynamic programming method (if performance is crucial) will be necessary. Is also important to notice that I didn't used any cache memory management policy, which is important to economize memory. Most appropriate cache data structures (such as numpy arrays for integer arguments) also are welcome. The python 3.2 version added the lru_cache decorator into the functools module to make this cache mechanism. If you are already using this version, it's smarter to use it instead of implementing your one. Here is how it must be used:

    # Python > 3.2
    import functools
    @functools.lru_cached(max_size=500) # uses a fixed size cache to avoid memory usage explosion
    def fib(n)

    This technique is very useful not only for economize the CPU resources but also network (such as caching SQL query results), other IO operations (such as disk reading) and even user interaction input in a straightforward way.

    Jun 5, 2011

    CpG Islands (3) - Model Evaluation

    Following the post we've built a Hidden Markov Model (HMM) for the CpG islands problem, using the training set. Now we should evaluate if our model is adequate to predict things about CpG islands. For evaluate we may use a tagged sequence and see if the HMM we built can predict the islands and what is it's accuracy.

    Using the viterbi path and a tagged sequence (out ouf the training set), enable us to compare if the estimative given by the model is coherent with the real data. Using the HMM and the training and test sets of the last post, we can compare the results using a confusion matrix. For example, at the following snippet of the viterbi path paired with the tagged sequence:

    [ ... 0,0,0,0,0,1,1,1,1,1, ... ]
    [ ... a t t g t c C G A C ... ]

    We may calculate the following confusion matrix:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |      4     |      0     |
    U +------------+------------+------------+
    A | Non Island |      1     |      5     |
    L +------------+------------+------------+

    Note that the bigger are the main diagonal values, the more accurate our model is. Making this matrix with the experiment made (using HMM with maximum likelihood) I got the following matrix:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |   328815   |    61411   |
    U +------------+------------+------------+
    A | Non Island |   1381515  |   1889819  |
    L +------------+------------+------------+

    We may see that a good amount of islands was correctly predicted (84,26% of the islands), but 57,76% of the non islands regions was classified as islands, which is a bad indicator. A experiment with the HMM after running the Baum Welch was also evaluated. The results was better from the obtained by the maximum likelihood when predicting islands (96,56% of accuracy). But also obtained a higher miss rate (70,70%). Here is the confusion matrix for this HMM:

      +----------- P R E D I C T E D --------+
    A |            |   Island   | Non Island |
    C +------------+------------+------------+
    T |   Island   |   376839   |    13387   |
    U +------------+------------+------------+
    A | Non Island |   2313138  |   958196   |
    L +------------+------------+------------+

    Matching of Real Sequence, Viterbi and Posterior results

    The result obtained was not very accurate. It may happened because we had few data to train and to evaluate our model. We could also build a HMM with other indicators that identify the CpG islands for model it better. Remember we simplified the CpG problem to the frequencies of occurring Cs and Gs, but a better model could also use the CG pairs. Using a more complicate HMM we could have more than 2 states and then associate a set of states to the CpG and non CpG islands sites. This would allow to use better the posterior result to classify the sequence. But I keep this as future work.

    Jun 2, 2011

    CpG Islands (2) - Building a Hidden Markov Model

    By the definition of the CpG islands in the previous post and the Hidden Markov Models (HMMs) short introduction, we now can model a HMM for finding CpG islands. We can create a model very similar to the "Fair Bet Casino" problem.

    When we are in a nucleotide of given DNA sequence there are two possibilities, that nucleotide belongs to CpG island (lets denote state S1) or do not (S0). If you analyse a sibling nucleotide it can stay in the same state or to change with complementary probabilities. We can view it as this Markov chain:

    CpG states Markov chain

    As the emissions of the states S0 and S1, we may have the associated probabilities of the emissions of ACGT symbols. However we can't know for now how much are these probabilities. However if we have a dataset of sequences with the attached information of where the CpG islands occurs, we can estimate those parameters. I used these DNA sequences tagged by biologists (used biopython library to parse the .fasta files) as the training and test sets to build the model and evaluate the technique. This dataset consists of nucleotides in upper and lower cases. When we have a given nucleotide in upper case, it denotes that it is a CpG site and the lower case nucleotides means they are not. I used a statistical technique called maximum likelihood to build the HMM.

    Using the maximum likelihood to calculate the HMM emissions probabilities, I count each char frequency in the dataset and divide by the total amount of nucleotides in the same state:

    Emissions probability diagram
    P(A|S0) = a / (a+c+g+t)
    P(C|S0) = c / (a+c+g+t)
    P(G|S0) = g / (a+c+g+t)
    P(T|S0) = t / (a+c+g+t)
    P(A|S1) = A / (A+C+G+T)
    P(C|S1) = C / (A+C+G+T)
    P(G|S1) = G / (A+C+G+T)
    P(T|S1) = T / (A+C+G+T)

    And to calculate the transition probabilities of each state, count the number of transitions between a CpG to a non CpG site (out_of) and the reversal transitions (into). Then divide each of them by the state frequency which they're coming from. Note that the permanence probability of each state is the complement of the probability of transiting between states since there is no other state to go:

    P(S0|S1) = out_of / (A+C+G+T)
    P(S1|S0) = into / (a+c+g+t)
    P(S0|S0) = 1 - P(S1|S0)
    P(S1|S1) = 1 - P(S0|S1)

    Our model is almost done, it just lacks of the initial transitions which can be supposed to be the frequency of each state over all nucleotides:

    P(S0) = a+c+g+t/(a+c+g+t+A+C+G+T)
    P(S1) = A+C+G+T/(a+c+g+t+A+C+G+T)

    I used the ghmm to create my HMM with the given dataset and it generated this model:

    DiscreteEmissionHMM(N=2, M=4)
      state 0 (initial=0.97)
        Emissions: 0.30, 0.21, 0.20, 0.29
        Transitions: ->0 (0.00), ->1 (1.00)
      state 1 (initial=0.03)
        Emissions: 0.22, 0.29, 0.29, 0.20
        Transitions: ->0 (1.00), ->1 (0.00)

    It's important to notice that when we print the HMM it says that has the probability of 1 to stay in the same state. This occurs due to float formatted print. The reality is that in that value is very close to 1 but still has a chance to hop between states. It's also interesting to see the probability of emission of C and G when we contrast the states. In S1 the probability of getting a C or a G is significantly higher than in S0.

    Another manner of creating the HMM, is by initializing a model with the same structure we modelled the problem and then apply some unsupervised learning algorithms, such as the Baum Welch algorithm. The Baum Welch uses randomized initial weights in the HMM and once you feed untagged sequences, it tries to infer a model. However, usually is far better initialize your HMM with biased data about the problem (for example using the model we created) and let the Baum Welch calibrate the probabilities. With ghmm do the following:

    sigma = ghmm.IntegerRange(0,4) # our emission alphabet
    emission_seq = ghmm.EmissionSequence(sigma, [0, 4, 5, ... ]) # create a emission sequence
    hmm.baumWelch(sigma, emission_seq) # use a template hmm to calibrate the probablities

    In the next post I will show how is the evaluation process of the HMM in the CpG island.