Codesprint 2012

After a good nights sleep I’ve put up some solutions to the Codesprint 2012 competition for the Quora Classification and Newsle Clustering problems.

Newsle Clustering

Quora Classification

CodeSprint 2012

So CodeSprint 2012 just finished and I finally have some time to look back over the experience. It was a hectic 48 hours full of non-stop coding and small amounts of sleep. I have to say I found many of the questions pretty interesting, especially those with a machine learning theme like the classification problem from Quora and the clustering problem from Newsle.

Looking back I learnt alot from the process and enjoyed myself aswell. I’d definitely consider participating again next year.

I’ll put up my solutions to some of the questions sometime soon.

Randomizing Training Data – Generating a Random Permutation on the Fly

So I’m working on a small problem involving the randomization of the order in which training data is presented to a machine learning algorithm. This is important for machine learning algorithms like stochastic gradient descent (SGD) since training order impacts the rate of convergence.

The simplest approach is to generate an array, fill it with the sequence 1..N and then use the Fisher-Yates shuffle to turn this into a random permutation. The problem with this approach however is that if the amount of training data is large then the size of the array may no longer fit into memory. In order to scale this we need a way of computing a random permutation on the fly. That is, at each iteration the method should return the next index in the random permutation sequence, which can then be used to load the appropriate training sample and feed it into SGD.

An elegant solution for this is to use a linear feedback shift register (LFSR – http://en.wikipedia.org/wiki/Linear_feedback_shift_register). At each iteration the state of the register is read out, and forms a (quasi-)random permutation sequence. Nothing needs to be stored other than the internal state of the LFSR. Polynomials for maximum period LFSRs are well known, and if these are used we are guaranteed the state will not repeat within the period, which is 2^N-1, where N is the number of registers in the LFSR.  This makes LFSRs perfect for generating a random permutation sequence on the fly.

In order to generate random permutation sequences of different lengths I’m planning to have a small lookup table for each possible maximal period, which are all powers of 2. When a random permutation of size M is needed I’ll use the maximal period LFSR with period 2^ceil(log2(M)). A check can be added at each iteration to see if the state is greater than M. If this is the case, the value will be discarded and the LFSR will be clocked to its next state. In the worst case this will result in half of the values being discarded which is still acceptable.