This post focuses on how to use functors and monads in practice with the HLearn library. We won’t talk about their category theoretic foundations; instead, we’ll go through ten concrete examples involving the categorical distribution. This distribution is somewhat awkwardly named for our purposes because it has nothing to do with category theory—it is the most general distribution over nonnumeric (i.e. categorical) data. It’s simplicity should make the examples a little easier to follow. Some more complicated models (e.g. the kernel density estimator and Bayesian classifier) also have functor and monad instances, but we’ll save those for another post.
Before we dive into using functors and monads, we need to set up our code and create some data. Let’s install the packages:
$ cabal install HLearndistributions1.1.0.1
Import our modules:
> import Control.ConstraintKinds.Functor > import Control.ConstraintKinds.Monad > import Prelude hiding (Functor(..), Monad (..)) > > import HLearn.Algebra > import HLearn.Models.Distributions
For efficiency reasons we’ll be using the Functor and Monad instances provided by the ConstraintKinds package and language extension. From the user’s perspective, everything works the same as normal monads.
Now let’s create a simple marble data type, and a small bag of marbles for our data set.
> data Marble = Red  Pink  Green  Blue  White > deriving (Read,Show,Eq,Ord) > > bagOfMarbles = [ Pink,Green,Red,Blue,Green,Red,Green,Pink,Blue,White ]
This is a very small data set just to make things easy to visualize. Everything we’ll talk about works just as well on arbitrarily large data sets.
We train a categorical distribution on this data set using the train function:
> marblesDist = train bagOfMarbles :: Categorical Double Marble
The Categorical type takes two parameters. The first is the type of our probabilities, and the second is the type of our data points. If you stick your hand into the bag and draw a random marble, this distribution tells you the probability of drawing each color.
Let’s plot our distribution:
ghci> plotDistribution (plotFile "marblesDist" $ PNG 400 300) marblesDist
Okay. Now we’re ready for the juicy bits. We’ll start by talking about the list functor. This will motivate the advantages of the categorical distribution functor.
A functor is a container that lets us “map” a function onto every element of the container. Lists are a functor, and so we can apply a function to our data set using the map function.
map :: (a > b) > [a] > [b]
Example 1:
Let’s say instead of a distribution over the marbles’ colors, I want a distribution over the marbles’ weights. I might have a function that associates a weight with each type of marble:
> marbleWeight :: Marble > Int  weight in grams > marbleWeight Red = 3 > marbleWeight Pink = 2 > marbleWeight Green = 3 > marbleWeight Blue = 6 > marbleWeight White = 2
I can generate my new distribution by first transforming my data set, and then training on the result. Notice that the type of our distribution has changed. It is no longer a categorical distribution over marbles; it’s a distribution over ints.
> weightsDist = train $ map marbleWeight bagOfMarbles :: Categorical Double Int
ghci> plotDistribution (plotFile "weightsDist" $ PNG 400 300) weightsDist
This is the standard way of preprocessing data. But we can do better because the categorical distribution is also a functor. Functors have a function called fmap that is analogous to calling map on a list. This is its type signature specialized for the Categorical type:
fmap :: (Ord dp0, Ord dp1) => (dp0 > dp1) > Categorical prob dp0 > Categorical prob dp1
We can use fmap to apply the marbleWeights function directly to the distribution:
> weightDist' = fmap marbleWeight marblesDist
This is guaranteed to generate the same exact answer, but it is much faster. It takes only constant time to call Categorical’s fmap, no matter how much data we have!
Let me put that another way. Below is a diagram showing the two possible ways to generate a model on a preprocessed data set. Every arrow represents a function application.
The normal way to preprocess data is to take the bottom left path. But because our model is a functor, the top right path becomes available. This path is better because it has the shorter run time.
Furthermore, let’s say we want to experiment with different preprocessing functions. The standard method will take time, whereas using the categorical functor takes time .
Note: The diagram treats the number of different categories (m) as a constant because it doesn’t depend on the number of data points. In our case, we have 5 types of marbles, so m=5. Every function call in the diagram is really multiplied by m.
Example 2:
For another example, what if we don’t want to differentiate between red and pink marbles? The following function converts all the pink marbles to red.
> pink2red :: Marble > Marble > pink2red Pink = Red > pink2red dp = dp
Let’s apply it to our distribution, and plot the results:
> nopinkDist = fmap pink2red marblesDist
ghci> plotDistribution (plotFile "nopinkDist" $ PNG 400 300) nopinkDist
That’s about all that a Functor can do by itself. When we call fmap, we can only process individual data points. We can’t change the number of points in the resulting distribution or do other complex processing. Monads give us this power.
Monads are functors with two more functions. The first is called return. Its type signature is
return :: (Ord dp) => dp > Categorical prob dp
We’ve actually seen this function already in previous posts. It’s equivalent to the train1dp function found in the HomTrainer type class. All it does is train a categorical distribution on a single data point.
The next function is called join. It’s a little bit trickier, and it’s where all the magic lies. Its type signature is:
join :: (Ord dp) => Categorical prob (Categorical prob dp) > Categorical prob dp
As input, join takes a categorical distribution whose data points are other categorical distributions. It then “flattens” the distribution into one that does not take other distributions as input.
Example 3
Let’s write a function that removes all the pink marbles from our data set. Whenever we encounter a pink marble, we’ll replace it with an empty categorical distribution; if the marble is not pink, we’ll create a singleton distribution from it.
> forgetPink :: (Num prob) => Marble > Categorical prob Marble > forgetPink Pink = mempty > forgetPink dp = train1dp dp > > nopinkDist2 = join $ fmap forgetPink marblesDist
ghci> plotDistribution (plotFile "nopinkDist2" $ PNG 400 300) nopinkDist2
This idiom of join ( fmap … ) is used a lot. For convenience, the >>= operator (called bind) combines these steps for us. It is defined as:
(>>=) :: Categorical prob dp0 > (dp0 > Categorical prob dp1) > Categorical prob dp1 dist >>= f = join $ fmap f dist
Under this notation, our new distribution can be defined as:
> nopinkDist2' = marblesDist >>= forgetPink
Example 4
Besides removing data points, we can also add new ones. Let’s double the number of pink marbles in our training data:
> doublePink :: (Num prob) => Marble > Categorical prob Marble > doublePink Pink = 2 .* train1dp Pink > doublePink dp = train1dp dp > > doublepinkDist = marblesDist >>= doublePink
ghci> plotDistribution (plotFile "doublepinkDist" $ PNG 400 300) doublepinkDist
Example 5
Mistakes are often made when collecting data. One common machine learning task is to preprocess data sets to account for these mistakes. In this example, we’ll assume that our sampling process suffers from uniform noise. Specifically, if one of our data points is red, we will assume there is only a 60% chance that the marble was actually red, and a 10% chance each that it was one of the other colors. We will define a function to add this noise to our data set, increasing the accuracy of our final distribution.
Notice that we are using fractional weights for our noise, and that the weights are carefully adjusted so that the total number of marbles in the distribution still sums to one. We don’t want to add or remove marbles while adding noise.
> addNoise :: (Fractional prob) => Marble > Categorical prob Marble > addNoise dp = 0.5 .* train1dp dp <> 0.1 .* train [ Red,Pink,Green,Blue,White ] > > noiseDist = marblesDist >>= addNoise
ghci> plotDistribution (plotFile "noiseDist" $ PNG 400 300) noiseDist
Adding uniform noise just made all our probabilities closer together.
Example 6
Of course, the amount of noise we add to each sample doesn’t have to be the same everywhere. If I suffer from redgreen color blindness, then I might use this as my noise function:
> rgNoise :: (Fractional prob) => Marble > Categorical prob Marble > rgNoise Red = trainW [(0.7,Red),(0.3,Green)] > rgNoise Green = trainW [(0.1,Red),(0.9,Green)] > rgNoise dp = train1dp dp > > rgNoiseDist = marblesDist >>= rgNoise
ghci> plotDistribution (plotFile "rgNoiseDist" $ PNG 400 300) rgNoiseDist
Because of my color blindness, the probability of drawing a red marble from the bag is higher than drawing a green marble. This is despite the fact that we observed more green marbles in our training data.
Example 7
In the real world, we can never know exactly how much error we have in the samples. Luckily, we can try to learn it by conducting a second experiment. We’ll first experimentally determine how redgreen color blind I am, then we’ll use that to update our already trained distribution.
To determine the true error rate, we need some unbiased source of truth. In this case, we can just use someone with good vision. They will select ten red marbles and ten green marbles, and I will guess what color they are.
Let’s train a distribution on what I think green marbles look like:
> greenMarbles = [Green,Red,Green,Red,Green,Red,Red,Green,Green,Green] > greenDist = train greenMarbles :: Categorical Double Marble
and what I think red marbles look like:
> redMarbles = [Red,Green,Red,Green,Red,Red,Green,Green,Red,Red] > redDist = train redMarbles :: Categorical Double Marble
Now we’ll create the noise function based off of our empirical data. The (/.) function is scalar division, and we can use it because the categorical distribution is a vector space. We’re dividing by the number of data points in the distribution so that the distribution we output has an effective training size of one. This ensures that we’re not accidentally creating new data points when applying our function to another distribution.
> rgNoise2 :: Marble > Categorical Double Marble > rgNoise2 Green = greenDist /. numdp greenDist > rgNoise2 Red = redDist /. numdp redDist > rgNoise2 dp = train1dp dp > > rgNoiseDist2 = marblesDist >>= rgNoise2
ghci> plotDistribution (plotFile "rgNoiseDist2" $ PNG 400 300) rgNoiseDist2
Example 8
We can chain our preprocessing functions together in arbitrary ways.
> allDist = marblesDist >>= forgetPink >>= addNoise >>= rgNoise
ghci> plotDistribution (plotFile "allDist" $ PNG 400 300) allDist
But wait! Where’d that pink come from? Wasn’t the call to forgetPink supposed to remove it? The answer is that we did remove it, but then we added it back in with our noise functions. When using monadic functions, we must be careful about the order we apply them in. This is just as true when using regular functions.
Here’s another distribution created from those same functions in a different order:
> allDist2 = marblesDist >>= addNoise >>= rgNoise >>= forgetPink
ghci> plotDistribution (plotFile "allDist" $ PNG 400 300) allDist2
We can also use Haskell’s do notation to accomplish the same exact thing:
>allDist2' :: Categorical Double Marble >allDist2' = do > dp < train bagOfMarbles > dp < addNoise dp > dp < rgNoise dp > dp < forgetPink dp > return dp
(Since we’re using a custom Monad definition, do notation requires the RebindableSyntax extension.)
Example 9
Do notation gives us a convenient way to preprocess multiple data sets into a single data set. Let’s create two new data sets and their corresponding distributions for us to work with:
> bag1 = [Red,Pink,Green,Blue,White] > bag2 = [Red,Blue,White] > > bag1dist = train bag1 :: Categorical Double Marble > bag2dist = train bag2 :: Categorical Double Marble
Now, we’ll create a third data set that is a weighted combination of bag1 and bag2. We will do this by repeated sampling. On every iteration, with a 20% probability we’ll sample from bag1, and with an 80% probability we’ll sample from bag2. Imperative pseudocode for this algorithm is:
let comboDist be an empty distribution loop until desired accuracy achieved: let r be a random number from 0 to 1 if r > 0.2: sample dp1 from bag1 add dp1 to comboDist else: sample dp2 from bag2 add dp2 to comboDist
This sampling procedure will obviously not give us an exact answer. But since the categorical distribution supports weighted data points, we can use this simpler pseudocode to generate an exact answer:
let comboDist be an empty distribution foreach datapoint dp1 in bag1: foreach datapoint dp2 in bag2: add dp1 with weight 0.2 to comboDist add dp2 with weight 0.8 to comboDist
Using do notation, we can express this as:
> comboDist :: Categorical Double Marble > comboDist = do > dp1 < bag1dist > dp2 < bag2dist > trainW [(0.2,dp1),(0.8,dp2)]
plotDistribution (plotFile "comboDist" $ PNG 400 300) comboDist
And because the Categorical functor takes constant time, constructing comboDist also takes constant time. The naive imperative algorithm would have taken time .
When combining multiple distributions this way, the number of data points in our final distribution will be the product of the number of data points in the initial distributions:
ghci> numdp combination 15
Example 10
Finally, arbitrarily complex preprocessing functions can be written using Haskell’s do notation. And remember, no matter how complicated these functions are, their run time never depends on the number of elements in the initial data set.
This function adds uniform sampling noise to our bagOfMarbles, but only on those marbles that are also contained in bag2 above.
> comboDist2 :: Categorical Double Marble > comboDist2 = do > dp1 < marblesDist > dp2 < bag2dist > if dp1==dp2 > then addNoise dp1 > else return dp1
plotDistribution (plotFile "comboDist2" $ PNG 400 300) comboDist2
This application of monads to machine learning generalizes the monad used in probabilistic functional programming. The main difference is that PFP focused on manipulating already known distributions, not training them from data. Also, if you enjoy this kind of thing, you might be interested in the ncategory cafe discussion on category theory in machine learning from a few years back.
In future posts, we’ll look at functors and monads for continuous distributions, multivariate distributions, and classifiers.
Subscribe to the RSS feed to stay tuned!
]]>Haskell code is expressive. The HLearn library uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the Weka library uses over 100 lines of Java. That’s a big difference! In this post, we’ll look at the actual code and see why the Haskell is so much more concise.
But first, a disclaimer: It is really hard to fairly compare two code bases this way. In both libraries, there is a lot of supporting code that goes into defining each classifier, and it’s not obvious what code to include and not include. For example, both libraries implement interfaces to a number of probability distributions, and this code is not contained in the source count. The Haskell code takes more advantage of this abstraction, so this is one languageagnostic reason why the Haskell code is shorter. If you think I’m not doing a fair comparison, here’s some links to the full repositories so you can do it yourself:
HLearn implements training for a bayesian classifier with these six lines of Haskell:
newtype Bayes labelIndex dist = Bayes dist deriving (Read,Show,Eq,Ord,Monoid,Abelian,Group) instance (Monoid dist, HomTrainer dist) => HomTrainer (Bayes labelIndex dist) where type Datapoint (Bayes labelIndex dist) = Datapoint dist train1dp dp = Bayes $ train1dp dp
This code elegantly captures how to train a Bayesian classifier—just train a probability distribution. Here’s an explanation:
We only get the benefits of the HomTrainer type class because the bayesian classifier is a monoid. But we didn’t even have to specify what the monoid instance for bayesian classifiers looks like! In this case, it’s automatically derived from the monoid instances for the base distributions using a language extension called GeneralizedNewtypeDeriving. For examples of these monoid structures, check out the algebraic structure of the normal and categorical distributions, or more complex distributions using Markov networks.
Look for these differences between the HLearn and Weka source:
/** * Generates the classifier. * * @param instances set of instances serving as training data * @exception Exception if the classifier has not been generated * successfully */ public void buildClassifier(Instances instances) throws Exception { // can classifier handle the data? getCapabilities().testWithFail(instances); // remove instances with missing class instances = new Instances(instances); instances.deleteWithMissingClass(); m_NumClasses = instances.numClasses(); // Copy the instances m_Instances = new Instances(instances); // Discretize instances if required if (m_UseDiscretization) { m_Disc = new weka.filters.supervised.attribute.Discretize(); m_Disc.setInputFormat(m_Instances); m_Instances = weka.filters.Filter.useFilter(m_Instances, m_Disc); } else { m_Disc = null; } // Reserve space for the distributions m_Distributions = new Estimator[m_Instances.numAttributes()  1] [m_Instances.numClasses()]; m_ClassDistribution = new DiscreteEstimator(m_Instances.numClasses(), true); int attIndex = 0; Enumeration enu = m_Instances.enumerateAttributes(); while (enu.hasMoreElements()) { Attribute attribute = (Attribute) enu.nextElement(); // If the attribute is numeric, determine the estimator // numeric precision from differences between adjacent values double numPrecision = DEFAULT_NUM_PRECISION; if (attribute.type() == Attribute.NUMERIC) { m_Instances.sort(attribute); if ( (m_Instances.numInstances() > 0) && !m_Instances.instance(0).isMissing(attribute)) { double lastVal = m_Instances.instance(0).value(attribute); double currentVal, deltaSum = 0; int distinct = 0; for (int i = 1; i < m_Instances.numInstances(); i++) { Instance currentInst = m_Instances.instance(i); if (currentInst.isMissing(attribute)) { break; } currentVal = currentInst.value(attribute); if (currentVal != lastVal) { deltaSum += currentVal  lastVal; lastVal = currentVal; distinct++; } } if (distinct > 0) { numPrecision = deltaSum / distinct; } } } for (int j = 0; j < m_Instances.numClasses(); j++) { switch (attribute.type()) { case Attribute.NUMERIC: if (m_UseKernelEstimator) { m_Distributions[attIndex][j] = new KernelEstimator(numPrecision); } else { m_Distributions[attIndex][j] = new NormalEstimator(numPrecision); } break; case Attribute.NOMINAL: m_Distributions[attIndex][j] = new DiscreteEstimator(attribute.numValues(), true); break; default: throw new Exception("Attribute type unknown to NaiveBayes"); } } attIndex++; } // Compute counts Enumeration enumInsts = m_Instances.enumerateInstances(); while (enumInsts.hasMoreElements()) { Instance instance = (Instance) enumInsts.nextElement(); updateClassifier(instance); } // Save space m_Instances = new Instances(m_Instances, 0); }
And the code for online learning is:
/** * Updates the classifier with the given instance. * * @param instance the new training instance to include in the model * @exception Exception if the instance could not be incorporated in * the model. */ public void updateClassifier(Instance instance) throws Exception { if (!instance.classIsMissing()) { Enumeration enumAtts = m_Instances.enumerateAttributes(); int attIndex = 0; while (enumAtts.hasMoreElements()) { Attribute attribute = (Attribute) enumAtts.nextElement(); if (!instance.isMissing(attribute)) { m_Distributions[attIndex][(int)instance.classValue()]. addValue(instance.value(attribute), instance.weight()); } attIndex++; } m_ClassDistribution.addValue(instance.classValue(), instance.weight()); } }
Every algorithm implemented in HLearn uses similarly concise code. I invite you to browse the repository and see for yourself. The most complicated algorithm is for Markov chains which use only 6 lines for training, and about 20 for defining the Monoid.
You can expect lots of tutorials on how to incorporate the HLearn library into Haskell programs over the next few months.
Subscribe to the RSS feed to stay tuned!
]]>Code and instructions for reproducing these experiments are available on github.
Why is HLearn so much faster?
Well, it turns out that the bayesian classifier has the algebraic structure of a monoid, a group, and a vector space. HLearn uses a new crossvalidation algorithm that can exploit these algebraic structures. The standard algorithm runs in time , where is the number of “folds” and is the number of data points. The algebraic algorithms, however, run in time . In other words, it doesn’t matter how many folds we do, the run time is constant! And not only are we faster, but we get the exact same answer. Algebraic crossvalidation is not an approximation, it’s just fast.
Here’s some run times for kfold crossvalidation on the census income data set. Notice that HLearn’s run time is constant as we add more folds.
And when we set k=n, we have leaveoneout crossvalidation. Notice that Weka’s crossvalidation has quadratic run time, whereas HLearn has linear run time.
HLearn certainly isn’t going to replace Weka any time soon, but it’s got a number of cool tricks like this going on inside. If you want to read more, you should check out these two recent papers:
I’ll continue to write more about these tricks in future blog posts.
Subscribe to the RSS feed to stay tuned.
]]>
As usual, this post is a literate haskell file. To run this code, you’ll need to install the hlearndistributions package. This package requires GHC version at least 7.6.
bash> cabal install hlearndistributions1.1
Now for some code. We start with our language extensions and imports:
>{# LANGUAGE DataKinds #} >{# LANGUAGE TypeFamilies #} >{# LANGUAGE TemplateHaskell #} > >import HLearn.Algebra >import HLearn.Models.Distributions
Next, we’ll create data type to represent Futurama characters. There are a lot of characters, so we’ll need to keep things pretty organized. The data type will have a record for everything we might want to know about a character. Each of these records will be one of the variables in our multivariate distribution, and all of our data points will have this type.
>data Character = Character > { _name :: String > , _species :: String > , _job :: Job > , _isGood :: Maybe Bool > , _age :: Double  in years > , _height :: Double  in feet > , _weight :: Double  in pounds > } > deriving (Read,Show,Eq,Ord) > >data Job = Manager  Crew  Henchman  Other > deriving (Read,Show,Eq,Ord)
Now, in order for our library to be able to interpret the Character type, we call the template haskell function:
>makeTypeLenses ''Character
This function creates a bunch of data types and type classes for us. These “type lenses” give us a typesafe way to reference the different variables in our multivariate distribution. We’ll see how to use these type level lenses a bit later. There’s no need to understand what’s going on under the hood, but if you’re curious then checkout the hackage documentation or source code.
Now, we’re ready to create a data set and start training. Here’s a list of the employees of Planet Express provided by the resident bureaucrat Hermes Conrad. This list will be our first data set.
>planetExpress = > [ Character "Philip J. Fry" "human" Crew (Just True) 1026 5.8 195 > , Character "Turanga Leela" "alien" Crew (Just True) 43 5.9 170 > , Character "Professor Farnsworth" "human" Manager (Just True) 85 5.5 160 > , Character "Hermes Conrad" "human" Manager (Just True) 36 5.3 210 > , Character "Amy Wong" "human" Other (Just True) 21 5.4 140 > , Character "Zoidberg" "alien" Other (Just True) 212 5.8 225 > , Character "Cubert Farnsworth" "human" Other (Just True) 8 4.3 135 > ]
Let’s train a distribution from this data. Here’s how we would train a distribution where every variable is independent of every other variable:
>dist1 = train planetExpress :: Multivariate Character > '[ Independent Categorical '[String,String,Job,Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double
In the HLearn library, we always use the function train to train a model from data points. We specify which model to train in the type signature.
As you can see, the Multivariate distribution takes three type parameters. The first parameter is the type of our data point, in this case Character. The second parameter describes the dependency structure of our distribution. We’ll go over the syntax for the dependency structure in a bit. For now, just notice that it’s a typelevel list of distributions. Finally, the third parameter is the type we will use to store our probabilities.
What can we do with this distribution? One simple task we can do is to find marginal distributions. The marginal distribution is the distribution of a certain variable ignoring all the other variables. For example, let’s say I want a distribution of the species that work at planet express. I can get this by:
>dist1a = getMargin TH_species dist1
Notice that we specified which variable we’re taking the marginal of by using the type level lens TH_species. This data constructor was automatically created for us by out template haskell function makeTypeLenses. Every one of our records in the data type has its own unique type lens. It’s name is the name of the record, prefixed by TH. These lenses let us infer the types of our marginal distributions at compile time, rather than at run time. For example, the type of the marginal distribution of species is:
ghci> :t dist1a dist1a :: Categorical String Double
That is, a categorical distributions whose data points are Strings and which stores probabilities as a Double. Now, if I wanted a distribution of the weights of the employees, I can get that by:
>dist1b = getMargin TH_weight dist1
And the type of this distribution is:
ghci> :t dist1b dist1b :: Normal Double
Now, I can easily plot these marginal distributions with the plotDistribution function:
ghci> plotDistribution (plotFile "dist1a" $ PNG 250 250) dist1a ghci> plotDistribution (plotFile "dist1b" $ PNG 250 250) dist1b
In a traditional statistics library, we would have to retrain our data from scratch. If we had billions of elements in our data set, this would be an expensive mistake. But in our HLearn library, we can take advantage of the model’s monoid structure. In particular, the compiler used this structure to automatically derive a function called add1dp for us. Let’s look at its type:
ghci> :t add1dp add1dp :: HomTrainer model => model > Datapoint model > model
It’s pretty simple. The function takes a model and adds the data point associated with that model. It returns the model we would have gotten if the data point had been in our original data set. This is called online training.
Again, because our distributions form monoids, the compiler derived an efficient and exact online training algorithm for us automatically.
So let’s create a new distribution that considers bender:
>bender = Character "Bender Rodriguez" "robot" Crew (Just True) 44 6.1 612 >dist1' = add1dp dist1 bender
And plot our new marginals:
ghci> plotDistribution (plotFile "dist1withbenderspecies" $ PNG 250 250) $ getMargin TH_species dist1' ghci> plotDistribution (plotFile "dist1withbenderweight" $ PNG 250 250) $ getMargin TH_weight dist1'
ghci> mean dist1b 176.42857142857142 ghci> mean $ getMargin TH_weight dist1' 230.875
Bender’s weight really changed the distribution after all!
That’s cool, but our original distribution isn’t very interesting. What makes multivariate distributions interesting is when the variables affect each other. This is true in our case, so we’d like to be able to model it. For example, we’ve already seen that robots are much heavier than organic lifeforms, and are throwing off our statistics. The HLearn library supports a small subset of Markov Networks for expressing these dependencies.
We represent Markov Networks as graphs with undirected edges. Every attribute in our distribution is a node, and every dependence between attributes is an edge. We can draw this graph with the plotNetwork command:
ghci> plotNetwork "dist1network" dist1
As expected, there are no edges in our graph because everything is independent. Let’s create a more interesting distribution and plot its Markov network.
>dist2 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String] > , Independent Categorical '[Job,Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double
ghci> plotNetwork "dist2network" dist2
Okay, so what just happened?
The syntax for representing the dependence structure is a little confusing, so let’s go step by step. We represent the dependence information in the graph as a list of types. Each element in the list describes both the marginal distribution and the dependence structure for one or more records in our data type. We must list these elements in the same order as the original data type.
Notice that we’ve made two changes to the list. First, our list now starts with the type Ignore ‘[String]. This means that the first string in our data type—the name—will be ignored. Notice that TH_name is no longer in the Markov Network. This makes sense because we expect that a character’s name should not tell us too much about any of their other attributes.
Second, we’ve added a dependence. The MultiCategorical distribution makes everything afterward in the list dependent on that item, but not the things before it. This means that the exact types of dependencies it can specify are dependent on the order of the records in our data type. Let’s see what happens if we change the location of the MultiCategorical:
>dist3 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , Independent Categorical '[String] > , MultiCategorical '[Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double
ghci> plotNetwork "dist3network" dist3
As you can see, our species no longer have any relation to anything else. Unfortunately, using this syntax, the order of list elements is important, and so the order we specify our data records is important.
Finally, we can substitute any valid univariate distribution for our Normal and Categorical distributions. The HLearn library currently supports Binomial, Exponential, Geometric, LogNormal, and Poisson distributions. These just don’t make much sense for modelling Futurama characters, so we’re not using them.
Now, we might be tempted to specify that every variable is fully dependent on every other variable. In order to do this, we have to introduce the “Dependent” type. Any valid multivariate distribution can follow Dependent, but only those records specified in the typelist will actually be dependent on each other. For example:
>dist4 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job,Maybe Bool] > , Dependent MultiNormal '[Double,Double,Double] > ] > Double
ghci> plotNetwork "dist4network" dist4
Undoubtably, this is in always going to be the case—everything always has a slight influence on everything else. Unfortunately, it is not easy in practice to model these fully dependent distributions. We need roughly data points to accurately train a distribution, where n is the number of nodes in our graph and e is the number of edges in our network. Thus, by selecting that two attributes are independent of each other, we can greatly reduce the amount of data we need to train an accurate distribution.
I realize that this syntax is a little awkward. I chose it because it was relatively easy to implement. Future versions of the library should support a more intuitive syntax. I also plan to use copulas to greatly expand the expressiveness of these distributions. In the mean time, the best way to figure out the dependencies in a Markov Network are just to plot it and see visually.
Okay. So what distribution makes the most sense for Futurama characters? We’ll say that everything depends on both the characters’ species and job, and that their weight depends on their height.
>planetExpressDist = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double] > , Dependent MultiNormal '[Double,Double] > ] > Double
ghci> plotNetwork "planetExpressnetwork" planetExpressDist
We still don’t have enough data to to train this network, so let’s create some more. We start by creating a type for our Markov network called FuturamaDist. This is just for convenience so we don’t have to retype the dependence structure many times.
>type FuturamaDist = Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double] > , Dependent MultiNormal '[Double,Double] > ] > Double
Next, we train some more distribubtions of this type on some of the characters. We’ll start with Mom Corporation and the brave Space Forces.
>momCorporation = > [ Character "Mom" "human" Manager (Just False) 100 5.5 130 > , Character "Walt" "human" Henchman (Just False) 22 6.1 170 > , Character "Larry" "human" Henchman (Just False) 18 5.9 180 > , Character "Igner" "human" Henchman (Just False) 15 5.8 175 > ] >momDist = train momCorporation :: FuturamaDist
>spaceForce = > [ Character "Zapp Brannigan" "human" Manager (Nothing) 45 6.0 230 > , Character "Kif Kroker" "alien" Crew (Just True) 113 4.5 120 > ] >spaceDist = train spaceForce :: FuturamaDist
And now some more robots:
>robots = > [ bender > , Character "Calculon" "robot" Other (Nothing) 123 6.8 650 > , Character "The Crushinator" "robot" Other (Nothing) 45 8.0 4500 > , Character "Clamps" "robot" Henchman (Just False) 134 5.8 330 > , Character "DonBot" "robot" Manager (Just False) 178 5.8 520 > , Character "Hedonismbot" "robot" Other (Just False) 69 4.3 1200 > , Character "Preacherbot" "robot" Manager (Nothing) 45 5.8 350 > , Character "Roberto" "robot" Other (Just False) 77 5.9 250 > , Character "Robot Devil" "robot" Other (Just False) 895 6.0 280 > , Character "Robot Santa" "robot" Other (Just False) 488 6.3 950 > ] >robotDist = train robots :: FuturamaDist
Now we’re going to take advantage of the monoid structure of our multivariate distributions to combine all of these distributions into one.
> futuramaDist = planetExpressDist <> momDist <> spaceDist <> robotDist
The resulting distribution is equivalent to having trained a distribution from scratch on all of the data points:
train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist
We can take advantage of this property any time we use the train function to automatically parallelize our code. The higher order function parallel will split the training task evenly over each of your available processors, then merge them together with the monoid operation. This results in “theoretically perfect” parallel training of these models.
parallel train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist
Again, this is only possible because the distributions have a monoid structure.
Now, let’s ask some questions of our distribution. If I pick a character at random, what’s the probability that they’re a good guy? Let’s plot the marginal.
ghci> plotDistribution (plotFile "goodguy" $ PNG 250 250) $ getMargin TH_isGood futuramaDist
But what if I only want to pick from those characters that are humans, or those characters that are robots? Statisticians call this conditioning. We can do that with the condition function:
ghci> plotDistribution (plotFile "goodguyhuman" $ PNG 250 250) $ getMargin TH_isGood $ condition TH_species "human" futuramaDist ghci> plotDistribution (plotFile "goodguyrobot" $ PNG 250 250) $ getMargin TH_isGood $ condition TH_species "robot" futuramaDist
Now let’s ask: What’s the average age of an evil robot?
ghci> mean $ getMargin TH_age $ condition TH_isGood (Just False) $ condition TH_species "robot" futuramaDist 273.0769230769231
Notice that conditioning a distribution is a commutative operation. That means we can condition in any order and still get the exact same results. Let’s try it:
ghci> mean $ getMargin TH_age $ condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist 273.0769230769231
There’s one last thing for us to consider. What does our Markov network look like after conditioning? Let’s find out!
plotNetwork "conditionspeciesisGood" $ condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist
Notice that conditioning against these variables caused them to go away from our Markov Network.
Finally, there’s another similar process to conditioning called “marginalizing out.” This lets us ignore the effects of a single attribute without specifically saying what that attribute must be. When we marginalize out on our Markov network, we get the same dependence structure as if we conditioned.
plotNetwork "marginalizeOutspeciesisGood" $ marginalizeOut TH_species $ marginalizeOut TH_isGood futuramaDist
Effectively, what the marginalizeOut function does is “forget” the extra dependencies, whereas the condition function “applies” those dependencies. In the end, the resulting Markov network has the same structure, but different values.
Finally, at the start of the post, I mentioned that our multivariate distributions have group and vector space structure. This gives us two more operations we can use: the inverse and scalar multiplication. You can find more posts on how to take advantage of these structures here and here.
The best part of all of this is still coming. Next, we’ll take a look at full on Bayesian classification and why it forms a monoid. Besides online and parallel trainers, this also gives us a fast crossvalidation method.
There’ll also be a posts about the monoid structure of Markov chains, the Free HomTrainer, and how this whole algebraic framework applies to NPapproximation algorithms as well.
Subscribe to the RSS feed to stay tuned.
In this article, we’ll go over the math behind the categorical distribution, the algebraic structure of the distribution, and how to manipulate it within Haskell’s HLearn library. We’ll also see some examples of how this focus on algebra makes HLearn’s interface more powerful than other common statistical packages. Everything that we’re going to see is in a certain sense very “obvious” to a statistician, but this algebraic framework also makes it convenient. And since programmers are inherently lazy, this is a Very Good Thing.
Before delving into the “cool stuff,” we have to look at some of the mechanics of the HLearn library.
The HLearndistributions package contains all the functions we need to manipulate categorical distributions. Let’s install it:
$ cabal install HLearndistributions1.1
We import our libraries:
>import Control.DeepSeq >import HLearn.Algebra >import HLearn.Models.Distributions
We create a data type for Simon’s marbles:
>data Marble = Red  Green  Blue  White > deriving (Read,Show,Eq,Ord)
The easiest way to represent Simon’s bag of marbles is with a list:
>simonBag :: [Marble] >simonBag = [Red, Red, Red, Green, Blue, Green, Red, Blue, Green, Green, Red, Red, Blue, Red, Red, Red, White]
And now we’re ready to train a categorical distribution of the marbles in Simon’s bag:
>simonDist = train simonBag :: Categorical Double Marble
We can load up ghci and plot the distribution with the conveniently named function plotDistribution:
ghci> plotDistribution (plotFile "simonDist" $ PDF 400 300) simonDist
This gives us a histogram of probabilities:
In the HLearn library, every statistical model is generated from data using either train or train’. Because these functions are overloaded, we must specify the type of simonDist so that the compiler knows which model to generate. Categorical takes two parameters. The first is the type of the discrete data (Marble). The second is the type of the probability (Double). We could easily create Categorical distributions with different types depending on the requirements for our application. For example:
>stringDist = train (map show simonBag) :: Categorical Float String
This is the first “cool thing” about Categorical: We can make distributions over any userdefined type. This makes programming with probabilities easier, more intuitive, and more convenient. Most other statistical libraries would require you to assign numbers corresponding to each color of marble, and then create a distribution over those numbers.
Now that we have a distribution, we can find some probabilities. If Simon pulls a marble from the bag, what’s the probability that it would Red?
We can use the pdf function to do this calculation for us:
ghci> pdf simonDist Red 0.5294117647058824 ghci> pdf simonDist Blue 0.17647058823529413 ghci> pdf simonDist Green 0.23529411764705882 ghci> pdf simonDist White 0.058823529411764705
If we sum all the probabilities, as expected we would get 1:
ghci> sum $ map (pdf simonDist) [Red,Green,Blue,White] 1.0
Due to rounding errors, you may not always get 1. If you absolutely, positively, have to avoid rounding errors, you should use Rational probabilities:
>simonDistRational = train simonBag :: Categorical Rational Marble
Rationals are slower, but won’t be subject to floating point errors.
This is just about all the functionality you would get in a “normal” stats package like R or NumPy. But using Haskell’s nice support for algebra, we can get some extra cool features.
First, let’s talk about semigroups. A semigroup is any data structure that has a binary operation (<>) that joins two of those data structures together. The categorical distribution is a semigroup.
Don wants to play marbles with Simon, and he has his own bag. Don’s bag contains only red and blue marbles:
>donBag = [Red,Blue,Red,Blue,Red,Blue,Blue,Red,Blue,Blue]
We can train a categorical distribution on Don’s bag in the same way we did earlier:
>donDist = train donBag :: Categorical Double Marble
In order to play marbles together, Don and Simon will have to add their bags together.
>bothBag = simonBag ++ donBag
Now, we have two options for training our distribution. First is the naive way, we can train the distribution directly on the combined bag:
>bothDist = train bothBag :: Categorical Double Marble
This is the way we would have to approach this problem in most statistical libraries. But with HLearn, we have a more efficient alternative. We can combine the trained distributions using the semigroup operation:
>bothDist' = simonDist <> donDist
Under the hood, the categorical distribution stores the number of times each possibility occurred in the training data. The <> operator just adds the corresponding counts from each distribution together:
This method is more efficient because it avoids repeating work we’ve already done. Categorical’s semigroup operation runs in time O(1), so no matter how big the bags are, we can calculate the distribution very quickly. The naive method, in contrast, requires time O(n). If our bags had millions or billions of marbles inside them, this would be a considerable savings!
We get another cool performance trick “for free” based on the fact that Categorical is a semigroup: The function train can be automatically parallelized using the higher order function parallel. I won’t go into the details about how this works, but here’s how you do it in practice.
First, we must show the compiler how to resolve the Marble data type down to “normal form.” This basically means we must show the compiler how to fully compute the data type. (We only have to do this because Marble is a type we created. If we were using a built in type, like a String, we could skip this step.) This is fairly easy for a type as simple as Marble:
>instance NFData Marble where > rnf Red = () > rnf Blue = () > rnf Green = () > rnf White = ()
Then, we can perform the parallel computation by:
>simonDist_par = parallel train simonBag :: Categorical Double Marble
Other languages require a programmer to manually create parallel versions of their functions. But in Haskell with the HLearn library, we get these parallel versions for free! All we have to do is ask for it!
A monoid is a semigroup with an empty element, which is called mempty in Haskell. It obeys the law that:
M <> mempty == mempty <> M == M
And it is easy to show that Categorical is also a monoid. We get this empty element by training on an empty data set:
mempty = train ([] :: [Marble]) :: Categorical Double Marble
The HomTrainer type class requires that all its instances also be instances of Monoid. This lets the compiler automatically derive “online trainers” for us. An online trainer can add new data points to our statistical model without retraining it from scratch.
For example, we could use the function add1dp (stands for: add one data point) to add another white marble into Simon’s bag:
>simonDistWhite = add1dp simonDist White
This also gives us another approach for our earlier problem of combining Simon and Don’s bags. We could use the function addBatch:
>bothDist'' = addBatch simonDist donBag
Because Categorical is a monoid, we maintain the property that:
bothDist == bothDist' == bothDist''
Again, statisticians have always known that you could add new points into a categorical distribution without training from scratch. The cool thing here is that the compiler is deriving all of these functions for us, and it’s giving us a consistent interface for use with different data structures. All we had to do to get these benefits was tell the compiler that Categorical is a monoid. This makes designing and programming libraries much easier, quicker, and less error prone.
A group is a monoid with the additional property that all elements have an inverse. This lets us perform subtraction on groups. And Categorical is a group.
Ed wants to play marbles too, but he doesn’t have any of his own. So Simon offers to give Ed some of from his own bag. He gives Ed one of each color:
>edBag = [Red,Green,Blue,White]
Now, if Simon draws a marble from his bag, what’s the probability it will be blue?
To answer this question without algebra, we’d have to go back to the original data set, remove the marbles Simon gave Ed, then retrain the distribution. This is awkward and computationally expensive. But if we take advantage of Categorical’s group structure, we can just subtract directly from the distribution itself. This makes more sense intuitively and is easier computationally.
>simonDist2 = subBatch simonDist edBag
This is a shorthand notation for using the group operations directly:
>edDist = train edBag :: Categorical Double Marble >simonDist2' = simonDist <> (inverse edDist)
The way the inverse operation works is it multiplies the counts for each category by 1. In picture form, this flips the distribution upside down:
Then, adding an upside down distribution to a normal one is just subtracting the histogram columns and renormalizing:
Notice that the green bar in edDist looks really big—much bigger than the green bar in simonDist. But when we subtract it away from simonDist, we still have some green marbles left over in simonDist2. This is because the histogram is only showing the probability of a green marble, and not the actual number of marbles.
Finally, there’s one more crazy trick we can perform with the Categorical group. It’s perfectly okay to have both positive and negative marbles in the same distribution. For example:
ghci> plotDistribution (plotFile "mixedDist" $ PDF 400 300) (edDist <> (inverse donDist))
results in:
Most statisticians would probably say that these upside down Categoricals are not “real distributions.” But at the very least, they are a convenient mathematical trick that makes working with distributions much more pleasant.
Finally, an RModule is a group with two additional properties. First, it is abelian. That means <> is commutative. So, for all a, b:
a <> b == b <> a
Second, the data type supports multiplication by any element in the ring R. In Haskell, you can think of a ring as any member of the Num type class.
How is this useful? It let’s “retrain” our distribution on the data points it has already seen. Back to the example…
Well, Ed—being the clever guy that he is—recently developed a marble copying machine. That’s right! You just stick some marbles in on one end, and on the other end out pop 10 exact duplicates. Ed’s not just clever, but pretty nice too. He duplicates his new marbles and gives all of them back to Simon. What’s Simon’s new distribution look like?
Again, the naive way to answer this question would be to retrain from scratch:
>duplicateBag = simonBag ++ (concat $ replicate 10 edBag) >duplicateDist = train duplicateBag :: Categorical Double Marble
Slightly better is to take advantage of the Semigroup property, and just apply that over and over again:
>duplicateDist' = simonDist2 <> (foldl1 (<>) $ replicate 10 edDist)
But even better is to take advantage of the fact that Categorical is a module and the (.*) operator:
>duplicateDist'' = simonDist2 <> 10 .* edDist
In picture form:
Also notice that without the scalar multiplication, we would get back our original distribution:
Another way to think about the module’s scalar multiplication is that it allows us to weight our distributions.
Ed just realized that he still needs a marble, and has decided to take one. Someone has left their Marble bag sitting nearby, but he’s not sure whose it is. He thinks that Simon is more forgetful than Don is, so he assigns a 60% probability that the bag is Simon’s and a 40% probability that it is Don’s. When he takes a marble, what’s the probability that it is red?
We create a weighted distribution using module multiplication:
>weightedDist = 0.6 .* simonDist <> 0.4 .* donDist
Then in ghci:
ghci> pdf weightedDist Red 0.4929577464788732
We can also train directly on weighted data using the trainW function:
>weightedDataDist = trainW [(0.4,Red),(0.5,Green),(0.2,Green),(3.7,White)] :: Categorical Double Marble
which gives us:
Talking about the categorical distribution in algebraic terms let’s us do some cool new stuff with our distributions that we can’t easily do in other libraries. None of this is statistically ground breaking. The cool thing is that algebra just makes everything so convenient to work with.
I think I’ll do another post on some cool tricks with the kernel density estimator that are not possible at all in other libraries, then do a post about the category (formal categorytheoretic sense) of statistical training methods. At that point, we’ll be ready to jump into machine learning tasks. Depending on my mood we might take a pit stop to discuss the computational aspects of free groups and modules and how these relate to machine learning applications.
Sign up for the RSS feed to stay tuned!
]]>
Before we get into the math, we’ll need to review the basics of nuclear politics.
The nuclear NonProliferation Treaty (NPT) is the main treaty governing nuclear weapons. Basically, it says that there are five countries that are “allowed” to have nukes: the USA, UK, France, Russia, and China. “Allowed” is in quotes because the treaty specifies that these countries must eventually get rid of their nuclear weapons at some future, unspecified date. When another country, for example Iran, signs the NPT, they are agreeing to not develop nuclear weapons. What they get in exchange is help from the 5 nuclear weapons states in developing their own civilian nuclear power programs. (Iran has the legitimate complaint that Western countries are actively trying to stop its civilian nuclear program when they’re supposed to be helping it, but that’s a whole ‘nother can of worms.)
The Nuclear Notebook tracks the nuclear capabilities of all these countries. The mostcurrent estimates are from mid2012. Here’s a summary (click the warhead type for more info):
Country  Delivery Method  Warhead  Yield (kt)  # Deployed 
USA  ICBM  W78  335  250 
USA  ICBM  W87  300  250 
USA  SLBM  W76  100  468 
USA  SLBM  W761  100  300 
USA  SLBM  W88  455  384 
USA  Bomber  W80  150  200 
USA  Bomber  B61  340  50 
USA  Bomber  B83  1200  50 
UK  SLBM  W76  100  225 
France  SLBM  TN75  100  150 
France  Bomber  TN81  300  150 
Russia  ICBM  RS20V  800  500 
Russia  ICBM  RS18  400  288 
Russia  ICBM  RS12M  800  135 
Russia  ICBM  RS12M2  800  56 
Russia  ICBM  RS12M1  800  18 
Russia  ICBM  RS24  100  90 
Russia  SLBM  RSM50  50  144 
Russia  SLBM  RSM54  100  384 
Russia  Bomber  AS15  200  820 
China  ICBM  DF3A  3300  16 
China  ICBM  DF4  3300  12 
China  ICBM  DF5A  5000  20 
China  ICBM  DF21  300  60 
China  ICBM  DF31  300  20 
China  ICBM  DF31A  300  20 
China  Bomber  H6  3100  20 
I’ve consolidated all this data into the file nukeslist.csv, which we will analyze in this post. If you want to try out this code for yourself (or the homework question at the end), you’ll need to download it. Every line in the file corresponds to a single nuclear warhead, not delivery method. Warheads are the parts that go boom! Bombers, ICBMs, and SSBN/SLBMs are the delivery method.
There are three things to note about this data. First, it’s only estimates based on public sources. In particular, it probably overestimates the Russian nuclear forces. Other estimates are considerably lower. Second, we will only be considering deployed, strategic warheads. Basically, this means the “really big nukes that are currently aimed at another country.” There are thousands more tactical warheads, and warheads in reserve stockpiles waiting to be disassembled. For simplicity—and because these nukes don’t significantly affect strategic planning—we won’t be considering them here. Finally, there are 4 countries who are not members of the NPT but have nuclear weapons: Israel, Pakistan, India, and North Korea. We will be ignoring them here because their inventories are relatively small, and most of their weapons would not be considered strategic.
First, let’s install the library:
$ cabal install HLearndistributions0.1
Now we’re ready to start programming. First, let’s import our libraries:
>import Control.Lens >import Data.Csv >import qualified Data.Vector as V >import qualified Data.ByteString.Lazy.Char8 as BS > >import HLearn.Algebra >import HLearn.Models.Distributions >import HLearn.Gnuplot.Distributions
Next, we load our data using the Cassava package. (You don’t need to understand how this works.)
>main = do > Right rawdata < fmap (fmap V.toList . decode True) $ BS.readFile "nukeslist.csv" > :: IO (Either String [(String, String, String, Int)])
And we’ll use the Lens package to parse the CSV file into a series of variables containing just the values we want. (You also don’t need to understand this.)
> let list_usa = fmap (\row > row^._4) $ filter (\row > (row^._1)=="USA" ) rawdata > let list_uk = fmap (\row > row^._4) $ filter (\row > (row^._1)=="UK" ) rawdata > let list_france = fmap (\row > row^._4) $ filter (\row > (row^._1)=="France") rawdata > let list_russia = fmap (\row > row^._4) $ filter (\row > (row^._1)=="Russia") rawdata > let list_china = fmap (\row > row^._4) $ filter (\row > (row^._1)=="China" ) rawdata
NOTE: All you need to understand about the above code is what these list_country variables look like. So let’s print one:
> putStrLn $ "List of American nuclear weapon sizes = " ++ show list_usa
gives us the output:
List of American nuclear weapon sizes = fromList [335,335,335,335,335,335,335,335,335,335 ... 1200,1200,1200,1200,1200]
If we want to know how many weapons are in the American arsenal, we can take the length of the list:
> putStrLn $ "Number of American weapons = " ++ show (length list_usa)
We get that there are 1951 American deployed, strategic nuclear weapons. If we want to know the total “blowing up” power, we take the sum of the list:
> putStrLn $ "Explosive power of American weapons = " ++ show (sum list_usa)
We get that the US has 516 megatons of deployed, strategic nuclear weapons. That’s the equivalent of 1,033,870,000,000 pounds of TNT.
To get the total number of weapons in the world, we concatenate every country’s list of weapons and find the length:
> let list_all = list_usa ++ list_uk ++ list_france ++ list_russia ++ list_china > putStrLn $ "Number of nukes in the whole world = " ++ show (length list_all)
Doing this for every country gives us the table:
Country  Warheads  Total explosive power (kt) 
USA  1,951  516,935 
UK  225  22,500 
France  300  60,000 
Russia  2,435  901,000 
China  168  284,400 
Total  5,079  1,784,835 
Now let’s do some algebra!
In a previous post, we saw that the Gaussian distribution forms a group. This means that it has all the properties of a monoid—an empty element (mempty) that represents the distribution trained on no data, and a binary operation (mappend) that merges two distributions together—plus an inverse. This inverse lets us “subtract” two Gaussians from each other.
It turns out that many other distributions also have this group property. For example, the categorical distribution. This distribution is used for measuring discrete data. Essentially, it assigns some probability to each “label.” In our case, the labels are the size of the nuclear weapon, and the probability is the chance that a randomly chosen nuke will be exactly that destructive. We train our categorical distribution using the train function:
> let cat_usa = train list_usa :: Categorical Int Double
If we plot this distribution, we’ll get a graph that looks something like:
A distribution like this is useful to war planners from other countries. It can help them statistically determine the amount of casualties their infrastructure will take from a nuclear exchange.
Now, let’s train equivalent distributions for our other countries.
> let cat_uk = train list_uk :: Categorical Int Double > let cat_france = train list_france :: Categorical Int Double > let cat_russia = train list_russia :: Categorical Int Double > let cat_china = train list_china :: Categorical Int Double
Because training the categorical distribution is a group homomorphism, we can train a distribution over all nukes by either training directly on the data:
> let cat_allA = train list_all :: Categorical Int Double
or we can merge the already generated categorical distributions:
> let cat_allB = cat_usa <> cat_uk <> cat_france <> cat_russia <> cat_china
Because of the homomorphism property, we will get the same result both ways. Since we’ve already done the calculations for each of the the countries already, method B will be more efficient—it won’t have to repeat work we’ve already done. If we plot either of these distributions, we get:
The thing to notice in this plot is that most countries have a nuclear arsenal that is distributed similarly to the United States—except for China. These Chinese ICBMs will become much more important when we discuss nuclear strategy in the last section.
But nuclear war planners don’t particularly care about this complete list of nuclear weapons. What war planners care about is the survivable nuclear weapons—that is, weapons that won’t be blown up by a surprise nuclear attack. Our distributions above contain nukes dropped from bombers, but these are not survivable. They are easy to destroy. For our purposes, we’ll call anything that’s not a bomber a survivable weapon.
We’ll use the group property of the categorical distribution to calculate the survivable weapons. First, we create a distribution of just the unsurvivable bombers:
> let list_bomber = fmap (\row > row^._4) $ filter (\row > (row^._2)=="Bomber") rawdata > let cat_bomber = train list_bomber :: Categorical Int Double
Then, we use our group inverse to subtract these unsurvivable weapons away:
> let cat_survivable = cat_allB <> (inverse cat_bomber)
Notice that we calculated this distribution indirectly—there was no possible way to combine our variables above to generate this value without using the inverse! This is the power of groups in statistics.
The categorical distribution is not sufficient to accurately describe the distribution of nuclear weapons. This is because we don’t actually know the yield of a given warhead. Like all things, it has some manufacturing tolerances that we must consider. For example, if we detonate a 300 kt warhead, the actual explosion might be 275 kt, 350 kt, or the bomb might even “fizzle out” and have almost a 0kt explosion.
We’ll model this by using a kernel density estimator (KDE). The KDE basically takes all our data points, assigns each one a probability distribution called a “kernel,” then sums these kernels together. It is a very powerful and general technique for modelling distributions… and it also happens to form a group!
First, let’s create the parameters for our KDE. The bandwidth controls how wide each of the kernels is. Bigger means wider. I selected 20 because it made a reasonable looking density function. The sample points are exactly what they sounds like: they are where we will sample the density from. We can generate them using the function genSamplePoints. Finally, the kernel is the shape of the distributions we will be summing up. There are many supported kernels.
> let kdeparams = KDEParams > { bandwidth = Constant 20 > , samplePoints = genSamplePoints > 0  minimum > 4000  maximum > 4000  number of samples > , kernel = KernelBox Gaussian > } :: KDEParams Double
Now, we’ll train kernel density estimates on our data. Notice that because the KDE takes parameters, we must use the train’ function instead of just train.
> let kde_usa = train' kdeparams list_usa :: KDE Double
Again, plotting just the American weapons gives:
And we train the corresponding distributions for the other countries.
> let kde_uk = train' kdeparams list_uk :: KDE Double > let kde_france = train' kdeparams list_france :: KDE Double > let kde_russia = train' kdeparams list_russia :: KDE Double > let kde_china = train' kdeparams list_china :: KDE Double > > let kde_all = kde_usa <> kde_uk <> kde_france <> kde_russia <> kde_china
The KDE is a powerful technique, but the draw back is that it is computationally expensive—especially when a large number of sample points are used. Fortunately, all computations in the HLearn library are easily parallelizable by applying the higher order function parallel.
We can calculate the full KDE from scratch in parallel like this:
> let list_double_all = map fromIntegral list_all :: [Double] > let kde_all_parA = (parallel (train' kdeparams)) list_double_all :: KDE Double
or we can perform a parallel reduction on the KDEs for each country like this:
> let kde_all_parB = (parallel reduce) [kde_usa, kde_uk, kde_france, kde_russia, kde_china]
And because the KDE is a homomorphism, we get the same exact thing either way. Let’s plot the parallel version:
> plotDistribution (genPlotParams "kde_all" kde_all_parA) kde_all_parA
The parallel computation takes about 16 seconds on my Core2 Duo laptop running on 2 processors, whereas the serial computation takes about 28 seconds.
This is a considerable speedup, but we can still do better. It turns out that there is a homomorphism from the Categorical distribution to the KDE:
> let kde_fromcat_all = cat_allB $> kdeparams > plotDistribution (genPlotParams "kde_fromcat_all" kde_fromcat_all) kde_fromcat_all
(For more information about the morphism chaining operator $>, see the Hlearn documentation.) This computation takes less than a second and gets the exact same result as the much more expensive computations above.
We can express this relationship with a commutative diagram:
No matter which path we take to get to a KDE, we will get the exact same answer. So we should always take the path that will be least computationally expensive for the data set we’re working on.
Why does this work? Well, the categorical distribution is a structure called the “free module” in disguise.
RModules (like groups, but unlike monoids) have not seen much love from functional programmers. This is a shame, because they’re quite handy. It turns out they will increase our performance dramatically in this case.
It’s not super important to know the formal definition of an Rmodule, but here it is anyways: An Rmodule is a group with an additional property: it can be “multiplied” by any element of the ring R. This is a generalization of vector spaces because R need only be a ring instead of a field. (Rings do not necessarily have multiplicative inverses.) It’s probably easier to see what this means by an example.
Vectors are modules. Let’s say I have a vector:
> let vec = [1,2,3,4,5] :: [Int]
I can perform scalar multiplication on that vector like this:
> let vec2 = 3 .* vec
which as you might expect results in:
[3,6,9,12,15]
Our next example is the free Rmodule. A “free” structure is one that obeys only the axioms of the structure and nothing else. Functional programmers are very familiar with the free monoid—it’s the list data type. The free Zmodule is like a beefed up list. Instead of just storing the elements in a list, it also stores the number of times that element occurred. (Z is shorthand for the set of integers, which form a ring but not a field.) This lets us greatly reduce the memory required to store a repetitive data set.
In HLearn, we represent the free module over a ring r with the data type:
:: FreeMod r a
where a is the type of elements to be stored in the free module. We can convert our lists into free modules using the function list2module like this:
> let module_usa = list2module list_usa
But what does the free module actually look like? Let’s print it to find out:
> print module_usa
gives us:
FreeMod (fromList [(100,768),(150,200),(300,250),(335,249),(340,50),(455,384),(1200,50)])
This is much more compact! So this is the take away: The free module makes repetitive data sets easier to work with. Now, let’s convert all our country data into module form:
> let module_uk = list2module list_uk > let module_france = list2module list_france > let module_russia = list2module list_russia > let module_china = list2module list_china
Because modules are also groups, we can combine them like so:
> let module_allA = module_usa <> module_uk <> module_france <> module_russia <> module_china
or, we could train them from scratch:
> let module_allB = list2module list_all
Again, because generating a free module is a homomorphism, both methods are equivalent.
The categorical distribution and the KDE both have this module structure. This gives us two cool properties for free.
First, we can train these distributions directly from the free module. Because the free module is potentially much more compact than a list is, this can save both memory and time. If we run:
> let cat_module_all = train module_allB :: Categorical Int Double > let kde_module_all = train' kdeparams module_allB :: KDE Double
Then we get the properties:
cat_mod_all == cat_all kde_mod_all == kde_all == kde_cat_all
Extending our commutative diagram above gives:
Again, no matter which path we take to train our KDE, we still get the same result because each of these arrows is a homomorphism.
Second, if a distribution is a module, we can weight the importance of our data points. Let’s say we’re a general from North Korea (DPRK), and we’re planning our nuclear strategy. The US and North Korea have a very strained relationship in the nuclear department. It is much more likely that the US will try to nuke the DPRK than China will. And modules let us model this! We can weight each country’s influence on our “nuclear threat profile” distribution like this:
> let threats_dprk = 20 .* kde_usa > <> 10 .* kde_uk > <> 5 .* kde_france > <> 2 .* kde_russia > <> 1 .* kde_china > > plotDistribution (genPlotParams "threats_dprk" threats_dprk) threats_dprk
Basically, we’re saying that the USA is 20x more likely to attack the DPRK than China is. Graphically, our threat distribution is:
The maximum threat that we have to worry about is about 1300 kt, so we need to design all our nuclear bunkers to withstand this level of blast. Nuclear war planners would use the above distribution to figure out how much infrastructure would survive a nuclear exchange. To see how this is done, you’ll have to click the link.
On the other hand, if we’re an American general, then we might say that China is our biggest threat… who knows what they’ll do when we can’t pay all the debt we owe them!?
> let threats_usa = 1 .* kde_russia > <> 5 .* kde_china > > plotDistribution (genPlotParams "threats_usa" threats_usa) threats_usa
Graphically:
So now Chinese ICBMs are a real threat. For American infrastructure to be secure, most of it needs to be able to withstand ~3500 kt blast. (Actually, Chinese nuclear policy is called the “minimum means of reprisal”—these nukes are not targeted at military installations, but major cities. Unlike the other nuclear powers, China doesn’t hope to win a nuclear war. Instead, its nuclear posture is designed to prevent nuclear war in the first place. This is why China has the fewest weapons of any of these countries. For a detailed analysis, see the book Minimum Means of Reprisal. This means that American military infrastructure isn’t threatened by these large Chinese nukes, and really only needs to be able to withstand an 800kt explosion to be survivable.)
By the way, since we’ve already calculated all of the kde_country variables before, these computations take virtually no time at all to compute. Again, this is all made possible thanks to our friend abstract algebra.
If you want to try out the HLearn library for yourself, here’s a question you can try to answer: Create the DPRK and US threat distributions above, but only use survivable weapons. Don’t include bombers in the analysis.
In our next post, we’ll go into more detail about the mathematical plumbing that makes all this possible. Then we’ll start talking about Bayesian classification and fullon machine learning. Subscribe to the RSS feed so you don’t miss out!
Why don’t you listen to Tom Lehrer’s “Song for WWIII” while you wait?
]]>
This is the first in a series of posts about the HLearn library for haskell that I’ve been working on for the past few months. The idea of the library is to show that abstract algebra—specifically monoids, groups, and homomorphisms—are useful not just in esoteric functional programming, but also in real world machine learning problems. In particular, by framing a learning algorithm according to these algebraic properties, we get three things for free: (1) an online version of the algorithm; (2) a parallel version of the algorithm; and (3) a procedure for crossvalidation that runs asymptotically faster than the standard version.
We’ll start with the example of a Gaussian distribution. Gaussians are ubiquitous in learning algorithms because they accurately describe most data. But more importantly, they are easy to work with. They are fully determined by their mean and variance, and these parameters are easy to calculate.
In this post we’ll start with examples of why the monoid and group properties of Gaussians are useful in practice, then we’ll look at the math underlying these examples, and finally we’ll see that this technique is extremely fast in practice and results in near perfect parallelization.
Install the libraries from a shell:
$ cabal install HLearnalgebra0.0.1 $ cabal install HLearndistributions0.0.1
Then import the HLearn libraries into a literate haskell file:
> import HLearn.Algebra > import HLearn.Models.Distributions.Gaussian
And some libraries for comparing our performance:
> import Criterion.Main > import Statistics.Distribution.Normal > import qualified Data.Vector.Unboxed as VU
Now let’s create some data to work with. For simplicity’s sake, we’ll use a made up data set of how much money people make. Every entry represents one person making that salary. (We use a small data set here for ease of explanation. When we stress test this library at the end of the post we use much larger data sets.)
> gradstudents = [15e3,25e3,18e3,17e3,9e3] :: [Double] > teachers = [40e3,35e3,89e3,50e3,52e3,97e3] :: [Double] > doctors = [130e3,105e3,250e3] :: [Double]
In order to train a Gaussian distribution from the data, we simply use the train function, like so:
> gradstudents_gaussian = train gradstudents :: Gaussian Double > teachers_gaussian = train teachers :: Gaussian Double > doctors_gaussian = train doctors :: Gaussian Double
The train function is a member of the HomTrainer type class, which we’ll talk more about later. Also, now that we’ve trained some Gaussian distributions, we can perform all the normal calculations we might want to do on a distribution. For example, taking the mean, standard deviation, pdf, and cdf.
Now for the interesting bits. We start by showing that the Gaussian is a semigroup. A semigroup is any data structure that has an associative binary operation called (<>). Basically, we can think of (<>) as “adding” or “merging” the two structures together. (Semigroups are monoids with only a mappend function.)
So how do we use this? Well, what if we decide we want a Gaussian over everyone’s salaries? Using the traditional approach, we’d have to recompute this from scratch.
> all_salaries = concat [gradstudents,teachers,doctors] > traditional_all_gaussian = train all_salaries :: Gaussian Double
But this repeats work we’ve already done. On a real world data set with millions or billions of samples, this would be very slow. Better would be to merge the Gaussians we’ve already trained into one final Gaussian. We can do that with the semigroup operation (<>):
> semigroup_all_gaussian = gradstudents_gaussian <> teachers_gaussian <> doctors_gaussian
Now,
traditional_all_gaussian == semigroup_all_gaussian
The coolest part about this is that the semigroup operation takes time O(1), no matter how much data we’ve trained the Gaussians on. The naive approach takes time O(n), so we’ve got a pretty big speed up!
Next, a monoid is a semigroup with an identity. The identity for a Gaussian is easy to define—simply train on the empty data set!
> gaussian_identity = train ([]::[Double]) :: Gaussian Double
Now,
gaussian_identity == mempty
But we’ve still got one more trick up our sleeves. The Gaussian distribution is not just a monoid, but also a group. Groups appear all the time in abstract algebra, but they haven’t seen much attention in functional programming for some reason. Well groups are simple: they’re just monoids with an inverse. This inverse lets us do “subtraction” on our data structures.
So back to our salary example. Lets say we’ve calculated all our salaries, but we’ve realized that including grad students in the salary calculations was a mistake. (They’re not real people after all.) In a normal library, we would have to recalculate everything from scratch again, excluding the grad students:
> nograds = concat [teachers,doctors] > traditional_nograds_gaussian = train nograds :: Gaussian Double
But as we’ve already discussed, this takes a lot of time. We can use the inverse function to do this same operation in constant time:
> group_nograds_gaussian = semigroup_all_gaussian <> (inverse gradstudents_gaussian)
And now,
traditional_nograds_gaussian == group_nograds_gaussian
Again, we’ve converted an operation that would have taken time O(n) into one that takes time O(1). Can’t get much better than that!
As I’ve already mentioned, the HomTrainer type class is the basis of the HLearn library. Basically, any learning algorithm that is also a semigroup homomorphism can be made an instance of HomTrainer. This means that if xs and ys are lists of data points, the class obeys the following law:
train (xs ++ ys) == (train xs) <> (train ys)
It might be easier to see what this means in picture form:
On the left hand side, we have some data sets, and on the right hand side, we have the corresponding Gaussian distributions and their parameters. Because training the Gaussian is a homomorphism, it doesn’t matter whether we follow the orange or green paths to get to our final answer. We get the exact same answer either way.
Based on this property alone, we get the three “free” properties I mentioned in the introduction. (1) We get an online algorithm for free. The function add1dp can be used to add a single new point to an existing Gaussian distribution. Let’s say I forgot about one of the graduate students—I’m sure this would never happen in real life—I can add their salary like this:
> gradstudents_updated_gaussian = add1dp gradstudents_gaussian (10e3::Double)
This updated Gaussian is exactly what we would get if we had included the new data point in the original data set.
(2) We get a parallel algorithm. We can use the higher order function parallel to parallelize any application of train. For example,
> gradstudents_parallel_gaussian = (parallel train) gradstudents :: Gaussian Double
The function parallel automatically detects the number of processors your computer has and evenly distributes the work load over them. As we’ll see in the performance section, this results in perfect parallelization of the training function. Parallelization literally could not be any simpler!
(3) We get asymptotically faster crossvalidation; but that’s not really applicable to a Gaussian distribution so we’ll ignore it here.
One last note about the HomTrainer class: we never actually have to define the train function for our learning algorithm explicitly. All we have to do is define the semigroup operation, and the compiler will derive our training function for us! We’ll save a discussion of why this homomorphism property gives us these results for another post. Instead, we’ll just take a look at what the Gaussian distribution’s semigroup operation looks like.
Our Gaussian data type is defined as:
data Gaussian datapoint = Gaussian { n :: !Int  The number of samples trained on , m1 :: !datapoint  The mean (first moment) of the trained distribution , m2 :: !datapoint  The variance (second moment) times (n1) , dc :: !Int  The number of "dummy points" that have been added }
In order to estimate a Gaussian from a sample, we must find the total number of samples (n), the mean (m1), and the variance (calculated from m2). (We’ll explain what dc means a little later.) Therefore, we must figure out an appropriate definition for our semigroup operation below:
(Gaussian na m1a m2a dca) <> (Gaussian nb m1b m2b dcb) = Gaussian n' m1' m2' dc'
First, we calculate the number of samples n’. The number of samples in the resulting distribution is simply the sum of the number of samples in both the input distributions:
Second, we calculate the new average m1′. We start with the definition that the final mean is:
Then we split the summation according to whether the input element was from the left Gaussian a or right Gaussian b, and substitute with the definition of the mean above:
Notice that this is simply the weighted average of the two means. This makes intuitive sense. But there is a slight problem with this definition: When implemented on a computer with floating point arithmetic, we will get infinity whenever n’ is 0. We solve this problem by adding a “dummy” element into the Gaussian whenever n’ would be zero. This increases n’ from 0 to 1, preventing the division by 0. The variable dc counts how many dummy variables have been added, so that we can remove them before performing calculations (e.g. finding the pdf) that would be affected by an incorrect number of samples.
Finally, we must calculate the new m2′. We start with the definition that the variance times (n1) is:
(Note that the second half of the equation is a property of variance, and its derivation can be found on wikipedia.)
Then, we do some algebra, split the summations according to which input Gaussian the data point came from, and resubstitute the definition of m2 to get:
Notice that this equation has no divisions in it. This is why we are storing m2 as the variance times (n1) rather than simply the variance. Adding in the extra divisions causes training our Gaussian distribution to run about 4x slower. I’d say haskell is getting pretty fast if the number of floating point divisions we perform is impacting our code’s performance that much!
This algebraic interpretation of the Gaussian distribution has excellent time and space performance. To show this, we’ll compare performance to the excellent Haskell package called “statistics” that also has support for Gaussian distributions. We use the criterion package to create three tests:
> size = 10^8 > main = defaultMain > [ bench "statisticsGaussian" $ whnf (normalFromSample . VU.enumFromN 0) (size) > , bench "HLearnGaussian" $ whnf > (train :: VU.Vector Double > Gaussian Double) > (VU.enumFromN (0::Double) size) > , bench "HLearnGaussianParallel" $ whnf > (parallel $ (train :: VU.Vector Double > Gaussian Double)) > (VU.enumFromN (0::Double) size) > ]
In these test, we time three different methods of constructing Gaussian distributions given 100,000,000 data points. On my laptop with 2 cores, I get these results:
statisticsGaussian  2.85 sec 
HLearnGaussian  1.91 sec 
HLearnGaussianParallel  0.96 sec 
Pretty nice! The algebraic method managed to outperform the traditional method for training a Gaussian by a handy margin. Plus, our parallel algorithm runs exactly twice as fast on two processors. Theoretically, this should scale to an arbitrary number of processors, but I don’t have a bigger machine to try it out on.
Another interesting advantage of the HLearn library is that we can trade off time and space performance by changing which data structures store our data set. Specifically, we can use the same functions to train on a list or an unboxed vector. We do this by using the ConstraintKinds package on hackage that extends the base type classes like Functor and Foldable to work on classes that require constraints. Thus, we have a Functor instance of Vector.Unboxed. This is not possible without ConstraintKinds.
Using this benchmark code:
main = do print $ (train [0..fromIntegral size::Double] :: Gaussian Double) print $ (train (VU.enumFromN (0::Double) size) :: Gaussian Double)
We generate the following heap profile:
Processing the data as a vector requires that we allocate all the memory in advance. This lets the program run faster, but prevents us from loading data sets larger than the amount of memory we have. Processing the data as a list, however, allows us to allocate the memory only as we use it. But because lists are boxed and lazy data structures, we must accept that our program will run about 10x slower. Lucky for us, GHC takes care of all the boring details of making this happen seamlessly. We only have to write our train function once.
There’s still at least four more major topics to cover in the HLearn library: (1) We can extend this discussion to show how the Naive Bayes learning algorithm has a similar monoid and group structure. (2) There are many more learning algorithms with group structures we can look into. (3) We can look at exactly how all these higher order functions, like batch and parallel work under the hood. And (4) we can see how the fast crossvalidation I briefly mentioned works and why it’s important.
Subscribe to the RSS feed and stay tuned!
]]>EDIT: WordPress seems to garble the code sections on occasion for no good reason. If you want to run the code, you should download the original file instead. Sorry.
This is a tutorial for how to use Hidden Markov Models (HMMs) in Haskell. We will use the Data.HMM package to find genes in the second chromosome of Vitis vinifera: the wine grape vine. Predicting gene locations is a common task in bioinformatics that HMMs have proven good at.
The basic procedure has three steps. First, we create an HMM to model the chromosome. We do this by running the BaumWelch training algorithm on all the DNA. Second, we create an HMM to model transcription factor binding sites. This is where genes are located. Finally, we use Viterbi’s algorithm to determine which HMM best models the DNA at a given location in the chromosome. If it’s the first, this is probably not the start of a gene. If it’s the second, then we’ve found a gene!
Unfortunately, it’s beyond the scope of this tutorial to go into the math of HMMs and how they work. Instead, we will focus on how to use them in practice. And like all good Haskell tutorials, this page is actually a literate Haskell program, so you can simply cut and paste it into your favorite text editor to run it.
Before we do anything else, we must import the Data.HMM library, and some other libraries for the program
>import Data.HMM >import Control.Monad >import Data.Array >import System.IO
Now, let’s create our first HMM. The HMM datatype is:
data HMM stateType eventType = HMM { states :: [stateType] , events :: [eventType] , initProbs :: (stateType > Prob) , transMatrix :: (stateType > stateType > Prob) , outMatrix :: (stateType > eventType > Prob) }
Notice that states and events can be any type supported by Haskell. In this example, we will be using both integers and strings for the states, and characters for the events. DNA is composed of 4 base pairs that get repeated over and over: adenine (A), guanine (G), cytosine (C), and thymine (T), so “AGCT” will be the list of our events.
We’ll start by creating a simple HMM by hand:
>hmm1 = HMM { states=[1,2] > , events=['A','G','C','T'] > , initProbs = ip > , transMatrix = tm > , outMatrix = om > } > >ip s >  s == 1 = 0.1 >  s == 2 = 0.9 > >tm s1 s2 >  s1==1 && s2==1 = 0.9 >  s1==1 && s2==2 = 0.1 >  s1==2 && s2==1 = 0.5 >  s1==2 && s2==2 = 0.5 > >om s e >  s==1 && e=='A' = 0.4 >  s==1 && e=='G' = 0.1 >  s==1 && e=='C' = 0.1 >  s==1 && e=='T' = 0.4 >  s==2 && e=='A' = 0.1 >  s==2 && e=='G' = 0.4 >  s==2 && e=='C' = 0.4 >  s==2 && e=='T' = 0.1
While creating HMMs manually is straightforward, we will typically want to start with one of the built in HMMs. This simplest way to do this is the function simpleHMM:
>hmm2 = simpleHMM [1,2] "AGCT"
hmm2 is an HMM with the same states and events as hmm1, but all the initial, transition, and output probabilities are distributed in an unknown manner. This is okay, however, because we will normally want to train our HMM using BaumWelch to determine those parameters automatically.
Another simple way to create an HMM is by creating a nonhidden Markov model with the simpleMM command. (Note the absence of an “H”) Below, hmm3 is a 3rd order Markov model for DNA:
>hmm3 = simpleMM "AGCT" 3
Now, how do we train our model? The standard algorithm is called BaumWelch. To illustrate the process, we’ll create a short array of DNA, then call three iterations of baumWelch on it.
>dnaArray = listArray (1,20) "AAAAGGGGCTCTCTCCAACC" >hmm4 = baumWelch hmm3 dnaArray 3
We use arrays instead of lists because this gives us better performance when we start passing large training data to BaumWelch. Doing three iterations is completely arbitrary. BaumWelch is guaranteed to converge, but there is no way of knowing how long that will take.
Now, let’s train our HMM on an entire chromosome. We will use the winegrapechromosome2 file. This DNA file was downloaded from the plant genomics database. We can load and process it like this:
>loadDNAArray len = do > let dnaArray = listArray (1,len) $ filter isBP dna > return dnaArray > where > isBP x = if x `elem` "AGCT"  This filters out the "N" base pair > then True  "N" means it could be any bp > else False  so this should not affect results too much > >createDNAhmm file len hmm = do > let hmm' = baumWelch hmm dna 10 > putStrLn $ show hmm' > saveHMM file hmm' > return hmm'
The loadDNAArray function simply loads the DNA from the file into an array, and the createDNAhmm function actually calls the BaumWelch algorithm. This function can take a while on long inputs—and DNA is a long input!—so we also pass a file parameter for it to save our HMM when it’s done for later use. Now let’s create our HMM:
>hmmDNA = createDNAhmm "trainedDNA.hmm" 50000 hmm3
This call takes almost a full day on my laptop. Luckily, you don’t have to repeat it. The Data.HMM.HMMFile module allows us to write our HMMs to disk and retrieve them later. Simply download trainedDNA.hmm and then call loadHMM:
>hmmDNA_file = loadHMM "trainedDNA.hmm" :: IO (HMM String Char)
NOTE: Whenever you use loadHMM, you must specify the type of the resulting HMM. loadHMM relies on the builtin “read” function, and this cannot work unless you specify the type!
Great! Now, we have a fully trained HMM for our chromosome. Our next step is to train another HMM on the transcription factor binding sites. There are many advanced ways to do this (e.g. Profile HMMs), but that’s beyond the scope of this tutorial. We’re simply going to download a list of TF binding sites, concatenate them, then train our HMM on them. This won’t be as effective, but saves us from taking an unnecessary tangent.
>createTFhmm file hmm = do > x < strTF > let hmm' = baumWelch hmm (listArray (1,length x) x) 10 > putStrLn $ show hmm' > saveHMM file hmm' > return hmm' > where > strTF = liftM (concat . map ( (++) "") ) loadTF > loadTF = liftM (filter isValidTF) $ (liftM lines) $ readFile "TFBindingSites" > isValidTF str = (length str > 0) && (not $ elemChecker "#(/)[]N" str) > >elemChecker :: (Eq a) => [a] > [a] > Bool >elemChecker elemList list >  elemList == [] = False >  otherwise = if (head elemList) `elem` list > then True > else elemChecker (tail elemList) list
Now, let’s create our transcription factor HMM:
>hmmTF = createTFhmm "trainedTF.hmm" $ simpleMM "AGCT" 3
Or if you’re in a hurry, just download trainedTF.hmm and load it:
>hmmTF_file = loadHMM "trainedTF.hmm" :: IO (HMM String Char)
So now we have 2 HMMs, how are we going to use them? We’ll combine the two HMMs into a single HMM, then use Viterbi’s algorithm to determine which HMM best characterizes our DNA at a given point. If it’s hmmDNA, then we do not have a TF binding site at that location, but if it’s hmmTF, then we probably do.
The Data.HMM library provides another convenient function for combining HMMs, hmmJoin. It adds transitions from every state in the first HMM to every state in the second, and vice versa, using the “joinParam” to determine the relative probability of making that transition. This is the simplest way to combine to HMMs. If you want more control over how they get combined, you can implement your own version.
>findGenes len joinParam hout = do > hmmTF < loadHMM "hmm/TF3.hmm" :: IO (HMM String Char) > hmmDNA < loadHMM "hmm/autowinegrape10003.hmm" :: IO (HMM String Char) > let hmm' = seq hmmDNA $ seq hmmTF $ hmmJoin hmmTF hmmDNA joinParam > dna < loadDNAArray len > hPutStrLn hout ("len="++show len++",joinParam="++show joinParam++" > "++(show $ concat $ map (show . fst) $ viterbi hmm' dna)) > >main = do > hout mapM_ (\len > mapM_ (\jp > findGenes len jp hout) [0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59,0.6]) [50000] > hClose hout
Finally, our main function runs findGenes with several different joinParams. These act as thresholds for finding where the genes actually occur. You can download the full results here.
How should we interpret these results? Let’s look at the output from around 38000 base pairs into the chromosome:
jP=0.50 > 222222222222222222222222222222222222222222222222222222
jP=0.51 > 222222222222222222222222222222222222222222222222222222
jP=0.52 > 222222222222222222222222222222222222222222222222222222
jP=0.53 > 222222222222222222222222222222222222222222222222222222
jP=0.54 > 222222222222222222222222222222222222222222222222222222
jP=0.55 > 222222222222222222222222222222222222222222222222222222
jP=0.56 > 222222222222222222222222222211112222222222222222222222
jP=0.57 > 222222222222222222222222222211112222222222222111222222
jP=0.58 > 222221111111112222222222222211111122222222222111222222
jP=0.59 > 222221111111112222211111111111111122211111111111222222
jP=0.60 > 222221111111112222211111111111111111111111111111112222
Everywhere where there is a 2, Viterbi selected hmmDNA; where there is a 1, Viterbi selected the hmmTF. Whether you select this area as a likely candidate for a transcription factor binding site depends on how you set your join parameter.
Now that you’re familiar with how the Data.HMM module works, let’s look at its performance characteristics.
Overall, the Data.HMM package performs well on medium size datasets of up to about 10,000 items. Unfortunately, on larger datasets, performance begins to suffer. Algorithms that should be running in linear time start taking superlinear time, presumably because Haskell’s garbage collector is interfering. More work is needed to determine the exact cause and fix it. Still, performance remains tractable on these large datasets up to 100,000 items, which is the largest I tried.
I ran these tests using haskell’s Data.Criterion package. Criterion conveniently allows you to define multiple tests and does all the statistical analysis of them. For these tests, I did 3 trials each, and ran them on my Core 2 duo laptop. The code for the tests can be found in the HMMPerf.hs file. In all graphs, the blue line is actual performance data and the red line is a best fit curve.
BaumWelch’s performance
First, as expected we find that BaumWelch runs in linear time based on the number of iterations. In an imperative language, there would be no point in even testing this. But in Haskell, laziness can rear its head in unexpected ways, so it is important to ensure this is linear.
For small arrays, BaumWelch runs in linear time.
But for larger arrays, it runs in superlinear time. It is interesting that the exponent on our polynomial function is not quite at 2. This provides evidence that the performance hit has to do with the Haskell compiler and not an incorrect implementation.
Viterbi’s performance
As expected, the Viterbi runs in quadratic time on the number of states in the HMM.
The curves for the Viterbi algorithm clearly demonstrate that something weird is going on. At small array sizes, Viterbi is only mildly superlinear. It’s best fit polynomial curve has an exponent of only 1.3. But at medium array lengths, this exponent increases to 1.8, and at large array lengths, the exponent increases to 1.97.
Data.HMM is a great tool if you just need a small HMM in your Haskell application for some reason. If you’re going to be making heavy use of HMMs and don’t specifically need to interact with Haskell, it’s probably better to use a package written in C++ that’s been optimized for speed.
]]>We’ll make our unfair coins by bending them. Our hypothesis is that the concave side will have less area to land on, and so the coin should land on it less often. Let’s get started.
It’s easy to bend the coins with your teeth:
WAIT! That really hurts! Using pliers or wrenches works much better:
I made seven coins this way, each with a different bending angle.
I did 100 flips for each coin, making sure each flip went at least a foot in the air and spun real well. “Umm… only 100 flips?” you ask, “That can’t be enough!” Just you wait until the section on the math.
Here’s the raw results:
Coin  Total Flips  Heads  Tails 
0  100  53  47 
1  100  55  45 
2  100  49  51 
3  100  41  59 
4  100  39  61 
5  100  27  73 
6  100  0  100 
Coin flipping is a Bernoulli process. This just means that all trials (flips) can have only two outcomes (heads or tails), and each trial is independent of every other trial. What we’re interested in calculating is the expected value of a coin flip for each of our coins. That is, what is the probability it will come up heads? The obvious way to calculate this probability is simply to divide the number of heads by the total number of trials. Unfortunately, this doesn’t give us a good idea about how accurate our estimate is.
Enter the beta distribution. This is a distribution over the bias of a Bernoulli process. Intuitively, this means that CDF(x) equals the probability that the expectation of a coin flip is x. In other words, we’re finding the probability that a probability is what we think it should be. That’s a convoluted definition! Some examples should make it clearer.
The beta distribution takes two parameters and . is the number of heads we have flipped plus one, and is the number of tails plus one. We’ll talk about why that plus one is there in a bit, but first let’s see what the distribution actually looks like with some example parameters.
In both the above cases, the distribution is centered around 0.5 because and are equal—we’ve gotten the same number of heads as we have tails. As these parameters increase, the distribution gets tighter and tighter. This should makes sense. The more flips we do, the more confident we can be that the data we’ve collected actually match the characteristics of the coin.
When the parameters are not equal to each other—for example, we’ve seen twice as many heads as we have tails—then the distribution is skewed to the left or right accordingly. The peak of the PDF occurs at:
That’s exactly what we said the expectation of the next coin flip should be above. Awesome!
So what happens when and are one?
We get the flat distribution. Basically, we haven’t flipped the coin at all yet, so we have no data about how our coin is biased, so all biases are equally likely. This is why we must add one to the number of heads and tails we have flipped to get the appropriate and .
If and are less than one, we get something like this:
Essentially, this means that we know our coin is very biased in one way or the other, but we don’t know which way yet! As you can imagine, such perverse parameterizations are rarely used in practice.
Hopefully, this has given you an intuitive sense for what the beta distribution looks like. But for the pedantic, here’s how the beta distribution’s pdf is formally defined:
Where is the gamma function—you can think of it as being a generalization of factorials to the real numbers. That is, . Excel, many calculators, and any scientific programming package will be able to calculate that for you easily. Most of these applications will even have the beta function already built in.
We’re finally ready to see just how biased our coins actually are!
Coin 0 Heads: 53 Tails: 47 

Coin 1 Heads: 55 Tails: 45 

Coin 2 Heads: 49 Tails: 51 

Coin 3 Heads: 41 Tails: 59 

Coin 4 Heads: 39 Tails: 61 

Coin 5 Heads: 27 Tails: 73 

Coin 6 Heads: 0 Tails: 100 
Amazingly, it takes some pretty big bends to make a biased coin. It is not until coin 3, which has an almost 90 degree bend that we can say with any confidence that the coin is biased at all. People might notice if you tried to flip that coin to settle a bet!
]]>But first, we must start from the beginning. What exactly is a times series? Anything that can be plotted on a line graph. For example, the price of Google stock is a time series:
As you can imagine, time series have been studied extensively. Most scientists use them at some point in their careers. Unsurprisingly, they have developed many techniques for analyzing them. If we can convert our images into time series, then all these tools become available to us. Therefore, the time series distance measure has two steps:
STEP 1: Convert the images into a time series
STEP 2: Find the distance between two images by finding the distance between their time series
We have our choice of several algorithms for each step. In the rest of this post, we will look at two algorithms for converting images into time series: radial scanning and linear scanning. Then, we will look at two algorithms for measuring the distance between time series: Euclidean distance and dynamic time warping. We will conclude by looking at the types of problems time series analysis handles best and worst.
STEP 1A: Creating a time series by radial scanning
Radial scanning is tricky to explain, but once it clicks you’ll realize that it is both simple and elegant. Here’s an example from a human skull:
First we find the skull’s outline. Then we find the distance from the center of the skull to each point on the skull’s outline (B). Finally, we plot those distances as a time series (C). The lines connecting the skull to the graph show where that point on the skull maps to the time series below. In this case, we started at the skull’s mouth and went clockwise.
Skulls from different species produce different time series:
Take a careful look at these skulls and their time series. Make sure you can spot the differences in the time series between each grouping. Don’t worry yet about how the groupings were made. Right now, just get a feel for how a shape can be converted into a time series.
Another example of radial scanning comes from Korea University. Here we are trying to determine a tree’s species based on it’s leaf shapes:^{2}
The labeled points on the leaf at left correspond to the labeled positions on the time series at right. Radial scanning is a popular technique for leaf classification because every species of plant has a characteristic leaf shape. Each leaf will be unique, but the pattern of peaks and valleys in the resulting time series should be similar if the species of plant is the same.
We can already tell that the graphs created by the skulls and the leaf look very different to the human eye. This is a good sign that radial scanning captures important information about the objects shape that we will be able to use in the comparison step.
STEP 1B: Creating a time series by linear scanning
Some objects just aren’t circular, so radial scanning makes no sense. One example is hand written words. The University of Massachusetts has analyzed a large collection of George Washington’s letters using the linear scanning method.^{3} ^{4} In the first image is a picture of the word “Alexandria” as Washington actually wrote it:
Then, we remove the tilt from the image. All of Washington’s writing has a fairly constant tilt, so this process is easy to automate.
Finally, we create a time series from the word:
To create this time series, we start at the left of the image and consider each column of pixels in turn. The value at each “time” is just the number of dark pixels in that column. If you look closely at the time series, you should be able to tell where each bump corresponds to a specific letter. Some letters, like the “d” get two bumps in the time series because they have two areas with a high concentration of dark pixels.
We could have constructed the time series in other ways as well. For example, we could have counted the number of pixels from the top of the column to the first dark pixel. This would have created an outline of the top of the word. We simply have to consider our application carefully and decide which method will work the best.
We now have two simple methods for creating time series from images. These are the simplest and most common methods, but the only ones. WARP^{5} and Beam Angle Statistics^{6} are two examples of other methods. Which is best depends—as always—on the specific application. Now that we can create the time series, let’s figure out how to compare them.
STEP 2: Comparing the distances
The whole purpose of creating the time series was to create a distance measure that uses them. The easiest way to do this is the Euclidean distance. (This is the normal that we are used to.) Consider the two time series below:^{7}
To calculate the overall distance, we calculate the distance between each corresponding point in the time series. Corresponding points are connected by black lines. Notice that the first blue hump corresponds to a flat red area, so this causes the black lines to be shorter. The second red hump corresponds to a flat blue area, so the black lines are longer. Everywhere else, the two time series line up fairly well, so the black lines have a mostly constant height. (Normally, we would start the two time series at the same height, so the first black line would be zero; however, the time series have been moved apart to make the black lines easy to see.)
More formally,
where is the height of the red series at “time” , is the height of the blue series at “time” , and is the length of the time series. This is a simple and fast calculation, running in time .
A more sophisticated way to compare time series is called Dynamic Time Warping (DTW). DTW tries to compare similar areas in each time series with each other. Here are the same two time series compared with DTW:
In this case, each of the humps in the blue series is matched with a hump in the red series, and all the flat areas are paired together. Notice that a single point in one time series can align with multiple points in the other. In this case, DTW gives a distance nearly zero—it is a nearly perfect match. Euclidean distance had a much worse match and would give a large distance.
For most applications, dynamic time warping outperforms straight Euclidean distance. Take a look at this dendrogram clustering:
The orange series contain three humps, the green four, and the blue five. But the humps do not line up, so this is a difficult problem for straight Euclidean distance. In contrast, DTW successfully clustered the time series based on the number of humps they have.
That’s great, but how did DTW decide which points in the red and blue time series should align?
Exhaustive search. We try every possible alignment and pick the one that works best. This will be easier to see with a simpler example:
To perform the search, we create an x matrix. Each row corresponds to a time along the red series, and each column corresponds to a time along the blue series. The value of each cell is the distance between and . This effectively compares every time in the red series with every other time in the blue series. Then, we select the path through the matrix that minimizes the total distance:
The colored boxes correspond to the colored lines connecting the two time series in the first image. For example, the four light blue squares in the top right are on a single row, so they map one point on the red series to four points on the blue one.
Using dynamic programming, DTW is an algorithm, which is much slower than Euclidean distance’s . This is a serious problem if we want to use the algorithm to search a large database.
The easiest way to speed up the algorithm is to calculate only a small fraction of the matrix. Intuitively, we want our warping path to stay relatively close to a diagonal line. If it stays exactly on the diagonal line, then every red and blue time correspond exactly. This is the same as the Euclidean distance. At the opposite extreme would be a path that follows the left most, then top most edges. In this case we are comparing the first blue value to all red values and the last red value to all blue values. This seems unlikely to make a good match.
There are two common ways to limit the number of calculations. First is the SakoeChiba band:
The second method is the Itakura parallelogram:
The basic ideas behind these restrictions is pretty straightforward from their pictures. What isn’t straightforward, however, is that these techniques also increase DTW’s accuracy.^{8} DTW was introduced to the data mining community in 1994.^{9} For over a decade researchers tried to find ways to increase the amount of the matrix they could search because they falsely believed that this would lead to more accurate results.
We can also speed up the calculation using an approximation function called a lower bound. A lower bound is computationally much cheaper than the full DTW function—a good one might run 1000 times faster than the time of the full DTW—and is always less than or equal to the real DTW. We can run the lower bound on millions of images, and only select the potentially closest matches to run the full DTW algorithm on. Two good lower bounds are LB_Improved^{10} and LB_Keogh.^{11}
Finally, there are other methods for comparing time series. The most common is called Longest Common SubSequence (LCSS). It is useful for matching images suffering from occlusion. ^{12}
When to use Time Series Analysis
Time series analysis is only sensitive to an object’s shape. It is invariant to colors and internal features. These properties make time series analysis good for comparing rigid objects, such as skulls, leaves, and handwriting. These shapes do not change over time, so they will have similar time series no matter when they are measured.
Time series analysis will not work on objects that can change their shapes over time. People are good examples of this, because we have many different postures. We can walk, sit, or curl into a ball. Another distance measure called “shock graphs” is better for comparing the shapes of objects that can move. We’ll cover shock graphs in a later post.
Footnotes