<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>My Experiments in Truth &#187; Computer Science</title>
	<atom:link href="http://izbicki.me/blog/category/computer-science/feed" rel="self" type="application/rss+xml" />
	<link>http://izbicki.me/blog</link>
	<description>Writing about computer science and religion</description>
	<lastBuildDate>Tue, 11 Jun 2013 17:59:15 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>HLearn&#8217;s code is shorter and clearer than Weka&#8217;s</title>
		<link>http://izbicki.me/blog/hlearns-code-is-shorter-and-clearer-than-wekas?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hlearns-code-is-shorter-and-clearer-than-wekas</link>
		<comments>http://izbicki.me/blog/hlearns-code-is-shorter-and-clearer-than-wekas#comments</comments>
		<pubDate>Tue, 11 Jun 2013 17:50:09 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=2520</guid>
		<description><![CDATA[Haskell code is expressive.  The HLearn library uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the Weka library uses over 100 lines of Java.  That&#8217;s a big difference!  In this post, we&#8217;ll look at the actual code and see why the Haskell is so much [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright  wp-image-2478" alt="weka-lambda-haskell" src="http://izbicki.me/blog/wp-content/uploads/2013/05/weka-lambda-haskell-300x150.png" width="240" height="120" /></p>
<p>Haskell code is expressive.  The <a href="https://github.com/mikeizbicki/HLearn">HLearn library</a> uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the <a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka library</a> uses over 100 lines of Java.  That&#8217;s a big difference!  In this post, we&#8217;ll look at the actual code and see why the Haskell is so much more concise.</p>
<p><strong>But first, a disclaimer:</strong>  It is really hard to fairly compare two code bases this way.  In both libraries, there is a lot of supporting code that goes into defining each classifier, and it&#8217;s not obvious what code to include and not include.  For example, both libraries implement interfaces to a number of probability distributions, and this code is not contained in the source count.  The Haskell code takes more advantage of this abstraction, so this is one language-agnostic reason why the Haskell code is shorter.  If you think I&#8217;m not doing a fair comparison, here&#8217;s some links to the full repositories so you can do it yourself:</p>
<ul>
<li><span class="Apple-style-span" style="line-height: 12px;"><a href="https://github.com/mikeizbicki/HLearn/blob/master/HLearn-classification/src/HLearn/Models/Classifiers/Bayes.hs">HLearn&#8217;s bayesian classifier source code</a> (74 lines of code)</span></li>
<li><a href="https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka/src/main/java/weka/classifiers/bayes/NaiveBayes.java">Weka&#8217;s naive bayes source code</a> (946 lines of code)</li>
</ul>
<p><span id="more-2520"></span></p>
<h3>The HLearn code</h3>
<p>HLearn implements training for a <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">bayesian classifier</a> with these six lines of Haskell:</p>
<pre>newtype Bayes labelIndex dist = Bayes dist
    deriving (Read,Show,Eq,Ord,Monoid,Abelian,Group)

instance (Monoid dist, HomTrainer dist) =&gt; HomTrainer (Bayes labelIndex dist) where
    type Datapoint (Bayes labelIndex dist) = Datapoint dist
    train1dp dp = Bayes $ train1dp dp</pre>
<p>This code elegantly captures how to train a Bayesian classifier&#8212;just train a probability distribution.  Here&#8217;s an explanation:</p>
<ul>
<li>The first two lines define the Bayes data type as a wrapper around a distribution.</li>
<li>The fourth line says that we&#8217;re implementing the Bayesian classifier using the HomTrainer type class.  We do this because <strong>the Haskell compiler automatically generates a parallel batch training function, an online training function, and a fast cross-validation function for all HomTrainer instances.</strong></li>
<li>The fifth line says that our data points have the same type as the underlying distribution.</li>
<li>The sixth line says that in order to train, just train the corresponding distribution.</li>
</ul>
<p>We only get the benefits of the HomTrainer type class because the bayesian classifier is a monoid.  But we didn&#8217;t even have to specify what the monoid instance for bayesian classifiers looks like!  In this case, it&#8217;s automatically derived from the monoid instances for the base distributions using a language extension called <a href="http://www.haskell.org/ghc/docs/7.6.1/html/users_guide/deriving.html">GeneralizedNewtypeDeriving</a>.  For examples of these monoid structures, check out the algebraic structure of the <a href="http://izbicki.me/blog/gausian-distributions-are-monoids">normal</a> and <a href="http://izbicki.me/blog/the-categorical-distributions-algebraic-structure">categorical</a> distributions, or more complex distributions using <a href="http://izbicki.me/blog/markov-networks-monoids-and-futurama">Markov networks</a>.</p>
<h3>The Weka code</h3>
<p>Look for these differences between the HLearn and Weka source:</p>
<ul>
<li>In Weka we must separately define the online and batch trainers, whereas Haskell derived these for us automatically.</li>
<li>Weka must perform a variety of error handling that Haskell&#8217;s type system takes care of in HLearn.</li>
<li>The Weka code is tightly coupled to the underlying probability distribution, whereas the Haskell code was generic enough to handle any distribution. This means that while Weka must make the &#8220;naive bayes assumption&#8221; that all attributes are independent of each other, HLearn can support any dependence structure.</li>
<li>Weka&#8217;s code is made more verbose by for loops and if statements that aren&#8217;t necessary for HLearn.</li>
<li>The Java code requires extensive comments to maintain readability, but the Haskell code is simple enough to be self-documenting (at least once you know how to read Haskell).</li>
<li>Weka does not have parallel training, fast cross-validation, data point subtraction, or weighted data points, but HLearn does.</li>
</ul>
<pre>/**
   * Generates the classifier.
   *
   * @param instances set of instances serving as training data 
   * @exception Exception if the classifier has not been generated 
   * successfully
   */
  public void buildClassifier(Instances instances) throws Exception {

    // can classifier handle the data?
    getCapabilities().testWithFail(instances);

    // remove instances with missing class
    instances = new Instances(instances);
    instances.deleteWithMissingClass();

    m_NumClasses = instances.numClasses();

    // Copy the instances
    m_Instances = new Instances(instances);

    // Discretize instances if required
    if (m_UseDiscretization) {
      m_Disc = new weka.filters.supervised.attribute.Discretize();
      m_Disc.setInputFormat(m_Instances);
      m_Instances = weka.filters.Filter.useFilter(m_Instances, m_Disc);
    } else {
      m_Disc = null;
    }

    // Reserve space for the distributions
    m_Distributions = new Estimator[m_Instances.numAttributes() - 1]
      [m_Instances.numClasses()];
    m_ClassDistribution = new DiscreteEstimator(m_Instances.numClasses(), 
                                                true);
    int attIndex = 0;
    Enumeration enu = m_Instances.enumerateAttributes();
    while (enu.hasMoreElements()) {
      Attribute attribute = (Attribute) enu.nextElement();

      // If the attribute is numeric, determine the estimator 
      // numeric precision from differences between adjacent values
      double numPrecision = DEFAULT_NUM_PRECISION;
      if (attribute.type() == Attribute.NUMERIC) {
	m_Instances.sort(attribute);
	if ( (m_Instances.numInstances() &gt; 0)
	    &amp;&amp; !m_Instances.instance(0).isMissing(attribute)) {
	  double lastVal = m_Instances.instance(0).value(attribute);
	  double currentVal, deltaSum = 0;
	  int distinct = 0;
	  for (int i = 1; i &lt; m_Instances.numInstances(); i++) { 	    
            Instance currentInst = m_Instances.instance(i); 	    
              if (currentInst.isMissing(attribute)) {
                break; 	    
              }
 	    currentVal = currentInst.value(attribute);
 	    if (currentVal != lastVal) {
 	      deltaSum += currentVal - lastVal;
 	      lastVal = currentVal;
 	      distinct++;
 	    }
 	  }
 	  if (distinct &gt; 0) {
	    numPrecision = deltaSum / distinct;
	  }
	}
      }

      for (int j = 0; j &lt; m_Instances.numClasses(); j++) {
	switch (attribute.type()) {
	case Attribute.NUMERIC: 
	  if (m_UseKernelEstimator) {
	    m_Distributions[attIndex][j] = 
	      new KernelEstimator(numPrecision);
	  } else {
	    m_Distributions[attIndex][j] = 
	      new NormalEstimator(numPrecision);
	  }
	  break;
	case Attribute.NOMINAL:
	  m_Distributions[attIndex][j] = 
	    new DiscreteEstimator(attribute.numValues(), true);
	  break;
	default:
	  throw new Exception("Attribute type unknown to NaiveBayes");
	}
      }
      attIndex++;
    }

    // Compute counts
    Enumeration enumInsts = m_Instances.enumerateInstances();
    while (enumInsts.hasMoreElements()) {
      Instance instance = 
	(Instance) enumInsts.nextElement();
      updateClassifier(instance);
    }

    // Save space
    m_Instances = new Instances(m_Instances, 0);
  }</pre>
<p>And the code for online learning is:</p>
<pre>/**
   * Updates the classifier with the given instance.
   *
   * @param instance the new training instance to include in the model 
   * @exception Exception if the instance could not be incorporated in
   * the model.
   */
  public void updateClassifier(Instance instance) throws Exception {

    if (!instance.classIsMissing()) {
      Enumeration enumAtts = m_Instances.enumerateAttributes();
      int attIndex = 0;
      while (enumAtts.hasMoreElements()) {
	Attribute attribute = (Attribute) enumAtts.nextElement();
	if (!instance.isMissing(attribute)) {
	  m_Distributions[attIndex][(int)instance.classValue()].
            addValue(instance.value(attribute), instance.weight());
	}
	attIndex++;
      }
      m_ClassDistribution.addValue(instance.classValue(),
                                   instance.weight());
    }
  }</pre>
<h3>Conclusion</h3>
<p>Every algorithm implemented in HLearn uses similarly concise code.  I invite you to <a href="https://github.com/mikeizbicki/HLearn/">browse the repository</a> and see for yourself.  The most complicated algorithm is for Markov chains which use only <a href="https://github.com/mikeizbicki/HLearn/blob/master/HLearn-markov/src/HLearn/Models/Markov/MarkovChain.hs">6 lines for training, and about 20 for defining the Monoid</a>.</p>
<p>You can expect lots of tutorials on how to incorporate the HLearn library into Haskell programs over the next few months.</p>
<p>Subscribe to the <a href="http://izbicki.me/blog/feed">RSS feed</a> to stay tuned!</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=2520" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/hlearns-code-is-shorter-and-clearer-than-wekas/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>HLearn cross-validates &gt;400x faster than Weka</title>
		<link>http://izbicki.me/blog/hlearn-cross-validates-400x-faster-than-weka?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hlearn-cross-validates-400x-faster-than-weka</link>
		<comments>http://izbicki.me/blog/hlearn-cross-validates-400x-faster-than-weka#comments</comments>
		<pubDate>Mon, 03 Jun 2013 15:33:16 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=2468</guid>
		<description><![CDATA[Weka is one of the most popular tools for data analysis.  But Weka takes 70 minutes to perform leave-one-out cross-validate using a simple naive bayes classifier on the census income data set, whereas Haskell&#8217;s HLearn library only takes 9 seconds.  Weka is 465x slower! Code and instructions for reproducing these experiments are available on github. Why is HLearn so much faster? [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright  wp-image-2478" alt="weka-lambda-haskell" src="http://izbicki.me/blog/wp-content/uploads/2013/05/weka-lambda-haskell-300x150.png" width="240" height="120" /><a href="http://www.cs.waikato.ac.nz/~ml/weka/">Weka</a> is one of the most popular tools for data analysis.  But Weka takes <strong>70 minutes</strong> to perform leave-one-out cross-validate using a simple <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">naive bayes classifier</a> on the <a href="http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD)">census income</a> data set, whereas Haskell&#8217;s <a href="https://github.com/mikeizbicki/HLearn">HLearn</a> library only takes <strong>9 seconds</strong>.  Weka is 465x slower!</p>
<p><strong>Code and instructions for reproducing these experiments are <a href="https://github.com/mikeizbicki/HLearn/tree/master/HLearn-classification/src/examples/weka-cv#readme">available on github</a>.</strong></p>
<p><strong><span id="more-2468"></span></strong></p>
<p>Why is HLearn so much faster?</p>
<p>Well, it turns out that the bayesian classifier has the algebraic structure of a <a href="https://en.wikipedia.org/wiki/Monoid">monoid</a>, a <a href="https://en.wikipedia.org/wiki/Abelian_group">group</a>, and a <a href="https://en.wikipedia.org/wiki/Vector_space">vector space</a>.  HLearn uses a new cross-validation algorithm that can exploit these algebraic structures.  The standard algorithm runs in time <span id='tex_5229'></span>, where <span id='tex_6636'></span> is the number of &#8220;folds&#8221; and <span id='tex_2370'></span> is the number of data points.  The algebraic algorithms, however, run in time <span id='tex_1643'></span>.  In other words, it doesn&#8217;t matter how many folds we do, the run time is constant!  And not only are we faster, but we get the <em>exact same answer</em>.  Algebraic cross-validation is not an approximation, it&#8217;s just fast.</p>
<p>Here&#8217;s some run times for k-fold cross-validation on the census income data set.  Notice that HLearn&#8217;s run time is constant as we add more folds.<i><br />
</i></p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-2479" alt="k-fold-cross-validation-weka" src="http://izbicki.me/blog/wp-content/uploads/2013/05/k-fold-cross-validation-weka1.png" width="555" height="336" /></p>
<p>And when we set k=n, we have leave-one-out cross-validation.  Notice that Weka&#8217;s cross-validation has quadratic run time, whereas HLearn has linear run time.</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-2480" alt="leave-one-out-fast-cross-validation-weka" src="http://izbicki.me/blog/wp-content/uploads/2013/05/leave-one-out-fast-cross-validation-weka1.png" width="553" height="333" /></p>
<p>HLearn certainly isn&#8217;t going to replace Weka any time soon, but it&#8217;s got a number of cool tricks like this going on inside.  If you want to read more, you should check out these two recent papers:</p>
<ul>
<li>(ICML13) <a href="http://izbicki.me/public/papers/icml2013-algebraic-classifiers.pdf">Algebraic Classifiers: a generic approach to fast cross-validation, online training, and parallel training</a></li>
</ul>
<ul>
<li><span class="Apple-style-span" style="line-height: 12px;">(TFP13) <a href="http://izbicki.me/public/papers/tfp2013-hlearn-a-machine-learning-library-for-haskell.pdf">HLearn: a machine learning library for Haskell</a></span></li>
</ul>
<p>I&#8217;ll continue to write more about these tricks in future blog posts.</p>
<p>Subscribe to the <a href="http://izbicki.me/blog/feed">RSS feed</a> to stay tuned.</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=2468" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/hlearn-cross-validates-400x-faster-than-weka/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Markov Networks, Monoids, and Futurama</title>
		<link>http://izbicki.me/blog/markov-networks-monoids-and-futurama?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=markov-networks-monoids-and-futurama</link>
		<comments>http://izbicki.me/blog/markov-networks-monoids-and-futurama#comments</comments>
		<pubDate>Thu, 09 May 2013 15:14:43 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=2229</guid>
		<description><![CDATA[In this post, we&#8217;re going to look at how to manipulate multivariate distributions in Haskell&#8217;s HLearn library.  There are many ways to represent multivariate distributions, but we&#8217;ll use a technique called Markov networks.  These networks have the algebraic structure called a monoid (and group and vector space), and training them is a homomorphism.  Despite the scary names, these mathematical [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright size-medium wp-image-2241" alt="fry" src="http://izbicki.me/blog/wp-content/uploads/2013/05/fry-300x225.jpg" width="300" height="225" />In this post, we&#8217;re going to look at how to manipulate multivariate distributions in Haskell&#8217;s <a href="https://github.com/mikeizbicki/HLearn">HLearn library</a>.  There are many ways to represent multivariate distributions, but we&#8217;ll use a technique called <a href="https://en.wikipedia.org/wiki/Markov_random_field">Markov networks</a>.  These networks have the algebraic structure called a <a href="https://en.wikipedia.org/wiki/Monoid">monoid</a> (and group and vector space), and training them is a <a href="https://en.wikipedia.org/wiki/Monoid_homomorphism#Monoid_homomorphisms">homomorphism</a>.  Despite the scary names, these mathematical structures make working with our distributions really easy and convenient&#8212;they give us online and parallel training algorithms &#8220;for free.&#8221;  If you want to go into the details of how, you can check out my <a href="http://izbicki.me/public/papers/tfp2013-hlearn-a-machine-learning-library-for-haskell.pdf">TFP13 submission</a>, but in this post we&#8217;ll ignore those mathy details to focus on how to use the library in practice.  We&#8217;ll use a running example of creating a distribution over characters in the show Futurama.</p>
<p><span id="more-2229"></span></p>
<h3>Prelimiaries: Creating the data Types</h3>
<p>As usual, this post is a literate haskell file.  To run this code, you&#8217;ll need to install the <a href="http://hackage.haskell.org/package/HLearn-distributions">hlearn-distributions</a> package.  This package requires GHC version at least 7.6.</p>
<pre>bash&gt; cabal install hlearn-distributions-1.0.0.1</pre>
<p>Now for some code.  We start with our language extensions and imports:</p>
<pre>&gt;{-# LANGUAGE DataKinds #-}
&gt;{-# LANGUAGE TypeFamilies #-}
&gt;{-# LANGUAGE TemplateHaskell #-}
&gt;
&gt;import HLearn.Algebra
&gt;import HLearn.Models.Distributions</pre>
<p>Next, we&#8217;ll create data type to represent Futurama characters.  There are a lot of characters, so we&#8217;ll need to keep things pretty organized.  The data type will have a record for everything we might want to know about a character.  Each of these records will be one of the variables in our multivariate distribution, and all of our data points will have this type.</p>
<p><img class="aligncenter size-large wp-image-2250" alt="FuturamaCast" src="http://izbicki.me/blog/wp-content/uploads/2013/05/FuturamaCast-1024x439.png" width="500" height="214" /></p>
<pre>&gt;data Character = Character
&gt;   { _name      :: String
&gt;   , _species   :: String
&gt;   , _job       :: Job
&gt;   , _isGood    :: Maybe Bool
&gt;   , _age       :: Double -- in years
&gt;   , _height    :: Double -- in feet
&gt;   , _weight    :: Double -- in pounds
&gt;   }
&gt;   deriving (Read,Show,Eq,Ord)
&gt;
&gt;data Job = Manager | Crew | Henchman | Other
&gt;   deriving (Read,Show,Eq,Ord)</pre>
<p>Now, in order for our library to be able to interpret the Character type, we call the template haskell function:</p>
<pre>&gt;makeTypeLenses ''Character</pre>
<p>This function creates a bunch of data types and type classes for us.  These &#8220;type lenses&#8221; give us a type-safe way to reference the different variables in our multivariate distribution.  We&#8217;ll see how to use these type level lenses a bit later.  There&#8217;s no need to understand what&#8217;s going on under the hood, but if you&#8217;re curious then checkout the <a href="http://hackage.haskell.org/packages/archive/HLearn-distributions/1.0.0.1/doc/html/HLearn-Models-Distributions-Multivariate-Internal-TypeLens.html">hackage documentation</a> or <a href="https://github.com/mikeizbicki/HLearn/blob/master/HLearn-distributions/src/HLearn/Models/Distributions/Multivariate/Internal/TypeLens.hs">source code</a>.</p>
<h3>Training a distribution</h3>
<p>Now, we&#8217;re ready to create a data set and start training.  Here&#8217;s a list of the employees of Planet Express provided by the resident bureaucrat Hermes Conrad.  This list will be our first data set.</p>
<p><img class="aligncenter size-full wp-image-2306" alt="hermes-zoom" src="http://izbicki.me/blog/wp-content/uploads/2013/05/hermes-zoom.png" width="700" height="250" /></p>
<pre>&gt;planetExpress = 
&gt;   [ Character "Philip J. Fry"         "human" Crew     (Just True) 1026   5.8 195
&gt;   , Character "Turanga Leela"         "alien" Crew     (Just True) 43     5.9 170
&gt;   , Character "Professor Farnsworth"  "human" Manager  (Just True) 85     5.5 160
&gt;   , Character "Hermes Conrad"         "human" Manager  (Just True) 36     5.3 210
&gt;   , Character "Amy Wong"              "human" Other    (Just True) 21     5.4 140
&gt;   , Character "Zoidberg"              "alien" Other    (Just True) 212    5.8 225
&gt;   , Character "Cubert Farnsworth"     "human" Other    (Just True) 8      4.3 135
&gt;   ]</pre>
<p>Let&#8217;s train a distribution from this data.  Here&#8217;s how we would train a distribution where every variable is independent of every other variable:</p>
<pre>&gt;dist1 = train planetExpress :: Multivariate Character
&gt;  '[ Independent Categorical '[String,String,Job,Maybe Bool]
&gt;   , Independent Normal '[Double,Double,Double]
&gt;   ]
&gt;   Double</pre>
<p>In the HLearn library, we always use the function <strong>train</strong> to train a model from data points.  We specify which model to train in the type signature.</p>
<p>As you can see, the Multivariate distribution takes three type parameters.  The first parameter is the type of our data point, in this case Character.  The second parameter describes the dependency structure of our distribution.  We&#8217;ll go over the syntax for the dependency structure in a bit.  For now, just notice that it&#8217;s a type-level list of distributions.  Finally, the third parameter is the type we will use to store our probabilities.</p>
<p>What can we do with this distribution?  One simple task we can do is to find <a href="https://en.wikipedia.org/wiki/Marginal_distribution">marginal distributions</a>.  The marginal distribution is the distribution of a certain variable ignoring all the other variables.  For example, let&#8217;s say I want a distribution of the species that work at planet express.  I can get this by:</p>
<pre>&gt;dist1a = getMargin TH_species dist1</pre>
<p>Notice that we specified which variable we&#8217;re taking the marginal of by using the type level lens TH_species.  This data constructor was automatically created for us by out template haskell function makeTypeLenses.  Every one of our records in the data type has its own unique type lens.  It&#8217;s name is the name of the record, prefixed by TH.  These lenses let us infer the types of our marginal distributions at compile time, rather than at run time.  For example, the type of the marginal distribution of species is:</p>
<pre>ghci&gt; :t dist1a
dist1a :: Categorical String Double</pre>
<p>That is, a categorical distributions whose data points are Strings and which stores probabilities as a Double.  Now, if I wanted a distribution of the weights of the employees, I can get that by:</p>
<pre>&gt;dist1b = getMargin TH_weight dist1</pre>
<p>And the type of this distribution is:</p>
<pre>ghci&gt; :t dist1b
dist1b :: Normal Double</pre>
<p>Now, I can easily plot these marginal distributions with the <strong>plotDistribution</strong> function:</p>
<pre>ghci&gt; plotDistribution (plotFile "dist1a") dist1a
ghci&gt; plotDistribution (plotFile "dist1b") dist1b</pre>
<p><center><br />
<img class="alignnone size-full wp-image-2271" alt="dist1a" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist1a.png" width="250" height="250" /><img class="alignnone size-full wp-image-2272" alt="dist1b" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist1b.png" width="250" height="250" /></center><br />
<img class=" wp-image-2310 alignright" alt="futurama-bender-smoking-cigar-wallpaper" src="http://izbicki.me/blog/wp-content/uploads/2013/05/futurama-bender-smoking-cigar-wallpaper-225x300.jpg" width="108" height="144" />But wait! I accidentally forgot to include Bender in the planetExpress data set! What can I do?</p>
<p>In a traditional statistics library, we would have to retrain our data from scratch.  If we had billions of elements in our data set, this would be an expensive mistake.  But in our HLearn library, we can take advantage of the model&#8217;s monoid structure.  In particular, the compiler used this structure to automatically derive a function called <strong>add1dp</strong> for us.  Let&#8217;s look at its type:</p>
<pre>ghci&gt; :t add1dp
add1dp :: HomTrainer model =&gt; model -&gt; Datapoint model -&gt; model</pre>
<p>It&#8217;s pretty simple.  The function takes a model and adds the data point associated with that model.  It returns the model we would have gotten if the data point had been in our original data set.  This is called online training.</p>
<p>Again, because our distributions form monoids, the compiler derived an efficient and exact online training algorithm for us automatically.</p>
<p>So let&#8217;s create a new distribution that considers bender:</p>
<pre>&gt;bender = Character "Bender Rodriguez" "robot" Crew (Just True) 44 6.1 612
&gt;dist1' = add1dp dist1 bender</pre>
<p>And plot our new marginals:</p>
<pre>ghci&gt; plotDistribution (plotFile "dist1-withbender-species" $ PNG 250 250) $ 
                getMargin TH_species dist1'
ghci&gt; plotDistribution (plotFile "dist1-withbender-weight"  $ PNG 250 250) $ 
                getMargin TH_weight dist1'</pre>
<p><center><br />
<img class="alignnone size-full wp-image-2267" alt="dist1-withbender-species" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist1-withbender-species.png" width="250" height="250" /><img class="alignnone size-full wp-image-2268" alt="dist1-withbender-weight" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist1-withbender-weight.png" width="250" height="250" /></center><br />
Notice that our categorical marginal has clearly changed, but that our normal marginal doesn&#8217;t seemed to have changed at all. This is because the plotting routines automatically scale the distribution, and the normal distribution, when scaled, always looks the same. We can double check that we actually did change the weight distribution by comparing the mean:</p>
<pre>ghci&gt; mean dist1b
176.42857142857142
ghci&gt; mean $ getMargin TH_weight dist1'
230.875</pre>
<p>Bender&#8217;s weight really changed the distribution after all!</p>
<h3>Complicated DependencE structureS</h3>
<p>That&#8217;s cool, but our original distribution isn&#8217;t very interesting.  What makes multivariate distributions interesting is when the variables affect each other.  This is true in our case, so we&#8217;d like to be able to model it.  For example, we&#8217;ve already seen that robots are much heavier than organic lifeforms, and are throwing off our statistics.  The HLearn library supports a small subset of Markov Networks for expressing these dependencies.</p>
<p>We represent Markov Networks as graphs with undirected edges.  Every attribute in our distribution is a node, and every dependence between attributes is an edge.  We can draw this graph with the <strong>plotNetwork</strong> command:</p>
<pre>ghci&gt; plotNetwork "dist1-network" dist1</pre>
<p style="text-align: left;"><img class="size-medium wp-image-2276 aligncenter" alt="dist1-network" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist1-network-300x276.png" width="300" height="276" />As expected, there are no edges in our graph because everything is independent.  Let&#8217;s create a more interesting distribution and plot its Markov network.</p>
<pre>&gt;dist2 = train planetExpress :: Multivariate Character
&gt;  '[ Ignore                  '[String]
&gt;   , MultiCategorical        '[String]
&gt;   , Independent Categorical '[Job,Maybe Bool]
&gt;   , Independent Normal      '[Double,Double,Double]
&gt;   ]
&gt;   Double</pre>
<pre>ghci&gt; plotNetwork "dist2-network" dist2</pre>
<p style="text-align: center;"> <img class="size-medium wp-image-2277 aligncenter" alt="dist2-network" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist2-network-300x263.png" width="300" height="263" /></p>
<p>Okay, so what just happened?</p>
<p>The syntax for representing the dependence structure is a little confusing, so let&#8217;s go step by step.  We represent the dependence information in the graph as a list of types.  Each element in the list describes both the marginal distribution and the dependence structure for one or more records in our data type.  We must list these elements in the same order as the original data type.</p>
<p>Notice that we&#8217;ve made two changes to the list.  First, our list now starts with the type Ignore &#8216;[String].  This means that the first string in our data type&#8212;the name&#8212;will be ignored.  Notice that TH_name is no longer in the Markov Network.  This makes sense because we expect that a character&#8217;s name should not tell us too much about any of their other attributes.</p>
<p>Second, we&#8217;ve added a dependence.  The MultiCategorical distribution makes everything afterward in the list dependent on that item, but not the things before it.  This means that the exact types of dependencies it can specify are dependent on the order of the records in our data type.  Let&#8217;s see what happens if we change the location of the MultiCategorical:</p>
<pre>&gt;dist3 = train planetExpress :: Multivariate Character
&gt;  '[ Ignore '[String]
&gt;   , Independent Categorical '[String]
&gt;   , MultiCategorical '[Job]
&gt;   , Independent Categorical '[Maybe Bool]
&gt;   , Independent Normal '[Double,Double,Double]
&gt;   ]
&gt;   Double</pre>
<pre>ghci&gt; plotNetwork "dist3-network" dist3</pre>
<p style="text-align: center;"><img class="size-medium wp-image-2279 aligncenter" alt="dist3-network" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist3-network1-300x246.png" width="300" height="246" /></p>
<p>As you can see, our species no longer have any relation to anything else.  Unfortunately, using this syntax, the order of list elements is important, and so the order we specify our data records is important.</p>
<p>Finally, we can substitute any valid univariate distribution for our Normal and Categorical distributions.  The HLearn library currently supports Binomial, Exponential, Geometric, LogNormal, and Poisson distributions.  These just don&#8217;t make much sense for modelling Futurama characters, so we&#8217;re not using them.</p>
<p>Now, we might be tempted to specify that every variable is fully dependent on every other variable.  In order to do this, we have to introduce the &#8220;Dependent&#8221; type.  Any valid multivariate distribution can follow Dependent, but only those records specified in the type-list will actually be dependent on each other.  For example:</p>
<pre>&gt;dist4 = train planetExpress :: Multivariate Character
&gt;  '[ Ignore '[String]
&gt;   , MultiCategorical '[String,Job,Maybe Bool]
&gt;   , Dependent MultiNormal '[Double,Double,Double]
&gt;   ]
&gt;   Double</pre>
<pre>ghci&gt; plotNetwork "dist4-network" dist4</pre>
<p><img class="aligncenter size-medium wp-image-2281" alt="distb-network" src="http://izbicki.me/blog/wp-content/uploads/2013/05/distb-network-300x226.png" width="300" height="226" /></p>
<p>Undoubtably, this is in always going to be the case&#8212;everything always has a slight influence on everything else.  Unfortunately, it is not easy in practice to model these fully dependent distributions.  We need roughly <span id='tex_8318'></span> data points to accurately train a distribution, where n is the number of nodes in our graph and e is the number of edges in our network.  Thus, by selecting that two attributes are independent of each other, we can greatly reduce the amount of data we need to train an accurate distribution.</p>
<p>I realize that this syntax is a little awkward.  I chose it because it was relatively easy to implement.  Future versions of the library should support a more intuitive syntax.  I also plan to use <a href="https://en.wikipedia.org/wiki/Copula_(probability_theory)">copulas</a> to greatly expand the expressiveness of these distributions.  In the mean time, the best way to figure out the dependencies in a Markov Network are just to plot it and see visually.</p>
<p>Okay.  So what distribution makes the most sense for Futurama characters?  We&#8217;ll say that everything depends on both the characters&#8217; species and job, and that their weight depends on their height.</p>
<pre>&gt;planetExpress = train planetExpress :: Multivariate Character
&gt;  '[ Ignore '[String]
&gt;   , MultiCategorical '[String,Job]
&gt;   , Independent Categorical '[Maybe Bool]
&gt;   , Independent Normal '[Double]
&gt;   , Dependent MultiNormal '[Double,Double]
&gt;   ]
&gt;   Double</pre>
<pre>ghci&gt; plotNetwork "planetExpress-network" planetExpress</pre>
<p><img class="aligncenter size-medium wp-image-2280" alt="dist4-network" src="http://izbicki.me/blog/wp-content/uploads/2013/05/dist4-network-300x225.png" width="300" height="225" /></p>
<p>We still don&#8217;t have enough data to to train this network, so let&#8217;s create some more.  We start by creating a type for our Markov network called FuturamaDist.  This is just for convenience so we don&#8217;t have to retype the dependence structure many times.</p>
<pre>&gt;type FuturamaDist = Multivariate Character
&gt;  '[ Ignore '[String]
&gt;   , MultiCategorical '[String,Job]
&gt;   , Independent Categorical '[Maybe Bool]
&gt;   , Independent Normal '[Double]
&gt;   , Dependent MultiNormal '[Double,Double]
&gt;   ]
&gt;   Double</pre>
<p>Next, we train some more distribubtions of this type on some of the characters.  We&#8217;ll start with Mom Corporation and the brave Space Forces.</p>
<p><center> <img class="alignnone size-full wp-image-2316" alt="200-futurama_mom_and_sons" src="http://izbicki.me/blog/wp-content/uploads/2013/05/200-futurama_mom_and_sons.jpg" width="304" height="200" /> <img class="alignnone size-full wp-image-2318" alt="200-kif and zapp" src="http://izbicki.me/blog/wp-content/uploads/2013/05/200-kif-and-zapp.jpg" width="267" height="200" /></center></p>
<pre>&gt;momCorporation = 
&gt;   [ Character "Mom"                   "human" Manager  (Just False) 100 5.5 130
&gt;   , Character "Walt"                  "human" Henchman (Just False) 22  6.1 170
&gt;   , Character "Larry"                 "human" Henchman (Just False) 18  5.9 180
&gt;   , Character "Igner"                 "human" Henchman (Just False) 15  5.8 175
&gt;   ]
&gt;momDist = train momCorporation :: FuturamaDist</pre>
<pre>&gt;spaceForce = 
&gt;   [ Character "Zapp Brannigan"        "human" Manager  (Nothing)   45  6.0 230
&gt;   , Character "Kif Kroker"            "alien" Crew     (Just True) 113 4.5 120
&gt;   ]
&gt;spaceDist = train spaceForce :: FuturamaDist</pre>
<p style="text-align: left;">And now some more robots:</p>
<p><center><img class="alignnone size-full wp-image-2319" alt="200-robotmafia" src="http://izbicki.me/blog/wp-content/uploads/2013/05/200-robotmafia.jpg" width="330" height="200" /> <img class="alignnone size-full wp-image-2317" alt="200-hedonismbot" src="http://izbicki.me/blog/wp-content/uploads/2013/05/200-hedonismbot.jpg" width="250" height="200" /></center></p>
<pre>&gt;robots = 
&gt;   [ bender
&gt;   , Character "Calculon"              "robot" Other    (Nothing)    123  6.8 650
&gt;   , Character "The Crushinator"       "robot" Other    (Nothing)    45   8.0 4500
&gt;   , Character "Clamps"                "robot" Henchman (Just False) 134  5.8 330
&gt;   , Character "DonBot"                "robot" Manager  (Just False) 178  5.8 520
&gt;   , Character "Hedonismbot"           "robot" Other    (Just False) 69   4.3 1200
&gt;   , Character "Preacherbot"           "robot" Manager  (Nothing)    45   5.8 350
&gt;   , Character "Roberto"               "robot" Other    (Just False) 77   5.9 250
&gt;   , Character "Robot Devil"           "robot" Other    (Just False) 895  6.0 280
&gt;   , Character "Robot Santa"           "robot" Other    (Just False) 488  6.3 950
&gt;   ]
&gt;robotDist = train robots :: FuturamaDist</pre>
<p>Now we&#8217;re going to take advantage of the monoid structure of our multivariate distributions to combine all of these distributions into one.</p>
<pre>&gt; futuramaDist = dist1 &lt;&gt; momDist &lt;&gt; spaceDist &lt;&gt; robotDist</pre>
<p>The resulting distribution is equivalent to having trained a distribution from scratch on all of the data points:</p>
<pre>train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist</pre>
<div>
<p>We can take advantage of this property any time we use the train function to automatically parallelize our code.  The higher order function <strong>parallel</strong> will split  the training task evenly over each of your available processors, then merge them together with the monoid operation.  This results in &#8220;theoretically perfect&#8221; parallel training of these models.</p>
<pre>parallel train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist</pre>
<div>
<p>Again, this is only possible because the distributions have a monoid structure.</p>
</div>
<p><span class="Apple-style-span" style="line-height: 17px;">Now, let&#8217;s ask some questions of our distribution.  If I pick a character at random, what&#8217;s the probability that they&#8217;re a good guy?  Let&#8217;s plot the marginal.</span></p>
</div>
<pre>ghci&gt; plotDistribution (plotFile "goodguy" $ PNG 250 250) $ getMargin TH_isGood futuramaDist</pre>
<p><img class="aligncenter size-full wp-image-2294" alt="goodguy" src="http://izbicki.me/blog/wp-content/uploads/2013/05/goodguy.png" width="250" height="250" /></p>
<p>But what if I only want to pick from those characters that are humans, or those characters that are robots?  Statisticians call this conditioning.  We can do that with the condition function:</p>
<pre>ghci&gt; plotDistribution (plotFile "goodguy-human" $ PNG 250 250) $
             getMargin TH_isGood $ condition TH_species "human" futuramaDist
ghci&gt; plotDistribution (plotFile "goodguy-robot" $ PNG 250 250) $
             getMargin TH_isGood $ condition TH_species "robot" futuramaDist</pre>
<p>&nbsp;</p>
<p><center><img class="alignright" alt="Preacherbot" src="http://izbicki.me/blog/wp-content/uploads/2013/05/Preacherbot-174x300.jpg" width="174" height="300" /><img class="alignnone size-full wp-image-2295" alt="goodguy-human" src="http://izbicki.me/blog/wp-content/uploads/2013/05/goodguy-human.png" width="250" height="250" /> <img class="size-medium wp-image-2296 alignnone" alt="goodguy-robot" src="http://izbicki.me/blog/wp-content/uploads/2013/05/goodguy-robot.png" width="250" height="250" /></center>On the left is the plot for humans, and on the right the plot for robots.  Apparently, original robot sin is much worse than that in humans!  If only they would listen to Preacherbot and repent of their wicked ways&#8230;</p>
<p>Now let&#8217;s ask: What&#8217;s the average age of an evil robot?</p>
<pre>ghci&gt; mean $ getMargin TH_age $ 
         condition TH_isGood (Just False) $ condition TH_species "robot" futuramaDist 
273.0769230769231</pre>
<p>Notice that conditioning a distribution is a commutative operation.  That means we can condition in any order and still get the exact same results.  Let&#8217;s try it:</p>
<pre>ghci&gt; mean $ getMargin TH_age $ 
         condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist 
273.0769230769231</pre>
<p>There&#8217;s one last thing for us to consider.  What does our Markov network look like after conditioning?  Let&#8217;s find out!</p>
<pre>plotNetwork "condition-species-isGood" $ 
         condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist</pre>
<p style="text-align: center;"><img class="aligncenter size-medium wp-image-2344" alt="condition-species-isGood" src="http://izbicki.me/blog/wp-content/uploads/2013/05/condition-species-isGood-300x168.png" width="300" height="168" /></p>
<p>Notice that conditioning against these variables caused them to go away from our Markov Network.</p>
<p>Finally, there&#8217;s another similar process to conditioning called &#8220;marginalizing out.&#8221; This lets us ignore the effects of a single attribute without specifically saying what that attribute must be. When we marginalize out on our Markov network, we get the same dependence structure as if we conditioned.</p>
<pre>plotNetwork "marginalizeOut-species-isGood" $ 
         marginalizeOut TH_species $ marginalizeOut TH_isGood futuramaDist</pre>
<p><img class="aligncenter" alt="condition-species-isGood" src="http://izbicki.me/blog/wp-content/uploads/2013/05/condition-species-isGood-300x168.png" width="300" height="168" /></p>
<p>Effectively, what the marginalizeOut function does is &#8220;forget&#8221; the extra dependencies, whereas the condition function &#8220;applies&#8221; those dependencies.  In the end, the resulting Markov network has the same structure, but different values.</p>
<p>&nbsp;</p>
<p>Finally, at the start of the post, I mentioned that our multivariate distributions have group and vector space structure.  This gives us two more operations we can use: the inverse and scalar multiplication.  You can find more posts on how to take advantage of these structures <a href="http://izbicki.me/blog/the-categorical-distributions-algebraic-structure">here</a> and <a href="http://izbicki.me/blog/nuclear-weapon-statistics-using-monoids-groups-and-modules-in-haskell">here</a>.</p>
<h3>Next time&#8230;</h3>
<p><img class="alignright size-medium wp-image-2248" alt="futurama-spacesuits" src="http://izbicki.me/blog/wp-content/uploads/2013/05/futurama-spacesuits-300x208.jpg" width="300" height="208" /></p>
<p>The best part of all of this is still coming.  Next, we&#8217;ll take a look at full on Bayesian classification and why it forms a monoid.  Besides online and parallel trainers, this also gives us a fast cross-validation method.</p>
<p>There&#8217;ll also be a posts about the monoid structure of Markov <em>chains</em>, the Free HomTrainer, and how this whole algebraic framework applies to NP-approximation algorithms as well.</p>
<p>Subscribe to the <a href="http://izbicki.me/blog/feed">RSS feed</a> to stay tuned.</p>
<div><span class="Apple-style-span" style="line-height: 17px;"> </span></div>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=2229" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/markov-networks-monoids-and-futurama/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The categorical distribution&#8217;s algebraic structure</title>
		<link>http://izbicki.me/blog/the-categorical-distributions-algebraic-structure?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-categorical-distributions-algebraic-structure</link>
		<comments>http://izbicki.me/blog/the-categorical-distributions-algebraic-structure#comments</comments>
		<pubDate>Tue, 08 Jan 2013 14:43:15 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=1932</guid>
		<description><![CDATA[ The categorical distribution is the main distribution for handling discrete data. I like to think of it as a histogram.  For example, let&#8217;s say Simon has a bag full of marbles.  There are four &#8220;categories&#8221; of marbles&#8212;red, green, blue, and white.  Now, if Simon reaches into the bag and randomly selects a marble, what&#8217;s the probability [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright" alt="histogram of simonDist" src="http://izbicki.me/blog/wp-content/uploads/2013/01/histogram-of-simonDist.png" width="247" height="173" /> The <a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical distribution</a> is the main distribution for handling discrete data. I like to think of it as a <strong>histogram</strong>.  For example, let&#8217;s say Simon has a bag full of marbles.  There are four &#8220;categories&#8221; of marbles&#8212;red, green, blue, and white.  Now, if Simon reaches into the bag and randomly selects a marble, what&#8217;s the probability it will be green?  We would use the categorical distribution to find out.</p>
<p>In this article, we&#8217;ll go over the math behind the categorical distribution, the algebraic structure of the distribution, and how to manipulate it within Haskell&#8217;s <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn</a> library.  We&#8217;ll also see some examples of how this focus on algebra makes HLearn&#8217;s interface more powerful than other common statistical packages.  Everything that we&#8217;re going to see is in a certain sense very &#8220;obvious&#8221; to a statistician, but this algebraic framework also makes it <strong>convenient</strong>.  And since programmers are inherently lazy, this is a Very Good Thing.</p>
<p>Before delving into the &#8220;cool stuff,&#8221; we have to look at some of the mechanics of the HLearn library.</p>
<p><span id="more-1932"></span></p>
<h3>Preliminaries</h3>
<p>The <a href="http://hackage.haskell.org/package/HLearn-distributions">HLearn-distributions</a> package contains all the functions we need to manipulate categorical distributions. Let&#8217;s install it:</p>
<pre>$ cabal install HLearn-distributions</pre>
<p>We import our libraries:</p>
<pre>&gt;import Control.DeepSeq
&gt;import HLearn.Algebra
&gt;import HLearn.Gnuplot.Distributions
&gt;import HLearn.Models.Distributions</pre>
<p>We create a data type for Simon&#8217;s marbles:</p>
<pre>&gt;data Marble = Red | Green | Blue | White
&gt;    deriving (Read,Show,Eq,Ord)</pre>
<p><img class="aligncenter size-full wp-image-2058" alt="marbles" src="http://izbicki.me/blog/wp-content/uploads/2013/01/marbles.png" width="400" height="100" /></p>
<p>The easiest way to represent Simon&#8217;s bag of marbles is with a list:</p>
<pre>&gt;simonBag :: [Marble]
&gt;simonBag = [Red, Red, Red, Green, Blue, Green, Red, Blue, Green, Green, Red, Red, Blue, Red, Red, Red, White]</pre>
<p>And now we&#8217;re ready to train a categorical distribution of the marbles in Simon&#8217;s bag:</p>
<pre>&gt;simonDist = train simonBag :: Categorical Marble Double</pre>
<p>We can load up ghci and plot the distribution with the conveniently named function <a href="http://hackage.haskell.org/packages/archive/HLearn-distributions/0.1.0.1/doc/html/HLearn-Gnuplot-Distributions.html">plotDistribution</a>:</p>
<pre>ghci&gt; plotDistribution (plotFile "simonDist") simonDist</pre>
<p>This gives us a histogram of probabilities:</p>
<p><img class="aligncenter size-full wp-image-1965" alt="marbles trained into categorical" src="http://izbicki.me/blog/wp-content/uploads/2013/01/marbles-trained-into-categorical.png" width="700" height="210" /></p>
<p>In the HLearn library, every statistical model is generated from data using either <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.0.1/doc/html/HLearn-Algebra-Models.html">train</a> or <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.0.1/doc/html/HLearn-Algebra-Models.html">train&#8217;</a>.  Because these functions are overloaded, we must specify the type of simonDist so that the compiler knows which model to generate. <a href="http://hackage.haskell.org/packages/archive/HLearn-distributions/0.1.0.1/doc/html/HLearn-Models-Distributions-Categorical.html">Categorical</a> takes two parameters. The first is the type of the discrete data (Marble). The second is the type of the probability (Double). We could easily create Categorical distributions with different types depending on the requirements for our application. For example:</p>
<pre>&gt;stringDist = train (map show simonBag) :: Categorical String Float</pre>
<p>This is the first &#8220;cool thing&#8221; about Categorical:  <strong>We can make distributions over any user-defined type</strong>.  This makes programming with probabilities easier, more intuitive, and more convenient.  Most other statistical libraries would require you to assign numbers corresponding to each color of marble, and then create a distribution over those numbers.</p>
<p>Now that we have a distribution, we can find some probabilities. If Simon pulls a marble from the bag, what&#8217;s the probability that it would Red?</p>
<p style="text-align: center;"><span id='tex_5035'></span></p>
<p>We can use the pdf function to do this calculation for us:</p>
<pre>ghci&gt; pdf simonDist Red
0.5626
ghci&gt; pdf simonDist Blue
0.1876
ghci&gt; pdf simonDist Green
0.1876
ghci&gt; pdf simonDist White
6.26e-2</pre>
<p>If we sum all the probabilities, as expected we would get 1:</p>
<pre>ghci&gt; sum $ map (pdf simonDist) [Red,Green,Blue,White]
1.0</pre>
<p>Due to rounding errors, you may not always get 1. If you absolutely, positively, have to avoid rounding errors, you should use Rational probabilities:</p>
<pre>&gt;simonDistRational = train simonBag :: Categorical Marble Rational</pre>
<p>Rationals are slower, but won&#8217;t be subject to floating point errors.</p>
<p>This is just about all the functionality you would get in a &#8220;normal&#8221; stats package like R or NumPy. But using Haskell&#8217;s nice support for algebra, we can get some extra cool features.</p>
<h3>Semigroup</h3>
<p>First, let&#8217;s talk about semigroups. A <a href="https://en.wikipedia.org/wiki/Semigroup">semigroup</a> is any data structure that has a binary operation (<strong>&lt;&gt;</strong>) that joins two of those data structures together. The categorical distribution is a semigroup.</p>
<p>Don wants to play marbles with Simon, and he has his own bag. Don&#8217;s bag contains only red and blue marbles:</p>
<pre>&gt;donBag = [Red,Blue,Red,Blue,Red,Blue,Blue,Red,Blue,Blue]</pre>
<p>We can train a categorical distribution on Don&#8217;s bag in the same way we did earlier:</p>
<pre>&gt;donDist = train donBag :: Categorical Marble Double</pre>
<p>In order to play marbles together, Don and Simon will have to add their bags together.</p>
<pre>&gt;bothBag = simonBag ++ donBag</pre>
<p>Now, we have two options for training our distribution. First is the naive way, we can train the distribution directly on the combined bag:</p>
<pre>&gt;bothDist = train bothBag :: Categorical Marble Double</pre>
<p>This is the way we would have to approach this problem in most statistical libraries. But with HLearn, we have a more efficient alternative. We can combine the trained distributions using the semigroup operation:</p>
<pre>&gt;bothDist' = simonDist &lt;&gt; donDist</pre>
<p>Under the hood, the categorical distribution stores the number of times each possibility occurred in the training data.  The &lt;&gt; operator just adds the corresponding counts from each distribution together:</p>
<p><img class="aligncenter size-full wp-image-1994" alt="semigroup and bothDist" src="http://izbicki.me/blog/wp-content/uploads/2013/01/semigroup-and-bothDist.png" width="700" height="260" /></p>
<p>This method is more efficient because it avoids repeating work we&#8217;ve already done. Categorical&#8217;s semigroup operation runs in time <strong>O(1)</strong>, so no matter how big the bags are, we can calculate the distribution very quickly. The naive method, in contrast, requires time <strong>O(n)</strong>. If our bags had millions or billions of marbles inside them, this would be a considerable savings!</p>
<p>We get another cool performance trick &#8220;for free&#8221; based on the fact that Categorical is a semigroup: The function train can be <strong>automatically parallelized</strong> using the higher order function <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.1.0/doc/html/HLearn-Algebra-Functions.html">parallel</a>. I won&#8217;t go into the details about how this works, but here&#8217;s how you do it in practice.</p>
<p>First, we must show the compiler how to resolve the Marble data type down to &#8220;<a href="http://stackoverflow.com/questions/6872898/haskell-what-is-weak-head-normal-form">normal form</a>.&#8221; This basically means we must show the compiler how to fully compute the data type. (We only have to do this because Marble is a type we created.  If we were using a built in type, like a String, we could skip this step.) This is fairly easy for a type as simple as Marble:</p>
<pre>&gt;instance NFData Marble where
&gt;    rnf Red   = ()
&gt;    rnf Blue  = ()
&gt;    rnf Green = ()
&gt;    rnf White = ()</pre>
<p>Then, we can perform the parallel computation by:</p>
<pre>&gt;simonDist_par = parallel train simonBag :: Categorical Marble Double</pre>
<p>Other languages require a programmer to manually create parallel versions of their functions. But in Haskell with the HLearn library, we get these parallel versions for free! All we have to do is ask for it!</p>
<h3>Monoid</h3>
<p>A monoid is a semigroup with an empty element, which is called <strong>mempty</strong> in Haskell. It obeys the law that:</p>
<pre>M &lt;&gt; mempty == mempty &lt;&gt; M == M</pre>
<p>And it is easy to show that Categorical is also a monoid. We get this empty element by training on an empty data set:</p>
<pre>mempty = train ([] :: [Marble]) :: Categorical Marble Double</pre>
<p>The <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.1.0/doc/html/HLearn-Algebra-Models.html">HomTrainer</a> type class requires that all its instances also be instances of Monoid. This lets the compiler automatically derive &#8220;<a href="http://en.wikipedia.org/wiki/Online_machine_learning">online trainers</a>&#8221; for us. An online trainer can add new data points to our statistical model without retraining it from scratch.</p>
<p>For example, we could use the function <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.1.0/doc/html/HLearn-Algebra-Models.html">add1dp</a> (stands for: add one data point) to add another white marble into Simon&#8217;s bag:</p>
<pre>&gt;simonDistWhite = add1dp simonDist White</pre>
<p>This also gives us another approach for our earlier problem of combining Simon and Don&#8217;s bags. We could use the function <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.1.0/doc/html/HLearn-Algebra-Models.html">addBatch</a>:</p>
<pre>&gt;bothDist'' = addBatch simonDist donBag</pre>
<p>Because Categorical is a monoid, we maintain the property that:</p>
<pre>bothDist == bothDist' == bothDist''</pre>
<p>Again, statisticians have always known that you could add new points into a categorical distribution without training from scratch.  The cool thing here is that <strong>the compiler is deriving all of these functions for us</strong>, and it&#8217;s giving us a<strong> consistent interface </strong>for use with different data structures.  All we had to do to get these benefits was tell the compiler that Categorical is a monoid.  This makes designing and programming libraries much <strong>easier, quicker, and less error prone</strong>.</p>
<h3>Group</h3>
<p>A <a href="http://en.wikipedia.org/wiki/Group_(mathematics)">group</a> is a monoid with the additional property that all elements have an <strong>inverse</strong>. This lets us perform subtraction on groups.  And Categorical is a group.</p>
<p>Ed wants to play marbles too, but he doesn&#8217;t have any of his own. So Simon offers to give Ed some of from his own bag. He gives Ed one of each color:</p>
<pre>&gt;edBag = [Red,Green,Blue,White]</pre>
<p>Now, if Simon draws a marble from his bag, what&#8217;s the probability it will be blue?</p>
<p>To answer this question without algebra, we&#8217;d have to go back to the original data set, remove the marbles Simon gave Ed, then retrain the distribution. This is awkward and computationally expensive. But if we take advantage of Categorical&#8217;s group structure, we can just subtract directly from the distribution itself. This makes more sense intuitively and is easier computationally.</p>
<pre>&gt;simonDist2 = subBatch simonDist edBag</pre>
<p>This is a shorthand notation for using the group operations directly:</p>
<pre>&gt;edDist = train edBag :: Categorical Marble Double
&gt;simonDist2' = simonDist &lt;&gt; (inverse edDist)</pre>
<p>The way the inverse operation works is it multiplies the counts for each category by -1. In picture form, this flips the distribution upside down:</p>
<p><img class="aligncenter size-full wp-image-2013" alt="edDist inversification" src="http://izbicki.me/blog/wp-content/uploads/2013/01/edDist-inversification.png" width="620" height="200" /></p>
<p>Then, adding an upside down distribution to a normal one is just subtracting the histogram columns and renormalizing:</p>
<p><img class="aligncenter size-full wp-image-2014" alt="Simon substraction edDist" src="http://izbicki.me/blog/wp-content/uploads/2013/01/Simon-substraction-edDist.png" width="700" height="200" /></p>
<p>Notice that the green bar in edDist looks really big&#8212;much bigger than the green bar in simonDist.  But when we subtract it away from simonDist, we still have some green marbles left over in simonDist2.  This is because the histogram is only showing the <em>probability</em> of a green marble, and not the <em>actual number</em> of marbles.</p>
<p>Finally, there&#8217;s one more crazy trick we can perform with the Categorical group.  It&#8217;s perfectly okay to have both positive and negative marbles in the same distribution.  For example:</p>
<pre>ghci&gt; plotDistribution (plotFile "mixedDist") (edDist &lt;&gt; (inverse donDist))</pre>
<p>results in:</p>
<p><img class="aligncenter size-full wp-image-2073" alt="mixedDist-300" src="http://izbicki.me/blog/wp-content/uploads/2013/01/mixedDist-300.png" width="300" height="209" /></p>
<p>Most statisticians would probably say that these upside down Categoricals are not &#8220;real distributions.&#8221; But at the very least, they are a convenient mathematical trick that makes <strong>working with distributions much more pleasant</strong>.</p>
<h3>Module</h3>
<p>Finally, an <a href="http://en.wikipedia.org/wiki/R-module">R-Module</a> is a group with two additional properties. First, it is <a href="http://en.wikipedia.org/wiki/Abelian_groups">abelian</a>. That means &lt;&gt; is commutative. So, for all a, b:</p>
<pre>a &lt;&gt; b == b &lt;&gt; a</pre>
<p>Second, the data type supports <strong>multiplication by any element in the </strong><a href="http://en.wikipedia.org/wiki/Ring_(mathematics)"><strong>ring</strong></a><strong> R</strong>. In Haskell, you can think of a ring as any member of the <a href="http://www.haskell.org/tutorial/numbers.html">Num</a> type class.</p>
<p>How is this useful?  It let&#8217;s &#8220;retrain&#8221; our distribution on the data points it has already seen.  Back to the example&#8230;</p>
<p>Well, Ed&#8212;being the clever guy that he is&#8212;recently developed a marble copying machine. That&#8217;s right! You just stick some marbles in on one end, and on the other end out pop 10 exact duplicates. Ed&#8217;s not just clever, but pretty nice too. He duplicates his new marbles and gives all of them back to Simon. What&#8217;s Simon&#8217;s new distribution look like?</p>
<p>Again, the naive way to answer this question would be to retrain from scratch:</p>
<pre>&gt;duplicateBag = simonBag ++ (concat $ replicate 10 edBag)
&gt;duplicateDist = train duplicateBag :: Categorical Marble Double</pre>
<p>Slightly better is to take advantage of the Semigroup property, and just apply that over and over again:</p>
<pre>&gt;duplicateDist' = simonDist2 &lt;&gt; (foldl1 (&lt;&gt;) $ replicate 10 edDist)</pre>
<p>But even better is to take advantage of the fact that Categorical is a module and the (<strong>.*</strong>) operator:</p>
<pre>&gt;duplicateDist'' = simonDist2 &lt;&gt; 10 .* edDist</pre>
<p>In picture form:</p>
<p><img class="aligncenter size-full wp-image-2066" alt="module example" src="http://izbicki.me/blog/wp-content/uploads/2013/01/module-example.png" width="700" height="200" /></p>
<p>Also notice that without the scalar multiplication, we would get back our original distribution:</p>
<p><img class="aligncenter size-full wp-image-2067" alt="module example-mod" src="http://izbicki.me/blog/wp-content/uploads/2013/01/module-example-mod.png" width="700" height="200" /></p>
<p>Another way to think about the module&#8217;s scalar multiplication is that it allows us to <strong>weight our distributions</strong>.</p>
<p>Ed just realized that he still needs a marble, and has decided to take one.  Someone has left their Marble bag sitting nearby, but he&#8217;s not sure whose it is.  He thinks that Simon is more forgetful than Don is, so he assigns a 60% probability that the bag is Simon&#8217;s and a 40% probability that it is Don&#8217;s.  When he takes a marble, what&#8217;s the probability that it is red?</p>
<p>We create a weighted distribution using module multiplication:</p>
<pre>&gt;weightedDist = 0.6 .* simonDist &lt;&gt; 0.4 .* donDist</pre>
<p>Then in ghci:</p>
<pre>ghci&gt; pdf weightedDist Red
0.4929577464788732</pre>
<p>We can also train directly on weighted data:</p>
<pre>&gt;weightedDataDist = train [(0.4,Red),(0.5,Green),(0.2,Green),(3.7,White)] :: Categorical Marble Double</pre>
<p>which gives us:</p>
<p><img class="aligncenter size-full wp-image-2068" alt="weightedDataDist-300" src="http://izbicki.me/blog/wp-content/uploads/2013/01/weightedDataDist-300.png" width="300" height="209" /></p>
<h3>The Takeaway and next posts</h3>
<p>Talking about the categorical distribution in algebraic terms let&#8217;s us do some cool new stuff with our distributions that we can&#8217;t easily do in other libraries.  None of this is statistically ground breaking. The cool thing is that <strong>algebra just makes everything so convenient to work with</strong>.</p>
<p>I think I&#8217;ll do another post on some cool tricks with the kernel density estimator that are not possible at all in other libraries, then do a post about the category (formal category-theoretic sense) of statistical training methods.  At that point, we&#8217;ll be ready to jump into machine learning tasks.  Depending on my mood we might take a pit stop to discuss the computational aspects of free groups and modules and how these relate to machine learning applications.</p>
<p><a href="http://izbicki.me/blog/feed">Sign up for the RSS feed</a> to stay tuned!</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=1932" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/the-categorical-distributions-algebraic-structure/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nuclear weapon statistics using monoids, groups, and modules in Haskell</title>
		<link>http://izbicki.me/blog/nuclear-weapon-statistics-using-monoids-groups-and-modules-in-haskell?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=nuclear-weapon-statistics-using-monoids-groups-and-modules-in-haskell</link>
		<comments>http://izbicki.me/blog/nuclear-weapon-statistics-using-monoids-groups-and-modules-in-haskell#comments</comments>
		<pubDate>Fri, 04 Jan 2013 14:47:40 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=1766</guid>
		<description><![CDATA[The Bulletin of the Atomic Scientists tracks the nuclear capabilities of every country. We&#8217;re going to use their data to demonstrate Haskell&#8217;s HLearn library and the usefulness of abstract algebra to statistics. Specifically, we&#8217;ll see that the categorical distribution and kernel density estimates have monoid, group, and module algebraic structures.  We&#8217;ll explain what this crazy [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright" alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/Operation_Upshot-Knothole_-_Badger_001.jpg/282px-Operation_Upshot-Knothole_-_Badger_001.jpg" width="197" height="168" />The <a href="http://www.thebulletin.org/">Bulletin of the Atomic Scientists</a> tracks the nuclear capabilities of every country. We&#8217;re going to use their data to demonstrate Haskell&#8217;s <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn</a> library and the usefulness of abstract algebra to statistics. Specifically, we&#8217;ll see that the <a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical distribution</a> and <a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">kernel density estimates</a> have <a href="https://en.wikipedia.org/wiki/Monoid">monoid</a>, <a href="https://en.wikipedia.org/wiki/Group_(mathematics)">group</a>, and <a href="https://en.wikipedia.org/wiki/Module_(mathematics)">module</a> algebraic structures.  We&#8217;ll explain what this crazy lingo even means, then take advantage of these structures to <strong>efficiently</strong> <strong>answer real-world statistical questions about nuclear war</strong>. It&#8217;ll be a <a href="https://en.wikipedia.org/wiki/WOPR">WOPR</a>!</p>
<p><span id="more-1766"></span></p>
<p>Before we get into the math, we&#8217;ll need to review the basics of nuclear politics.</p>
<p>The nuclear <a href="https://en.wikipedia.org/wiki/Nuclear_Non-Proliferation_Treaty">Non-Proliferation Treaty</a> (<strong>NPT</strong>) is the main treaty governing nuclear weapons. Basically, it says that there are five countries that are &#8220;allowed&#8221; to have nukes: the <strong>USA</strong>, <strong>UK</strong>, <strong>France</strong>, <strong>Russia</strong>, and <strong>China</strong>. &#8220;Allowed&#8221; is in quotes because the treaty specifies that these countries must eventually get rid of their nuclear weapons at some future, unspecified date. When another country, for example Iran, signs the NPT, they are agreeing to not develop nuclear weapons. What they get in exchange is help from the 5 nuclear weapons states in developing their own civilian nuclear power programs. (Iran has the legitimate complaint that Western countries are actively trying to stop its civilian nuclear program when they&#8217;re supposed to be helping it, but that&#8217;s a <a href="http://www.csmonitor.com/Commentary/Opinion/2010/0917/Reality-check-Iran-is-not-a-nuclear-threat">whole &#8216;nother can of worms</a>.)</p>
<p>The <a href="http://bos.sagepub.com/">Nuclear Notebook</a> tracks the nuclear capabilities of all these countries.  The most-current estimates are from mid-2012.  Here&#8217;s a summary (click the warhead type for more info):</p>
<table border="1" align="center">
<tbody>
<tr>
<td align="LEFT"><strong>Country</strong></td>
<td align="LEFT"><strong>Delivery Method</strong></td>
<td align="LEFT"><strong>Warhead</strong></td>
<td align="LEFT"><strong>Yield (kt)</strong></td>
<td align="LEFT"><strong># Deployed</strong></td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W78.html">W78</a></td>
<td align="RIGHT">335</td>
<td align="RIGHT">250</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W87.html">W87</a></td>
<td align="RIGHT">300</td>
<td align="RIGHT">250</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W76.html">W76</a></td>
<td align="RIGHT">100</td>
<td align="RIGHT">468</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W76.html">W76-1</a></td>
<td align="RIGHT">100</td>
<td align="RIGHT">300</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W88.html">W88</a></td>
<td align="RIGHT">455</td>
<td align="RIGHT">384</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W80.html">W80</a></td>
<td align="RIGHT">150</td>
<td align="RIGHT">200</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/B61_nuclear_bomb">B61</a></td>
<td align="RIGHT">340</td>
<td align="RIGHT">50</td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/B83_nuclear_bomb">B83</a></td>
<td align="RIGHT">1200</td>
<td align="RIGHT">50</td>
</tr>
<tr>
<td align="LEFT" height="17">UK</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="http://nuclearweaponarchive.org/Usa/Weapons/W76.html">W76</a></td>
<td align="RIGHT">100</td>
<td align="RIGHT">225</td>
</tr>
<tr>
<td align="LEFT" height="17">France</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/TN_75">TN75</a></td>
<td align="RIGHT">100</td>
<td align="RIGHT">150</td>
</tr>
<tr>
<td align="LEFT" height="17">France</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/TN_81">TN81</a></td>
<td align="RIGHT">300</td>
<td align="RIGHT">150</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-20V</td>
<td align="RIGHT">800</td>
<td align="RIGHT">500</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-18</td>
<td align="RIGHT">400</td>
<td align="RIGHT">288</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-12M</td>
<td align="RIGHT">800</td>
<td align="RIGHT">135</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-12M2</td>
<td align="RIGHT">800</td>
<td align="RIGHT">56</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-12M1</td>
<td align="RIGHT">800</td>
<td align="RIGHT">18</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">ICBM</td>
<td align="LEFT">RS-24</td>
<td align="RIGHT">100</td>
<td align="RIGHT">90</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/R-29_Vysota">RSM-50</a></td>
<td align="RIGHT">50</td>
<td align="RIGHT">144</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">SLBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/R-29RM_Shtil">RSM-54</a></td>
<td align="RIGHT">100</td>
<td align="RIGHT">384</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/Kh-55_(missile_family)">AS-15</a></td>
<td align="RIGHT">200</td>
<td align="RIGHT">820</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://www.fas.org/nuke/guide/china/theater/df-3a.htm">DF-3A</a></td>
<td align="RIGHT">3300</td>
<td align="RIGHT">16</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://www.fas.org/nuke/guide/china/theater/df-4.htm">DF-4</a></td>
<td align="RIGHT">3300</td>
<td align="RIGHT">12</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/DF-5">DF-5A</a></td>
<td align="RIGHT">5000</td>
<td align="RIGHT">20</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/DF-21">DF-21</a></td>
<td align="RIGHT">300</td>
<td align="RIGHT">60</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/DF-31">DF-31</a></td>
<td align="RIGHT">300</td>
<td align="RIGHT">20</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">ICBM</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/DF-31">DF-31A</a></td>
<td align="RIGHT">300</td>
<td align="RIGHT">20</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="LEFT">Bomber</td>
<td align="LEFT"><a href="https://en.wikipedia.org/wiki/Xian_H-6">H-6</a></td>
<td align="RIGHT">3100</td>
<td align="RIGHT">20</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>I&#8217;ve consolidated all this data into the file <a href="http://izbicki.me/public/datasets/nukes-list.csv">nukes-list.csv</a>, which we will analyze in this post.  If you want to try out this code for yourself (or the homework question at the end), you&#8217;ll need to download it.  Every line in the file corresponds to a single nuclear warhead, not delivery method.  Warheads are the parts that go boom!  Bombers, <a href="https://en.wikipedia.org/wiki/Icbm">ICBMs</a>, and <a href="https://en.wikipedia.org/wiki/SSBN">SSBN</a>/<a href="https://en.wikipedia.org/wiki/SLBMs">SLBMs</a> are the delivery method.</p>
<p><img class="aligncenter" alt="nuclear-triad" src="http://izbicki.me/blog/wp-content/uploads/2013/01/nuclear-triad.jpg" width="700" height="149" /></p>
<p>There are three things to note about this data.  First, it&#8217;s <strong>only estimates</strong> based on public sources.  In particular, it probably overestimates the Russian nuclear forces. <a href="http://russianforces.org/missiles/">Other estimates are considerably lower</a>.  Second, we will only be considering <strong>deployed, strategic warheads</strong>.  Basically, this means the &#8220;really big nukes that are currently aimed at another country.&#8221;  There are thousands more tactical warheads, and warheads in reserve stockpiles waiting to be disassembled.  For simplicity&#8212;and because these nukes don&#8217;t significantly affect strategic planning&#8212;we won&#8217;t be considering them here.   Finally, there are 4 countries who are not members of the NPT but have nuclear weapons: <strong>Israel</strong>, <strong>Pakistan</strong>, <strong>India</strong>, and <strong>North Korea</strong>.  We will be ignoring them here because their inventories are relatively small, and most of their weapons would not be considered strategic.</p>
<h3>Programming preliminaries</h3>
<p>Now we&#8217;re ready to start programming. First, let&#8217;s import our libraries:</p>
<pre>&gt;import Control.Lens
&gt;import Data.Csv
&gt;import qualified Data.Vector as V
&gt;import qualified Data.ByteString.Lazy.Char8  as BS
&gt; 
&gt;import HLearn.Algebra
&gt;import HLearn.Models.Distributions
&gt;import HLearn.Gnuplot.Distributions</pre>
<p>Next, we load our data using the <a href="http://hackage.haskell.org/package/cassava">Cassava</a> package.  (You don&#8217;t need to understand how this works.)</p>
<pre>&gt;main = do
&gt;    Right rawdata &lt;- fmap (fmap V.toList . decode True) $ BS.readFile "nukes-list.csv"
&gt;        :: IO (Either String [(String, String, String, Int)])</pre>
<p>And we&#8217;ll use the <a href="http://hackage.haskell.org/package/lens">Lens</a> package to parse the CSV file into a series of variables containing just the values we want.  (You also don&#8217;t need to understand this.)</p>
<pre>&gt;   let list_usa    = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._1)=="USA"   ) rawdata
&gt;   let list_uk     = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._1)=="UK"    ) rawdata 
&gt;   let list_france = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._1)=="France") rawdata 
&gt;   let list_russia = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._1)=="Russia") rawdata 
&gt;   let list_china  = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._1)=="China" ) rawdata</pre>
<p><strong>NOTE:</strong> All you need to understand about the above code is what these list_country variables look like. So let&#8217;s print one:</p>
<pre>&gt;   putStrLn $ "List of American nuclear weapon sizes = " ++ show list_usa</pre>
<p>gives us the output:</p>
<pre>List of American nuclear weapon sizes = fromList [335,335,335,335,335,335,335,335,335,335  ...  1200,1200,1200,1200,1200]</pre>
<p>If we want to know how many weapons are in the American arsenal, we can take the length of the list:</p>
<pre>&gt;   putStrLn $ "Number of American weapons = " ++ show (length list_usa)</pre>
<p>We get that there are <strong>1951 American deployed, strategic nuclear weapons</strong>.  If we want to know the total &#8220;blowing up&#8221; power, we take the sum of the list:</p>
<pre>&gt;   putStrLn $ "Explosive power of American weapons = " ++ show (sum list_usa)</pre>
<p>We get that the US has  <strong>516 megatons of deployed, strategic nuclear weapons</strong>.  That&#8217;s the equivalent of <strong>1,033,870,000,000 pounds of TNT</strong>.</p>
<p>To get the total number of weapons in the world, we concatenate every country&#8217;s list of weapons and find the length:</p>
<pre>&gt;   let list_all = list_usa ++ list_uk ++ list_france ++ list_russia ++ list_china
&gt;   putStrLn $ "Number of nukes in the whole world = " ++ show (length list_all)</pre>
<p>Doing this for every country gives us the table:</p>
<table border="1" align="center">
<tbody>
<tr>
<td align="LEFT"><strong>Country</strong></td>
<td align="LEFT"><strong>Warheads</strong></td>
<td align="LEFT"><strong>Total explosive power (kt)</strong></td>
</tr>
<tr>
<td align="LEFT" height="17">USA</td>
<td align="RIGHT">1,951</td>
<td align="RIGHT">516,935</td>
</tr>
<tr>
<td align="LEFT" height="17">UK</td>
<td align="RIGHT">225</td>
<td align="RIGHT">22,500</td>
</tr>
<tr>
<td align="LEFT" height="17">France</td>
<td align="RIGHT">300</td>
<td align="RIGHT">60,000</td>
</tr>
<tr>
<td align="LEFT" height="17">Russia</td>
<td align="RIGHT">2,435</td>
<td align="RIGHT">901,000</td>
</tr>
<tr>
<td align="LEFT" height="17">China</td>
<td align="RIGHT">168</td>
<td align="RIGHT">284,400</td>
</tr>
<tr>
<td align="LEFT" height="17"><strong>Total</strong></td>
<td align="RIGHT">5,079</td>
<td align="RIGHT">1,784,835</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>Now let&#8217;s do some algebra!</p>
<h3>Monoids and groups</h3>
<p>In a previous post, we saw that the <a href="http://izbicki.me/blog/gausian-distributions-are-monoids">Gaussian distribution forms a group</a>. This means that it has all the properties of a monoid&#8212;an empty element (<strong>mempty</strong>) that represents the distribution trained on no data, and a binary operation (<strong>mappend</strong>) that merges two distributions together&#8212;plus an <strong>inverse</strong>. This inverse lets us &#8220;subtract&#8221; two Gaussians from each other.</p>
<p>It turns out that many other distributions also have this group property. For example, the <strong>categorical distribution.</strong>  This distribution is used for measuring discrete data. Essentially, it assigns some probability to each &#8220;label.&#8221;  In our case, the labels are the size of the nuclear weapon, and the probability is the chance that a randomly chosen nuke will be exactly that destructive.  We train our categorical distribution using the <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.0.1/doc/html/HLearn-Algebra-Models.html">train</a> function:</p>
<pre>&gt; let cat_usa = train list_usa :: Categorical Int Double</pre>
<p>If we plot this distribution, we&#8217;ll get a graph that looks something like:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-1808" alt="catigorical distribution of american nuclear weapons" src="http://izbicki.me/blog/wp-content/uploads/2013/01/catigorical-distribution-of-american-nuclear-weapons1.png" width="572" height="402" /></p>
<p>A distribution like this is useful to war planners from other countries.  It can help them statistically determine the amount of casualties their infrastructure will take from a nuclear exchange.</p>
<p>Now, let&#8217;s train equivalent distributions for our other countries.</p>
<pre>&gt; let cat_uk = train list_uk :: Categorical Int Double
&gt; let cat_france = train list_france :: Categorical Int Double
&gt; let cat_russia = train list_russia :: Categorical Int Double
&gt; let cat_china = train list_china :: Categorical Int Double</pre>
<p>Because training the categorical distribution is a group <strong>homomorphism</strong>, we can train a distribution over all nukes by either training directly on the data:</p>
<pre>&gt;   let cat_allA = train list_all :: Categorical Int Double</pre>
<p>or we can merge the already generated categorical distributions:</p>
<pre>&gt;   let cat_allB = cat_usa &lt;&gt; cat_uk &lt;&gt; cat_france &lt;&gt; cat_russia &lt;&gt; cat_china</pre>
<p>Because of the homomorphism property, we will get the same result both ways. Since we&#8217;ve already done the calculations for each of the the countries already, method B will be more efficient&#8212;it won&#8217;t have to repeat work we&#8217;ve already done.  If we plot either of these distributions, we get:</p>
<p><img class="aligncenter size-full wp-image-1811" alt="catigorical distribution of all nuclear weapons" src="http://izbicki.me/blog/wp-content/uploads/2013/01/catigorical-distribution-of-all-nuclear-weapons.png" width="572" height="401" /></p>
<p>The thing to notice in this plot is that most countries have a nuclear arsenal that is distributed similarly to the United States&#8212;except for China.  These Chinese ICBMs will become much more important when we discuss nuclear strategy in the last section.</p>
<p>But nuclear war planners don&#8217;t particularly care about this complete list of nuclear weapons.  What war planners care about is the <strong>survivable nuclear weapons</strong>&#8212;that is, weapons that won&#8217;t be blown up by a surprise nuclear attack.  Our distributions above contain nukes dropped from bombers, but these are not survivable.  They are easy to destroy.  For our purposes, we&#8217;ll call anything that&#8217;s not a bomber a survivable weapon.</p>
<p><img class="aligncenter size-full wp-image-1803" alt="nuclear-triad-no-bomber" src="http://izbicki.me/blog/wp-content/uploads/2013/01/nuclear-triad-no-bomber.jpg" width="700" height="149" /></p>
<p>We&#8217;ll use the group property of the categorical distribution to calculate the survivable weapons.  First, we create a distribution of just the <em>un</em>survivable bombers:</p>
<pre>&gt;   let list_bomber = fmap (\row -&gt; row^._4) $ filter (\row -&gt; (row^._2)=="Bomber") rawdata
&gt;   let cat_bomber = train list_bomber :: Categorical Int Double</pre>
<p>Then, we use our group inverse to subtract these unsurvivable weapons away:</p>
<pre>&gt;   let cat_survivable = cat_allB &lt;&gt; (inverse cat_bomber)</pre>
<p>Notice that we calculated this distribution indirectly&#8212;there was no possible way to combine our variables above to generate this value without using the inverse! This is the power of groups in statistics.</p>
<h3>More distributions</h3>
<p>The categorical distribution is not sufficient to accurately describe the distribution of nuclear weapons. This is because we don&#8217;t actually know the yield of a given warhead. Like all things, it has some manufacturing tolerances that we must consider. For example, if we detonate a 300 kt warhead, the actual explosion might be 275 kt, 350 kt, or the bomb might even &#8220;fizzle out&#8221; and have almost a 0kt explosion.</p>
<p>We&#8217;ll model this by using a <strong>kernel density estimator</strong> (KDE).  The KDE basically takes all our data points, assigns each one a probability distribution called a &#8220;kernel,&#8221; then sums these kernels together.  It is a very powerful and general technique for modelling distributions&#8230; and it also happens to form a group!</p>
<p>First, let&#8217;s create the parameters for our KDE.  The bandwidth controls how wide each of the kernels is.  Bigger means wider.  I selected 20 because it made a reasonable looking density function.  The sample points are exactly what they sounds like: they are where we will sample the density from.  We can generate them using the function <a href="http://hackage.haskell.org/packages/archive/HLearn-distributions/0.1.0.1/doc/html/HLearn-Models-Distributions-KernelDensityEstimator.html#g:3">genSamplePoints</a>.  Finally, the kernel is the shape of the distributions we will be summing up.  There are many <a href="http://hackage.haskell.org/packages/archive/HLearn-distributions/0.1.0.1/doc/html/HLearn-Models-Distributions-KernelDensityEstimator-Kernels.html">supported kernels</a>.</p>
<pre>&gt;   let kdeparams = KDEParams
&gt;        { bandwidth    = Constant 20
&gt;        , samplePoints = genSamplePoints
&gt;               0       -- minimum
&gt;               4000    -- maximum
&gt;               4000    -- number of samples
&gt;        , kernel       = KernelBox Gaussian
&gt;        } :: KDEParams Double</pre>
<p>Now, we&#8217;ll train kernel density estimates on our data.  Notice that because the KDE takes parameters, we must use the <strong>train&#8217;</strong> function instead of just train.</p>
<pre>&gt;   let kde_usa     = train' kdeparams list_usa      :: KDE Double</pre>
<p>Again, plotting just the American weapons gives:</p>
<p><img class="aligncenter size-full wp-image-1816" alt="kernel density estimate of american nuclear weapons" src="http://izbicki.me/blog/wp-content/uploads/2013/01/kernel-density-estimate-of-american-nuclear-weapons.png" width="572" height="400" /></p>
<p>And we train the corresponding distributions for the other countries.</p>
<pre>&gt;   let kde_uk      = train' kdeparams list_uk       :: KDE Double
&gt;   let kde_france  = train' kdeparams list_france   :: KDE Double
&gt;   let kde_russia  = train' kdeparams list_russia   :: KDE Double
&gt;   let kde_china   = train' kdeparams list_china    :: KDE Double
&gt;
&gt;   let kde_all = kde_usa &lt;&gt; kde_uk &lt;&gt; kde_france &lt;&gt; kde_russia &lt;&gt; kde_china</pre>
<p>The KDE is a powerful technique, but the draw back is that it is computationally expensive&#8212;especially when a large number of sample points are used. Fortunately, all computations in the HLearn library are <strong>easily parallelizable</strong> by applying the higher order function <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.0.1/doc/html/HLearn-Algebra-Functions.html">parallel</a>.</p>
<p>We can calculate the full KDE from scratch in parallel like this:</p>
<pre>&gt;   let list_double_all = map fromIntegral list_all :: [Double]
&gt;   let kde_all_parA = (parallel (train' kdeparams)) list_double_all :: KDE Double</pre>
<p>or we can perform a parallel reduction on the KDEs for each country like this:</p>
<pre>&gt;   let kde_all_parB = (parallel reduce) [kde_usa, kde_uk, kde_france, kde_russia, kde_china]</pre>
<p>And because the KDE is a homomorphism, we get the same exact thing either way.  Let&#8217;s plot the parallel version:</p>
<pre>&gt;   plotDistribution (genPlotParams "kde_all" kde_all_parA) kde_all_parA</pre>
<p><img class="aligncenter size-full wp-image-1817" alt="kernel density estimate of all nuclear weapons globally-mod" src="http://izbicki.me/blog/wp-content/uploads/2013/01/kernel-density-estimate-of-all-nuclear-weapons-globally-mod.png" width="573" height="401" /></p>
<p>The parallel computation takes about 16 seconds on my Core2 Duo laptop running on 2 processors, whereas the serial computation takes about 28 seconds.</p>
<p>This is a considerable speedup, but we can still do better. It turns out that there is a homomorphism from the Categorical distribution to the KDE:</p>
<pre>&gt;   let kde_fromcat_all = cat_allB $&gt; kdeparams
&gt;   plotDistribution (genPlotParams "kde_fromcat_all" kde_fromcat_all) kde_fromcat_all</pre>
<p>(For more information about the morphism chaining operator <strong>$&gt;</strong>, see the <a href="http://hackage.haskell.org/packages/archive/HLearn-algebra/0.1.0.1/doc/html/HLearn-Algebra-Morphism.html">Hlearn documentation</a>.) This computation takes less than a second and gets the exact same result as the much more expensive computations above.</p>
<p>We can express this relationship with a commutative diagram:</p>
<p><img class="aligncenter size-full wp-image-1790" alt="kde-commutative-diagram-small" src="http://izbicki.me/blog/wp-content/uploads/2013/01/kde-commutative-diagram-small.png" width="300" height="400" /></p>
<p>No matter which path we take to get to a KDE, we will get the exact same answer.  So we should always take the path that will be least computationally expensive for the data set we&#8217;re working on.</p>
<p>Why does this work? Well, the categorical distribution is a structure called the &#8220;free module&#8221; in disguise.</p>
<h3>Modules and the Free Module</h3>
<p><strong>R-Modules</strong> (like groups, but unlike monoids) have not seen much love from functional programmers. This is a shame, because they&#8217;re quite handy. It turns out they will increase our performance dramatically in this case.</p>
<p>It&#8217;s not super important to know the formal definition of an R-module, but here it is anyways: An R-module is a group with an additional property: it can be &#8220;multiplied&#8221; by any element of the <a href="https://en.wikipedia.org/wiki/Ring_(mathematics)">ring</a> R. This is a generalization of <a href="https://en.wikipedia.org/wiki/Vector_space">vector spaces</a> because R need only be a ring instead of a <a href="https://en.wikipedia.org/wiki/Field_(mathematics)">field</a>. (Rings do not necessarily have multiplicative inverses.)  It&#8217;s probably easier to see what this means by an example.</p>
<p>Vectors are modules.  Let&#8217;s say I have a vector:</p>
<pre>&gt;   let vec = [1,2,3,4,5] :: [Int]</pre>
<p>I can perform scalar multiplication on that vector like this:</p>
<pre>&gt;   let vec2 = 3 .* vec</pre>
<p>which as you might expect results in:</p>
<pre>[3,6,9,12,15]</pre>
<p>Our next example is the <strong>free R-module</strong>. A &#8220;free&#8221; structure is one that obeys only the axioms of the structure and nothing else. Functional programmers are very familiar with the free monoid&#8212;it&#8217;s the list data type. The <strong>free Z-module</strong> is like a beefed up list. Instead of just storing the elements in a list, it also stores the number of times that element occurred.  (Z is shorthand for the set of integers, which form a ring but not a field.) This lets us greatly reduce the memory required to store a repetitive data set.</p>
<p>In HLearn, we represent the free module over a ring r with the data type:</p>
<pre>:: FreeMod r a</pre>
<p>where a is the type of elements to be stored in the free module. We can convert our lists into free modules using the function <strong>list2module</strong> like this:</p>
<pre>&gt;   let module_usa = list2module list_usa</pre>
<p>But what does the free module actually look like? Let&#8217;s print it to find out:</p>
<pre>&gt;   print module_usa</pre>
<p>gives us:</p>
<pre>FreeMod (fromList [(100,768),(150,200),(300,250),(335,249),(340,50),(455,384),(1200,50)])</pre>
<p>This is much more compact! So this is the take away: <strong>The free module makes repetitive data sets easier to work with.</strong> Now, let&#8217;s convert all our country data into module form:</p>
<pre>&gt;   let module_uk       = list2module list_uk
&gt;   let module_france   = list2module list_france
&gt;   let module_russia   = list2module list_russia
&gt;   let module_china    = list2module list_china</pre>
<p>Because modules are also groups, we can combine them like so:</p>
<pre>&gt;   let module_allA = module_usa &lt;&gt; module_uk &lt;&gt; module_france &lt;&gt; module_russia &lt;&gt; module_china</pre>
<p>or, we could train them from scratch:</p>
<pre>&gt;   let module_allB = list2module list_all</pre>
<p>Again, because generating a free module is a homomorphism, both methods are equivalent.</p>
<h3>Module distributions</h3>
<p>The categorical distribution and the KDE both have this module structure. This gives us two cool properties for free.</p>
<p>First, <strong>we can train these distributions directly from the free module</strong>.  Because the free module is potentially much more compact than a list is, this can save both memory and time. If we run:</p>
<pre>&gt;   let cat_module_all = train module_allB :: Categorical Int Double
&gt;   let kde_module_all = train' kdeparams module_allB :: KDE Double</pre>
<p>Then we get the properties:</p>
<pre>cat_mod_all == cat_all
kde_mod_all == kde_all == kde_cat_all</pre>
<p>Extending our commutative diagram above gives:</p>
<p><img class="aligncenter size-full wp-image-1791" alt="kde-commutative-diagram-big" src="http://izbicki.me/blog/wp-content/uploads/2013/01/kde-commutative-diagram-big.png" width="580" height="400" /></p>
<p>Again, no matter which path we take to train our KDE, we still get the same result because each of these arrows is a homomorphism.</p>
<p>Second, <strong>if a distribution is a module, we can weight the importance of our data points</strong>.  Let&#8217;s say we&#8217;re a general from North Korea (DPRK), and we&#8217;re planning our nuclear strategy. The US and North Korea have a very strained relationship in the nuclear department. It is much more likely that the US will try to nuke the DPRK than China will. And modules let us model this!  We can weight each country&#8217;s influence on our &#8220;nuclear threat profile&#8221; distribution like this:</p>
<pre>&gt;   let threats_dprk = 20 .* kde_usa
&gt;                   &lt;&gt; 10 .* kde_uk
&gt;                   &lt;&gt; 5  .* kde_france
&gt;                   &lt;&gt; 2  .* kde_russia
&gt;                   &lt;&gt; 1  .* kde_china
&gt;
&gt;   plotDistribution (genPlotParams "threats_dprk" threats_dprk) threats_dprk</pre>
<p>Basically, we&#8217;re saying that the USA is 20x more likely to attack the DPRK than China is.  Graphically, our threat distribution is:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-1797" alt="nuclear-threat-against-dprk" src="http://izbicki.me/blog/wp-content/uploads/2013/01/nuclear-threat-against-dprk1.png" width="570" height="400" /></p>
<p>The maximum threat that we have to worry about is about 1300 kt, so we need to <a href="https://encrypted.google.com/url?sa=t&amp;rct=j&amp;q=nuclear%20blast%20dynamics&amp;source=web&amp;cd=3&amp;ved=0CEQQFjAC&amp;url=http%3A%2F%2Fwww.dtic.mil%2Fdtic%2Ftr%2Ffulltext%2Fu2%2F601139.pdf&amp;ei=sFbmUIPtM8K6igLqpIGICg&amp;usg=AFQjCNE_yhq31Q03YvFfGl1Te_WsQlKOgw&amp;sig2=Q3YNU84DaVTtWffkOlLOCQ&amp;bvm=bv.1355534169,d.cGE">design all our nuclear bunkers to withstand this level of blast</a>.  Nuclear war planners would use the above distribution to figure out how much infrastructure would survive a nuclear exchange.  To see how this is done, you&#8217;ll have to click the link.</p>
<p>On the other hand, if we&#8217;re an <strong>American general</strong>, then we might say that China is our biggest threat&#8230; who knows what they&#8217;ll do when we can&#8217;t pay all the debt we owe them!?</p>
<pre>&gt;   let threats_usa = 1 .* kde_russia 
&gt;                  &lt;&gt; 5 .* kde_china
&gt;
&gt;   plotDistribution (genPlotParams "threats_usa" threats_usa) threats_usa</pre>
<p>Graphically:</p>
<p><img class="aligncenter size-full wp-image-1798" alt="nuclear-threat-against-usa" src="http://izbicki.me/blog/wp-content/uploads/2013/01/nuclear-threat-against-usa.png" width="573" height="400" /></p>
<p>So now Chinese ICBMs are a real threat.  For American infrastructure to be secure, most of it needs to be able to withstand ~3500 kt blast.  (Actually, Chinese nuclear policy is called the &#8220;minimum means of reprisal&#8221;&#8212;these nukes are not targeted at military installations, but major cities.  Unlike the other nuclear powers, China doesn&#8217;t hope to win a nuclear war.  Instead, its nuclear posture is designed to prevent nuclear war in the first place.  This is why China has the fewest weapons of any of these countries.  For a detailed analysis, see the book <a href="http://mitpress.mit.edu/books/minimum-means-reprisal">Minimum Means of Reprisal</a>.  This means that American military infrastructure isn&#8217;t threatened by these large Chinese nukes, and really only needs to be able to withstand an 800kt explosion to be survivable.)</p>
<p>By the way, since we&#8217;ve already calculated all of the kde_country variables before, <strong>these computations take virtually no time at all to compute</strong>.  Again, this is all made possible thanks to our friend abstract algebra.</p>
<h3>Homework + next Post</h3>
<p>If you want to try out the <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn library</a> for yourself, here&#8217;s a question you can try to answer: Create the DPRK and US threat distributions above, but only use survivable weapons.  Don&#8217;t include bombers in the analysis.</p>
<p>In our <strong>next post</strong>, we&#8217;ll go into more detail about the <strong>mathematical plumbing</strong> that makes all this possible. Then we&#8217;ll start talking about Bayesian classification and full-on machine learning. <a href="http://izbicki.me/blog/feed">Subscribe to the RSS feed</a> so you don&#8217;t miss out!</p>
<p>Why don&#8217;t you listen to <a href="http://www.youtube.com/watch?v=YDFqoReof6A">Tom Lehrer&#8217;s &#8220;Song for WWIII&#8221;</a> while you wait?</p>
<p>&nbsp;</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=1766" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/nuclear-weapon-statistics-using-monoids-groups-and-modules-in-haskell/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Gausian distributions form a monoid</title>
		<link>http://izbicki.me/blog/gausian-distributions-are-monoids?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gausian-distributions-are-monoids</link>
		<comments>http://izbicki.me/blog/gausian-distributions-are-monoids#comments</comments>
		<pubDate>Sun, 25 Nov 2012 00:43:24 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=1442</guid>
		<description><![CDATA[(And why machine learning experts should care) This is the first in a series of posts about the HLearn library for haskell that I&#8217;ve been working on for the past few months. The idea of the library is to show that abstract algebra&#8212;specifically monoids, groups, and homomorphisms&#8212;are useful not just in esoteric functional programming, but [...]]]></description>
				<content:encoded><![CDATA[<h4>(And why machine learning experts should care)</h4>
<p><img class="alignright" title="gaussian" src="http://izbicki.me/blog/wp-content/uploads/2012/11/gaussian.png" alt="" width="275" height="220" />This is the first in a series of posts about the <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn library</a> for haskell that I&#8217;ve been working on for the past few months. The idea of the library is to show that abstract algebra&#8212;specifically <a href="https://en.wikipedia.org/wiki/Monoid">monoids</a>, <a href="https://en.wikipedia.org/wiki/Group_(mathematics)">groups</a>, and <a href="https://en.wikipedia.org/wiki/Homomorphism">homomorphisms</a>&#8212;are useful not just in esoteric functional programming, but also in real world machine learning problems.  In particular, <strong>by framing a learning algorithm according to these algebraic properties, we get three things for free</strong>: (1) an online version of the algorithm; (2) a parallel version of the algorithm; and (3) a procedure for cross-validation that runs asymptotically faster than the standard version.</p>
<p>We&#8217;ll start with the example of a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Gaussian distribution</a>. Gaussians are ubiquitous in learning algorithms because they accurately describe most data.  But more importantly, they are easy to work with.  They are fully determined by their mean and variance, and these parameters are easy to calculate.</p>
<p>In this post we&#8217;ll start with examples of why the monoid and group properties of Gaussians are useful in practice, then we&#8217;ll look at the math underlying these examples, and finally we&#8217;ll see that this technique is extremely fast in practice and results in <strong>near perfect parallelization</strong>.</p>
<p><span id="more-1442"></span></p>
<h3>HLearn by Example</h3>
<p>Install the libraries from a shell:</p>
<pre>$ cabal install HLearn-distributions</pre>
<p>Then import the HLearn libraries into a literate haskell file:</p>
<pre>&gt; import HLearn.Algebra
&gt; import HLearn.Models.Distributions.Gaussian</pre>
<p>And some libraries for comparing our performance:</p>
<pre>&gt; import Criterion.Main
&gt; import Statistics.Distribution.Normal
&gt; import qualified Data.Vector.Unboxed as VU</pre>
<p>Now let&#8217;s create some data to work with. For simplicity&#8217;s sake, we&#8217;ll use a made up data set of how much money people make. Every entry represents one person making that salary. (We use a small data set here for ease of explanation.  When we stress test this library at the end of the post we use much larger data sets.)</p>
<pre>&gt; gradstudents = [15e3,25e3,18e3,17e3,9e3]        :: [Double]
&gt; teachers     = [40e3,35e3,89e3,50e3,52e3,97e3]  :: [Double]
&gt; doctors      = [130e3,105e3,250e3]              :: [Double]</pre>
<p>In order to train a Gaussian distribution from the data, we simply use the <strong>train</strong> function, like so:</p>
<pre>&gt; gradstudents_gaussian = train gradstudents      :: Gaussian Double
&gt; teachers_gaussian     = train teachers          :: Gaussian Double
&gt; doctors_gaussian      = train doctors           :: Gaussian Double</pre>
<p>The train function is a member of the HomTrainer type class, which we&#8217;ll talk more about later.  Also, now that we&#8217;ve trained some Gaussian distributions, we can perform all the normal calculations we might want to do on a distribution.  For example, taking the mean, standard deviation, pdf, and cdf.</p>
<p>Now for the interesting bits. We start by showing that the Gaussian is a semigroup. A <strong>semigroup</strong> is any data structure that has an associative binary operation called (<strong>&lt;&gt;</strong>). Basically, we can think of (&lt;&gt;) as &#8220;adding&#8221; or &#8220;merging&#8221; the two structures together. (Semigroups are monoids with only a mappend function.)</p>
<p>So how do we use this? Well, what if we decide we want a Gaussian over everyone&#8217;s salaries? Using the traditional approach, we&#8217;d have to recompute this from scratch.</p>
<pre>&gt; all_salaries = concat [gradstudents,teachers,doctors]
&gt; traditional_all_gaussian = train all_salaries :: Gaussian Double</pre>
<p>But this repeats work we&#8217;ve already done. On a real world data set with millions or billions of samples, this would be very slow. Better would be to merge the Gaussians we&#8217;ve already trained into one final Gaussian. We can do that with the semigroup operation (&lt;&gt;):</p>
<pre>&gt; semigroup_all_gaussian = gradstudents_gaussian &lt;&gt; teachers_gaussian &lt;&gt; doctors_gaussian</pre>
<p>Now,</p>
<pre>traditional_all_gaussian == semigroup_all_gaussian</pre>
<p>The coolest part about this is that <em>the semigroup operation takes time <strong>O(1)</strong>, no matter how much data we&#8217;ve trained the Gaussians on.</em> The naive approach takes time <strong>O(n)</strong>, so we&#8217;ve got a pretty big speed up!</p>
<p>Next, a <strong>monoid</strong> is a semigroup with an identity. The identity for a Gaussian is easy to define&#8212;simply train on the empty data set!</p>
<pre>&gt; gaussian_identity = train ([]::[Double]) :: Gaussian Double</pre>
<p>Now,</p>
<pre>gaussian_identity == mempty</pre>
<p>But we&#8217;ve still got one more trick up our sleeves.  The Gaussian distribution is not just a monoid, but also a group. Groups appear all the time in abstract algebra, but they haven&#8217;t seen much attention in functional programming for some reason. Well <strong>groups</strong> are simple: they&#8217;re just monoids with an inverse. This inverse lets us do &#8220;subtraction&#8221; on our data structures.</p>
<p>So back to our salary example. Lets say we&#8217;ve calculated all our salaries, but we&#8217;ve realized that including grad students in the salary calculations was a mistake. (They&#8217;re not real people after all.) In a normal library, we would have to recalculate everything from scratch again, excluding the grad students:</p>
<pre>&gt; nograds = concat [teachers,doctors]
&gt; traditional_nograds_gaussian = train nograds :: Gaussian Double</pre>
<p>But as we&#8217;ve already discussed, this takes a lot of time. We can use the <strong>inverse</strong> function to do this same operation in constant time:</p>
<pre>&gt; group_nograds_gaussian = semigroup_all_gaussian &lt;&gt; (inverse gradstudents_gaussian)</pre>
<p>And now,</p>
<pre>traditional_nograds_gaussian == group_nograds_gaussian</pre>
<p>Again, we&#8217;ve converted an operation that would have taken time<strong> O(n)</strong> into one that takes time <strong>O(1)</strong>. Can&#8217;t get much better than that!</p>
<h3>The HomTrainer Type Class</h3>
<p>As I&#8217;ve already mentioned, the HomTrainer type class is the basis of the <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn library</a>.  Basically, any learning algorithm that is also a <strong>semigroup homomorphism</strong> can be made an instance of HomTrainer.  This means that if xs and ys are lists of data points, the class obeys the following law:</p>
<pre>train (xs ++ ys) == (train xs) &lt;&gt; (train ys)</pre>
<p>It might be easier to see what this means in picture form:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-1635" title="gaussian commutative table" src="http://izbicki.me/blog/wp-content/uploads/2012/11/gaussian-commutative-table22.png" alt="gaussian commutative table" width="525" height="489" /></p>
<p>On the left hand side, we have some data sets, and on the right hand side, we have the corresponding Gaussian distributions and their parameters.  Because training the Gaussian is a homomorphism, it doesn&#8217;t matter whether we follow the orange or green paths to get to our final answer.  We get the exact same answer either way.</p>
<p>Based on this property alone, we get the three &#8220;free&#8221; properties I mentioned in the introduction.  (1) We get an online algorithm for free.  The function <strong>add1dp</strong> can be used to add a single new point to an existing Gaussian distribution.  Let&#8217;s say I forgot about one of the graduate students&#8212;I&#8217;m sure this would never happen in real life&#8212;I can add their salary like this:</p>
<pre>&gt; gradstudents_updated_gaussian = add1dp gradstudents_gaussian (10e3::Double)</pre>
<p>This updated Gaussian is exactly what we would get if we had included the new data point in the original data set.</p>
<p>(2) We get a parallel algorithm.  We can use the higher order function <strong>parallel</strong> to parallelize any application of train.  For example,</p>
<pre>&gt; gradstudents_parallel_gaussian = (parallel train) gradstudents :: Gaussian Double</pre>
<p>The function parallel automatically detects the number of processors your computer has and evenly distributes the work load over them.  As we&#8217;ll see in the performance section, this results in perfect parallelization of the training function.  Parallelization literally could not be any simpler!</p>
<p>(3) We get asymptotically faster cross-validation; but that&#8217;s not really applicable to a Gaussian distribution so we&#8217;ll ignore it here.</p>
<p>One last note about the HomTrainer class: we never actually have to define the <strong>train</strong> function for our learning algorithm explicitly.  All we have to do is define the semigroup operation, and the compiler will derive our training function for us!  We&#8217;ll save a discussion of why this homomorphism property gives us these results for another post.  Instead, we&#8217;ll just take a look at what the Gaussian distribution&#8217;s semigroup operation looks like.</p>
<h3>The Semigroup operation</h3>
<p>Our Gaussian data type is defined as:</p>
<pre>data Gaussian datapoint = Gaussian
    { n  :: !Int         -- The number of samples trained on
    , m1 :: !datapoint   -- The mean (first moment) of the trained distribution
    , m2 :: !datapoint   -- The variance (second moment) times (n-1)
    , dc :: !Int         -- The number of "dummy points" that have been added
    }</pre>
<p>In order to estimate a Gaussian from a sample, we must find the total number of samples (n), the mean (m1), and the variance (calculated from m2).  (We&#8217;ll explain what dc means a little later.)  Therefore, we must figure out an appropriate definition for our semigroup operation below:</p>
<pre>(Gaussian na m1a m2a dca) &lt;&gt; (Gaussian nb m1b m2b dcb) = Gaussian n' m1' m2' dc'</pre>
<p>First, we calculate the number of samples n&#8217;. The number of samples in the resulting distribution is simply the sum of the number of samples in both the input distributions:</p>
<p style="text-align: center;"><span id='tex_5003'></span></p>
<p>Second, we calculate the new average m1&#8242;. We start with the definition that the final mean is:</p>
<p style="text-align: center;"><span id='tex_4819'></span></p>
<p>Then we split the summation according to whether the input element <span id='tex_7282'></span> was from the left Gaussian a or right Gaussian b, and substitute with the definition of the mean above:</p>
<table cellpadding="10" align="center">
<tbody>
<tr>
<td style="text-align: left;"><span id='tex_9135'></span></td>
</tr>
<tr>
<td><span id='tex_2631'></span></td>
</tr>
</tbody>
</table>
<p>Notice that this is simply the weighted average of the two means. This makes intuitive sense. But there is a slight problem with this definition: When implemented on a computer with floating point arithmetic, we will get infinity whenever n&#8217; is 0.  We solve this problem by adding a &#8220;dummy&#8221; element into the Gaussian whenever n&#8217; would be zero.  This increases n&#8217; from 0 to 1, preventing the division by 0.  The variable dc counts how many dummy variables have been added, so that we can remove them before performing calculations (e.g. finding the pdf) that would be affected by an incorrect number of samples.</p>
<p>Finally, we must calculate the new m2&#8242;. We start with the definition that the variance times (n-1) is:</p>
<p style="text-align: center;"><span id='tex_8151'></span></p>
<p>(Note that the second half of the equation is a property of variance, and <a href="https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance">its derivation can be found on wikipedia</a>.)</p>
<p>Then, we do some algebra, split the summations according to which input Gaussian the data point came from, and resubstitute the definition of m2 to get:</p>
<table cellpadding="10" align="center">
<tbody>
<tr>
<td><span id='tex_543'></span></td>
</tr>
<tr>
<td><span id='tex_5954'></span></td>
</tr>
<tr>
<td><span id='tex_6081'></span></td>
</tr>
<tr>
<td><span id='tex_3104'></span></td>
</tr>
<tr>
<td><span id='tex_9246'></span></td>
</tr>
</tbody>
</table>
<p>Notice that this equation has no divisions in it.  This is why we are storing m2 as the variance times (n-1) rather than simply the variance.  Adding in the extra divisions causes training our Gaussian distribution to run about 4x slower.  I&#8217;d say haskell is getting pretty fast if the number of floating point divisions we perform is impacting our code&#8217;s performance that much!</p>
<h3>Performance</h3>
<p>This algebraic interpretation of the Gaussian distribution has excellent time and space performance.  To show this, we&#8217;ll compare performance to the excellent Haskell package called &#8220;<a href="http://hackage.haskell.org/package/statistics">statistics</a>&#8221; that also has support for Gaussian distributions.  We use the criterion package to create three tests:</p>
<pre>&gt; size = 10^8
&gt; main = defaultMain
&gt;     [ bench "statistics-Gaussian" $ whnf (normalFromSample . VU.enumFromN 0) (size)
&gt;     , bench "HLearn-Gaussian" $ whnf
&gt;         (train :: VU.Vector Double -&gt; Gaussian Double)
&gt;         (VU.enumFromN (0::Double) size)
&gt;     , bench "HLearn-Gaussian-Parallel" $ whnf
&gt;         (parallel $ (train :: VU.Vector Double -&gt; Gaussian Double))
&gt;         (VU.enumFromN (0::Double) size)
&gt;     ]</pre>
<p>In these test, we time three different methods of constructing Gaussian distributions given 100,000,000 data points.  On my laptop with 2 cores, I get these results:</p>
<table border="1" cellspacing="0" cellpadding="5px" align="center">
<tbody>
<tr>
<td>statistics-Gaussian</td>
<td>2.85 sec</td>
</tr>
<tr>
<td>HLearn-Gaussian</td>
<td>1.91 sec</td>
</tr>
<tr>
<td><strong>HLearn-Gaussian-Parallel</strong></td>
<td><strong>0.96 sec</strong></td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>Pretty nice!  The algebraic method managed to outperform the traditional method for training a Gaussian by a handy margin.  Plus, our parallel algorithm runs exactly twice as fast on two processors.  Theoretically, this should scale to an arbitrary number of processors, but I don&#8217;t have a bigger machine to try it out on.</p>
<p>Another interesting advantage of the <a href="http://hackage.haskell.org/package/HLearn-algebra">HLearn library</a> is that <strong>we can trade off time and space performance</strong> by changing which data structures store our data set.  Specifically, we can use the same functions to train on a list or an unboxed vector.  We do this by using the <a href="http://hackage.haskell.org/package/ConstraintKinds">ConstraintKinds</a> package on hackage that extends the base type classes like Functor and Foldable to work on classes that require constraints.  Thus, we have a Functor instance of Vector.Unboxed. This is not possible without ConstraintKinds.</p>
<p>Using this benchmark code:</p>
<pre>main = do
    print $ (train [0..fromIntegral size::Double] :: Gaussian Double)
    print $ (train (VU.enumFromN (0::Double) size) :: Gaussian Double)</pre>
<p style="text-align: left;">We generate the following heap profile:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-1485" title="spacetests-gaussian" src="http://izbicki.me/blog/wp-content/uploads/2012/11/spacetests-gaussian.png" alt="" width="690" height="462" /></p>
<p style="text-align: left;">Processing the data as a vector requires that we allocate all the memory in advance.  This lets the program run faster, but prevents us from loading data sets larger than the amount of memory we have.  Processing the data as a list, however, allows us to allocate the memory only as we use it.  But because lists are boxed and lazy data structures, we must accept that our program will run about 10x slower.  Lucky for us, <strong>GHC takes care of all the boring details of making this happen seamlessly.  We only have to write our train function once.</strong></p>
<h3 style="text-align: left;">Future Posts</h3>
<p style="text-align: left;">There&#8217;s still at least four more major topics to cover in the HLearn library:  (1) We can extend this discussion to show how the Naive Bayes learning algorithm has a similar monoid and group structure.  (2) There are many more learning algorithms with group structures we can look into.  (3) We can look at exactly how all these higher order functions, like batch and parallel work under the hood.  And (4) we can see how the fast cross-validation I briefly mentioned works and why it&#8217;s important.</p>
<p style="text-align: left;"><a href="http://izbicki.me/blog/feed">Subscribe to the RSS feed</a> and stay tuned!</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=1442" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/gausian-distributions-are-monoids/feed</wfw:commentRss>
		<slash:comments>34</slash:comments>
		</item>
		<item>
		<title>Using HMMs in Haskell for Bioinformatics</title>
		<link>http://izbicki.me/blog/using-hmms-in-haskell-for-bioinformatics?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-hmms-in-haskell-for-bioinformatics</link>
		<comments>http://izbicki.me/blog/using-hmms-in-haskell-for-bioinformatics#comments</comments>
		<pubDate>Thu, 22 Mar 2012 03:11:24 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Haskell]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=767</guid>
		<description><![CDATA[EDIT: WordPress seems to garble the code sections on occasion for no good reason.  If you want to run the code, you should download the original file instead.  Sorry. This is a tutorial for how to use Hidden Markov Models (HMMs) in Haskell.  We will use the Data.HMM package to find genes in the second [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/dna.jpg"><img class="alignright" style="border: 0pt none;" title="dna" src="http://izbicki.me/blog/wp-content/uploads/2012/03/dna-300x195.jpg" alt="" width="240" height="156" /></a></p>
<p><span style="color: #ff0000;">EDIT: WordPress seems to garble the code sections on occasion for no good reason.  If you want to run the code, you should <a href="http://izbicki.me/public/cs/BioBlog.lhs">download the original file instead</a>.  Sorry.<br />
</span></p>
<p>This is a tutorial for how to use Hidden Markov Models (HMMs) in Haskell.  We will use the Data.HMM package to find genes in the second chromosome of <em>Vitis vinifera</em>: the wine grape vine. Predicting gene locations is a common task in bioinformatics that HMMs have proven good at.</p>
<p>The basic procedure has three steps.  First, we create an HMM to model the chromosome.  We do this by running the Baum-Welch training algorithm on all the DNA.  Second, we create an HMM to model transcription factor binding sites.  This is where genes are located.  Finally, we use Viterbi&#8217;s algorithm to determine which HMM best models the DNA at a given location in the chromosome.  If it&#8217;s the first, this is probably not the start of a gene.  If it&#8217;s the second, then we&#8217;ve found a gene!</p>
<p><span id="more-767"></span></p>
<p>Unfortunately, it&#8217;s beyond the scope of this tutorial to go into the math of HMMs and how they work.  Instead, we will focus on how to use them in practice.  And like all good Haskell tutorials, this page is actually a literate Haskell program, so you can simply cut and paste it into your favorite text editor to run it.</p>
<h3>The code</h3>
<p>Before we do anything else, we must import the Data.HMM library, and some other libraries for the program</p>
<pre>&gt;import Data.HMM
&gt;import Control.Monad
&gt;import Data.Array
&gt;import System.IO</pre>
<p>Now, let&#8217;s create our first HMM.  The HMM datatype is:</p>
<pre>data HMM stateType eventType = HMM { states :: [stateType]
                                   , events :: [eventType]
                                   , initProbs :: (stateType -&gt; Prob)
                                   , transMatrix :: (stateType -&gt; stateType -&gt; Prob)
                                   , outMatrix :: (stateType -&gt; eventType -&gt; Prob)
                                   }</pre>
<p>Notice that states and events can be any type supported by Haskell.  In this example, we will be using both integers and strings for the states, and characters for the events.  DNA is composed of 4 base pairs that get repeated over and over: adenine (A), guanine (G), cytosine (C), and thymine (T), so &#8220;AGCT&#8221; will be the list of our events.</p>
<p>We&#8217;ll start by creating a simple HMM by hand:</p>
<pre>&gt;hmm1 = HMM { states=[1,2]
&gt;           , events=['A','G','C','T']
&gt;           , initProbs = ip
&gt;           , transMatrix = tm
&gt;           , outMatrix = om
&gt;           }
&gt;
&gt;ip s
&gt;    | s == 1  = 0.1
&gt;    | s == 2  = 0.9
&gt;
&gt;tm s1 s2
&gt;    | s1==1 &amp;&amp; s2==1    = 0.9
&gt;    | s1==1 &amp;&amp; s2==2    = 0.1
&gt;    | s1==2 &amp;&amp; s2==1    = 0.5
&gt;    | s1==2 &amp;&amp; s2==2    = 0.5
&gt;
&gt;om s e
&gt;    | s==1 &amp;&amp; e=='A'    = 0.4
&gt;    | s==1 &amp;&amp; e=='G'    = 0.1
&gt;    | s==1 &amp;&amp; e=='C'    = 0.1
&gt;    | s==1 &amp;&amp; e=='T'    = 0.4
&gt;    | s==2 &amp;&amp; e=='A'    = 0.1
&gt;    | s==2 &amp;&amp; e=='G'    = 0.4
&gt;    | s==2 &amp;&amp; e=='C'    = 0.4
&gt;    | s==2 &amp;&amp; e=='T'    = 0.1</pre>
<p>While creating HMMs manually is straightforward, we will typically want to start with one of the built in HMMs.  This simplest way to do this is the function simpleHMM:</p>
<pre>&gt;hmm2 = simpleHMM [1,2] "AGCT"</pre>
<p>hmm2 is an HMM with the same states and events as hmm1, but all the initial, transition, and output probabilities are distributed in an unknown manner.  This is okay, however, because we will normally want to train our HMM using Baum-Welch to determine those parameters automatically.</p>
<p>Another simple way to create an HMM is by creating a non-hidden Markov model with the simpleMM command.  (Note the absence of an &#8220;H&#8221;)  Below, hmm3 is a 3rd order Markov model for DNA:</p>
<pre>&gt;hmm3 = simpleMM "AGCT" 3</pre>
<p>Now, how do we train our model?  The standard algorithm is called Baum-Welch.  To illustrate the process, we&#8217;ll create a short array of DNA, then call three iterations of baumWelch on it.</p>
<pre>&gt;dnaArray = listArray (1,20) "AAAAGGGGCTCTCTCCAACC"
&gt;hmm4 = baumWelch hmm3 dnaArray 3</pre>
<p>We use arrays instead of lists because this gives us better performance when we start passing large training data to Baum-Welch.  Doing three iterations is completely arbitrary.  Baum-Welch is guaranteed to converge, but there is no way of knowing how long that will take.</p>
<p>Now, let&#8217;s train our HMM on an entire chromosome.  We will use the <a href="http://izbicki.me/public/hmm/winegrape-chromosome2">winegrape-chromosome2</a> file.  This DNA file was downloaded from the <a href="http://www.plantgdb.org/">plant genomics database</a>.  We can load and process it like this:</p>
<pre>&gt;loadDNAArray len = do
&gt;    let dnaArray = listArray (1,len) $ filter isBP dna
&gt;    return dnaArray
&gt;    where
&gt;          isBP x = if x `elem` "AGCT" -- This filters out the "N" base pair
&gt;                      then True       -- "N" means it could be any bp
&gt;                      else False      -- so this should not affect results too much
&gt;
&gt;createDNAhmm file len hmm = do
&gt;    let hmm' = baumWelch hmm dna 10
&gt;    putStrLn $ show hmm'
&gt;    saveHMM file hmm'
&gt;    return hmm'</pre>
<p>The loadDNAArray function simply loads the DNA from the file into an array, and the createDNAhmm function actually calls the Baum-Welch algorithm.  This function can take a while on long inputs&#8212;and DNA is a long input!&#8212;so we also pass a file parameter for it to save our HMM when it&#8217;s done for later use.  Now let&#8217;s create our HMM:</p>
<pre>&gt;hmmDNA = createDNAhmm "trainedDNA.hmm" 50000 hmm3</pre>
<p>This call takes almost a full day on my laptop.  Luckily, you don&#8217;t have to repeat it.  The Data.HMM.HMMFile module allows us to write our HMMs to disk and retrieve them later.  Simply download <a href="http://izbicki.me/public/hmm/trainedDNA.hmm">trainedDNA.hmm</a> and  then call loadHMM:</p>
<pre>&gt;hmmDNA_file = loadHMM "trainedDNA.hmm" :: IO (HMM String Char)</pre>
<p>NOTE: Whenever you use loadHMM, you must specify the type of the resulting HMM.  loadHMM relies on the built-in &#8220;read&#8221; function, and this cannot work unless you specify the type!</p>
<p>Great!  Now, we have a fully trained HMM for our chromosome.  Our next step is to train another HMM on the transcription factor binding sites.  There are many advanced ways to do this (e.g. Profile HMMs), but that&#8217;s beyond the scope of this tutorial.  We&#8217;re simply going to download a <a href="http://izbicki.me/public/hmm/TFBindingSites">list of TF binding sites</a>, concatenate them, then train our HMM on them.  This won&#8217;t be as effective, but saves us from taking an unnecessary tangent.</p>
<pre>&gt;createTFhmm file hmm = do
&gt;    x &lt;- strTF
&gt;    let hmm' = baumWelch hmm (listArray (1,length x) x) 10
&gt;    putStrLn $ show hmm'
&gt;    saveHMM file hmm'
&gt;    return hmm'
&gt;    where
&gt;          strTF = liftM (concat . map ( (++) "") ) loadTF
&gt;          loadTF = liftM (filter isValidTF) $ (liftM lines) $ readFile "TFBindingSites"
&gt;          isValidTF str = (length str &gt; 0) &amp;&amp; (not $ elemChecker "#(/)[]|N" str)
&gt;
&gt;elemChecker :: (Eq a) =&gt; [a] -&gt; [a] -&gt; Bool
&gt;elemChecker elemList list
&gt;    | elemList == []  = False
&gt;    | otherwise       = if (head elemList) `elem` list
&gt;                           then True
&gt;                           else elemChecker (tail elemList) list</pre>
<p>Now, let&#8217;s create our transcription factor HMM:</p>
<pre>&gt;hmmTF = createTFhmm "trainedTF.hmm" $ simpleMM "AGCT" 3</pre>
<p>Or if you&#8217;re in a hurry, just download <a href="http://izbicki.me/public/hmm/trainedTF.hmm">trainedTF.hmm</a> and load it:</p>
<pre>&gt;hmmTF_file = loadHMM "trainedTF.hmm" :: IO (HMM String Char)</pre>
<p>So now we have 2 HMMs, how are we going to use them?  We&#8217;ll combine the two HMMs into a single HMM, then use Viterbi&#8217;s algorithm to determine which HMM best characterizes our DNA at a given point.  If it&#8217;s hmmDNA, then we do not have a TF binding site at that location, but if it&#8217;s hmmTF, then we probably do.</p>
<p>The Data.HMM library provides another convenient function for combining HMMs, hmmJoin.  It adds transitions from every state in the first HMM to every state in the second, and vice versa, using the &#8220;joinParam&#8221; to determine the relative probability of making that transition.  This is the simplest way to combine to HMMs.  If you want more control over how they get combined, you can implement your own version.</p>
<pre>&gt;findGenes len joinParam hout = do
&gt;    hmmTF &lt;- loadHMM "hmm/TF-3.hmm" :: IO (HMM String Char)
&gt;    hmmDNA &lt;- loadHMM "hmm/autowinegrape-1000-3.hmm"  :: IO (HMM String Char)
&gt;    let hmm' = seq hmmDNA $ seq hmmTF $ hmmJoin hmmTF hmmDNA joinParam
&gt;    dna &lt;- loadDNAArray len
&gt;    hPutStrLn hout ("len="++show len++",joinParam="++show joinParam++" -&gt; "++(show $ concat $ map (show . fst) $ viterbi hmm' dna))
&gt;
&gt;main = do
&gt;    hout     mapM_ (\len -&gt; mapM_ (\jp -&gt; findGenes len jp hout) [0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59,0.6]) [50000]
&gt;    hClose hout</pre>
<p>Finally, our main function runs findGenes with several different joinParams.  These act as thresholds for finding where the genes actually occur.  You can download the full results <a href="http://izbicki.me/public/hmm/BioResults">here</a>.</p>
<p>How should we interpret these results?  Let&#8217;s look at the output from around 38000 base pairs into the chromosome:</p>
<p style="text-align: center;">jP=0.50 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.51 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.52 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.53 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.54 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.55 -&gt; 222222222222222222222222222222222222222222222222222222<br />
jP=0.56 -&gt; 222222222222222222222222222211112222222222222222222222<br />
jP=0.57 -&gt; 222222222222222222222222222211112222222222222111222222<br />
jP=0.58 -&gt; 222221111111112222222222222211111122222222222111222222<br />
jP=0.59 -&gt; 222221111111112222211111111111111122211111111111222222<br />
jP=0.60 -&gt; 222221111111112222211111111111111111111111111111112222</p>
<p>Everywhere where there is a 2, Viterbi selected hmmDNA; where there is a 1, Viterbi selected the hmmTF.  Whether you select this area as a likely candidate for a transcription factor binding site depends on how you set your join parameter.</p>
<p>Now that you&#8217;re familiar with how the Data.HMM module works, let&#8217;s look at its performance characteristics.</p>
<h3>Performance</h3>
<p>Overall, the Data.HMM package performs well on medium size datasets of up to about 10,000 items.  Unfortunately, on larger datasets, performance begins to suffer.  Algorithms that should be running in linear time start taking super-linear time, presumably because Haskell&#8217;s garbage collector is interfering.  More work is needed to determine the exact cause and fix it.  Still, performance remains tractable on these large datasets up to 100,000 items, which is the largest I tried.</p>
<p>I ran these tests using haskell&#8217;s Data.Criterion package.  Criterion conveniently allows you to define multiple tests and does all the statistical analysis of them.  For these tests, I did 3 trials each, and ran them on my Core 2 duo laptop.  The code for the tests can be found in the HMMPerf.hs file.  In all graphs, the blue line is actual performance data and the red line is a best fit curve.</p>
<p><strong>Baum-Welch&#8217;s performance</strong></p>
<p>First, as expected we find that Baum-Welch runs in linear time based on the number of iterations.  In an imperative language, there would be no point in even testing this.  But in Haskell, laziness can rear its head in unexpected ways, so it is important to ensure this is linear.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-itr-lin1.png"><img class="aligncenter size-full wp-image-773" title="bw-itr-lin1" src="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-itr-lin1.png" alt="" width="540" height="281" /></a></p>
<p>For small arrays, Baum-Welch runs in linear time.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-len-small.png"><img class="aligncenter size-full wp-image-774" title="bw-len-small" src="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-len-small.png" alt="" width="540" height="285" /></a>But for larger arrays, it runs in super-linear time.  It is interesting that the exponent on our polynomial function is not quite at 2.  This provides evidence that the performance hit has to do with the Haskell compiler and not an incorrect implementation.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-len-large.png"><img class="aligncenter size-full wp-image-775" title="bw-len-large" src="http://izbicki.me/blog/wp-content/uploads/2012/03/bw-len-large.png" alt="" width="541" height="284" /></a></p>
<p><strong>Viterbi&#8217;s performance</strong></p>
<p>As expected, the Viterbi runs in quadratic time on the number of states in the HMM.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-states.png"><img class="aligncenter size-full wp-image-779" title="viterbi-states" src="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-states.png" alt="" width="541" height="284" /></a></p>
<p>The curves for the Viterbi algorithm clearly demonstrate that something weird is going on.  At small array sizes, Viterbi is only mildly super-linear.  It&#8217;s best fit polynomial curve has an exponent of only 1.3.  But at medium array lengths, this exponent increases to 1.8, and at large array lengths, the exponent increases to 1.97.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-small.png"><img class="aligncenter size-full wp-image-778" title="viterbi-len-small" src="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-small.png" alt="" width="545" height="285" /></a><br />
<a href="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-med.png"><img class="aligncenter size-full wp-image-777" title="viterbi-len-med" src="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-med.png" alt="" width="541" height="281" /></a><br />
<a href="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-large.png"><img class="aligncenter size-full wp-image-777" title="viterbi-len-large" src="http://izbicki.me/blog/wp-content/uploads/2012/03/viterbi-len-large.png" alt="" width="541" height="281" /></a></p>
<h3> Conclusion</h3>
<p>Data.HMM is a great tool if you just need a small HMM in your Haskell application for some reason.  If you&#8217;re going to be making heavy use of HMMs and don&#8217;t specifically need to interact with Haskell, it&#8217;s probably better to use a package written in C++ that&#8217;s been optimized for speed.</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=767" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/using-hmms-in-haskell-for-bioinformatics/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>How to create an unfair coin and prove it with math</title>
		<link>http://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-with-math?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-create-an-unfair-coin-and-prove-it-with-math</link>
		<comments>http://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-with-math#comments</comments>
		<pubDate>Sat, 03 Dec 2011 09:46:32 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=498</guid>
		<description><![CDATA[Want to make sure you win the coin toss just a little more often than you should?  I certainly do, so I made some unfair coins.  We&#8217;ll use the beta distribution to see just how unfair they are.  While this is just a toy example problem for using the beta distribution, machine learning algorithms rely [...]]]></description>
				<content:encoded><![CDATA[<p>Want to make sure you win the coin toss just a little more often than you should?  I certainly do, so I made some unfair coins.  We&#8217;ll use the beta distribution to see just how unfair they are.  While this is just a toy example problem for using the beta distribution, machine learning algorithms rely on this distribution for learning just about everything. Math is an amazing thing that way.</p>
<h3><span id="more-498"></span>Making the coins</h3>
<p>We&#8217;ll make our unfair coins by bending them.  Our hypothesis is that the concave side will have less area to land on, and so the coin should land on it less often.  Let&#8217;s get started.</p>
<p>It&#8217;s easy to bend the coins with your teeth:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/bend_coin_with_tooth.jpg"><img class="aligncenter size-full wp-image-515" title="bend_coin_with_tooth" src="http://izbicki.me/blog/wp-content/uploads/2011/11/bend_coin_with_tooth.jpg" alt="Bending a coin with my teeth" width="3264" height="1952" /></a></p>
<p>WAIT!  That really hurts!  Using pliers or wrenches works much better:</p>
<p style="text-align: center;"><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/IMAG0076.jpg"><img class="aligncenter size-full wp-image-503" title="Bending coins with pliers" src="http://izbicki.me/blog/wp-content/uploads/2011/11/IMAG0076.jpg" alt="Bending coins with pliers" width="3264" height="1952" /></a></p>
<p>I made seven coins this way, each with a different bending angle.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coins-all.jpg"><img class="aligncenter size-full wp-image-624" title="coins-all" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coins-all.jpg" alt="" width="700" height="230" /></a></p>
<p>I did 100 flips for each coin, making sure each flip went at least a foot in the air and spun real well.  &#8220;Umm&#8230; only 100 flips?&#8221; you ask, &#8220;That can&#8217;t be enough!&#8221;  Just you wait until the section on the math.</p>
<p>Here&#8217;s the raw results:</p>
<table border="1" cellspacing="0">
<tbody>
<tr>
<td style="text-align: center;">Coin</td>
<td>Total Flips</td>
<td>Heads</td>
<td>Tails</td>
</tr>
<tr>
<td>0</td>
<td> 100</td>
<td>53</td>
<td>47</td>
</tr>
<tr>
<td>1</td>
<td> 100</td>
<td>55</td>
<td>45</td>
</tr>
<tr>
<td>2</td>
<td> 100</td>
<td>49</td>
<td>51</td>
</tr>
<tr>
<td>3</td>
<td> 100</td>
<td>41</td>
<td>59</td>
</tr>
<tr>
<td>4</td>
<td> 100</td>
<td>39</td>
<td>61</td>
</tr>
<tr>
<td>5</td>
<td> 100</td>
<td>27</td>
<td>73</td>
</tr>
<tr>
<td>6</td>
<td> 100</td>
<td>0</td>
<td>100</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<h3>Now for the math</h3>
<p>Coin flipping is a <a href="http://en.wikipedia.org/wiki/Bernoulli_process">Bernoulli process</a>.  This just means that all trials (flips) can have only two outcomes (heads or tails), and each trial is independent of every other trial.  What we&#8217;re interested in calculating is the expected value of a coin flip for each of our coins.  That is, what is the probability it will come up heads?  The obvious way to calculate this probability is simply to divide the number of heads by the total number of trials.  Unfortunately, this doesn&#8217;t give us a good idea about how accurate our estimate is.</p>
<p>Enter the <a href="http://en.wikipedia.org/wiki/Beta_distribution">beta distribution</a>. This is a distribution over the bias of a Bernoulli process.  Intuitively, this means that CDF(x) equals the probability that the expectation of a coin flip is <span id='tex_1667'></span> x.  In other words, we&#8217;re finding the probability that a probability is what we think it should be.  That&#8217;s a convoluted definition!  Some examples should make it clearer.</p>
<p>The beta distribution takes two parameters <span id='tex_8878'></span> and <span id='tex_1827'></span>.  <span id='tex_6661'></span> is the number of heads we have flipped plus one, and <span id='tex_3696'></span> is the number of tails plus one.  We&#8217;ll talk about why that plus one is there in a bit, but first let&#8217;s see what the distribution actually looks like with some example parameters.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_center.png"><img class="aligncenter size-full wp-image-621" title="gamma_center" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_center.png" alt="" width="710" height="280" /></a>In both the above cases, the distribution is centered around 0.5 because <span id='tex_9099'></span> and <span id='tex_5796'></span> are equal&#8212;we&#8217;ve gotten the same number of heads as we have tails.  As these parameters increase, the distribution gets tighter and tighter.  This should makes sense. The more flips we do, the more confident we can be that the data we&#8217;ve collected actually match the characteristics of the coin.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_offset.png"><img title="gamma_offset" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_offset.png" alt="" width="710" height="280" /></a></p>
<p>When the parameters are not equal to each other&#8212;for example, we&#8217;ve seen twice as many heads as we have tails&#8212;then the distribution is skewed to the left or right accordingly.  The peak of the PDF occurs at:</p>
<p style="text-align: center;"><span id='tex_6318'></span></p>
<p>That&#8217;s exactly what we said the expectation of the next coin flip should be above.  Awesome!</p>
<p>So what happens when <span id='tex_7250'></span> and <span id='tex_6329'></span> are one?</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_1_1.png"><img class="aligncenter size-full wp-image-648" title="gamma_1_1" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_1_1.png" alt="" width="350" height="280" /></a>We get the flat distribution.  Basically, we haven&#8217;t flipped the coin at all yet, so we have no data about how our coin is biased, so all biases are equally likely.  This is why we must add one to the number of heads and tails we have flipped to get the appropriate <span id='tex_2271'></span> and <span id='tex_3330'></span>.</p>
<p>If <span id='tex_9424'></span> and <span id='tex_1517'></span> are less than one, we get something like this:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_.5_.5.png"><img class="aligncenter size-full wp-image-653" title="gamma_.5_.5" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_.5_.5.png" alt="" width="350" height="280" /></a>Essentially, this means that we know our coin is very biased in one way or the other, but we don&#8217;t know which way yet!  As you can imagine, such perverse parameterizations are rarely used in practice.</p>
<p>Hopefully, this has given you an intuitive sense for what the beta distribution looks like.  But for the pedantic, here&#8217;s how the beta distribution&#8217;s pdf is formally defined:</p>
<p style="text-align: center;"><span id='tex_4985'></span></p>
<p>Where <span id='tex_7747'></span> is the <a href="http://en.wikipedia.org/wiki/Gamma_function">gamma function</a>&#8212;you can think of it as being a generalization of factorials to the real numbers.  That is, <span id='tex_7389'></span>.  Excel, many calculators, and any scientific programming package will be able to calculate that for you easily.  Most of these applications will even have the beta function already built in.</p>
<h3>Applying the beta distribution to our coins</h3>
<p>We&#8217;re finally ready to see just how biased our coins actually are!</p>
<table>
<tbody>
<tr>
<td>
<p style="text-align: center;">Coin 0</p>
<p><img title="coin0" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin0.jpg" alt="" width="100" height="70" /></p>
<p style="text-align: center;">Heads: 53</p>
<p style="text-align: center;">Tails: 47</p>
</td>
<td><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_54_48.png"><img class="aligncenter size-full wp-image-638" title="gamma_54_48" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_54_48.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 1</p>
<p> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin1.jpg"><img title="coin1" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin1.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 55</p>
<p style="text-align: center;">Tails: 45</p>
</td>
<td> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_56_46.png"><img class="aligncenter size-full wp-image-639" title="gamma_56_46" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_56_46.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 2</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin2.jpg"><img title="coin2" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin2.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 49</p>
<p style="text-align: center;">Tails: 51</p>
</td>
<td> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_50_52.png"><img class="aligncenter size-full wp-image-637" title="gamma_50_52" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_50_52.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 3</p>
<p> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin3.jpg"><img title="coin3" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin3.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 41</p>
<p style="text-align: center;">Tails: 59</p>
</td>
<td> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_42_60.png"><img class="aligncenter size-full wp-image-636" title="gamma_42_60" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_42_60.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 4</p>
<p> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin4.jpg"><img title="coin4" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin4.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 39</p>
<p style="text-align: center;">Tails: 61</p>
</td>
<td> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_40_62.png"><img class="aligncenter size-full wp-image-635" title="gamma_40_62" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_40_62.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 5</p>
<p> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin5.jpg"><img title="coin5" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin5.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 27</p>
<p style="text-align: center;">Tails: 73</p>
</td>
<td> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_28_74.png"><img class="aligncenter size-full wp-image-634" title="gamma_28_74" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_28_74.png" alt="" width="350" height="280" /></a></td>
</tr>
<tr>
<td>
<p style="text-align: center;">Coin 6</p>
<p> <a href="http://izbicki.me/blog/wp-content/uploads/2011/11/coin6.jpg"><img title="coin6" src="http://izbicki.me/blog/wp-content/uploads/2011/11/coin6.jpg" alt="" width="100" height="70" /></a></p>
<p style="text-align: center;">Heads: 0</p>
<p style="text-align: center;">Tails: 100</p>
</td>
<td><a href="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_1_101.png"><img class="aligncenter size-full wp-image-633" title="gamma_1_101" src="http://izbicki.me/blog/wp-content/uploads/2011/11/gamma_1_101.png" alt="" width="350" height="280" /></a></td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>Amazingly, it takes some pretty big bends to make a biased coin. It is not until coin 3, which has an almost 90 degree bend that we can say with any confidence that the coin is biased at all.  People might notice if you tried to flip that coin to settle a bet!</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=498" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-with-math/feed</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>Converting images into time series for data mining</title>
		<link>http://izbicki.me/blog/converting-images-into-time-series-for-data-mining?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=converting-images-into-time-series-for-data-mining</link>
		<comments>http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#comments</comments>
		<pubDate>Fri, 28 Oct 2011 08:02:52 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=265</guid>
		<description><![CDATA[The first step in data mining images is to create a distance measure for two images.  In the intro to data mining images, we called this distance measure the &#8220;black box.&#8221;  This post will cover how to create distance measures based on time series analysis.  This technique is great for comparing objects with a constant, [...]]]></description>
				<content:encoded><![CDATA[<p>The first step in data mining images is to create a distance measure for two images.  <a href="http://izbicki.me/blog/data-mining-images-tutorial">In the intro to data mining images</a>, we called this distance measure the &#8220;black box.&#8221;  This post will cover how to create distance measures based on <em>time series analysis</em>.  This technique is great for comparing objects with a constant, rigid shape.  For example, it will work well on classifying images of skulls, but not on images of people.  Skulls always have the same shape, whereas a person might be walking, standing, sitting, or curled into a ball.  By the end of this post, you should understand how to compare these <a href="http://en.wikipedia.org/wiki/Hominid">hominid</a> skulls from UC Riverside<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_0_265" id="identifier_0_265" class="footnote-link footnote-identifier-link" title=" Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee and Michail Vlachos &nbsp;&amp;#8221;LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures.&amp;#8221; VLDB 2006. (PDF) ">1</a></sup> using radial scanning and dynamic time warping.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/gorilla-skulls2.png"><img class="aligncenter size-full wp-image-438" title="gorilla-skulls2" src="http://izbicki.me/blog/wp-content/uploads/2011/10/gorilla-skulls2.png" alt="" width="478" height="500" /></a><span id="more-265"></span>But first, we must start from the beginning.  What exactly is a times series?  Anything that can be plotted on a line graph.  For example, the price of Google stock is a time series:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/stocks-goog.png"><img class="aligncenter" title="stocks-goog" src="http://izbicki.me/blog/wp-content/uploads/2011/10/stocks-goog.png" alt="" width="805" height="168" /></a>As you can imagine, time series have been studied extensively.  Most scientists use them at some point in their careers.  Unsurprisingly, they have developed many techniques for analyzing them.  If we can convert our images into time series, then all these tools become available to us.  Therefore, the time series distance measure has two steps:</p>
<p>STEP 1: Convert the images into a time series</p>
<p>STEP 2: Find the distance between two images by finding the distance between their time series</p>
<p>We have our choice of several algorithms for each step.  In the rest of this post, we will look at two algorithms for converting images into time series: radial scanning and linear scanning.  Then, we will look at two algorithms for measuring the distance between time series: Euclidean distance and dynamic time warping.  We will conclude by looking at the types of problems time series analysis handles best and worst.</p>
<p><strong>STEP 1A: Creating a time series by radial scanning<br />
</strong></p>
<p>Radial scanning is tricky to explain, but once it clicks you&#8217;ll realize that it is both simple and elegant.  Here&#8217;s an example from a human skull:</p>
<p><a><img class="aligncenter size-full wp-image-309" title="human-skull" src="http://izbicki.me/blog/wp-content/uploads/2011/10/human-skull.png" alt="" width="456" height="220" /></a>First we find the skull&#8217;s outline.  Then we find the distance from the center of the skull to each point on the skull&#8217;s outline (B).  Finally, we plot those distances as a time series (C).  The lines connecting the skull to the graph show where that point on the skull maps to the time series below.  In this case, we started at the skull&#8217;s mouth and went clockwise.</p>
<p>Skulls from different species produce different time series:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/skull-tree.png"><img class="aligncenter size-full wp-image-344" title="skull-tree" src="http://izbicki.me/blog/wp-content/uploads/2011/10/skull-tree.png" alt="" width="660" height="520" /></a>Take a careful look at these skulls and their time series.  Make sure you can spot the differences in the time series between each grouping. Don&#8217;t worry yet about how the groupings were made.  Right now, just get a feel for how a shape can be converted into a time series.</p>
<p>Another example of radial scanning comes from Korea University.  Here we are trying to determine a tree&#8217;s species based on it&#8217;s leaf shapes:<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_1_265" id="identifier_1_265" class="footnote-link footnote-identifier-link" title="Yoon-Sik Tak and Eenjun Hwang.&nbsp; &amp;#8220;A Leaf Image Retrieval Scheme Based on Partial Dynamic Warping and Two-Level Filtering&amp;#8221; 7th International Conference on Computer and Information Technology, 2007. (Access on IEEE) ">2</a></sup></p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/leaf.png"><img class="aligncenter size-full wp-image-282" title="leaf" src="http://izbicki.me/blog/wp-content/uploads/2011/10/leaf.png" alt="" width="663" height="300" /></a><strong></strong></p>
<p>The labeled points on the leaf at left correspond to the labeled positions on the time series at right.  Radial scanning is a popular technique for leaf classification because every species of plant has a characteristic leaf shape.  Each leaf will be unique, but the pattern of peaks and valleys in the resulting time series should be similar if the species of plant is the same.</p>
<p>We can already tell that the graphs created by the skulls and the leaf look very different to the human eye.  This is a good sign that radial scanning captures important information about the objects shape that we will be able to use in the comparison step.</p>
<p><strong>STEP 1B: Creating a time series by linear scanning</strong></p>
<p>Some objects just aren&#8217;t circular, so radial scanning makes no sense.  One example is hand written words.  The University of Massachusetts has analyzed a large collection of George Washington&#8217;s letters using the linear scanning method.<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_2_265" id="identifier_2_265" class="footnote-link footnote-identifier-link" title=" Rath, Kane, Lehman, Partridge, and Manmatha. &amp;#8220;Indexing for a Digital Library of George Washinton&amp;#8217;s Manuscripts: A Study of Word Matching Techniques.&amp;#8221; CIIR Technical Report. (PDF) ">3</a></sup> <sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_3_265" id="identifier_3_265" class="footnote-link footnote-identifier-link" title=" Rath, Manmatha. &amp;#8220;Word Image Matching Using Dynamic Time Warping,&amp;#8221;&nbsp; the Proceedings of CVPR-03 conference,vol. 2, pp. 521-527. (PDF) ">4</a></sup>  In the first image is a picture of the word &#8220;Alexandria&#8221; as Washington actually wrote it:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-slant.png"><img class="aligncenter" title="alexandria-slant" src="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-slant.png" alt="" width="573" height="86" /></a>Then, we remove the tilt from the image.  All of Washington&#8217;s writing has a fairly constant tilt, so this process is easy to automate.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-noslant.png"><img class="aligncenter" title="alexandria-noslant" src="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-noslant.png" alt="" width="571" height="85" /></a>Finally, we create a time series from the word:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-dtw.png"><img class="aligncenter size-full wp-image-288" title="alexandria-dtw" src="http://izbicki.me/blog/wp-content/uploads/2011/10/alexandria-dtw.png" alt="" width="695" height="170" /></a><strong></strong></p>
<p>To create this time series, we start at the left of the image and consider each column of pixels in turn.  The value at each &#8220;time&#8221; is just the number of dark pixels in that column.  If you look closely at the time series, you should be able to tell where each bump corresponds to a specific letter.  Some letters, like the &#8220;d&#8221; get two bumps in the time series because they have two areas with a high concentration of dark pixels.</p>
<p>We could have constructed the time series in other ways as well.  For example, we could have counted the number of pixels from the top of the column to the first dark pixel.  This would have created an outline of the top of the word.  We simply have to consider our application carefully and decide which method will work the best.</p>
<p>We now have two simple methods for creating time series from images.  These are the simplest and most common methods, but the only ones.  WARP<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_4_265" id="identifier_4_265" class="footnote-link footnote-identifier-link" title="Bartolini, Ciaccia, Patella, &amp;#8220;WARP: Accurate Retrieval of Shapes Using Phase of Fourier Descriptors and Time Warping Distance&amp;#8221; IEEE Transactions of Pattern Analysis and Machine Intelligence, Vol 27 No 1, January 2005. (PDF) ">5</a></sup> and Beam Angle Statistics<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_5_265" id="identifier_5_265" class="footnote-link footnote-identifier-link" title=" Arica, Yarman-vural. &amp;#8220;BAS: a perceptual shape descriptor based on the beam angle statistics.&amp;#8221;&nbsp; Pattern Recognition Letters 2003. (PDF) ">6</a></sup> are two examples of other methods.  Which is best depends&#8212;as always&#8212;on the specific application.  Now that we can create the time series, let&#8217;s figure out how to compare them.</p>
<p><strong>STEP 2: Comparing the distances<br />
</strong></p>
<p>The whole purpose of creating the time series was to create a distance measure that uses them.  The easiest way to do this is the <a href="http://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a>.  (This is the normal <span id='tex_231'></span> that we are used to.)  Consider the two time series below:<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_6_265" id="identifier_6_265" class="footnote-link footnote-identifier-link" title=" Keogh, &amp;#8220;Exact Indexing of Dynamic Time Warping&amp;#8221; (PDF) ">7</a></sup></p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-euclidean.png"><img class="aligncenter size-full wp-image-352" title="dist-euclidean" src="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-euclidean.png" alt="" width="475" height="228" /></a>To calculate the overall distance, we calculate the distance between each corresponding point in the time series.  Corresponding points are connected by black lines.  Notice that the first blue hump corresponds to a flat red area, so this causes the black lines to be shorter.  The second red hump corresponds to a flat blue area, so the black lines are longer.  Everywhere else, the two time series line up fairly well, so the black lines have a mostly constant height.  (Normally, we would start the two time series at the same height, so the first black line would be zero; however, the time series have been moved apart to make the black lines easy to see.)</p>
<p>More formally,</p>
<p style="text-align: center;"><span id='tex_822'></span></p>
<p>where <span id='tex_4104'></span> is the height of the red series at &#8220;time&#8221; <span id='tex_6883'></span>, <span id='tex_4509'></span> is the height of the blue series at &#8220;time&#8221; <span id='tex_3203'></span>, and <span id='tex_2678'></span> is the length of the time series.  This is a simple and fast calculation, running in time <span id='tex_826'></span>.</p>
<p>A more sophisticated way to compare time series is called Dynamic Time Warping (DTW).  DTW tries to compare similar areas in each time series with each other.  Here are the same two time series compared with DTW:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-DTW.png"><img class="aligncenter size-full wp-image-351" title="dist-DTW" src="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-DTW.png" alt="" width="469" height="244" /></a>In this case, each of the humps in the blue series is matched with a hump in the red series, and all the flat areas are paired together.  Notice that a single point in one time series can align with multiple points in the other.  In this case, DTW gives a distance nearly zero&#8212;it is a nearly perfect match.  Euclidean distance had a much worse match and would give a large distance.</p>
<p>For most applications, dynamic time warping outperforms straight Euclidean distance.  Take a look at this dendrogram clustering:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-compare.png"><img class="aligncenter size-full wp-image-350" title="dist-compare" src="http://izbicki.me/blog/wp-content/uploads/2011/10/dist-compare.png" alt="" width="534" height="481" /></a>The orange series contain three humps, the green four, and the blue five.  But the humps do not line up, so this is a difficult problem for straight Euclidean distance.  In contrast, DTW successfully clustered the time series based on the number of humps they have.</p>
<p>That&#8217;s great, but how did DTW decide which points in the red and blue time series should align?</p>
<p>Exhaustive search.  We try every possible alignment and pick the one that works best.  This will be easier to see with a simpler example:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/CQ2.png"><img class="aligncenter size-full wp-image-458" title="CQ2" src="http://izbicki.me/blog/wp-content/uploads/2011/10/CQ2.png" alt="" width="280" height="165" /></a>To perform the search, we create an <span id='tex_452'></span>x<span id='tex_8998'></span> matrix<em></em>.  Each row corresponds to a time along the red series, and each column corresponds to a time along the blue series.  The value of each cell <span id='tex_3088'></span> is the distance between <span id='tex_3772'></span> and <span id='tex_8421'></span>.  This effectively compares every time in the red series with every other time in the blue series.  Then, we select the path through the matrix that minimizes the total distance:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/CQ2-matrix.png"><img class="aligncenter size-full wp-image-459" title="CQ2-matrix" src="http://izbicki.me/blog/wp-content/uploads/2011/10/CQ2-matrix.png" alt="" width="370" height="370" /></a>The colored boxes correspond to the colored lines connecting the two time series in the first image.  For example, the four light blue squares in the top right are on a single row, so they map one point on the red series to four points on the blue one.</p>
<p>Using <a href="http://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a>, DTW is an <span id='tex_4595'></span> algorithm, which is much slower than Euclidean distance&#8217;s <span id='tex_8748'></span>.  This is a serious problem if we want to use the algorithm to search a large database.</p>
<p>The easiest way to speed up the algorithm is to calculate only a small fraction of the matrix.  Intuitively, we want our warping path to stay relatively close to a diagonal line.  If it stays exactly on the diagonal line, then every red and blue time correspond exactly. This is the same as the Euclidean distance.  At the opposite extreme would be a path that follows the left most, then top most edges.  In this case we are comparing the first blue value to all red values and the last red value to all blue values.  This seems unlikely to make a good match.</p>
<p>There are two common ways to limit the number of calculations.  First is the Sakoe-Chiba band:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/Sakoe-Chiba1.png"><img class="aligncenter size-full wp-image-464" title="Sakoe-Chiba" src="http://izbicki.me/blog/wp-content/uploads/2011/10/Sakoe-Chiba1.png" alt="" width="370" height="370" /></a>The second method is the Itakura parallelogram:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/Itakura.png"><img class="aligncenter size-full wp-image-463" title="Itakura" src="http://izbicki.me/blog/wp-content/uploads/2011/10/Itakura.png" alt="" width="370" height="370" /></a>The basic ideas behind these restrictions is pretty straightforward from their pictures.  What isn&#8217;t straightforward, however, is that these techniques also increase DTW&#8217;s accuracy.<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_7_265" id="identifier_7_265" class="footnote-link footnote-identifier-link" title="&nbsp;Ratanamahatana, Keogh. &amp;#8220;Three Myths about&nbsp; Dynamic Time Warping.&amp;#8221; SDM 2005. (PDF) ">8</a></sup>  DTW was introduced to the data mining community in 1994.<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_8_265" id="identifier_8_265" class="footnote-link footnote-identifier-link" title=" Berndt, Clifford.&nbsp; &amp;#8220;Using Dynamic Time Warping to Find Patterns in Time Series,&amp;#8221; KDD 1994. (PDF) ">9</a></sup>  For over a decade researchers tried to find ways to increase the amount of the matrix they could search because they falsely believed that this would lead to more accurate results.</p>
<p>We can also speed up the calculation using an approximation function called a <em>lower bound</em>.  A lower bound is computationally much cheaper than the full DTW function&#8212;a good one might run 1000 times faster than the time of the full DTW&#8212;and is always less than or equal to the real DTW.  We can run the lower bound on millions of images, and only select the potentially closest matches to run the full DTW algorithm on. Two good lower bounds are LB_Improved<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_9_265" id="identifier_9_265" class="footnote-link footnote-identifier-link" title=" Lemire, &amp;#8220;Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound,&amp;#8221; Pattern Recognition 2008. (PDF) ">10</a></sup> and LB_Keogh.<sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_10_265" id="identifier_10_265" class="footnote-link footnote-identifier-link" title="&nbsp;Keogh, Ratanamahatana. &amp;#8220;Exact indexing of dynamic time warping,&amp;#8221; Knowledge and Information Sytems 2002. (PDF) ">11</a></sup></p>
<p>Finally, there are other methods for comparing time series.  The most common is called <em>Longest Common Sub-Sequence</em> (LCSS).  It is useful for matching images suffering from occlusion. <sup><a href="http://izbicki.me/blog/converting-images-into-time-series-for-data-mining#footnote_11_265" id="identifier_11_265" class="footnote-link footnote-identifier-link" title="&nbsp;Yazdani, Meral &Ouml;zsoyoglu. 1996 Sequence matching of images. In Proc. 8th Int. Conf. Sci. Stat. Database Manag. pp. 53&ndash;62. ">12</a></sup></p>
<p><strong>When to use Time Series Analysis</strong></p>
<p>Time series analysis is only sensitive to an object&#8217;s shape.  It is invariant to colors and internal features.  These properties make time series analysis good for comparing rigid objects, such as skulls, leaves, and handwriting.  These shapes do not change over time, so they will have similar time series no matter when they are measured.</p>
<p>Time series analysis will not work on objects that can change their shapes over time.  People are good examples of this, because we have many different postures.  We can walk, sit, or curl into a ball.  Another distance measure called &#8220;shock graphs&#8221; is better for comparing the shapes of objects that can move.  We&#8217;ll cover shock graphs in a later post.</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=265" width="1" height="1" style="display: none;" /><strong>Footnotes</strong><ol class="footnotes"><li id="footnote_0_265" class="footnote"> Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee and Michail Vlachos  &#8221;LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures.&#8221; VLDB 2006. (<a href="http://www.cs.ucr.edu/%7Eeamonn/VLDB2006_Expanded.pdf">PDF</a>) </li><li id="footnote_1_265" class="footnote">Yoon-Sik Tak and Eenjun Hwang.  &#8220;A Leaf Image Retrieval Scheme Based on Partial Dynamic Warping and Two-Level Filtering&#8221; <em>7th International Conference on Computer and Information Technology, </em>2007. (<a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4385155&amp;isnumber=4385041">Access on IEEE</a>) </li><li id="footnote_2_265" class="footnote"> Rath, Kane, Lehman, Partridge, and Manmatha. &#8220;Indexing for a Digital Library of George Washinton&#8217;s Manuscripts: A Study of Word Matching Techniques.&#8221; CIIR Technical Report. (<a href="http://maroo.cs.umass.edu/pub/web/getpdf.php?id=334">PDF</a>) </li><li id="footnote_3_265" class="footnote"> Rath, Manmatha. &#8220;Word Image Matching Using Dynamic Time Warping,&#8221;  the Proceedings of CVPR-03 conference,vol. 2, pp. 521-527. (<a href="http://maroo.cs.umass.edu/pub/web/getpdf.php?id=336">PDF</a>) </li><li id="footnote_4_265" class="footnote">Bartolini, Ciaccia, Patella, &#8220;WARP: Accurate Retrieval of Shapes Using Phase of Fourier Descriptors and Time Warping Distance&#8221; <em>IEEE Transactions of Pattern Analysis and Machine Intelligence, </em>Vol 27 No 1, January 2005. (<a href="http://www-db.deis.unibo.it/courses/SI-M/papers/BCP05.pdf">PDF</a>) </li><li id="footnote_5_265" class="footnote"> Arica, Yarman-vural. &#8220;BAS: a perceptual shape descriptor based on the beam angle statistics.&#8221;  <em>Pattern Recognition Letters</em> 2003. (<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.969&amp;rep=rep1&amp;type=pdf">PDF</a>) </li><li id="footnote_6_265" class="footnote"> Keogh, &#8220;Exact Indexing of Dynamic Time Warping&#8221; (<a href="http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf">PDF</a>) </li><li id="footnote_7_265" class="footnote"> Ratanamahatana, Keogh. &#8220;Three Myths about  Dynamic Time Warping.&#8221; SDM 2005. (<strong></strong><a href="http://www.cs.ucr.edu/%7Eeamonn/DTW_myths.pdf">PDF</a>) </li><li id="footnote_8_265" class="footnote"> Berndt, Clifford.  &#8220;Using Dynamic Time Warping to Find Patterns in Time Series,&#8221; KDD 1994. (<a href="https://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-031.pdf">PDF</a>) </li><li id="footnote_9_265" class="footnote"> Lemire, &#8220;Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound,&#8221; <em>Pattern Recognition</em> 2008. (<a href="http://arxiv.org/pdf/0811.3301v2">PDF</a>) </li><li id="footnote_10_265" class="footnote"> Keogh, Ratanamahatana. &#8220;Exact indexing of dynamic time warping,&#8221; <em>Knowledge and Information Sytems</em> 2002. (<a href="http://www.cs.ucr.edu/~eamonn/KAIS_2004_warping.pdf">PDF</a>) </li><li id="footnote_11_265" class="footnote"> Yazdani, Meral Özsoyoglu. 1996 Sequence matching of images. In <em>Proc. 8th Int. Conf. Sci. Stat. Database Manag</em>. pp. 53–62. </li></ol>]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/converting-images-into-time-series-for-data-mining/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Introduction to data mining images</title>
		<link>http://izbicki.me/blog/data-mining-images-tutorial?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=data-mining-images-tutorial</link>
		<comments>http://izbicki.me/blog/data-mining-images-tutorial#comments</comments>
		<pubDate>Thu, 20 Oct 2011 06:36:40 +0000</pubDate>
		<dc:creator>Mike</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://izbicki.me/blog/?p=100</guid>
		<description><![CDATA[Image processing is one of those things people are still much better at than computers.  Take this set of cats: Just at a glance, you can easily tell the difference between the cartoon animals and the photographs.  You can tell that the hearts in the top left probably don&#8217;t belong, and that Odie is tackling [...]]]></description>
				<content:encoded><![CDATA[<p>Image processing is one of those things people are still much better at than computers.  Take this set of cats:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/cats.png"><img class="aligncenter" title="cats" src="http://izbicki.me/blog/wp-content/uploads/2011/10/cats.png" alt="" width="500" height="280" /></a>Just at a glance, you can easily tell the difference between the cartoon animals and the photographs.  You can tell that the hearts in the top left probably don&#8217;t belong, and that Odie is tackling Garfield in the top right.  The human brain does this really well on small datasets.</p>
<p><span id="more-100"></span>But what if we had thousands, millions, or even billions of images?  Could we make an image search engine, where I give it a picture of an animal and it says what type it is?  Could we make it automatically find patterns that people miss?</p>
<p>Yes!  This post is the beginning of a series about how.  Finding patterns in large databases of images is still an active research area, and these posts will hopefully make those results more accessible.  The current research still isn&#8217;t perfect, but it&#8217;s probably much better than you&#8217;d guess.</p>
<p><strong>The &#8220;black box&#8221; framework</strong></p>
<p>There are three basic steps in data mining images:</p>
<p>STEP 1: Create the &#8220;black box&#8221;</p>
<p>STEP 2: Cluster</p>
<p>STEP 3: Run queries</p>
<p>That&#8217;s it!</p>
<p>&#8230; well &#8230; sort of &#8230;</p>
<p>There are many different algorithms that can be used at each step.  Which ones you decide to use will depend on the type of information you&#8217;re mining from the images.  The rest of this post gives a high level overview of how each of these steps works, and later posts will focus on specific implementations for each step.</p>
<p><strong>STEP 1: Creating the black box</strong></p>
<p>The black box defines the &#8220;distance&#8221; between two images.  The smaller the distance, the more similar the images are.  For example:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/garfield1.png"><img class="aligncenter size-full wp-image-178" title="garfield" src="http://izbicki.me/blog/wp-content/uploads/2011/10/garfield1.png" alt="" width="778" height="276" /></a>Garfield is very similar to himself, that&#8217;s why Box A gives him a low score&#8211;nearly zero.  Odie is not very similar to Garfield, but he&#8217;s a lot closer than a palm tree.  The specific numbers outputted don&#8217;t matter.  All that matters is the ordering created by those numbers.  In this case:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/ordering1.png"><img class="aligncenter size-full wp-image-236" title="ordering1" src="http://izbicki.me/blog/wp-content/uploads/2011/10/ordering1.png" alt="" width="767" height="153" /></a>Of course, if we compare against a different image, we will probably get a different ordering.</p>
<p>Likewise, we can get different orderings with a different black box.  Let&#8217;s imagine that Box A was designed to determine if two pictures are of the same type of animal.  If we test it on some new input, we might get:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/boxA.png"><img class="aligncenter size-full wp-image-183" title="boxA" src="http://izbicki.me/blog/wp-content/uploads/2011/10/boxA.png" alt="" width="778" height="276" /></a>Notice that Box A thinks the real cat is more similar to Garfield than Odie is.  Now let&#8217;s consider another black box.  Imagine Box B is designed to see if two images were drawn in a similar style.  Box B might give the following:</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/boxB.png"><img class="aligncenter size-full wp-image-185" title="boxB" src="http://izbicki.me/blog/wp-content/uploads/2011/10/boxB.png" alt="" width="778" height="276" /></a>Now, Odie is similar to Garfield (they&#8217;re both drawn by <a href="http://en.wikipedia.org/wiki/Jim_Davis_%28cartoonist%29">Jim Davis</a>), but the cat is no longer similar to Garfield (because it&#8217;s a photograph).  Box B gives the opposite results of box A.</p>
<p>Creating a good black box is the hardest part of data mining images.  Most research is dedicated to this area, and most of this series will be focused on evaluating the performance of different black boxes.  Which ones are good depends on your dataset and what information you&#8217;re trying to extract.  Some general categories of black boxes we&#8217;ll look at are:</p>
<ol>
<li>Histogram analysis (a simple technique that can be surprisingly effective on colored input)</li>
<li>Converting images into a time series (for analyzing the shapes of rigid objects, e.g. fruit)</li>
<li>Creating shock graphs (for analyzing the shapes of non-rigid objects, e.g. animals)</li>
<li>Komolgorov comlexity of the images (for comparing an image&#8217;s textures)</li>
</ol>
<p>But first, let&#8217;s take a closer look at what makes a black box good.</p>
<p><strong>Properties of a good black box<br />
</strong></p>
<p>There are two more aspects of black boxes we must look at.  First, every black box will be <em>sensitive</em> to certain features of an image and <em>invariant</em> to others.  In the examples below, Box C is sensitive to shape, but invariant to color.  Box D is sensitive to color, but invariant to shape.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/boxC.png"><img class="aligncenter size-full wp-image-191" title="boxC" src="http://izbicki.me/blog/wp-content/uploads/2011/10/boxC.png" alt="" width="769" height="275" /></a></p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/boxD1.png"><img class="aligncenter size-full wp-image-245" title="boxD1" src="http://izbicki.me/blog/wp-content/uploads/2011/10/boxD1.png" alt="" width="753" height="275" /></a></p>
<p>Most black box algorithms contain both sensitivities and invariances.  These are the properties you will use to decide which black box is best for your application.</p>
<p>Second, a black box is a <a href="http://en.wikipedia.org/wiki/Metric_%28mathematics%29">metric</a> and as such must satisfy four criteria:</p>
<ol>
<li><em>distance</em>(<em>x</em>, <em>y</em>) ≥ 0     (<em><a title="Non-negative" href="http://en.wikipedia.org/wiki/Non-negative">non-negativity</a></em>)</li>
<li><em><em>distance</em></em>(<em>x</em>, <em>y</em>) = 0   if and only if   <em>x</em> = <em>y</em>     (<em><a title="Identity of indiscernibles" href="http://en.wikipedia.org/wiki/Identity_of_indiscernibles">identity of indiscernibles</a></em>)</li>
<li><em><em>distance</em></em>(<em>x</em>, <em>y</em>) = <em><em>distance</em></em>(<em>y</em>, <em>x</em>)     (<em><a title="Symmetric relation" href="http://en.wikipedia.org/wiki/Symmetric_relation">symmetry</a></em>)</li>
<li><em><em>distance</em></em>(<em>x</em>, <em>z</em>) ≤ <em><em>distance</em></em>(<em>x</em>, <em>y</em>) + <em><em>distance</em></em>(<em>y</em>, <em>z</em>)     (<em><a title="Subadditivity" href="http://en.wikipedia.org/wiki/Subadditivity">subadditivity</a></em> / <em><a title="Triangle inequality" href="http://en.wikipedia.org/wiki/Triangle_inequality">triangle inequality</a></em>).</li>
</ol>
<p>If you don&#8217;t understand these criteria, don&#8217;t worry too much.  All the black boxes we&#8217;ll look at in the rest of this series will satisfy these criteria automatically.</p>
<p><strong>STEP 2: Cluster the images</strong></p>
<p><a href="http://en.wikipedia.org/wiki/Cluster_analysis">Clustering</a> is much easier than designing the black box.  Clustering algorithms are used in many fields, so they have received much more attention.  Some clustering algorithms commonly used are:</p>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Support_vector_machine">Support vector machines</a></li>
<li><a href="http://en.wikipedia.org/wiki/K-means_clustering">K-means</a></li>
<li><a href="http://en.wikipedia.org/wiki/Hierarchical_clustering">Neural networks</a></li>
<li><a href="http://en.wikipedia.org/wiki/Hierarchical_clustering">Hierarchical clustering</a> (i.e. <a href="http://en.wikipedia.org/wiki/Dendrogram">Dendrograms</a>)</li>
</ol>
<p>There are many more as well.  In general, you can use whatever clustering algorithm you want.  When developing an application, most people will try several and pick whichever one happens to work the best for their data.</p>
<p>Here&#8217;s an example clustering of our cat data using Black Box A (i.e. by what the picture is of):</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/cats_cluster1.png"><img class="aligncenter size-full wp-image-220" title="cats_cluster1" src="http://izbicki.me/blog/wp-content/uploads/2011/10/cats_cluster1.png" alt="" width="500" height="280" /></a>We&#8217;ve created three clusters.  The red cluster contains hearts, the white cluster contains cats, and the blue cluster is an anomaly.  It contains both a cat and a dog, and there is no easy way to separate them.  If we had used a hierarchical classifier, the &#8220;contains cats and dogs cluster&#8221; might be a sub-cluster of the &#8220;contains cats cluster.&#8221;</p>
<p>Here&#8217;s the same data clustered using Black Box B (i.e. by how the picture is drawn):</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/cats_cluster2.png"><img class="aligncenter size-full wp-image-221" title="cats_cluster2" src="http://izbicki.me/blog/wp-content/uploads/2011/10/cats_cluster2.png" alt="" width="500" height="280" /></a>Now we have only two clusters: the white cluster contains cartoons, and the red cluster contains photographs.</p>
<p>One last note.  Most of the CPU work gets done during this step.  On large datasets, clustering can take hours to months depending on the algorithm and the speed of the black box.  There are many tricks for speeding up clustering, which will take a look at in later posts.</p>
<p><strong>STEP 3: Run your query</strong></p>
<p>Queries are fairly easy once the ground work is set up with the black box and clustering. Sometimes, all you want to know is how STEP 2 clustered your input.  For example, you could query &#8220;how many types of animals are in this dataset?&#8221;  The answer would just be the number of clusters using Box A.  Typically, however, your query we will supply the database with an image and find similar images.</p>
<p><a href="http://izbicki.me/blog/wp-content/uploads/2011/10/query1.png"><img class="aligncenter size-full wp-image-227" title="query1" src="http://izbicki.me/blog/wp-content/uploads/2011/10/query1.png" alt="" width="660" height="177" /></a>If we&#8217;ve done steps 1 and 2 well, this should take only seconds even when the database contains millions of images.  Of course, it&#8217;s not always possible to do steps 1 and 2 well enough to make this happen.  Later posts may cover some new techniques for speeding up the querying process.</p>
<p><strong>The Rest of the Series</strong></p>
<p>So far, we&#8217;ve seen that the black box framework for image datamining is very simple:</p>
<p>STEP 1: Create the &#8220;black box&#8221;</p>
<p>STEP 2: Cluster</p>
<p>STEP 3: Run queries</p>
<p>The tricky part is putting the right algorithm in each step.  In the rest of the series, we&#8217;ll look at a few different black boxes, and show how to efficiently combine them with a clustering algorithm.  The different types of black boxes are the most interesting part of image mining, so we will focus on that first.</p>
 <img src="http://izbicki.me/blog/?feed-stats-post-id=100" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://izbicki.me/blog/data-mining-images-tutorial/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
<!-- This Quick Cache file was built for (  izbicki.me/blog/category/computer-science/feed ) in 7.30569 seconds, on Jun 20th, 2013 at 10:29 am UTC. -->
<!-- This Quick Cache file will automatically expire ( and be re-built automatically ) on Jun 20th, 2013 at 11:29 am UTC -->