# Computer Science

You are currently browsing the archive for the Computer Science category.

## Functors and monads for analyzing data

Functors and monads are powerful design patterns used in Haskell. They give us two cool tricks for analyzing data.  First, we can “preprocess” data after we’ve already trained a model.  The model will be automatically updated to reflect the changes.  Second, this whole process happens asymptotically faster than the standard method of preprocessing.  In some cases, you can do it in constant time no matter how many data points you have!

This post focuses on how to use functors and monads in practice with the HLearn library.  We won’t talk about their category theoretic foundations; instead, we’ll go through ten concrete examples involving the categorical distribution. This distribution is somewhat awkwardly named for our purposes because it has nothing to do with category theory—it is the most general distribution over non-numeric (i.e. categorical) data. It’s simplicity should make the examples a little easier to follow.  Some more complicated models (e.g. the kernel density estimator and Bayesian classifier) also have functor and monad instances, but we’ll save those for another post.
Read the rest of this entry »

## HLearn’s code is shorter and clearer than Weka’s

Haskell code is expressive.  The HLearn library uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the Weka library uses over 100 lines of Java.  That’s a big difference!  In this post, we’ll look at the actual code and see why the Haskell is so much more concise.

But first, a disclaimer:  It is really hard to fairly compare two code bases this way.  In both libraries, there is a lot of supporting code that goes into defining each classifier, and it’s not obvious what code to include and not include.  For example, both libraries implement interfaces to a number of probability distributions, and this code is not contained in the source count.  The Haskell code takes more advantage of this abstraction, so this is one language-agnostic reason why the Haskell code is shorter.  If you think I’m not doing a fair comparison, here’s some links to the full repositories so you can do it yourself:

## HLearn cross-validates >400x faster than Weka

Weka is one of the most popular tools for data analysis.  But Weka takes 70 minutes to perform leave-one-out cross-validate using a simple naive bayes classifier on the census income data set, whereas Haskell’s HLearn library only takes 9 seconds.  Weka is 465x slower!

Code and instructions for reproducing these experiments are available on github.

Read the rest of this entry »

## Markov Networks, Monoids, and Futurama

In this post, we’re going to look at how to manipulate multivariate distributions in Haskell’s HLearn library. There are many ways to represent multivariate distributions, but we’ll use a technique called Markov networks.  These networks have the algebraic structure called a monoid (and group and vector space), and training them is a homomorphism.  Despite the scary names, these mathematical structures make working with our distributions really easy and convenient—they give us online and parallel training algorithms “for free.” If you want to go into the details of how, you can check out my TFP13 submission, but in this post we’ll ignore those mathy details to focus on how to use the library in practice.  We’ll use a running example of creating a distribution over characters in the show Futurama.

## The categorical distribution’s algebraic structure

The categorical distribution is the main distribution for handling discrete data. I like to think of it as a histogram.  For example, let’s say Simon has a bag full of marbles.  There are four “categories” of marbles—red, green, blue, and white.  Now, if Simon reaches into the bag and randomly selects a marble, what’s the probability it will be green?  We would use the categorical distribution to find out.

In this article, we’ll go over the math behind the categorical distribution, the algebraic structure of the distribution, and how to manipulate it within Haskell’s HLearn library.  We’ll also see some examples of how this focus on algebra makes HLearn’s interface more powerful than other common statistical packages.  Everything that we’re going to see is in a certain sense very “obvious” to a statistician, but this algebraic framework also makes it convenient.  And since programmers are inherently lazy, this is a Very Good Thing.

Before delving into the “cool stuff,” we have to look at some of the mechanics of the HLearn library.

## Nuclear weapon statistics using monoids, groups, and modules in Haskell

The Bulletin of the Atomic Scientists tracks the nuclear capabilities of every country. We’re going to use their data to demonstrate Haskell’s HLearn library and the usefulness of abstract algebra to statistics. Specifically, we’ll see that the categorical distribution and kernel density estimates have monoid, group, and module algebraic structures.  We’ll explain what this crazy lingo even means, then take advantage of these structures to efficiently answer real-world statistical questions about nuclear war. It’ll be a WOPR!

## Gausian distributions form a monoid

#### (And why machine learning experts should care)

This is the first in a series of posts about the HLearn library for haskell that I’ve been working on for the past few months. The idea of the library is to show that abstract algebra—specifically monoids, groups, and homomorphisms—are useful not just in esoteric functional programming, but also in real world machine learning problems.  In particular, by framing a learning algorithm according to these algebraic properties, we get three things for free: (1) an online version of the algorithm; (2) a parallel version of the algorithm; and (3) a procedure for cross-validation that runs asymptotically faster than the standard version.

We’ll start with the example of a Gaussian distribution. Gaussians are ubiquitous in learning algorithms because they accurately describe most data.  But more importantly, they are easy to work with.  They are fully determined by their mean and variance, and these parameters are easy to calculate.

In this post we’ll start with examples of why the monoid and group properties of Gaussians are useful in practice, then we’ll look at the math underlying these examples, and finally we’ll see that this technique is extremely fast in practice and results in near perfect parallelization.

## Using HMMs in Haskell for Bioinformatics

EDIT: WordPress seems to garble the code sections on occasion for no good reason.  If you want to run the code, you should download the original file instead.  Sorry.

This is a tutorial for how to use Hidden Markov Models (HMMs) in Haskell.  We will use the Data.HMM package to find genes in the second chromosome of Vitis vinifera: the wine grape vine. Predicting gene locations is a common task in bioinformatics that HMMs have proven good at.

The basic procedure has three steps.  First, we create an HMM to model the chromosome.  We do this by running the Baum-Welch training algorithm on all the DNA.  Second, we create an HMM to model transcription factor binding sites.  This is where genes are located.  Finally, we use Viterbi’s algorithm to determine which HMM best models the DNA at a given location in the chromosome.  If it’s the first, this is probably not the start of a gene.  If it’s the second, then we’ve found a gene!

## How to create an unfair coin and prove it with math

Want to make sure you win the coin toss just a little more often than you should?  I certainly do, so I made some unfair coins.  We’ll use the beta distribution to see just how unfair they are.  While this is just a toy example problem for using the beta distribution, machine learning algorithms rely on this distribution for learning just about everything. Math is an amazing thing that way.

## Converting images into time series for data mining

The first step in data mining images is to create a distance measure for two images.  In the intro to data mining images, we called this distance measure the “black box.”  This post will cover how to create distance measures based on time series analysis.  This technique is great for comparing objects with a constant, rigid shape.  For example, it will work well on classifying images of skulls, but not on images of people.  Skulls always have the same shape, whereas a person might be walking, standing, sitting, or curled into a ball.  By the end of this post, you should understand how to compare these hominid skulls from UC Riverside1 using radial scanning and dynamic time warping.

Footnotes
1. Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee and Michail Vlachos  ”LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures.” VLDB 2006. (PDF) []

« Older entries