# Gausian distributions form a monoid

#### (And why machine learning experts should care)

This is the first in a series of posts about the HLearn library for haskell that I’ve been working on for the past few months. The idea of the library is to show that abstract algebra—specifically monoids, groups, and homomorphisms—are useful not just in esoteric functional programming, but also in real world machine learning problems.  In particular, by framing a learning algorithm according to these algebraic properties, we get three things for free: (1) an online version of the algorithm; (2) a parallel version of the algorithm; and (3) a procedure for cross-validation that runs asymptotically faster than the standard version.

We’ll start with the example of a Gaussian distribution. Gaussians are ubiquitous in learning algorithms because they accurately describe most data.  But more importantly, they are easy to work with.  They are fully determined by their mean and variance, and these parameters are easy to calculate.

In this post we’ll start with examples of why the monoid and group properties of Gaussians are useful in practice, then we’ll look at the math underlying these examples, and finally we’ll see that this technique is extremely fast in practice and results in near perfect parallelization.

### HLearn by Example

Install the libraries from a shell:

```\$ cabal install HLearn-algebra-0.0.1
\$ cabal install HLearn-distributions-0.0.1```

Then import the HLearn libraries into a literate haskell file:

```> import HLearn.Algebra
> import HLearn.Models.Distributions.Gaussian```

And some libraries for comparing our performance:

```> import Criterion.Main
> import Statistics.Distribution.Normal
> import qualified Data.Vector.Unboxed as VU```

Now let’s create some data to work with. For simplicity’s sake, we’ll use a made up data set of how much money people make. Every entry represents one person making that salary. (We use a small data set here for ease of explanation.  When we stress test this library at the end of the post we use much larger data sets.)

```> gradstudents = [15e3,25e3,18e3,17e3,9e3]        :: [Double]
> teachers     = [40e3,35e3,89e3,50e3,52e3,97e3]  :: [Double]
> doctors      = [130e3,105e3,250e3]              :: [Double]```

In order to train a Gaussian distribution from the data, we simply use the train function, like so:

```> gradstudents_gaussian = train gradstudents      :: Gaussian Double
> teachers_gaussian     = train teachers          :: Gaussian Double
> doctors_gaussian      = train doctors           :: Gaussian Double```

The train function is a member of the HomTrainer type class, which we’ll talk more about later.  Also, now that we’ve trained some Gaussian distributions, we can perform all the normal calculations we might want to do on a distribution.  For example, taking the mean, standard deviation, pdf, and cdf.

Now for the interesting bits. We start by showing that the Gaussian is a semigroup. A semigroup is any data structure that has an associative binary operation called (<>). Basically, we can think of (<>) as “adding” or “merging” the two structures together. (Semigroups are monoids with only a mappend function.)

So how do we use this? Well, what if we decide we want a Gaussian over everyone’s salaries? Using the traditional approach, we’d have to recompute this from scratch.

```> all_salaries = concat [gradstudents,teachers,doctors]
> traditional_all_gaussian = train all_salaries :: Gaussian Double```

But this repeats work we’ve already done. On a real world data set with millions or billions of samples, this would be very slow. Better would be to merge the Gaussians we’ve already trained into one final Gaussian. We can do that with the semigroup operation (<>):

`> semigroup_all_gaussian = gradstudents_gaussian <> teachers_gaussian <> doctors_gaussian`

Now,

`traditional_all_gaussian == semigroup_all_gaussian`

The coolest part about this is that the semigroup operation takes time O(1), no matter how much data we’ve trained the Gaussians on. The naive approach takes time O(n), so we’ve got a pretty big speed up!

Next, a monoid is a semigroup with an identity. The identity for a Gaussian is easy to define—simply train on the empty data set!

`> gaussian_identity = train ([]::[Double]) :: Gaussian Double`

Now,

`gaussian_identity == mempty`

But we’ve still got one more trick up our sleeves.  The Gaussian distribution is not just a monoid, but also a group. Groups appear all the time in abstract algebra, but they haven’t seen much attention in functional programming for some reason. Well groups are simple: they’re just monoids with an inverse. This inverse lets us do “subtraction” on our data structures.

So back to our salary example. Lets say we’ve calculated all our salaries, but we’ve realized that including grad students in the salary calculations was a mistake. (They’re not real people after all.) In a normal library, we would have to recalculate everything from scratch again, excluding the grad students:

```> nograds = concat [teachers,doctors]

But as we’ve already discussed, this takes a lot of time. We can use the inverse function to do this same operation in constant time:

`> group_nograds_gaussian = semigroup_all_gaussian <> (inverse gradstudents_gaussian)`

And now,

`traditional_nograds_gaussian == group_nograds_gaussian`

Again, we’ve converted an operation that would have taken time O(n) into one that takes time O(1). Can’t get much better than that!

### The HomTrainer Type Class

As I’ve already mentioned, the HomTrainer type class is the basis of the HLearn library.  Basically, any learning algorithm that is also a semigroup homomorphism can be made an instance of HomTrainer.  This means that if xs and ys are lists of data points, the class obeys the following law:

`train (xs ++ ys) == (train xs) <> (train ys)`

It might be easier to see what this means in picture form:

On the left hand side, we have some data sets, and on the right hand side, we have the corresponding Gaussian distributions and their parameters.  Because training the Gaussian is a homomorphism, it doesn’t matter whether we follow the orange or green paths to get to our final answer.  We get the exact same answer either way.

Based on this property alone, we get the three “free” properties I mentioned in the introduction.  (1) We get an online algorithm for free.  The function add1dp can be used to add a single new point to an existing Gaussian distribution.  Let’s say I forgot about one of the graduate students—I’m sure this would never happen in real life—I can add their salary like this:

`> gradstudents_updated_gaussian = add1dp gradstudents_gaussian (10e3::Double)`

This updated Gaussian is exactly what we would get if we had included the new data point in the original data set.

(2) We get a parallel algorithm.  We can use the higher order function parallel to parallelize any application of train.  For example,

`> gradstudents_parallel_gaussian = (parallel train) gradstudents :: Gaussian Double`

The function parallel automatically detects the number of processors your computer has and evenly distributes the work load over them.  As we’ll see in the performance section, this results in perfect parallelization of the training function.  Parallelization literally could not be any simpler!

(3) We get asymptotically faster cross-validation; but that’s not really applicable to a Gaussian distribution so we’ll ignore it here.

One last note about the HomTrainer class: we never actually have to define the train function for our learning algorithm explicitly.  All we have to do is define the semigroup operation, and the compiler will derive our training function for us!  We’ll save a discussion of why this homomorphism property gives us these results for another post.  Instead, we’ll just take a look at what the Gaussian distribution’s semigroup operation looks like.

### The Semigroup operation

Our Gaussian data type is defined as:

```data Gaussian datapoint = Gaussian
{ n  :: !Int         -- The number of samples trained on
, m1 :: !datapoint   -- The mean (first moment) of the trained distribution
, m2 :: !datapoint   -- The variance (second moment) times (n-1)
, dc :: !Int         -- The number of "dummy points" that have been added
}```

In order to estimate a Gaussian from a sample, we must find the total number of samples (n), the mean (m1), and the variance (calculated from m2).  (We’ll explain what dc means a little later.)  Therefore, we must figure out an appropriate definition for our semigroup operation below:

`(Gaussian na m1a m2a dca) <> (Gaussian nb m1b m2b dcb) = Gaussian n' m1' m2' dc'`

First, we calculate the number of samples n’. The number of samples in the resulting distribution is simply the sum of the number of samples in both the input distributions:

Second, we calculate the new average m1′. We start with the definition that the final mean is:

Then we split the summation according to whether the input element was from the left Gaussian a or right Gaussian b, and substitute with the definition of the mean above:

Notice that this is simply the weighted average of the two means. This makes intuitive sense. But there is a slight problem with this definition: When implemented on a computer with floating point arithmetic, we will get infinity whenever n’ is 0.  We solve this problem by adding a “dummy” element into the Gaussian whenever n’ would be zero.  This increases n’ from 0 to 1, preventing the division by 0.  The variable dc counts how many dummy variables have been added, so that we can remove them before performing calculations (e.g. finding the pdf) that would be affected by an incorrect number of samples.

Finally, we must calculate the new m2′. We start with the definition that the variance times (n-1) is:

(Note that the second half of the equation is a property of variance, and its derivation can be found on wikipedia.)

Then, we do some algebra, split the summations according to which input Gaussian the data point came from, and resubstitute the definition of m2 to get:

Notice that this equation has no divisions in it.  This is why we are storing m2 as the variance times (n-1) rather than simply the variance.  Adding in the extra divisions causes training our Gaussian distribution to run about 4x slower.  I’d say haskell is getting pretty fast if the number of floating point divisions we perform is impacting our code’s performance that much!

### Performance

This algebraic interpretation of the Gaussian distribution has excellent time and space performance.  To show this, we’ll compare performance to the excellent Haskell package called “statistics” that also has support for Gaussian distributions.  We use the criterion package to create three tests:

```> size = 10^8
> main = defaultMain
>     [ bench "statistics-Gaussian" \$ whnf (normalFromSample . VU.enumFromN 0) (size)
>     , bench "HLearn-Gaussian" \$ whnf
>         (train :: VU.Vector Double -> Gaussian Double)
>         (VU.enumFromN (0::Double) size)
>     , bench "HLearn-Gaussian-Parallel" \$ whnf
>         (parallel \$ (train :: VU.Vector Double -> Gaussian Double))
>         (VU.enumFromN (0::Double) size)
>     ]```

In these test, we time three different methods of constructing Gaussian distributions given 100,000,000 data points.  On my laptop with 2 cores, I get these results:

 statistics-Gaussian 2.85 sec HLearn-Gaussian 1.91 sec HLearn-Gaussian-Parallel 0.96 sec

Pretty nice!  The algebraic method managed to outperform the traditional method for training a Gaussian by a handy margin.  Plus, our parallel algorithm runs exactly twice as fast on two processors.  Theoretically, this should scale to an arbitrary number of processors, but I don’t have a bigger machine to try it out on.

Another interesting advantage of the HLearn library is that we can trade off time and space performance by changing which data structures store our data set.  Specifically, we can use the same functions to train on a list or an unboxed vector.  We do this by using the ConstraintKinds package on hackage that extends the base type classes like Functor and Foldable to work on classes that require constraints.  Thus, we have a Functor instance of Vector.Unboxed. This is not possible without ConstraintKinds.

Using this benchmark code:

```main = do
print \$ (train [0..fromIntegral size::Double] :: Gaussian Double)
print \$ (train (VU.enumFromN (0::Double) size) :: Gaussian Double)```

We generate the following heap profile:

Processing the data as a vector requires that we allocate all the memory in advance.  This lets the program run faster, but prevents us from loading data sets larger than the amount of memory we have.  Processing the data as a list, however, allows us to allocate the memory only as we use it.  But because lists are boxed and lazy data structures, we must accept that our program will run about 10x slower.  Lucky for us, GHC takes care of all the boring details of making this happen seamlessly.  We only have to write our train function once.

### Future Posts

There’s still at least four more major topics to cover in the HLearn library:  (1) We can extend this discussion to show how the Naive Bayes learning algorithm has a similar monoid and group structure.  (2) There are many more learning algorithms with group structures we can look into.  (3) We can look at exactly how all these higher order functions, like batch and parallel work under the hood.  And (4) we can see how the fast cross-validation I briefly mentioned works and why it’s important.

Subscribe to the RSS feed and stay tuned!

1. The title, “Gaussian distributions are monoids” makes it sound like each individual Gaussian distribution is a monoid (which is nonsense and is the reason I clicked through to see what the heck was going on). I think you meant something like “Gaussian distributions form a monoid”, or more precisely that the type “Gaussian a” in your library is a monoid. To make this clear consider “The natural numbers form a monoid” vs “Natural numbers are monoids”.

1. You’re right. I’ve updated the title. Thanks.

1. Two remarks.

First, groups form transformations over some space. The group of Gaussian distributions form a transformation over what?

Presumably some function space?

(Once you answer this, you can then categorify the group vertically to a 2-group, or horizontally to a groupoid, and do a lot more abstract nonsense. But it will sound cool!)

Second, the title shouldn’t have monoid be plural. The set of Gaussian distributions have a monoidal structure induced from the List’s monoidal structure.

Then from this perspective, it’s not so exciting as categorifying this observation.

1. I’m not sure what you mean by “groups form transformations over some space.” Are you referring to group actions?

Yeah, I realized I made a mistake in the title and fixed it.

2. I’m intrigued by your Haskell code, but confused as to how you’re realizing Gaussian distributions as groups…

For instance, the operations performed aren’t on elements of a Gaussian distribution as a set (which is awkward to speak of, since distributions are measures). So I think instead you meant to call the set of all Gaussian distributions a group. But even then, what is the empty Gaussian? What is the inverse of a Gaussian? These are not Gaussian distributions (they are not even probability distributions at all). I can’t even imagine what the following piece of code would evaluate to, if the sample function were defined to sample from a distribution as one would expect:

Is there something I’m missing? It’s obvious that you have an efficient way of manipulating the samples implicitly, and that’s great, but I don’t think they’re group operations.

1. I guess technically this is a generalization of the Gaussian into something else. The empty element is:

`Gaussian 0 0 0 0`

and

`inverse \$ Gaussian n m1 m2 dc `

is defined as

`Gaussian (-n) m1 (-m2) dc`

The plots of the upside down Gaussian in the post are taken just by plugging these numbers directly into the formulas for the pdf. This gives a negative probability, which is meaningless, I agree.

You can verify that the associativity required for a group holds, or you could do some quick haskell tests to convince yourself. The only tricky thing is, like mentioned in the post, that you have to account for all the division by zeros you get from working with the empty element. You could dive into the code to see how this is handled, but I figured it was too much to explain in this post.

1. I guess it’s a case of a what is in the the name of the rose.

You labeled n “The number of samples trained on” and you said it’s a Gaussian; and then you’re saying it’s also a group because it has an inverse.

A negative cardinality set doesn’t make sense.

So maybe give some caveats to your naming scheme so old fogeys like myself don’t their panties in a bunch…

2. The author is looking at Gaussian distributions as monoids, not groups, so they don’t need to have an inverse. This is actually why monoids are a generalization of groups.

I admit that the analogy might be a bit forced, but it also looks pretty handy (and intriguing)!

1. The Gaussian is both a monoid and a group.

2. My point is that even with just a monoid structure, the “empty Gaussian” is not a probability distribution. If we’re trying to be rigorous in terms of algebra, we should also be rigorous in terms of probability theory. I can readily believe there are a few ways to combine distributions via an associative operation to make a semigroup.

I think one way you could phrase this is by saying it’s isomorphic to the formal group generated by the set of your Haskell Gaussian objects (trained on arbitrary lists). Then these inverses and identity objects are just that: formal objects which need no sensible interpretation as distributions.

At least mathematicians do this all the time with formal sums in free abelian groups and such, but programmers are usually not as sloppy as mathematicians (they can’t afford to be).

1. I see what you mean, but I’m arguing that maybe mathematicians should have a much more general notion of what it means to be a “Gaussian distribution.” We should be using the construct that is the most elegant, and this seems more elegant to me.

If I went the whole isomorphism route you suggest, I would have to add a couple of very pointless paragraphs explaining what all that means, and no one would slog through that to get to the juicier bits below.

1. I disagree. Probability theorists have perfectly good reasons for calling Gaussians Gaussians. And, alas, that means that Gaussian distributions only form a semigroup and no more. This is part of the reason why functional programmers like myself don’t talk about groups more often; while groups may be common in mathematics, they aren’t nearly so common for the kind of data I work on (as a computational linguist).

However, in spite of Gaussian distributions only forming a semigroup, it is very much worth pointing out that population statistics form a group. Without moving to a quantum setting there’s no meaning behind negative probabilities; however, there’s a perfectly good meaning behind a negative population statistic. Populations are something like sets, and so it’s perfectly valid to have an empty population or to take the inverse statistics of a population; in spite of the fact that these no longer generate meaningful Gaussian distributions.

Being rigorous about the abstract algebra is important, but so is being rigorous about the statistics. There is a connection between raw populations, summary statistics, and distributions; but it is vital not to confuse them. Doing so is akin to confusing between scalars and vectors.

2. Sorry for the late reply, but here are my two cents as a mathematician.

For brevity, let’s call G the set of all Gaussian distributions. First of all, note that this semigroup operation on G is clearly commutative. Moreover, every semigroup can be made into a monoid in a universal way by adding an identity element if there isn’t one already. So, even though the “0 Gaussian distribution” is not a probability distribution, it is uniquely determined by G (up to isomorphism).
Now that we have a commutative monoid G’, there is a universal way of getting a group out of it, called the Grothendieck group of G’ (see for example the Wikipedia article).

This is just to say that, although those described in this article are not Gaussian distributions in the usual meaning, there is a “best” way to get them, which is independent of the chosen representation.
So maybe a better title, albeit less catchy, would have been something like: “Why the Grothendieck group of Gaussian distributions matters”.

3. That’s very interesting and helpful. Thanks. Is their a version of the Grothendieck group that doesn’t require commutivity? The Gaussian semigroup happens to commute, but other important semigroups for learning problems don’t.

I think technically the construction I gave is not a Grothendieck group because I didn’t follow his procedure for constructing. I’m going to have to think more about the potential performance issues with using the more general version.

3. I think statement is misnomer. It’s correct but author deals here not with distributions but with estimation of distributions’ parameters. Normal distribution fully defined by mean and variance so semigroup operation should depend only on them but it takes into account number of elements in sample. It should do it to get correct estimate. Actually semigroup operation is join of two samples. It’s just more efficient and do not keep samples in the memory.

I tried to implement vaguely similar thing for estimation of various statistics but never completed due to a lack of time. If you interested incomplete description is here:

Inverse is weird thing. It have negative variance for starters. It could be seen as some computation trick but probably it does correspond to some mathematical object. Both imaginary numbers and generalized functions started as computation tricks.

In fact it’s very easy to make arbitrary distributions semigroups and monoids. If we have random variables X and Y (not necessarily normally distributed) then result of operation would be distribution X+Y. Neutral element would be distribution where we always get 0. Addition could be replaced with any other associative operation, But it’s entirely different from what is described in the post.

4. M

``` Registering HLearn-algebra-0.0.1... Installed HLearn-algebra-0.0.1 Configuring HLearn-distributions-0.0.1.1... Building HLearn-distributions-0.0.1.1... Preprocessing library HLearn-distributions-0.0.1.1... [1 of 4] Compiling HLearn.Models.Distributions.Common ( src/HLearn/Models/Distributions/Common.hs, dist/build/HLearn/Models/Distributions/Common.o ) [2 of 4] Compiling HLearn.Models.Distributions.Categorical ( src/HLearn/Models/Distributions/Categorical.hs, dist/build/HLearn/Models/Distributions/Categorical.o ) [3 of 4] Compiling HLearn.Models.Distributions.Gaussian ( src/HLearn/Models/Distributions/Gaussian.lhs, dist/build/HLearn/Models/Distributions/Gaussian.o )```

``` ```

```src/HLearn/Models/Distributions/Gaussian.lhs:187:33: Not in scope: type constructor or class Unbox' Perhaps you meant `U.Unbox' (imported from Data.Vector.Unboxed) Failed to install HLearn-distributions-0.0.1.1 cabal: Error: some packages failed to install: HLearn-distributions-0.0.1.1 failed during the building phase. The exception was: ExitFailure 1 ```

1. Thanks. I’m not sure why it’s failing to compile on the hackage server. It has something to do with the unboxed instance for the Gaussian, but I’m not sure why.

` cabal install hlearn-distributions`

Still works fine for me. I’m looking into it.

1. It is an update to `vector-th-unbox`. I was just trying to figure it out. Compare the incantations for version 0.1.x with those for version 0.2

1. Thanks a bunch! I’ve updated the dependencies in the cabal file and re-uploaded to hackage. Hopefully it works for you now.

1. The haddocks for version 0.2 subtly wrong It seems the new incantation should be:

``` derivingUnbox "Gaussian" [t| (U.Unbox a) => (Gaussian a) -> (Int, a, a, Int) |] [| \ (Gaussian n m1 m2 dc) -> (n,m1,m2,dc) |] [| \ (n,m1,m2,dc) -> (Gaussian n m1 m2 dc) |] ```

note the ‘t’ in place of ‘d’.

2. I like the layout of version 0.2 better than 0.1 because it looks more like standard haskell syntax. I’ve gone ahead and updated the package to use your code above. (I haven’t tested it, so hopefully it doesn’t break anything.)

Thanks again!

5. Cool post! Very intriguing!

Doesn’t directly carrying the mean around introduce a lot of floating-point inaccuracies? Wouldn’t it be better to have the sum in the datastructure and calculate the mean = sum / (n – dp) when we need it?

This escpecially worries me when doing a lot of add1dp…

Cheers,
Kosta

1. If we do what you suggest, then we have several divisions in the definition of m2, which is worse.

1. Hmmm… then let’s do both?

I don’t know how bad the floating point inaccuracies are w.r.t. those two approaches. That could be an option if it’s worth it…

1. I don’t know what you mean by “do both”?

6. Nevermind, my last comment made no sense whatsoever.

However, couldn’t we treat variance in a similar way, that it is variance * (n-1) * (n-dc)? That way, we would only need to do 2 divisions when showing the variance and keep our mean and variance calculations division-free. Of course, that would only make sense if the fp accuracy is greatly increased by this. I’m actually not really experienced in that field

(Sorry, it’s too late down here at the moment to figure out the math if that is even possible…)

1. I messed around with the algebra and couldn’t get anything like that to work out. Maybe it’s possible though.

7. Hi, I’d like to talk to you about translating this article to portuguese, giving all the due credits to you. Can you please send me an email? thanks a lot!

8. Maybe I miss something, but how did you arrive to the conclusion that \$(x_i – m)^2 = (x_i^2 – m^2)\$ ?

1. It follows directly from the definition of variance. There’s a proof on wikipedia: https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance

1. I was scratching my head over the same point. Maybe it would be helpful to add a note that that equivalence is not arrived at by a simple squaring.

1. Good idea. I added a note in the main text.

9. I found your article very interesting and tried to run your code. Unfortunately this was not possible:
The HLearn.Models.Distributions.Gaussian is no more available in the newer versions of HLearn.Distributions! On the other side, all older versions 0.0.x.x fail to install!!
Can you rework the code, so it works with the newest version of HLearn.distributions?

Many thanks and kind regards
Roland

roland@freebsd /usr/home/roland #ghc –version
The Glorious Glasgow Haskell Compilation System, version 7.6.1
roland@freebsd /usr/home/roland #cabal –version
cabal-install version 1.16.0.2
using version 1.16.0.3 of the Cabal library

roland@freebsd /usr/home/roland #cabal install HLearn-distributions-0.0.1.3
Resolving dependencies…
Configuring HLearn-distributions-0.0.1.3…
Building HLearn-distributions-0.0.1.3…
Preprocessing library HLearn-distributions-0.0.1.3…
[1 of 4] Compiling HLearn.Models.Distributions.Common ( src/HLearn/Models/Distributions/Common.hs, dist/build/HLearn/Models/Distributions/Common.o )
[2 of 4] Compiling HLearn.Models.Distributions.Categorical ( src/HLearn/Models/Distributions/Categorical.hs, dist/build/HLearn/Models/Distributions/Categorical.o )

src/HLearn/Models/Distributions/Categorical.hs:33:10:
Not in scope: type constructor or class `NFData’

src/HLearn/Models/Distributions/Categorical.hs:53:50:
Not in scope: type constructor or class `NFData’
Failed to install HLearn-distributions-0.0.1.3
cabal: Error: some packages failed to install:
HLearn-distributions-0.0.1.3 failed during the building phase. The exception
was:
ExitFailure 1