I joined the Navy because I wanted to serve my country. My religious beliefs no longer allow me to kill, but I still want to serve. Service, in fact, is an integral part of my beliefs. My country has given me a lot. I value the ideas of freedom and democracy. I want to give everything I have to my country and the ideals for which it stands. Ideally, I would serve in a capacity that maximizes the peace and welfare of the United States, but minimizes my contribution to war. I believe these goals are not mutually exclusive. This document explores how well my service options meet these goals, both inside and outside the military. This will explain my decision not to apply for noncombatant (1-A-0) status.

All billets in the military are designed to maximize the security of the United States, and these billets contribute to war in varying degrees. If a billet existed which did not contribute to war in any way, I would gladly volunteer for it. No matter how dangerous, difficult, time-consuming, or otherwise undesirable the job may be, I would enthusiastically perform this job to the best of my abilities. I cannot know every billet available, but I do know what communities exist. The Navy’s officer community is divided into four main groups: unrestricted line officers, restricted line officers, special duty officers, and the staff corps. I will classify these communities depending on whether they present high, medium, or low conflict with my beliefs.

I will demonstrate that had I applied for noncombatant (1-A-0) status, I would still be placed in a billet which conflicts with my beliefs. According to regulation MILPERSMAN 1900-020, a noncombatant can be assigned to serve “on board an armed ship or aircraft in a combat zone provided the member is not personally and directly involved in the operation of weapons.” For example, as a nuclear trained officer, I could be assigned to operate the nuclear propulsion system for an aircraft carrier. I would not be the individual delivering bombs to their targets, so according to regulations I would not be responsible. But according to my conscience I would still be responsible.

**High conflict communities**

Most of the Navy’s communities are primarily warfare related. These communities provide the maximum conflict with my convictions. The unrestricted line officers form the heart of the Navy. Their duties involve training for war, and conducting war once begun. This directly goes against my nonviolent religious beliefs. These communities include:

- Surface warfare
- Submarine warfare
- Naval aviation
- Naval flight officers
- Special warfare

Notably, by the definition of a 1-A-0 noncombatant I could still be billeted within these high conflict communities.

**Medium conflict communities**

Even if I were guaranteed a billet not in the high conflict communities, all naval communities present at least a medium conflict with my beliefs. They all participate in war indirectly because their missions are to make the warfighters more effective. The navy divides these medium conflict communities into three categories: restricted line officers, special duty officers, and staff corps.

Restricted line officers prepare the Navy for warfare. Without their support, the fighting elements of the Navy could not complete their missions. Therefore, these communities still provide significant conflict with my nonviolent religious beliefs. These communities include:

- Human Resources Officers “plan, program and execute life-cycle management of our Navy’s most important resource – people.”
- Nuclear Propulsion Training officers teach students the fundamentals of nuclear propulsion. The purpose of this training is so that students qualify in ship driving, and the training is critical in their training for war.
- Naval Reactors Engineers ensure the safe and reliable operation of the Navy’s nuclear propulsion plants. This ensures the combat readiness of the Navy’s submarine force and aircraft carriers.
- Engineering Duty Officers design, construct, and maintain the Navy’s ships. These ships are designed around their capabilities to project power and deliver weapons systems to enemy targets.
- Aerospace Engineering Duty officers perform a similar role for the Navy’s airplanes.
- Foreign Area Officers “manage and analyze politico-military activities overseas.”

Special duties officers are similar to unrestricted line officers in that they are usually only indirectly involved in warfare. This includes:

- Intelligence officers provide “tactical, operational and strategic intelligence support to U.S. naval forces, joint services, multi-national forces, and executive level decision-makers.”
- Public Affairs are responsible for projecting a good moral image of the Navy’s warfighting
- Recruiters convince young men and women to join the warfighting elements of the navy
- Fleet Support officers provide engineering assistance to warfighting units
- Meteorology/Oceanography officers “collect, analyze, and distribute data about the ocean and the atmosphere to Navy forces operating all over the world. They assist the war fighter in taking tactical advantage of the environment.”

Special duties officers are similar to unrestricted line officers. They includes:

- Information Professionals maintain the electronic equipment aboard naval installations
- Information Warfare officers “deliver overwhelming information superiority that successfully supports command objectives… And ultimately, providing war-fighters, planners and policy makers with real-time warning, offensive opportunities and an ongoing operational advantage.”
- Cyber Warfare Engineers conduct electronic attacks

Staff corps officers are like special duties officers that require special training. They include doctors and JAGs. I would not be qualified for any of these billets.

I believe there are many opportunities outside the military that would allow me to serve in a manner consistent with my beliefs. Should I be given a discharge, I will pursue such service. I would gladly accept as a condition of my discharge some other type of obligated service. Many conscientious objectors in the past have served honorably in government service. They have volunteered to restore national parks, serve in psychiatric wards, and even have medical experiments conducted on themselves. The smoke jumpers—an elite group of firefighters who parachute into blazing fires—were founded by conscientious objectors.

I have received training that can be utilized nonviolently in two areas: computer science and nuclear power. This training can be used nonviolently to promote the effective defense of the United States.

In a defensive capacity, my computer science training could be used to safeguard electronic systems against attack. Criminal organizations routinely target government electronic infrastructure. Sometime they are looking for specific information, sometimes simply to cause disruptions. I have significant experience protecting electronic assets. I would proudly serve in a role where I would harden the United States government and infrastructure against such threats.

In a defensive capacity, my nuclear training could be used to reduce the threat of nuclear weapons. The current administration has expressed an intent to reduce the nation’s nuclear arsenal. I could apply my nuclear training with the Department of Energy

All available billets within the Navy present high conflict with my belief in Jesus. I therefore cannot apply for noncombatant 1-A-0 status. But there are many other roles within the federal government that I am both highly qualified for and present no such conflict. I would gladly serve in such a capacity, no matter how difficult or dangerous the job may be.

]]>Here’s a picture:

The blue dot in the center of the parallelogram is **me** (or **you**!). Each of the dots on the corners represent different archetypes that we can follow.

In the bottom left is **Jonah**. Jonah was nonviolent, but he wasn’t a very good person. When God commanded Jonah to go help Ninevah, Jonah ran away. He was too concerned about his own personal comfort and safety to think about others. When we act like Jonah, the world suffers. As the saying goes, “all it takes for evil to triumph is for good men to do nothing.”

But we can get quite a bit more evil. If we perform violent actions, we travel right in the diagram. This takes us to **Judas**. Judas came with armed men to capture and kill the innocent Jesus. By using violence in this way, we can bring about quite a bit more evil than by doing nothing. That is why Judas is farther down in the diagram than Jonah. The farther down you are, the more evil you are.

But not all violence is created equal. If we travel up from Judas, we get to **David**. David was a king of Israel and is considered one of the most righteous people of the old testament. He used violence to protect the innocent. When a soldier kills a suicide bomber before the bomber can kill innocent civilians, the soldier is imitating David. The soldier has risked his own safety and done a *good* thing.

But it was not the *best* thing. If we travel again to the left in the diagram we come to **Jesus**. Jesus did not use violence, and he embodied all that is good in the world. When the devil gave Jesus the opportunity to use violence to stop evil people (Matthew 4:1-11), Jesus chose a better path: He sacrificed himself for those evil people. He died on the cross. While violence may be useful in protecting the innocent, it is useless when saving the guilty from themselves. This is a much harder (and in the Christian perspective much more righteous) task. That is why Jesus is higher in the diagram than David.

My goal is to be as much like Jesus as possible. Here’s two examples of how we can use the parallelogram to do this:

**Example 1:** Let’s rethink the story of David and Goliath as told by 1 Samuel 17. The Philistines are invading Israel, and are camped inside the borders of Judah. Every day, the giant Goliath comes forward and challenges the Israelites to single combat. At this point, the Jonah option would be to hide in the ranks. Jonah would depend on someone else to save the Israelites. The Judas option would be to secretly meet with the invading army. Judas would help the Philistines kill the Jews in the hope of escaping a similar fate. Enter David. David was brave. He chose to fight the Goliath single handedly. He wanted to save his friends from doom, and this was a *good* thing.

But it was not the *best* thing. What would Jesus have done? I can’t know for sure, but I can speculate: I think Jesus would have helped the Philistines. He would have delivered them water and food. He would have healed their wounded and cared for the widows and orphans left behind. Jesus would have been willing to die not just for the Israelites (like David), but also for the Philistines. What greater love is there than *that*!?

**Example 2**: The parallelogram has informed my personal development as a Christian. **(1)** Like most adolescents, I had no desire to risk my own safety for others. I didn’t stand up for the weird kids when the bullies picked on them. I followed Jonah. **(2)** This changed after September 11th. Around that time, I started taking my Christian faith seriously. The world trade centers taught me that there is evil in the world, and Christ showed me that this was not how the world was meant to be. I decided to do my best to fix the world, so I joined the Navy. David became my role model. **(3)** But David couldn’t heal my broken soul. I thought I could be the world’s savior, but only Jesus can do that. So I recommitted myself to Christ and decided to take his teaching to “turn the other cheek” seriously. (There’s a lot more to this transformation, and you can read about it here.)

In graphical form:

Notice that I don’t consider myself more righteous than David. In fact, I firmly believe that there have been violent people more righteous than I am! Nonetheless, my calling is to be like Jesus. That means striving for something better than David. My new goal is to follow the dotted line**…** to change the world by offering myself as a living sacrifice.

I fail every day. But with Christ’s grace, I find renewed strength to keep trying. That is why I call myself a Christian pacifist.

]]>This post focuses on how to use functors and monads in practice with the HLearn library. We won’t talk about their category theoretic foundations; instead, we’ll go through **ten concrete examples** involving the categorical distribution. This distribution is somewhat awkwardly named for our purposes because it has nothing to do with category theory—it is the most general distribution over non-numeric (i.e. categorical) data. It’s simplicity should make the examples a little easier to follow. Some more complicated models (e.g. the kernel density estimator and Bayesian classifier) also have functor and monad instances, but we’ll save those for another post.

Before we dive into using functors and monads, we need to set up our code and create some data. Let’s install the packages:

$ cabal install HLearn-distributions-1.1.0.1

Import our modules:

> import Control.ConstraintKinds.Functor > import Control.ConstraintKinds.Monad > import Prelude hiding (Functor(..), Monad (..)) > > import HLearn.Algebra > import HLearn.Models.Distributions

For efficiency reasons we’ll be using the Functor and Monad instances provided by the ConstraintKinds package and language extension. From the user’s perspective, everything works the same as normal monads.

Now let’s create a simple marble data type, and a small bag of marbles for our data set.

> data Marble = Red | Pink | Green | Blue | White > deriving (Read,Show,Eq,Ord) > > bagOfMarbles = [ Pink,Green,Red,Blue,Green,Red,Green,Pink,Blue,White ]

This is a very small data set just to make things easy to visualize. Everything we’ll talk about works just as well on arbitrarily large data sets.

We train a categorical distribution on this data set using the **train** function:

> marblesDist = train bagOfMarbles :: Categorical Double Marble

The **Categorical** type takes two parameters. The first is the type of our probabilities, and the second is the type of our data points. If you stick your hand into the bag and draw a random marble, this distribution tells you the probability of drawing each color.

Let’s plot our distribution:

ghci> plotDistribution (plotFile "marblesDist" $ PNG 400 300) marblesDist

Okay. Now we’re ready for the juicy bits. We’ll start by talking about the list functor. This will motivate the advantages of the categorical distribution functor.

A functor is a container that lets us “map” a function onto every element of the container. Lists are a functor, and so we can apply a function to our data set using the **map** function.

map :: (a -> b) -> [a] -> [b]

**Example 1:**

Let’s say instead of a distribution over the marbles’ colors, I want a distribution over the marbles’ weights. I might have a function that associates a weight with each type of marble:

> marbleWeight :: Marble -> Int -- weight in grams > marbleWeight Red = 3 > marbleWeight Pink = 2 > marbleWeight Green = 3 > marbleWeight Blue = 6 > marbleWeight White = 2

I can generate my new distribution by first transforming my data set, and then training on the result. Notice that the type of our distribution has changed. It is no longer a categorical distribution over marbles; it’s a distribution over ints.

> weightsDist = train $ map marbleWeight bagOfMarbles :: Categorical Double Int

ghci> plotDistribution (plotFile "weightsDist" $ PNG 400 300) weightsDist

This is the standard way of preprocessing data. But we can do better because the categorical distribution is also a functor. Functors have a function called **fmap** that is analogous to calling map on a list. This is its type signature specialized for the Categorical type:

fmap :: (Ord dp0, Ord dp1) => (dp0 -> dp1) -> Categorical prob dp0 -> Categorical prob dp1

We can use fmap to apply the marbleWeights function directly to the distribution:

> weightDist' = fmap marbleWeight marblesDist

This is guaranteed to generate the same exact answer, but it is much faster. **It takes only constant time to call Categorical’s fmap, no matter how much data we have!**

Let me put that another way. Below is a diagram showing the two possible ways to generate a model on a preprocessed data set. Every arrow represents a function application.

The normal way to preprocess data is to take the bottom left path. But because our model is a functor, the top right path becomes available. This path is better because it has the shorter run time.

Furthermore, let’s say we want to experiment with different preprocessing functions. The standard method will take time, whereas using the categorical functor takes time .

*Note: The diagram treats the number of different categories (m) as a constant because it doesn’t depend on the number of data points. In our case, we have 5 types of marbles, so m=5. Every function call in the diagram is really multiplied by m.*

**Example 2:**

For another example, what if we don’t want to differentiate between red and pink marbles? The following function converts all the pink marbles to red.

> pink2red :: Marble -> Marble > pink2red Pink = Red > pink2red dp = dp

Let’s apply it to our distribution, and plot the results:

> nopinkDist = fmap pink2red marblesDist

ghci> plotDistribution (plotFile "nopinkDist" $ PNG 400 300) nopinkDist

That’s about all that a Functor can do by itself. When we call fmap, we can only process individual data points. We can’t change the number of points in the resulting distribution or do other complex processing. Monads give us this power.

Monads are functors with two more functions. The first is called **return**. Its type signature is

return :: (Ord dp) => dp -> Categorical prob dp

We’ve actually seen this function already in previous posts. It’s equivalent to the **train1dp** function found in the **HomTrainer** type class. All it does is train a categorical distribution on a single data point.

The next function is called **join.** It’s a little bit trickier, and it’s where all the magic lies. Its type signature is:

join :: (Ord dp) => Categorical prob (Categorical prob dp) -> Categorical prob dp

As input, join takes a categorical distribution whose data points are other categorical distributions. It then “flattens” the distribution into one that does not take other distributions as input.

**Example 3**

Let’s write a function that removes all the pink marbles from our data set. Whenever we encounter a pink marble, we’ll replace it with an empty categorical distribution; if the marble is not pink, we’ll create a singleton distribution from it.

> forgetPink :: (Num prob) => Marble -> Categorical prob Marble > forgetPink Pink = mempty > forgetPink dp = train1dp dp > > nopinkDist2 = join $ fmap forgetPink marblesDist

ghci> plotDistribution (plotFile "nopinkDist2" $ PNG 400 300) nopinkDist2

This idiom of **join ( fmap … )** is used a lot. For convenience, the** >>=** operator (called **bind**) combines these steps for us. It is defined as:

(>>=) :: Categorical prob dp0 -> (dp0 -> Categorical prob dp1) -> Categorical prob dp1 dist >>= f = join $ fmap f dist

Under this notation, our new distribution can be defined as:

> nopinkDist2' = marblesDist >>= forgetPink

**Example 4
**

Besides removing data points, we can also add new ones. Let’s double the number of pink marbles in our training data:

> doublePink :: (Num prob) => Marble -> Categorical prob Marble > doublePink Pink = 2 .* train1dp Pink > doublePink dp = train1dp dp > > doublepinkDist = marblesDist >>= doublePink

ghci> plotDistribution (plotFile "doublepinkDist" $ PNG 400 300) doublepinkDist

**Example 5
**

Mistakes are often made when collecting data. One common machine learning task is to preprocess data sets to account for these mistakes. In this example, we’ll assume that our sampling process suffers from uniform noise. Specifically, if one of our data points is red, we will assume there is only a 60% chance that the marble was actually red, and a 10% chance each that it was one of the other colors. We will define a function to add this noise to our data set, increasing the accuracy of our final distribution.

Notice that we are using fractional weights for our noise, and that the weights are carefully adjusted so that the total number of marbles in the distribution still sums to one. We don’t want to add or remove marbles while adding noise.

> addNoise :: (Fractional prob) => Marble -> Categorical prob Marble > addNoise dp = 0.5 .* train1dp dp <> 0.1 .* train [ Red,Pink,Green,Blue,White ] > > noiseDist = marblesDist >>= addNoise

ghci> plotDistribution (plotFile "noiseDist" $ PNG 400 300) noiseDist

Adding uniform noise just made all our probabilities closer together.

**Example 6
**

Of course, the amount of noise we add to each sample doesn’t have to be the same everywhere. If I suffer from red-green color blindness, then I might use this as my noise function:

> rgNoise :: (Fractional prob) => Marble -> Categorical prob Marble > rgNoise Red = trainW [(0.7,Red),(0.3,Green)] > rgNoise Green = trainW [(0.1,Red),(0.9,Green)] > rgNoise dp = train1dp dp > > rgNoiseDist = marblesDist >>= rgNoise

ghci> plotDistribution (plotFile "rgNoiseDist" $ PNG 400 300) rgNoiseDist

Because of my color blindness, the probability of drawing a red marble from the bag is higher than drawing a green marble. This is despite the fact that we observed more green marbles in our training data.

**Example 7
**

In the real world, we can never know exactly how much error we have in the samples. Luckily, we can try to learn it by conducting a second experiment. We’ll first experimentally determine how red-green color blind I am, then we’ll use that to update our already trained distribution.

To determine the true error rate, we need some unbiased source of truth. In this case, we can just use someone with good vision. They will select ten red marbles and ten green marbles, and I will guess what color they are.

Let’s train a distribution on what I think green marbles look like:

> greenMarbles = [Green,Red,Green,Red,Green,Red,Red,Green,Green,Green] > greenDist = train greenMarbles :: Categorical Double Marble

and what I think red marbles look like:

> redMarbles = [Red,Green,Red,Green,Red,Red,Green,Green,Red,Red] > redDist = train redMarbles :: Categorical Double Marble

Now we’ll create the noise function based off of our empirical data. The **(/.)** function is scalar division, and we can use it because the categorical distribution is a vector space. We’re dividing by the number of data points in the distribution so that the distribution we output has an effective training size of one. This ensures that we’re not accidentally creating new data points when applying our function to another distribution.

> rgNoise2 :: Marble -> Categorical Double Marble > rgNoise2 Green = greenDist /. numdp greenDist > rgNoise2 Red = redDist /. numdp redDist > rgNoise2 dp = train1dp dp > > rgNoiseDist2 = marblesDist >>= rgNoise2

ghci> plotDistribution (plotFile "rgNoiseDist2" $ PNG 400 300) rgNoiseDist2

**Example 8
**

We can chain our preprocessing functions together in arbitrary ways.

> allDist = marblesDist >>= forgetPink >>= addNoise >>= rgNoise

ghci> plotDistribution (plotFile "allDist" $ PNG 400 300) allDist

But wait! Where’d that pink come from? Wasn’t the call to forgetPink supposed to remove it? The answer is that we did remove it, but then we added it back in with our noise functions. When using monadic functions, we must be careful about the order we apply them in. This is just as true when using regular functions.

Here’s another distribution created from those same functions in a different order:

> allDist2 = marblesDist >>= addNoise >>= rgNoise >>= forgetPink

ghci> plotDistribution (plotFile "allDist" $ PNG 400 300) allDist2

We can also use Haskell’s do notation to accomplish the same exact thing:

>allDist2' :: Categorical Double Marble >allDist2' = do > dp <- train bagOfMarbles > dp <- addNoise dp > dp <- rgNoise dp > dp <- forgetPink dp > return dp

(Since we’re using a custom Monad definition, do notation requires the RebindableSyntax extension.)

**Example 9
**

Do notation gives us a convenient way to preprocess multiple data sets into a single data set. Let’s create two new data sets and their corresponding distributions for us to work with:

> bag1 = [Red,Pink,Green,Blue,White] > bag2 = [Red,Blue,White] > > bag1dist = train bag1 :: Categorical Double Marble > bag2dist = train bag2 :: Categorical Double Marble

Now, we’ll create a third data set that is a weighted combination of bag1 and bag2. We will do this by repeated sampling. On every iteration, with a 20% probability we’ll sample from bag1, and with an 80% probability we’ll sample from bag2. Imperative pseudo-code for this algorithm is:

let comboDist be an empty distribution loop until desired accuracy achieved: let r be a random number from 0 to 1 if r > 0.2: sample dp1 from bag1 add dp1 to comboDist else: sample dp2 from bag2 add dp2 to comboDist

This sampling procedure will obviously not give us an exact answer. But since the categorical distribution supports weighted data points, we can use this simpler pseudo-code to generate an exact answer:

let comboDist be an empty distribution foreach datapoint dp1 in bag1: foreach datapoint dp2 in bag2: add dp1 with weight 0.2 to comboDist add dp2 with weight 0.8 to comboDist

Using do notation, we can express this as:

> comboDist :: Categorical Double Marble > comboDist = do > dp1 <- bag1dist > dp2 <- bag2dist > trainW [(0.2,dp1),(0.8,dp2)]

plotDistribution (plotFile "comboDist" $ PNG 400 300) comboDist

And because the Categorical functor takes constant time, constructing comboDist also takes constant time. The naive imperative algorithm would have taken time .

When combining multiple distributions this way, the number of data points in our final distribution will be the product of the number of data points in the initial distributions:

ghci> numdp combination 15

**Example 10
**

Finally, arbitrarily complex preprocessing functions can be written using Haskell’s do notation. And remember, no matter how complicated these functions are, their run time never depends on the number of elements in the initial data set.

This function adds uniform sampling noise to our bagOfMarbles, but only on those marbles that are also contained in bag2 above.

> comboDist2 :: Categorical Double Marble > comboDist2 = do > dp1 <- marblesDist > dp2 <- bag2dist > if dp1==dp2 > then addNoise dp1 > else return dp1

plotDistribution (plotFile "comboDist2" $ PNG 400 300) comboDist2

This application of monads to machine learning generalizes the monad used in probabilistic functional programming. The main difference is that PFP focused on manipulating already known distributions, not training them from data. Also, if you enjoy this kind of thing, you might be interested in the n-category cafe discussion on category theory in machine learning from a few years back.

In future posts, we’ll look at functors and monads for continuous distributions, multivariate distributions, and classifiers.

Subscribe to the RSS feed to stay tuned!

]]>Here’s a picture of our full chiller assembly. The internal chiller is on the right, and the external chiller is on the left.

The external chiller just sits around the outside of the boil pot. The pot’s handles keep the coil in place:

When we start the cooldown, water flows through the internal chiller, then through the external chiller. The external chiller has a number of holes cut into it. Water sprays out these hole and onto the outside of the pot:

This dramatically increases the surface area of water cooling. Heat is being transfered not just at the internal coils, but also along the whole pot. Here’s a zoomed out picture of the whole thing in action:

We only had to buy a 5 foot section of copper coil to wrap around the pot, and this cost about $5 at Lowes. I used a dremel to cut slots in the copper tubing about every 4 inches.

This external chilling reduced our cooling times from just over 30 minutes to just under 20 minutes. It was definitely worth the investment.

]]>Haskell code is expressive. The HLearn library uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the Weka library uses over 100 lines of Java. That’s a big difference! In this post, we’ll look at the actual code and see why the Haskell is so much more concise.

**But first, a disclaimer:** It is really hard to fairly compare two code bases this way. In both libraries, there is a lot of supporting code that goes into defining each classifier, and it’s not obvious what code to include and not include. For example, both libraries implement interfaces to a number of probability distributions, and this code is not contained in the source count. The Haskell code takes more advantage of this abstraction, so this is one language-agnostic reason why the Haskell code is shorter. If you think I’m not doing a fair comparison, here’s some links to the full repositories so you can do it yourself:

- HLearn’s bayesian classifier source code (74 lines of code)
- Weka’s naive bayes source code (946 lines of code)

HLearn implements training for a bayesian classifier with these six lines of Haskell:

newtype Bayes labelIndex dist = Bayes dist deriving (Read,Show,Eq,Ord,Monoid,Abelian,Group) instance (Monoid dist, HomTrainer dist) => HomTrainer (Bayes labelIndex dist) where type Datapoint (Bayes labelIndex dist) = Datapoint dist train1dp dp = Bayes $ train1dp dp

This code elegantly captures how to train a Bayesian classifier—just train a probability distribution. Here’s an explanation:

- The first two lines define the Bayes data type as a wrapper around a distribution.
- The fourth line says that we’re implementing the Bayesian classifier using the HomTrainer type class. We do this because
**the Haskell compiler automatically generates a parallel batch training function, an online training function, and a fast cross-validation function for all HomTrainer instances.** - The fifth line says that our data points have the same type as the underlying distribution.
- The sixth line says that in order to train, just train the corresponding distribution.

We only get the benefits of the HomTrainer type class because the bayesian classifier is a monoid. But we didn’t even have to specify what the monoid instance for bayesian classifiers looks like! In this case, it’s automatically derived from the monoid instances for the base distributions using a language extension called GeneralizedNewtypeDeriving. For examples of these monoid structures, check out the algebraic structure of the normal and categorical distributions, or more complex distributions using Markov networks.

Look for these differences between the HLearn and Weka source:

- In Weka we must separately define the online and batch trainers, whereas Haskell derived these for us automatically.
- Weka must perform a variety of error handling that Haskell’s type system takes care of in HLearn.
- The Weka code is tightly coupled to the underlying probability distribution, whereas the Haskell code was generic enough to handle any distribution. This means that while Weka must make the “naive bayes assumption” that all attributes are independent of each other, HLearn can support any dependence structure.
- Weka’s code is made more verbose by for loops and if statements that aren’t necessary for HLearn.
- The Java code requires extensive comments to maintain readability, but the Haskell code is simple enough to be self-documenting (at least once you know how to read Haskell).
- Weka does not have parallel training, fast cross-validation, data point subtraction, or weighted data points, but HLearn does.

/** * Generates the classifier. * * @param instances set of instances serving as training data * @exception Exception if the classifier has not been generated * successfully */ public void buildClassifier(Instances instances) throws Exception { // can classifier handle the data? getCapabilities().testWithFail(instances); // remove instances with missing class instances = new Instances(instances); instances.deleteWithMissingClass(); m_NumClasses = instances.numClasses(); // Copy the instances m_Instances = new Instances(instances); // Discretize instances if required if (m_UseDiscretization) { m_Disc = new weka.filters.supervised.attribute.Discretize(); m_Disc.setInputFormat(m_Instances); m_Instances = weka.filters.Filter.useFilter(m_Instances, m_Disc); } else { m_Disc = null; } // Reserve space for the distributions m_Distributions = new Estimator[m_Instances.numAttributes() - 1] [m_Instances.numClasses()]; m_ClassDistribution = new DiscreteEstimator(m_Instances.numClasses(), true); int attIndex = 0; Enumeration enu = m_Instances.enumerateAttributes(); while (enu.hasMoreElements()) { Attribute attribute = (Attribute) enu.nextElement(); // If the attribute is numeric, determine the estimator // numeric precision from differences between adjacent values double numPrecision = DEFAULT_NUM_PRECISION; if (attribute.type() == Attribute.NUMERIC) { m_Instances.sort(attribute); if ( (m_Instances.numInstances() > 0) && !m_Instances.instance(0).isMissing(attribute)) { double lastVal = m_Instances.instance(0).value(attribute); double currentVal, deltaSum = 0; int distinct = 0; for (int i = 1; i < m_Instances.numInstances(); i++) { Instance currentInst = m_Instances.instance(i); if (currentInst.isMissing(attribute)) { break; } currentVal = currentInst.value(attribute); if (currentVal != lastVal) { deltaSum += currentVal - lastVal; lastVal = currentVal; distinct++; } } if (distinct > 0) { numPrecision = deltaSum / distinct; } } } for (int j = 0; j < m_Instances.numClasses(); j++) { switch (attribute.type()) { case Attribute.NUMERIC: if (m_UseKernelEstimator) { m_Distributions[attIndex][j] = new KernelEstimator(numPrecision); } else { m_Distributions[attIndex][j] = new NormalEstimator(numPrecision); } break; case Attribute.NOMINAL: m_Distributions[attIndex][j] = new DiscreteEstimator(attribute.numValues(), true); break; default: throw new Exception("Attribute type unknown to NaiveBayes"); } } attIndex++; } // Compute counts Enumeration enumInsts = m_Instances.enumerateInstances(); while (enumInsts.hasMoreElements()) { Instance instance = (Instance) enumInsts.nextElement(); updateClassifier(instance); } // Save space m_Instances = new Instances(m_Instances, 0); }

And the code for online learning is:

/** * Updates the classifier with the given instance. * * @param instance the new training instance to include in the model * @exception Exception if the instance could not be incorporated in * the model. */ public void updateClassifier(Instance instance) throws Exception { if (!instance.classIsMissing()) { Enumeration enumAtts = m_Instances.enumerateAttributes(); int attIndex = 0; while (enumAtts.hasMoreElements()) { Attribute attribute = (Attribute) enumAtts.nextElement(); if (!instance.isMissing(attribute)) { m_Distributions[attIndex][(int)instance.classValue()]. addValue(instance.value(attribute), instance.weight()); } attIndex++; } m_ClassDistribution.addValue(instance.classValue(), instance.weight()); } }

Every algorithm implemented in HLearn uses similarly concise code. I invite you to browse the repository and see for yourself. The most complicated algorithm is for Markov chains which use only 6 lines for training, and about 20 for defining the Monoid.

You can expect lots of tutorials on how to incorporate the HLearn library into Haskell programs over the next few months.

Subscribe to the RSS feed to stay tuned!

]]>**Code and instructions for reproducing these experiments are available on github.**

Why is HLearn so much faster?

Well, it turns out that the bayesian classifier has the algebraic structure of a monoid, a group, and a vector space. HLearn uses a new cross-validation algorithm that can exploit these algebraic structures. The standard algorithm runs in time , where is the number of “folds” and is the number of data points. The algebraic algorithms, however, run in time . In other words, it doesn’t matter how many folds we do, the run time is constant! And not only are we faster, but we get the *exact same answer*. Algebraic cross-validation is not an approximation, it’s just fast.

Here’s some run times for k-fold cross-validation on the census income data set. Notice that HLearn’s run time is constant as we add more folds.

And when we set k=n, we have leave-one-out cross-validation. Notice that Weka’s cross-validation has quadratic run time, whereas HLearn has linear run time.

HLearn certainly isn’t going to replace Weka any time soon, but it’s got a number of cool tricks like this going on inside. If you want to read more, you should check out these two recent papers:

I’ll continue to write more about these tricks in future blog posts.

Subscribe to the RSS feed to stay tuned.

]]>This fully automatic AK-47 was used by the Romanian army during the Cold War. It could shoot 600 rounds per minute with an effective range of 400 yards. It was an instrument of death; but now, it is an instrument of life. It has been redeemed.

You see, 2500 years ago the prophet Isaiah wrote:

Nations will beat their swords into plowshares and their spears into pruning hooks.

Nation will not take up sword against nation, nor will they train for war anymore.

I want Isaiah’s vision to become reality.

This is the rifle fully disassembled.

This is a closeup of the the barrel assembly that I actually made into a spoon. I’m still looking for ideas on what to make out of everything else.

Here’s a closeup of the end where the bullet enters the barrel. On top is the “rear sights.” This is adjustable to shoot at targets anywhere from 0-800 meters away.

Notice that there’s actually many pieces of metal here—the barrel itself and two large blocks of steel attached to it. These blocks are held in place by “pinions.” These are the circle shaped pieces. If I had a hydraulic press I could push out the pinions and then remove the big metal blocks. But I don’t, so I’ll just beat them off with my forging hammer!

Looking down the barrel. This is where we would insert the bullet when firing.

Now for the business end. Here I have the flash suppressor removed. The forward sight (the tall thing jutting out to the right) will make a great handle for the future ladle.

In order to turn this chunk of metal into a spoon, I needed to build a forge. I bought a basic anvil and 5 lbs hammer on the internet for about $80, and the stump came from craigslist for free:

Next, I needed a way to heat the metal. I decided to build a propane forge, because I already had a good burner. The burner came out of a portable stove that we’ve used to serve free chili with a group called Food not Bombs. The base of the stove is below the bricks, but I unscrewed the burner and placed it on the bricks.

Then, I simply stacked the bricks on top of each other to create a nice chamber for the flames. Here’s a picture of a test run with a piece of rebar.

The flame is actually gigantic, shooting 1-2 feet past the opening of the forge. The middle of the rebar is a bright glowing orange-yellow, over 2000 degrees Fahrenheit. My camera just can’t do it justice.

The bricks are also really cheap. I payed 25 cents each at Lowes. Overall, the whole set up cost less than $100.

Let the hammering begin! You have to work quickly to hit the metal while it’s still hot from the forge!

Oooohhh…. glowing…… purty…..

After just a few hammer blows, look how much the blocks have shrunk. Also, I did NOT have a flash for this. The metal is so hot it’s putting off enough light to light up the wall I’m holding it next to!

I had to stop hammering after about an hour because of blisters. This is what I was starting with on the next day. Notice that you can still make out the serial number!

And here’s a view from the bottom up.

After a few more hours hammering, everything’s much flatter. The serial number has long since been flattened away.

Here’s the same view from the bottom. It looks like a ghost!

The layers of metal from the attached blocks are quite distinct and starting to get in the way.

After a lot of wrestling with some pliers, I’ve finally managed to remove the blocks of metal. All that remains is this “shrapnel.”

And the gun barrel itself is now the world’s coolest spatula.

And from the bottom:

All that’s left is to turn this spatula into a spoon. The metal is still pretty thick, so I flattened it out as much as I could.

Then I made a spoon shape. I did this by just holding the spatula at a slight angle while hammering. Every blow bent the metal just a little bit until the full bowl shape was complete.

I made my spoon about 2 inches too long. Whoops! No worries, I used a Dremel to cut the extra bits off. I also used it to smooth around some of the edges.

Notice that there’s a little hole in the middle of the spoon. I accidentally hammered the steel too thin and went all the way through. Meh. It’s still good. It’s just a straining spoon now!

There’s also a little bit of burnt steel along the sides. Seriously?! Steel can burn? Thankfully, a soak in vinegar and scrubbing brought it out. Thanks to the folks at iforgeiron.com for giving me the tip!

Here’s the final product:

That’s it!

I wish I had some pictures of me eating from the spoon, but unfortunately it’s not food safe. Real gunpowder has been detonated countless times inside this barrel. I tried cleaning it as best as I could, but I’m pretty sure there’s plenty of little cancer molecules still hanging out in there.

One gun down, only 874,999,999 to go. Depressing.

As usual, this post is a literate haskell file. To run this code, you’ll need to install the hlearn-distributions package. This package requires GHC version at least 7.6.

bash> cabal install hlearn-distributions-1.1

Now for some code. We start with our language extensions and imports:

>{-# LANGUAGE DataKinds #-} >{-# LANGUAGE TypeFamilies #-} >{-# LANGUAGE TemplateHaskell #-} > >import HLearn.Algebra >import HLearn.Models.Distributions

Next, we’ll create data type to represent Futurama characters. There are a lot of characters, so we’ll need to keep things pretty organized. The data type will have a record for everything we might want to know about a character. Each of these records will be one of the variables in our multivariate distribution, and all of our data points will have this type.

>data Character = Character > { _name :: String > , _species :: String > , _job :: Job > , _isGood :: Maybe Bool > , _age :: Double -- in years > , _height :: Double -- in feet > , _weight :: Double -- in pounds > } > deriving (Read,Show,Eq,Ord) > >data Job = Manager | Crew | Henchman | Other > deriving (Read,Show,Eq,Ord)

Now, in order for our library to be able to interpret the Character type, we call the template haskell function:

>makeTypeLenses ''Character

This function creates a bunch of data types and type classes for us. These “type lenses” give us a type-safe way to reference the different variables in our multivariate distribution. We’ll see how to use these type level lenses a bit later. There’s no need to understand what’s going on under the hood, but if you’re curious then checkout the hackage documentation or source code.

Now, we’re ready to create a data set and start training. Here’s a list of the employees of Planet Express provided by the resident bureaucrat Hermes Conrad. This list will be our first data set.

>planetExpress = > [ Character "Philip J. Fry" "human" Crew (Just True) 1026 5.8 195 > , Character "Turanga Leela" "alien" Crew (Just True) 43 5.9 170 > , Character "Professor Farnsworth" "human" Manager (Just True) 85 5.5 160 > , Character "Hermes Conrad" "human" Manager (Just True) 36 5.3 210 > , Character "Amy Wong" "human" Other (Just True) 21 5.4 140 > , Character "Zoidberg" "alien" Other (Just True) 212 5.8 225 > , Character "Cubert Farnsworth" "human" Other (Just True) 8 4.3 135 > ]

Let’s train a distribution from this data. Here’s how we would train a distribution where every variable is independent of every other variable:

>dist1 = train planetExpress :: Multivariate Character > '[ Independent Categorical '[String,String,Job,Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double

In the HLearn library, we always use the function **train** to train a model from data points. We specify which model to train in the type signature.

As you can see, the Multivariate distribution takes three type parameters. The first parameter is the type of our data point, in this case Character. The second parameter describes the dependency structure of our distribution. We’ll go over the syntax for the dependency structure in a bit. For now, just notice that it’s a type-level list of distributions. Finally, the third parameter is the type we will use to store our probabilities.

What can we do with this distribution? One simple task we can do is to find marginal distributions. The marginal distribution is the distribution of a certain variable ignoring all the other variables. For example, let’s say I want a distribution of the species that work at planet express. I can get this by:

>dist1a = getMargin TH_species dist1

Notice that we specified which variable we’re taking the marginal of by using the type level lens TH_species. This data constructor was automatically created for us by out template haskell function makeTypeLenses. Every one of our records in the data type has its own unique type lens. It’s name is the name of the record, prefixed by TH. These lenses let us infer the types of our marginal distributions at compile time, rather than at run time. For example, the type of the marginal distribution of species is:

ghci> :t dist1a dist1a :: Categorical String Double

That is, a categorical distributions whose data points are Strings and which stores probabilities as a Double. Now, if I wanted a distribution of the weights of the employees, I can get that by:

>dist1b = getMargin TH_weight dist1

And the type of this distribution is:

ghci> :t dist1b dist1b :: Normal Double

Now, I can easily plot these marginal distributions with the **plotDistribution** function:

ghci> plotDistribution (plotFile "dist1a" $ PNG 250 250) dist1a ghci> plotDistribution (plotFile "dist1b" $ PNG 250 250) dist1b

But wait! I accidentally forgot to include Bender in the planetExpress data set! What can I do?

In a traditional statistics library, we would have to retrain our data from scratch. If we had billions of elements in our data set, this would be an expensive mistake. But in our HLearn library, we can take advantage of the model’s monoid structure. In particular, the compiler used this structure to automatically derive a function called **add1dp** for us. Let’s look at its type:

ghci> :t add1dp add1dp :: HomTrainer model => model -> Datapoint model -> model

It’s pretty simple. The function takes a model and adds the data point associated with that model. It returns the model we would have gotten if the data point had been in our original data set. This is called online training.

Again, because our distributions form monoids, the compiler derived an efficient and exact online training algorithm for us automatically.

So let’s create a new distribution that considers bender:

>bender = Character "Bender Rodriguez" "robot" Crew (Just True) 44 6.1 612 >dist1' = add1dp dist1 bender

And plot our new marginals:

ghci> plotDistribution (plotFile "dist1-withbender-species" $ PNG 250 250) $ getMargin TH_species dist1' ghci> plotDistribution (plotFile "dist1-withbender-weight" $ PNG 250 250) $ getMargin TH_weight dist1'

Notice that our categorical marginal has clearly changed, but that our normal marginal doesn’t seemed to have changed at all. This is because the plotting routines automatically scale the distribution, and the normal distribution, when scaled, always looks the same. We can double check that we actually did change the weight distribution by comparing the mean:

ghci> mean dist1b 176.42857142857142 ghci> mean $ getMargin TH_weight dist1' 230.875

Bender’s weight really changed the distribution after all!

That’s cool, but our original distribution isn’t very interesting. What makes multivariate distributions interesting is when the variables affect each other. This is true in our case, so we’d like to be able to model it. For example, we’ve already seen that robots are much heavier than organic lifeforms, and are throwing off our statistics. The HLearn library supports a small subset of Markov Networks for expressing these dependencies.

We represent Markov Networks as graphs with undirected edges. Every attribute in our distribution is a node, and every dependence between attributes is an edge. We can draw this graph with the **plotNetwork** command:

ghci> plotNetwork "dist1-network" dist1

As expected, there are no edges in our graph because everything is independent. Let’s create a more interesting distribution and plot its Markov network.

>dist2 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String] > , Independent Categorical '[Job,Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double

ghci> plotNetwork "dist2-network" dist2

Okay, so what just happened?

The syntax for representing the dependence structure is a little confusing, so let’s go step by step. We represent the dependence information in the graph as a list of types. Each element in the list describes both the marginal distribution and the dependence structure for one or more records in our data type. We must list these elements in the same order as the original data type.

Notice that we’ve made two changes to the list. First, our list now starts with the type Ignore ‘[String]. This means that the first string in our data type—the name—will be ignored. Notice that TH_name is no longer in the Markov Network. This makes sense because we expect that a character’s name should not tell us too much about any of their other attributes.

Second, we’ve added a dependence. The MultiCategorical distribution makes everything afterward in the list dependent on that item, but not the things before it. This means that the exact types of dependencies it can specify are dependent on the order of the records in our data type. Let’s see what happens if we change the location of the MultiCategorical:

>dist3 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , Independent Categorical '[String] > , MultiCategorical '[Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double,Double,Double] > ] > Double

ghci> plotNetwork "dist3-network" dist3

As you can see, our species no longer have any relation to anything else. Unfortunately, using this syntax, the order of list elements is important, and so the order we specify our data records is important.

Finally, we can substitute any valid univariate distribution for our Normal and Categorical distributions. The HLearn library currently supports Binomial, Exponential, Geometric, LogNormal, and Poisson distributions. These just don’t make much sense for modelling Futurama characters, so we’re not using them.

Now, we might be tempted to specify that every variable is fully dependent on every other variable. In order to do this, we have to introduce the “Dependent” type. Any valid multivariate distribution can follow Dependent, but only those records specified in the type-list will actually be dependent on each other. For example:

>dist4 = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job,Maybe Bool] > , Dependent MultiNormal '[Double,Double,Double] > ] > Double

ghci> plotNetwork "dist4-network" dist4

Undoubtably, this is in always going to be the case—everything always has a slight influence on everything else. Unfortunately, it is not easy in practice to model these fully dependent distributions. We need roughly data points to accurately train a distribution, where n is the number of nodes in our graph and e is the number of edges in our network. Thus, by selecting that two attributes are independent of each other, we can greatly reduce the amount of data we need to train an accurate distribution.

I realize that this syntax is a little awkward. I chose it because it was relatively easy to implement. Future versions of the library should support a more intuitive syntax. I also plan to use copulas to greatly expand the expressiveness of these distributions. In the mean time, the best way to figure out the dependencies in a Markov Network are just to plot it and see visually.

Okay. So what distribution makes the most sense for Futurama characters? We’ll say that everything depends on both the characters’ species and job, and that their weight depends on their height.

>planetExpressDist = train planetExpress :: Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double] > , Dependent MultiNormal '[Double,Double] > ] > Double

ghci> plotNetwork "planetExpress-network" planetExpressDist

We still don’t have enough data to to train this network, so let’s create some more. We start by creating a type for our Markov network called FuturamaDist. This is just for convenience so we don’t have to retype the dependence structure many times.

>type FuturamaDist = Multivariate Character > '[ Ignore '[String] > , MultiCategorical '[String,Job] > , Independent Categorical '[Maybe Bool] > , Independent Normal '[Double] > , Dependent MultiNormal '[Double,Double] > ] > Double

Next, we train some more distribubtions of this type on some of the characters. We’ll start with Mom Corporation and the brave Space Forces.

>momCorporation = > [ Character "Mom" "human" Manager (Just False) 100 5.5 130 > , Character "Walt" "human" Henchman (Just False) 22 6.1 170 > , Character "Larry" "human" Henchman (Just False) 18 5.9 180 > , Character "Igner" "human" Henchman (Just False) 15 5.8 175 > ] >momDist = train momCorporation :: FuturamaDist

>spaceForce = > [ Character "Zapp Brannigan" "human" Manager (Nothing) 45 6.0 230 > , Character "Kif Kroker" "alien" Crew (Just True) 113 4.5 120 > ] >spaceDist = train spaceForce :: FuturamaDist

And now some more robots:

>robots = > [ bender > , Character "Calculon" "robot" Other (Nothing) 123 6.8 650 > , Character "The Crushinator" "robot" Other (Nothing) 45 8.0 4500 > , Character "Clamps" "robot" Henchman (Just False) 134 5.8 330 > , Character "DonBot" "robot" Manager (Just False) 178 5.8 520 > , Character "Hedonismbot" "robot" Other (Just False) 69 4.3 1200 > , Character "Preacherbot" "robot" Manager (Nothing) 45 5.8 350 > , Character "Roberto" "robot" Other (Just False) 77 5.9 250 > , Character "Robot Devil" "robot" Other (Just False) 895 6.0 280 > , Character "Robot Santa" "robot" Other (Just False) 488 6.3 950 > ] >robotDist = train robots :: FuturamaDist

Now we’re going to take advantage of the monoid structure of our multivariate distributions to combine all of these distributions into one.

> futuramaDist = planetExpressDist <> momDist <> spaceDist <> robotDist

The resulting distribution is equivalent to having trained a distribution from scratch on all of the data points:

train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist

We can take advantage of this property any time we use the train function to automatically parallelize our code. The higher order function **parallel** will split the training task evenly over each of your available processors, then merge them together with the monoid operation. This results in “theoretically perfect” parallel training of these models.

parallel train (planetExpress++momCorporation++spaceForces++robots) :: FuturamaDist

Again, this is only possible because the distributions have a monoid structure.

Now, let’s ask some questions of our distribution. If I pick a character at random, what’s the probability that they’re a good guy? Let’s plot the marginal.

ghci> plotDistribution (plotFile "goodguy" $ PNG 250 250) $ getMargin TH_isGood futuramaDist

But what if I only want to pick from those characters that are humans, or those characters that are robots? Statisticians call this conditioning. We can do that with the condition function:

ghci> plotDistribution (plotFile "goodguy-human" $ PNG 250 250) $ getMargin TH_isGood $ condition TH_species "human" futuramaDist ghci> plotDistribution (plotFile "goodguy-robot" $ PNG 250 250) $ getMargin TH_isGood $ condition TH_species "robot" futuramaDist

Now let’s ask: What’s the average age of an evil robot?

ghci> mean $ getMargin TH_age $ condition TH_isGood (Just False) $ condition TH_species "robot" futuramaDist 273.0769230769231

Notice that conditioning a distribution is a commutative operation. That means we can condition in any order and still get the exact same results. Let’s try it:

ghci> mean $ getMargin TH_age $ condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist 273.0769230769231

There’s one last thing for us to consider. What does our Markov network look like after conditioning? Let’s find out!

plotNetwork "condition-species-isGood" $ condition TH_species "robot" $ condition TH_isGood (Just False) futuramaDist

Notice that conditioning against these variables caused them to go away from our Markov Network.

Finally, there’s another similar process to conditioning called “marginalizing out.” This lets us ignore the effects of a single attribute without specifically saying what that attribute must be. When we marginalize out on our Markov network, we get the same dependence structure as if we conditioned.

plotNetwork "marginalizeOut-species-isGood" $ marginalizeOut TH_species $ marginalizeOut TH_isGood futuramaDist

Effectively, what the marginalizeOut function does is “forget” the extra dependencies, whereas the condition function “applies” those dependencies. In the end, the resulting Markov network has the same structure, but different values.

Finally, at the start of the post, I mentioned that our multivariate distributions have group and vector space structure. This gives us two more operations we can use: the inverse and scalar multiplication. You can find more posts on how to take advantage of these structures here and here.

The best part of all of this is still coming. Next, we’ll take a look at full on Bayesian classification and why it forms a monoid. Besides online and parallel trainers, this also gives us a fast cross-validation method.

There’ll also be a posts about the monoid structure of Markov *chains*, the Free HomTrainer, and how this whole algebraic framework applies to NP-approximation algorithms as well.

Subscribe to the RSS feed to stay tuned.

You see, **in the United States, roughly half of our tax dollars go to financing war**. (You can find a detailed breakdown here.) This is ridiculous and unacceptable. I would gladly pay more taxes to finance roads, schools, or public health care. But I will no longer pay other people to kill America’s enemies on my behalf.

I deeply regret the need for tax resistance because it contradicts a number of Biblical commands. For example, in Romans 13:7 Paul tell us that “if you owe taxes, pay taxes” and in Mathew 22:21 Jesus commands us to “give unto Caesar the things that are Caesar’s.” I wish I could obey these commands at face value. But **obeying the commands to pay taxes would result in me breaking the greatest commandment of them all: to love my neighbor as myself**. Jesus calls everyone my neighbor, even my enemies. Even people who kill Americans, like Osama bin Laden. I’m deeply ashamed that my tax dollars helped finance his assassination. Not to mention the near-daily drone strikes that continue to happen, the torture at gitmo, and the DOD’s research into newer and deadlier weapons systems. I payed for it all.

I could say a lot more about why I feel morally compelled to not pay war taxes, but I won’t. I’ll skip right to the part where **I’m making a public statement that I will not finance war, and I will accept whatever consequences that entails**. I also acknowledge that by taking this stand, I am sinning. But this is the least sinful option my limited wisdom can find. So I will continue on, “sinning boldly” as ever.

Below I describe the exact mechanics of how I’m refusing to pay war taxes. I’m following advice provided mainly by the War Resistor League’s War Tax Resistance book.

Today I filed my taxes just like everyone else. I filled out my form 1040, and found out that I owed 48 dollars. It’s not very much, but it’s something. I did my best to be as honest and complete as possible in the paperwork. But instead of including a check, I wrote them the following letter:

To whom it may concern:

After careful consideration, I have decided not to pay my 2012 taxes to the Federal government. I cannot in good conscience provide any financial support for our ongoing wars and excessive military spending.

I do, however, want to be a good citizen and contribute my fair share to society. Therefore, I am paying the taxes I owe to the federal government ($48) to my local state government (CA) instead. I have scanned a copy of my contribution check below.

Sincerely,

Michael Izbicki

I was happy to do my California taxes in addition to giving them this extra money. It’s only *war* that I’m against, not taxes in general. Here’s a copy of the actual check I wrote:

Also, for anyone interested, I’ve posted my form 1040:

Finally, just before mailing my envelope, I said the St Francis prayer:

]]>Lord, make me an instrument of your peace. Where there is hatred, let me sow love; where there is injury, pardon; where there is doubt, faith; where there is despair, hope; where there is darkness, light; and where there is sadness, joy.

O Divine Master, grant that I may not so much seek to be consoled, as to console; to be understood, as to understand; to be loved, as to love. For it is in giving that we receive; it is in pardoning that we are pardoned, and it is in dying that we are born to Eternal Life.

In this article, we’ll go over the math behind the categorical distribution, the algebraic structure of the distribution, and how to manipulate it within Haskell’s HLearn library. We’ll also see some examples of how this focus on algebra makes HLearn’s interface more powerful than other common statistical packages. Everything that we’re going to see is in a certain sense very “obvious” to a statistician, but this algebraic framework also makes it **convenient**. And since programmers are inherently lazy, this is a Very Good Thing.

Before delving into the “cool stuff,” we have to look at some of the mechanics of the HLearn library.

The HLearn-distributions package contains all the functions we need to manipulate categorical distributions. Let’s install it:

$ cabal install HLearn-distributions-1.1

We import our libraries:

>import Control.DeepSeq >import HLearn.Algebra >import HLearn.Models.Distributions

We create a data type for Simon’s marbles:

>data Marble = Red | Green | Blue | White > deriving (Read,Show,Eq,Ord)

The easiest way to represent Simon’s bag of marbles is with a list:

>simonBag :: [Marble] >simonBag = [Red, Red, Red, Green, Blue, Green, Red, Blue, Green, Green, Red, Red, Blue, Red, Red, Red, White]

And now we’re ready to train a categorical distribution of the marbles in Simon’s bag:

>simonDist = train simonBag :: Categorical Double Marble

We can load up ghci and plot the distribution with the conveniently named function plotDistribution:

ghci> plotDistribution (plotFile "simonDist" $ PDF 400 300) simonDist

This gives us a histogram of probabilities:

In the HLearn library, every statistical model is generated from data using either train or train’. Because these functions are overloaded, we must specify the type of simonDist so that the compiler knows which model to generate. Categorical takes two parameters. The first is the type of the discrete data (Marble). The second is the type of the probability (Double). We could easily create Categorical distributions with different types depending on the requirements for our application. For example:

>stringDist = train (map show simonBag) :: Categorical Float String

This is the first “cool thing” about Categorical: **We can make distributions over any user-defined type**. This makes programming with probabilities easier, more intuitive, and more convenient. Most other statistical libraries would require you to assign numbers corresponding to each color of marble, and then create a distribution over those numbers.

Now that we have a distribution, we can find some probabilities. If Simon pulls a marble from the bag, what’s the probability that it would Red?

We can use the pdf function to do this calculation for us:

ghci> pdf simonDist Red 0.5294117647058824 ghci> pdf simonDist Blue 0.17647058823529413 ghci> pdf simonDist Green 0.23529411764705882 ghci> pdf simonDist White 0.058823529411764705

If we sum all the probabilities, as expected we would get 1:

ghci> sum $ map (pdf simonDist) [Red,Green,Blue,White] 1.0

Due to rounding errors, you may not always get 1. If you absolutely, positively, have to avoid rounding errors, you should use Rational probabilities:

>simonDistRational = train simonBag :: Categorical Rational Marble

Rationals are slower, but won’t be subject to floating point errors.

This is just about all the functionality you would get in a “normal” stats package like R or NumPy. But using Haskell’s nice support for algebra, we can get some extra cool features.

First, let’s talk about semigroups. A semigroup is any data structure that has a binary operation (**<>**) that joins two of those data structures together. The categorical distribution is a semigroup.

Don wants to play marbles with Simon, and he has his own bag. Don’s bag contains only red and blue marbles:

>donBag = [Red,Blue,Red,Blue,Red,Blue,Blue,Red,Blue,Blue]

We can train a categorical distribution on Don’s bag in the same way we did earlier:

>donDist = train donBag :: Categorical Double Marble

In order to play marbles together, Don and Simon will have to add their bags together.

>bothBag = simonBag ++ donBag

Now, we have two options for training our distribution. First is the naive way, we can train the distribution directly on the combined bag:

>bothDist = train bothBag :: Categorical Double Marble

This is the way we would have to approach this problem in most statistical libraries. But with HLearn, we have a more efficient alternative. We can combine the trained distributions using the semigroup operation:

>bothDist' = simonDist <> donDist

Under the hood, the categorical distribution stores the number of times each possibility occurred in the training data. The <> operator just adds the corresponding counts from each distribution together:

This method is more efficient because it avoids repeating work we’ve already done. Categorical’s semigroup operation runs in time **O(1)**, so no matter how big the bags are, we can calculate the distribution very quickly. The naive method, in contrast, requires time **O(n)**. If our bags had millions or billions of marbles inside them, this would be a considerable savings!

We get another cool performance trick “for free” based on the fact that Categorical is a semigroup: The function train can be **automatically parallelized** using the higher order function parallel. I won’t go into the details about how this works, but here’s how you do it in practice.

First, we must show the compiler how to resolve the Marble data type down to “normal form.” This basically means we must show the compiler how to fully compute the data type. (We only have to do this because Marble is a type we created. If we were using a built in type, like a String, we could skip this step.) This is fairly easy for a type as simple as Marble:

>instance NFData Marble where > rnf Red = () > rnf Blue = () > rnf Green = () > rnf White = ()

Then, we can perform the parallel computation by:

>simonDist_par = parallel train simonBag :: Categorical Double Marble

Other languages require a programmer to manually create parallel versions of their functions. But in Haskell with the HLearn library, we get these parallel versions for free! All we have to do is ask for it!

A monoid is a semigroup with an empty element, which is called **mempty** in Haskell. It obeys the law that:

M <> mempty == mempty <> M == M

And it is easy to show that Categorical is also a monoid. We get this empty element by training on an empty data set:

mempty = train ([] :: [Marble]) :: Categorical Double Marble

The HomTrainer type class requires that all its instances also be instances of Monoid. This lets the compiler automatically derive “online trainers” for us. An online trainer can add new data points to our statistical model without retraining it from scratch.

For example, we could use the function add1dp (stands for: add one data point) to add another white marble into Simon’s bag:

>simonDistWhite = add1dp simonDist White

This also gives us another approach for our earlier problem of combining Simon and Don’s bags. We could use the function addBatch:

>bothDist'' = addBatch simonDist donBag

Because Categorical is a monoid, we maintain the property that:

bothDist == bothDist' == bothDist''

Again, statisticians have always known that you could add new points into a categorical distribution without training from scratch. The cool thing here is that **the compiler is deriving all of these functions for us**, and it’s giving us a** consistent interface **for use with different data structures. All we had to do to get these benefits was tell the compiler that Categorical is a monoid. This makes designing and programming libraries much **easier, quicker, and less error prone**.

A group is a monoid with the additional property that all elements have an **inverse**. This lets us perform subtraction on groups. And Categorical is a group.

Ed wants to play marbles too, but he doesn’t have any of his own. So Simon offers to give Ed some of from his own bag. He gives Ed one of each color:

>edBag = [Red,Green,Blue,White]

Now, if Simon draws a marble from his bag, what’s the probability it will be blue?

To answer this question without algebra, we’d have to go back to the original data set, remove the marbles Simon gave Ed, then retrain the distribution. This is awkward and computationally expensive. But if we take advantage of Categorical’s group structure, we can just subtract directly from the distribution itself. This makes more sense intuitively and is easier computationally.

>simonDist2 = subBatch simonDist edBag

This is a shorthand notation for using the group operations directly:

>edDist = train edBag :: Categorical Double Marble >simonDist2' = simonDist <> (inverse edDist)

The way the inverse operation works is it multiplies the counts for each category by -1. In picture form, this flips the distribution upside down:

Then, adding an upside down distribution to a normal one is just subtracting the histogram columns and renormalizing:

Notice that the green bar in edDist looks really big—much bigger than the green bar in simonDist. But when we subtract it away from simonDist, we still have some green marbles left over in simonDist2. This is because the histogram is only showing the *probability* of a green marble, and not the *actual number* of marbles.

Finally, there’s one more crazy trick we can perform with the Categorical group. It’s perfectly okay to have both positive and negative marbles in the same distribution. For example:

ghci> plotDistribution (plotFile "mixedDist" $ PDF 400 300) (edDist <> (inverse donDist))

results in:

Most statisticians would probably say that these upside down Categoricals are not “real distributions.” But at the very least, they are a convenient mathematical trick that makes **working with distributions much more pleasant**.

Finally, an R-Module is a group with two additional properties. First, it is abelian. That means <> is commutative. So, for all a, b:

a <> b == b <> a

Second, the data type supports **multiplication by any element in the ****ring**** R**. In Haskell, you can think of a ring as any member of the Num type class.

How is this useful? It let’s “retrain” our distribution on the data points it has already seen. Back to the example…

Well, Ed—being the clever guy that he is—recently developed a marble copying machine. That’s right! You just stick some marbles in on one end, and on the other end out pop 10 exact duplicates. Ed’s not just clever, but pretty nice too. He duplicates his new marbles and gives all of them back to Simon. What’s Simon’s new distribution look like?

Again, the naive way to answer this question would be to retrain from scratch:

>duplicateBag = simonBag ++ (concat $ replicate 10 edBag) >duplicateDist = train duplicateBag :: Categorical Double Marble

Slightly better is to take advantage of the Semigroup property, and just apply that over and over again:

>duplicateDist' = simonDist2 <> (foldl1 (<>) $ replicate 10 edDist)

But even better is to take advantage of the fact that Categorical is a module and the (**.***) operator:

>duplicateDist'' = simonDist2 <> 10 .* edDist

In picture form:

Also notice that without the scalar multiplication, we would get back our original distribution:

Another way to think about the module’s scalar multiplication is that it allows us to **weight our distributions**.

Ed just realized that he still needs a marble, and has decided to take one. Someone has left their Marble bag sitting nearby, but he’s not sure whose it is. He thinks that Simon is more forgetful than Don is, so he assigns a 60% probability that the bag is Simon’s and a 40% probability that it is Don’s. When he takes a marble, what’s the probability that it is red?

We create a weighted distribution using module multiplication:

>weightedDist = 0.6 .* simonDist <> 0.4 .* donDist

Then in ghci:

ghci> pdf weightedDist Red 0.4929577464788732

We can also train directly on weighted data using the **trainW** function:

>weightedDataDist = trainW [(0.4,Red),(0.5,Green),(0.2,Green),(3.7,White)] :: Categorical Double Marble

which gives us:

Talking about the categorical distribution in algebraic terms let’s us do some cool new stuff with our distributions that we can’t easily do in other libraries. None of this is statistically ground breaking. The cool thing is that **algebra just makes everything so convenient to work with**.

I think I’ll do another post on some cool tricks with the kernel density estimator that are not possible at all in other libraries, then do a post about the category (formal category-theoretic sense) of statistical training methods. At that point, we’ll be ready to jump into machine learning tasks. Depending on my mood we might take a pit stop to discuss the computational aspects of free groups and modules and how these relate to machine learning applications.

Sign up for the RSS feed to stay tuned!

]]>