In today’s adventure, our hero ghc faces its final challenge: granting parametricity to our lensified Functor, Applicative, and Monad classes. Parametricity is the key to finishing the task of simplifying our type signatures that we started two days ago. At the end we’ll still have some loose ends left untied, but they’ll have to wait for another haskell binge.
You are currently browsing the archive for the Haskell category.
It’s round 5 of typeparams versus GHC. We’ll be extending our Functor and Applicative classes to define a new Monad class. It’s all pretty simple if you just remember: lensified monads are like burritos with fiber optic cables telling you where to bite next. They’re also just monoids in the category of lens-enhanced endofunctors. Piece of cake.
This post covers a pretty neat trick with closed type families. Normal type families are “open” because any file can add new instances of the type. Closed type families, however, must be defined in a single file. This lets the type checker make more assumptions, and so the closed families are more powerful. In this post, we will circumvent this restriction and define certain closed type families over many files.
(or how to promote quickcheck and rewrite rules to the type level)
We’ve seen how to use the typeparams library to soup up our Functor and Applicative type classes. But we’ve been naughty little haskellers—we’ve been using type lenses without discussing their laws! Today we fix this oversight. Don’t worry if you didn’t read/understand the previous posts. This post is much simpler and does not require any background.
First, we’ll translate the standard lens laws to the type level. Then we’ll see how these laws can greatly simplify the type signatures of our functions. Finally, I’ll propose a very simple (yes, I promise!) GHC extension that promotes rewrite rules to the type level. These type level rewrite rules would automatically simplify our type signatures for us. It’s pretty freakin awesome.
Functors and monads are powerful design patterns used in Haskell. They give us two cool tricks for analyzing data. First, we can “preprocess” data after we’ve already trained a model. The model will be automatically updated to reflect the changes. Second, this whole process happens asymptotically faster than the standard method of preprocessing. In some cases, you can do it in constant time no matter how many data points you have!
This post focuses on how to use functors and monads in practice with the HLearn library. We won’t talk about their category theoretic foundations; instead, we’ll go through ten concrete examples involving the categorical distribution. This distribution is somewhat awkwardly named for our purposes because it has nothing to do with category theory—it is the most general distribution over non-numeric (i.e. categorical) data. It’s simplicity should make the examples a little easier to follow. Some more complicated models (e.g. the kernel density estimator and Bayesian classifier) also have functor and monad instances, but we’ll save those for another post.
Read the rest of this entry »
Haskell code is expressive. The HLearn library uses 6 lines of Haskell to define a function for training a Bayesian classifier; the equivalent code in the Weka library uses over 100 lines of Java. That’s a big difference! In this post, we’ll look at the actual code and see why the Haskell is so much more concise.
But first, a disclaimer: It is really hard to fairly compare two code bases this way. In both libraries, there is a lot of supporting code that goes into defining each classifier, and it’s not obvious what code to include and not include. For example, both libraries implement interfaces to a number of probability distributions, and this code is not contained in the source count. The Haskell code takes more advantage of this abstraction, so this is one language-agnostic reason why the Haskell code is shorter. If you think I’m not doing a fair comparison, here’s some links to the full repositories so you can do it yourself:
- HLearn’s bayesian classifier source code (74 lines of code)
- Weka’s naive bayes source code (946 lines of code)
Weka is one of the most popular tools for data analysis. But Weka takes 70 minutes to perform leave-one-out cross-validate using a simple naive bayes classifier on the census income data set, whereas Haskell’s HLearn library only takes 9 seconds. Weka is 465x slower!
Code and instructions for reproducing these experiments are available on github.
In this post, we’re going to look at how to manipulate multivariate distributions in Haskell’s HLearn library. There are many ways to represent multivariate distributions, but we’ll use a technique called Markov networks. These networks have the algebraic structure called a monoid (and group and vector space), and training them is a homomorphism. Despite the scary names, these mathematical structures make working with our distributions really easy and convenient—they give us online and parallel training algorithms “for free.” If you want to go into the details of how, you can check out my TFP13 submission, but in this post we’ll ignore those mathy details to focus on how to use the library in practice. We’ll use a running example of creating a distribution over characters in the show Futurama.