Weka is one of the most popular tools for data analysis. But Weka takes 70 minutes to perform leaveoneout crossvalidate using a simple naive bayes classifier on the census income data set, whereas Haskell’s HLearn library only takes 9 seconds. Weka is 465x slower!
Code and instructions for reproducing these experiments are available on github.
Why is HLearn so much faster?
Well, it turns out that the bayesian classifier has the algebraic structure of a monoid, a group, and a vector space. HLearn uses a new crossvalidation algorithm that can exploit these algebraic structures. The standard algorithm runs in time , where is the number of “folds” and is the number of data points. The algebraic algorithms, however, run in time . In other words, it doesn’t matter how many folds we do, the run time is constant! And not only are we faster, but we get the exact same answer. Algebraic crossvalidation is not an approximation, it’s just fast.
Here’s some run times for kfold crossvalidation on the census income data set. Notice that HLearn’s run time is constant as we add more folds.
And when we set k=n, we have leaveoneout crossvalidation. Notice that Weka’s crossvalidation has quadratic run time, whereas HLearn has linear run time.
HLearn certainly isn’t going to replace Weka any time soon, but it’s got a number of cool tricks like this going on inside. If you want to read more, you should check out these two recent papers:
I’ll continue to write more about these tricks in future blog posts.
Subscribe to the RSS feed to stay tuned.

Congratulations! Been following your work on HLearn and the associated blog posts closely for a while now, I did even play with HLearn at some point. Please keep it up, we have to keep people from using the python machine learning kits…
What are your plans for other learning algorithms in the future?

I’m now wondering about bootstrapping and these algebraic ideas. Seems like you should be able to get speedups through the same idea, but by reusing fragments of bootstrap samples?

Incredible! you’ve done a big deal for me.
Before I know this article, I have a hard time to convey the advantages of Haskell to other programmers, who use c++ ,java or something else. But after I show this article, they’re really shocked and opened their mind to functional programming especially Haskell although they still doubt with what the homomorphism and the algebra are all about. But I think this is really a good start for them. I have some mathematic backgroud, so I think I can help you both documentation and programming. let me help if possible.
8 comments
Comments feed for this article
Trackback link: http://izbicki.me/blog/hlearncrossvalidates400xfasterthanweka/trackback