HLearn cross-validates >400x faster than Weka

weka-lambda-haskellWeka is one of the most popular tools for data analysis.  But Weka takes 70 minutes to perform leave-one-out cross-validate using a simple naive bayes classifier on the census income data set, whereas Haskell’s HLearn library only takes 9 seconds.  Weka is 465x slower!

Code and instructions for reproducing these experiments are available on github.

Why is HLearn so much faster?

Well, it turns out that the bayesian classifier has the algebraic structure of a monoid, a group, and a vector space.  HLearn uses a new cross-validation algorithm that can exploit these algebraic structures.  The standard algorithm runs in time , where is the number of “folds” and is the number of data points.  The algebraic algorithms, however, run in time .  In other words, it doesn’t matter how many folds we do, the run time is constant!  And not only are we faster, but we get the exact same answer.  Algebraic cross-validation is not an approximation, it’s just fast.

Here’s some run times for k-fold cross-validation on the census income data set.  Notice that HLearn’s run time is constant as we add more folds.

k-fold-cross-validation-weka

And when we set k=n, we have leave-one-out cross-validation.  Notice that Weka’s cross-validation has quadratic run time, whereas HLearn has linear run time.

leave-one-out-fast-cross-validation-weka

HLearn certainly isn’t going to replace Weka any time soon, but it’s got a number of cool tricks like this going on inside.  If you want to read more, you should check out these two recent papers:

I’ll continue to write more about these tricks in future blog posts.

Subscribe to the RSS feed to stay tuned.

  1. Alp’s avatar

    Congratulations! Been following your work on HLearn and the associated blog posts closely for a while now, I did even play with HLearn at some point. Please keep it up, we have to keep people from using the python machine learning kits… :)

    What are your plans for other learning algorithms in the future?

    Reply

    1. Mike’s avatar

      Thanks! I’m working on 4 things right now:

      1) Proving that a certain algorithm I’ve made based on algebra is in fact a “boosting” algorithm. (Boosting algorithms are really important to machine learning because they are very accurate.)

      2) Working on the Functor -> Monad hierarchy of classes. In particular, the techniques in the “Probabilistic Functional Programming” can be generalized to the way HLearn works and we get lots of cool time saving algorithms from that.

      3) Working on “approximation algebras” and showing that all of the stuff in HLearn can be generalized so that it will work with any learning algorithm. I’m not yet sure how realistic this is yet though.

      4) I’m pretty sure there’s a lot of usability improvements that could be made to make incorporating the algorithms into an average Haskell program almost trivial. I’m mostly just thinking about the ways to do this right now and don’t have a lot of concrete ideas.

      Reply

  2. Carlos’s avatar

    I’m now wondering about bootstrapping and these algebraic ideas. Seems like you should be able to get speedups through the same idea, but by reusing fragments of bootstrap samples?

    Reply

    1. Mike’s avatar

      Yep! It turns out the bagging is a homomorphism as well. There’s several different versions implemented in HLearn, unfortunately they all need better documentation.

      The simplest one is the FreeHomTrainer. For more theoretic details of this model, you can check out my ICML13 paper.

      Next, there’s the somewhat misnamed MonoidBoost type. This is a specific implementation of the Free HomTrainer that selects good default parameters and has been pretty accurate in my informal tests. The downside is that it is only a monoid, and has no other algebraic structure (e.g. group/vector space).

      Finally, there is what’s implemented as the Bagging type. This type subsumes whatever algebraic structure the base models have.

      I should probably write a full blog post describing how to use these at some point.

      Reply

  3. He-chien Tsai’s avatar

    Incredible! you’ve done a big deal for me.
    Before I know this article, I have a hard time to convey the advantages of Haskell to other programmers, who use c++ ,java or something else. But after I show this article, they’re really shocked and opened their mind to functional programming especially Haskell although they still doubt with what the homomorphism and the algebra are all about. But I think this is really a good start for them. I have some mathematic backgroud, so I think I can help you both documentation and programming. let me help if possible.

    Reply

    1. Mike’s avatar

      Thanks! You should checkout the github page. You can help by looking through the documentation and finding things that aren’t clear, or if you want to implement code the easiest place to start is probably by implementing a new univariate distribution. Let me know what sounds best and I’ll give you an outline of small tasks that need doing.

      Reply

      1. He-chien Tsai’s avatar

        Implementing code sounds better because I think my english may be not good enough for elegent documenting.
        please send me emails if needed

        Reply

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>