Tuesday, November 11, 2014

Review: Thoughtful Machine Learning by Matthew Kirk

There seems to be a little tension these days between REPL people and Unit Test people. Some users of languages which feature only a Read-Eval-Print Loop (REPL) claim that unit tests are heavyweight and unnecessary. Others, I suppose, view any development workflow other than deliberate Red-Green-Refactor as deviant, unprofessional, and irresponsible. These are extreme cases. Clearly, using a REPL and Unit Test in development are not mutually exclusive, but we all have our tendencies on one side or the other.

Data Science and scientific programming, in general, favors the REPL approach. R, the lingua franca, of data science is basically an interactive language that allows scripting--as opposed to the other big data science language, Python, which is a scripting language that features a REPL. MATLAB (and I assume Octave, it's Open Source imitator) is similar. Haskell and F#, academic languages used especially in math-heavy industries like Finance, also feature scripts and favor interactive exploration using a REPL. I suspect Julia is similar, but I have yet to take a look at it. All this is to point out that a Test Driven Approach in Data Science is a bit of a novelty. This is the approach that Thoughtful Machine Learning by Matthew Kirk takes.

And it's a pretty good idea. Personally I am a little more on the side of the REPL, but as a .NET developer primarily I don't have many options in that quarter. So, when I wanted to test out some machine learning algorithms, the fist thing I did was create a Unit Test project and create a Test. I was excited then when the very next day I happened to see a Machine Learning book which is explicitly test driven!

I was somewhat disappointed. I thought that the initial introduction oversells TDD. I literally rolled my eyes while reading it several times. "Hypothesize, test, theorize could be called 'red-green-refactor' instead" claims the author on the 3rd page. Yeah... no. They could not be. There is nothing remotely similar about forming a hypothesis and creating a failing test; indeed, they are opposites. I would have thought the argument would have focused more on producing reproducible research or providing regressions when swapping algorithms. I don't recall these being touched on. A valuable part of the introduction was the list of risks in Machine Learning and a discussion of how to use automated tests to guard against these risks. It was good and I wish that this section was expanded.

Next, the author is a bit touchy on the subject of Ruby. Most Machine Learning books use Python or R, but the author favors Ruby because of the great automated test abstractions. Fair enough. I have not a lot of experience in Python or Ruby, but I will say this: I could understand 99% of the Python code in Toby Segaran's Programming Collective Intelligence instantly, but found most of the code in Thoughtful Machine Leaning to be gibberish sprinkled with pipes. Because of this I mostly read this book from the perspective, as the author puts it, of the CTO or Business Analyst.

After the introductory chapter on TDD, there is an overview of Machine Learning algorithms. I thought it was a bit superficial and suffered from introducing terms and jargon before explaining them. "The curse of dimensionality" was thrown around a few times but not discussed or defined until a sidebar a chapter or so later. I wish the introduction was more detailed and explained the criteria for picking the algorithms detailed in the book. Why were tree techniques omitted, for instance?

The rest of the book covers algorithms. The chapters follow the pattern: introduce a technique, describe a problem, write some tests to try to solve it, give a summary. I have to say that the example problems really didn't capture my imagination. When discussing k-Nearest Neighbor, the example is detecting beards and glasses in photographs. Ok... As a comparison, Segaran's book used eBay's web API as a source for price prediction. Which do you find more helpful?

In the Naive Bayes Classifier chapter, the cliche Spam detector is used. Which is fine, but this is all not very original. Why try to compete with Paul Graham's classic A Plan for Spam? Now, I understand that the primary goal of the book is teaching and that this is the canonical example, but it would be nice to see something more creative.

It was nice to see a popular Machine Learning book that covers Hidden Markov Models. I admit that I need to revisit this chapter a few more times, because I haven't fully internalized it, but this is a very interesting technique and I wish there were more popular treatments.

Finally, I read this book primarily on a Kindle (full-disclosure: I got a free review eBook from the publisher). It isn't great. There is no Table of Contents for some reason and the formatting isn't as clear. When I opened the PDF, I was surprised to see how beautifully it was laid out.

In summary, I was a bit disappointed by the book but I am glad I looked at it. Get it if you are a Rubyist, really into the "TDD way", and want a fairly high-level view. Otherwise, I would recommend Toby Segaran's book. Alternatively, check out free online courses from Coursera and Udacity and edX on Machine Learning. (Or see the excellent resources on FastML.com.

Product Information: