Monday, December 1, 2014

Review: Programming Collective Intelligence by Toby Segaran

Machine Learning is a hot topic these days.  O'Reilly just published a book called Thoughtful Machine Learning.  There is a .NET focused ML book being published in January, and one or two Python ML books this spring.  R is everywhere.  There are more than a few start-ups creating machine learning APIs, in addition to the big guys like Google.  Microsoft has an Azure Machine Learning offering that looks very interesting.  And if you want to try a MOOC, a new Machine Learning class is put on line for free every week.

Years and years ago when I was studying Neural Networks in graduate school, we used a commercial add-on to MATLAB and there were very few books on the topic.  Now, rather than being too little information, there seems to be too much.

Programming Collective Intelligence by Toby Segaran was published in 2009.  This is almost ancient history for a computer book, but it does not feel dated.   It says something that even though I received a free review ebook (full disclosure), I purchased a hard copy too.  I immediately recommended this book to a colleague.

There was a lot to like about this book.

First, the book uses Python and I think this is a great choice.  The purpose of books like this is learning and I can't think of a better language for teaching.  Even though I don't know Python very well, I was able to implement a .NET version of the decision tree chapter very quickly.  I never had to look up anything Python related.  It illustrated the points and never got in the way.

Second, the book rarely uses libraries.  In the real world, most projects that use machine learning algorithms will tend to use Libraries and APIs.  But to understand what is going on and to get an intuition for the appropriate approach to a problem, you need to write the algorithms yourself.  To this end Segaran usually implements the most basic version on the algorithm under discussion.  Each chapter includes exercises so that readers can the implementation more sophisticated.

Third, there is a unifying theme.  I love this.  Some of the other books I've read on Machine Learning treat the subject as a grab bag of algorithms and then treat them like a series of articles.  But Programming Collective Intelligence writes about the algorithms only insofar as they help solve Web 2.0 problems.  This gives a coherence to the book that most others lack.

The only think I didn't like about it was it's example of a Neural Network (used in the search chapter otherwise focused on web crawling).  It seemed like a poor fit and just thrown in there because people expect Neural Networks.   But this is a very minor annoyance.

In summary, this is a terrific book and I'd recommend it to anyone who wants to learn about Machine Learning and especially how to use these techniques to make their web sites better.

Tuesday, November 11, 2014

Review: Thoughtful Machine Learning by Matthew Kirk

There seems to be a little tension these days between REPL people and Unit Test people. Some users of languages which feature only a Read-Eval-Print Loop (REPL) claim that unit tests are heavyweight and unnecessary. Others, I suppose, view any development workflow other than deliberate Red-Green-Refactor as deviant, unprofessional, and irresponsible. These are extreme cases. Clearly, using a REPL and Unit Test in development are not mutually exclusive, but we all have our tendencies on one side or the other.

Data Science and scientific programming, in general, favors the REPL approach. R, the lingua franca, of data science is basically an interactive language that allows scripting--as opposed to the other big data science language, Python, which is a scripting language that features a REPL. MATLAB (and I assume Octave, it's Open Source imitator) is similar. Haskell and F#, academic languages used especially in math-heavy industries like Finance, also feature scripts and favor interactive exploration using a REPL. I suspect Julia is similar, but I have yet to take a look at it. All this is to point out that a Test Driven Approach in Data Science is a bit of a novelty. This is the approach that Thoughtful Machine Learning by Matthew Kirk takes.

And it's a pretty good idea. Personally I am a little more on the side of the REPL, but as a .NET developer primarily I don't have many options in that quarter. So, when I wanted to test out some machine learning algorithms, the fist thing I did was create a Unit Test project and create a Test. I was excited then when the very next day I happened to see a Machine Learning book which is explicitly test driven!

I was somewhat disappointed. I thought that the initial introduction oversells TDD. I literally rolled my eyes while reading it several times. "Hypothesize, test, theorize could be called 'red-green-refactor' instead" claims the author on the 3rd page. Yeah... no. They could not be. There is nothing remotely similar about forming a hypothesis and creating a failing test; indeed, they are opposites. I would have thought the argument would have focused more on producing reproducible research or providing regressions when swapping algorithms. I don't recall these being touched on. A valuable part of the introduction was the list of risks in Machine Learning and a discussion of how to use automated tests to guard against these risks. It was good and I wish that this section was expanded.

Next, the author is a bit touchy on the subject of Ruby. Most Machine Learning books use Python or R, but the author favors Ruby because of the great automated test abstractions. Fair enough. I have not a lot of experience in Python or Ruby, but I will say this: I could understand 99% of the Python code in Toby Segaran's Programming Collective Intelligence instantly, but found most of the code in Thoughtful Machine Leaning to be gibberish sprinkled with pipes. Because of this I mostly read this book from the perspective, as the author puts it, of the CTO or Business Analyst.

After the introductory chapter on TDD, there is an overview of Machine Learning algorithms. I thought it was a bit superficial and suffered from introducing terms and jargon before explaining them. "The curse of dimensionality" was thrown around a few times but not discussed or defined until a sidebar a chapter or so later. I wish the introduction was more detailed and explained the criteria for picking the algorithms detailed in the book. Why were tree techniques omitted, for instance?

The rest of the book covers algorithms. The chapters follow the pattern: introduce a technique, describe a problem, write some tests to try to solve it, give a summary. I have to say that the example problems really didn't capture my imagination. When discussing k-Nearest Neighbor, the example is detecting beards and glasses in photographs. Ok... As a comparison, Segaran's book used eBay's web API as a source for price prediction. Which do you find more helpful?

In the Naive Bayes Classifier chapter, the cliche Spam detector is used. Which is fine, but this is all not very original. Why try to compete with Paul Graham's classic A Plan for Spam? Now, I understand that the primary goal of the book is teaching and that this is the canonical example, but it would be nice to see something more creative.

It was nice to see a popular Machine Learning book that covers Hidden Markov Models. I admit that I need to revisit this chapter a few more times, because I haven't fully internalized it, but this is a very interesting technique and I wish there were more popular treatments.

Finally, I read this book primarily on a Kindle (full-disclosure: I got a free review eBook from the publisher). It isn't great. There is no Table of Contents for some reason and the formatting isn't as clear. When I opened the PDF, I was surprised to see how beautifully it was laid out.

In summary, I was a bit disappointed by the book but I am glad I looked at it. Get it if you are a Rubyist, really into the "TDD way", and want a fairly high-level view. Otherwise, I would recommend Toby Segaran's book. Alternatively, check out free online courses from Coursera and Udacity and edX on Machine Learning. (Or see the excellent resources on FastML.com.

Product Information:

Wednesday, October 29, 2014

Review: "Exam Ref 70-486: Developing ASP.NET MVC 4 Web Applications"

Last year I took and passed the Microsoft's Exam 70-480 (Programming HTML5/CSS3/JavaScript).  I perhaps over-studied, living a breathing JavaScript for a couple months.  This was prior to any published books on the test so I had to find my own way.  In any case, I found a process that worked well.  For each skill on the Exam's website, I did the following.
  1. Look around the internet for lists of study materials (often these were wrong though or outdated)
  2. Find several articles or book chapters.  
  3. Brainstorm projects and exercises that would apply to the skill
Once this was done, I had a giant study list that I could work off of, and gauge the speed of my learning.  This worked well but it had the drawback that I spent almost as much time identifying study materials and thinking up exercises, as I did studying and practicing.  Further, the material was uneven.

This is where a book like "Exam Ref 70-486: Developing ASP.NET MVC 4 Web Applications" by William Penberthy comes in handy.  It already gathers together in one places discussions on each subject the exam covers, as well as exercises to work through, and links to further information.

The chapters are concise and relatively well-written, and since I have a good bit of experience with MVC, I could tell that the author knows his subject.

I the end I decided not to pursue taking the exam (my interests have taken me in different directions), but if I ever decide to take another Microsoft Exam, I'll definitely consider getting a book from this series.