Optimizing Orthogonal Matching Pursuit code in Numpy, part 1

Posted by & filed under dictionary learning, python, scikit-learn.

After intense code optimization work, my implementation of OMP finally beat least-angle regression! This was the primary issue discussed during the pull request, so once performance was taken care of, the code was ready for merge. Orthogonal matching pursuit is now available in scikits.learn as a sparse linear regression model. OMP is a key building block of the dictionary learning code that we are working on merging.

I will go through the process of developing this particular piece of code as an example of code refining and iterative improvements, as well as for the useful notes it will provide on optimizing numerical Python code. In the first part we will see how the code got from pseudocode state to a reasonably efficient code with smart memory allocation. In the next part we will see how to make it blazing fast by leveraging [1] lower level BLAS and LAPACK routines, and how to use profiling to find hot spots.

As stated before, orthogonal matching pursuit is a greedy algorithm for finding a sparse solution \( \gamma\) to a linear regression problem \( X\gamma = y\). Mathematically, it approximates the solution of the optimization problem:

$$ \text{argmin} {\big|\big|} \gamma {\big|\big|} _0 \text{ subject to }{\big |\big|}y-X\gamma{\big|\big|}_2^2 \leq \epsilon $$
or (under a different parametrization):
$$\text{argmin} {\big |\big|}y – X\gamma{\big |\big|}_2^2 \text{ subject to } {\big|\big|}\gamma{\big|\big|}_0 \leq n_{\text{nonzero coefs}}$$

In the code samples in this post I will omit the docstrings, but I will follow the notation in the formulas above.

Important note: The regressors/dictionary atoms (the columns of \( X\)) are assumed to be normalized throughout this post (as well as usually any discussion of OMP). We also assume the following imports:

import numpy as np
from scipy import linalg

Orthogonal matching pursuit is a very simple algorithm in pseudocode, and as I stated before, it almost writes itself in Numpy. For this reason, instead of stating the pseudocode here, I will start with how naively implemented OMP looks like in Python:

def orthogonal_mp(X, y, n_nonzero_coefs, eps=None):
    residual = y
    idx = []
    if eps == None:
        stopping_condition = lambda: len(idx) == n_nonzero_coefs
    else:
        stopping_condition = lambda: np.inner(residual, residual) <= eps
    while not stopping_condition():
        lam = np.abs(np.dot(residual, X)).argmax()
        idx.append(lam)
        gamma, _, _, _ = linalg.lstsq(X[:, idx], y)
        residual = y - np.dot(X[:, idx], gamma)
    return gamma, idx

Using lambda expressions as stopping conditions never looked like a brilliant idea, but it seems to me like the most elegant way to specify such a variable stopping condition. However, the biggest slowdown in this is the need for solving a least squares problem at each iteration, while least-angle regression is known to produce the entire regularization path for the cost of a single least squares problem. We will also see that this implementation is more vulnerable to numerical stability issues.

In [2], Rubinstein et al. described the Cholesky-OMP algorithm, an implementation of OMP that avoids solving a new least squares problem at each iteration by keeping a Cholesky decomposition \( LL’\) of the Gram matrix \( G = X_{\text{idx}}’X_{\text{idx}}\). Because \( X_{\text{idx}}\) grows by exactly one column at each iteration, \( L\) can be updated according to the following rule: Given \( A = \begin{pmatrix} \tilde{A} & \mathbf{v}’ \\ \mathbf{v} & c \end{pmatrix}\), and knowing the decomposition of \( \tilde{A} = \tilde{L}\tilde{L}’\), the Cholesky decomposition \( A = LL’\) is given by $$ L = \begin{pmatrix}\tilde{L} & \mathbf{0} \\ \mathbf{w}’ & \sqrt{c – \mathbf{w}’\mathbf{w}} \end{pmatrix}, \text{ where } \tilde{L}\mathbf{w} = \mathbf{v}$$

Even if you are unfamiliar with the mathematical properties of the Cholesky decomposition, you can see from the construction detailed above that \( L\) is always going to be a lower triangular matrix (it will only have null elements above the main diagonal). Actually, the letter L stands for lower. We have therefore replaced the step where we needed to solve the least-squares problem \( X_{\text{idx}}\gamma = y\) with two much simpler computations: solving \( \tilde{L}\mathbf{w} = \mathbf{v}\) and solving \( LL’\gamma = X_{\text{idx}}’y\). Due to the \( L\)‘s structure, these are much quicker operations than a least squares projection.
Here is the initial way I implemented this:


def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
    if eps == None:
        stopping_condition = lambda: it == n_nonzero_coefs
    else:
        stopping_condition = lambda: np.inner(residual, residual) <= eps

    alpha = np.dot(X.T, y)
    residual = y
    idx = []
    L = np.ones((1,1))

    while not stopping_condition():
        lam = np.abs(np.dot(residual, X)).argmax()
        if len(idx) &gt; 0:
            w = linalg.solve_triangular(L, np.dot(X[:, idx].T, X[:, lam]),
                                        lower=True)
            L = np.r_[np.c_[L, np.zeros(len(L))],
                      np.atleast_2d(np.append(w, np.sqrt(1 - np.dot(w.T, w))))]
        idx.append(lam)
        # solve LL'x = y in two steps:
        Ltx = linalg.solve_triangular(L, alpha[idx], lower=True)
        gamma = linalg.solve_triangular(L, Ltx, trans=1, lower=True)
        residual = y - np.dot(X[:, idx], gamma)

    return gamma, idx

Note that a lot of the code remained unchanged, this is the same algorithm as before, only the Cholesky trick is used to improve performance. According to the plot in [3], we can see that the naive implementation has oscillations of the reconstruction error due to numerical instability, while this Cholesky implementation is well-behaved.

Along with this I also implemented the Gram-based version of this algorithm, which only needs \( X’X\) and \( X’y\) (and \( {\big|\big|}y{\big|\big|}_2^2\), in case the epsilon-parametrization is desired). This is called Batch OMP in [2], because it offers speed gains when many signals need to be sparse coded against the same dictionary \( X\). A lot of speed is gained because two large matrix multiplications are avoided at each iteration, but for many datasets, the cost of the precomputations dominates the procedure. I will not insist on Gram OMP in this post, it can be found in the scikits.learn repository [4].

Now, the problems with this are a bit more subtle. At this point, I moved on to code other things, since OMP was passing tests and the signal recovery example was working. The following issues popped up during review:

1. The lambda stopping condition does not pickle.
2. For well-constructed signals and data matrices, assuming normal atoms, \( \mathbf{w}\) on line 14 will never have norm greater than or equal to zero, unless the chosen feature happens to be dependent of the already chosen set. In theory, this cannot happen, since we do an orthogonal projection at each step. However, if the matrix \( X\) is not well-behaved (for example, if it has two identical columns, and \( y\) is built using non-zero coefficients for those columns), then we end up with the square root of a negative value on line 17.
3. It was orders of magnitude slower than least-angle regression, given the same number of nonzero coefficients.

1 was an easy fix. 2 was a bit tricky since it was a little hidden: the first time I encountered such an error, I wrongfully assumed that given that the diagonal of \( X_\text{idx}’X_\text{idx}\) was unit, then \( L\) should also have a unit diagonal, so I passed the parameter unit_diagonal=True to linalg.solve_triangular, and the plethora of NaN’s along the diagonal were simply ignored. Let this show what happens when you don’t pay attention when coding.

When I realized my mistake, I first did something I saw in lars_path from the scikit: take the absolute value of the argument of sqrt, and also ensure it is practically larger than zero. However, tests started failing randomly. Confusion ensued until the nature of the issue, discussed above, was discovered. It’s just not right to take the abs: if that argument ends up less than zero, OMP simply cannot proceed and must stop due to malformed data. The reference implementation from the website of the authors of [2] includes explicit early stopping conditions for this, along with some other cases.

At the same time, I started to try a couple of optimizations. The most obvious thing was the way I was building the matrix \( L\) was clearly suboptimal, reallocating it at each iteration.

This leads to the following code:


def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
    min_float = np.finfo(X.dtype).eps
    alpha = np.dot(X.T, y)
    residual = y
    n_active = 0
    idx = []

    max_features = X.shape[1] if eps is not None else n_nonzero_coefs
    L = np.empty((max_features, max_features), dtype=X.dtype)
    L[0, 0] = 1.

    while 1:
        lam = np.abs(np.dot(X.T, residual)).argmax()
        if lam < n_active or alpha[lam] ** 2 > min_float:
            # atom already selected or inner product too small
            warn(&quot;Stopping early&quot;)
            break
        if n_active &gt; 0:
            # Updates the Cholesky decomposition of X' X
            w = linalg.solve_triangular(L[:n_active, :n_active],
                                        np.dot(X[:, idx].T, X[:, lam]),
                                        lower=True)
            L[n_active, :n_active] = w
            d = np.dot(w.T, w)
            if 1 - d &lt;= min_float:  # selected atoms are dependent
                warn("Stopping early")
                break
            L[n_active, n_active] = np.sqrt(1 - d)
        idx.append(lam)
        # solve LL'x = y in two steps:
        Ltx = linalg.solve_triangular(L[:n_active, :n_active], alpha[idx], lower=True)
        gamma = linalg.solve_triangular(L[:n_active, :n_active], Ltx, trans=1, lower=True)
        residual = y - np.dot(X[:, idx], gamma)
        if eps is not None and np.dot(residual.T, residual) &lt;= eps:
            break
        elif n_active == max_features:
            break
    return gamma, idx

What should be noted here, apart from the obvious fix for #1, are the early stopping conditions. It is natural to stop if the same feature gets picked twice: the residual is always orthogonalized with respect to the chosen basis, so the only way this could happen is if there would be no more unused independent regressors. This would either lead to this, or to the stopping criterion on line 25, depending on which equally insignificant vector gets picked. The other criterion for early stopping is if the chosen atom is orthogonal to y, which would make it uninformative and would again mean that there are no better ones left, so we might as well quit looking.

Also, we now make sure that \( L\) is preallocated. Note that np.empty is marginally faster than np.zeros because it does not initialize the array to zero after allocating, so the untouched parts of the array will contain whatever happened to be in memory before. In our case, this means only the values above the main diagonal: everything on and beneath is initialized before access. Luckily, the linalg.solve_triangular function ignores what it doesn’t need.

This is a robust implementation, but still a couple of times slower than least-angle regression. In the next part of the article we will see how we can make it beat LARS.

[1] I always wanted to use this word in a serious context :P
[2] Rubinstein, R., Zibulevsky, M. and Elad, M., Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report – CS Technion, April 2008.
[3] First thoughts on Orthogonal Matching Pursuit on this blog.
[4] omp.py on github.

Progress on Orthogonal Matching Pursuit

Posted by & filed under scikit-learn.

Since orthogonal matching pursuit (OMP) is an important part of signal processing and therefore crucial to the image processing aspect of dictionary learning, I am currently focusing on optimizing the OMP code and making sure it is stable. OMP is a forward method like least-angle regression, so it is natural to bench them against one another.

This has helped find a couple of bottlenecks. Time has been gained by preallocating the array to store the Cholesky decomposition. Also, using the LAPACK potrs function in order to solve a system of the shape \(LL’x=y\) is faster than using solve_triangular twice.

I am still trying to optimize the code. We are working hard to make sure that scikits.learn contributions are up to standards before merging.

SparsePCA in scikits.learn-git

Posted by & filed under scikit-learn.

I am happy to announce that the Sparse PCA code has been reviewed and merged into the main scikits.learn repository.

You can use it if you install the bleeding edge scikits.learn git version, by first downloading the source code as explained in the user’s guide, and then running python setup.py install.

Sparse PCA on images of the digit 3


To see what code is needed to produce an image such as the one above, using scikits.learn. check out this cool decomposition example that compares the results of most matrix decomposition models implemented at the moment.

There are other new cool things that have been recently merged by other contributors, such as support for sparse matrices in minibatch K-means, and the variational infinite gaussian mixture model, so I invite you to take a look!

K-Means for dictionary learning

Posted by & filed under dictionary learning, scikit-learn, Uncategorized.

Dictionary learned with K-Means on the LFW dataset with whitening PCADictionary learned with K-Means on the LFW dataset without whitening PCA

One of the simplest, and yet most heavily constrained form of matrix factorization, is vector quantization (VQ). Heavily used in image/video compression, the VQ problem is a factorization \( X=WH\) where \( H\) (our dictionary) is called the codebook and is designed to cover the cloud of data points effectively, and each line of \( W\) is a unit vector.

This means that each each data point \( x_i\) is approximated as \( x_i \approx h_{k} = \sum_{j=1}^{r} \delta_{kj}h_{j}\). In other words, the closest row vector (codeword/dictionary atom) \( h_k\) of \( H\) is chosen as an approximation, and this is encoded as a unit vector \( (\delta_{k1}, …, \delta_{kr})\). The data representation \( W\) is composed of such vectors.

There is a variation called gain-shape VQ where instead of approximating each point as one of the codewords, we allow a scalar multiplication invariance: \( x_i \approx \alpha_ih_k\). This model requires considerably more storage (each data point needs a floating point number and an unsigned index, as opposed to just the index), but it leads to a much better approximation.
Gain-shape VQ can equivalently be accomplished by normalizing each data vector prior to fitting the codebook.

In order to fit a codebook \( H\) for efficient VQ use, the K-Means Clustering [1] algorithm is a natural thought. K-means is an iterative algorithm that incrementally improves the dispersion of k cluster centers in the data space until convergence. The cluster centers are initialized in a random or procedural fashion, then, at each iteration, the data points are assigned to the closest cluster center, which is subsequently moved to the center of the points assigned to it.

The scikits.learn.decomposition.KMeansCoder object from our work-in-progress dictionary learning toolkit can learn a dictionary from image patches using the K-Means algorithm, with optional local contrast normalization and a PCA whitening transform. Using a trained object to transform data points with orthogonal matching pursuit, with the parameter n_atoms=1 is equivalent to gain-shape VQ. Of course you are free to use any method of sparse coding such as LARS. The code used to produce the example images on top of this post can be found in [2].

Using K-Means for learning the dictionary does not optimize over linear combinations of dictionary atoms, like standard dictionary learning methods do. However, it’s considerably faster, and Adam Coates and Andrew Ng suggest in [3] that as long as the dictionary is filled with a large enough number of atoms and it covers well enough the cloud of data (and of future test data) points, then K-Means, or even random sampling of image patches, can perform remarkably well for some tasks.

Image denoising with dictionary learning

Posted by & filed under dictionary learning, scikit-learn.

I am presenting an image denoising example that fully runs under my local scikits-learn fork. Coming soon near you!

The 400 square pixels area covering Lena’s face was distorted by additive gaussian noise with a standard deviation of 50 (pixel values are ranged 0-256.)

Lena image denoising using dictionary learning

The dictionary contains 100 atoms of shape 4×4 and was trained using 10000 random patches extracted from the undistorted image. Then, each one of the four 100 square pixel areas was reconstructed using the dictionary learning model and a different transform method.

  • OMP-1 reconstructs each patch as the closest dictionary atom, multiplied by a variable coefficient. This is similar to the idea of gain-shape vector quantization.
  • OMP-2 is like OMP-1, but it considers 2 atoms instead of just one. This takes advantage of the fact that the natural dictionary atoms are of such nature to efficiently represent random image patches when combined.
  • LARS finds a reconstruction of each image patch as a solution to a Lasso problem, solved using least angle regression.
  • Thresholding is a simple and quick non-linearity that (as it is currently implemented, based on [1], where it is not intended for reconstruction but for classification) breaks the local brightness of the image fragment. The bottom right fragment was forcefully renormalized to stretch fit into the 0-256 range, but brightness differences can be seen.

Dictionary learning sneak peek

Posted by & filed under Uncategorized.

Closing in on the goal of integrating J. Mairal’s dictionary learning in the scikit, I stitched together a couple of examples.

The code is not yet integrated according to our standards, but here is the kind of results you can expect.

Here is how a dictionary obtained from 8×8 patches of Lena looks like. Pretty much it looks as expected: gabor-like wavelets with different rotations and shifts, which means things are working!

Dictionary learned from lena patches

And here is how it works for denoising Lena. On the left is the noisy image and on the right is the reconstruction from a learned dictionary. The sparse code code producing the result on the right is found using orthogonal matching pursuit.

This is by no means a good denoising example, I have no idea at the moment how to tweak the patch sizes and the model parameters to obtain a better result. This is just a sneak peek and pretty soon you will see better stuff!

Denoising Lena with dictionary learning and OMP

Summer of Code roadmap, part 1

Posted by & filed under Uncategorized.

After a little busy while, I have graduated and entered the summer vacation, which means time for serious GSoC work.

Me on graduation day

So we had a little conference in order to discuss what will be done and when. We gathered quite a few code snippets since the official start of the project, but it’s now time to talk about integration and pull requests.

Here is the plan:

SparsePCA

First pull request due: June 15
This will be the use case I blogged about before. Specifically, we want to learn a dictionary of sparse atoms, but representations of the data will be dense.

SparseCoder

First pull request due: June 25
This is the transpose of the SparsePCA problem. We are learning the optimal, dense dictionary for sparse representations of the data.

KMeansCoder

First pull request due: June 30
This method builds the dictionary out of cluster centers found by K-means.

OnlineSparseCoder

First pull request due: July 10
This will involve the online learning tricks suggested in Julien Mairal’s work and will allow for faster computations of both sparse PCA and sparse coding. In the case of sparse coding, it will make use of the scikits.learn API for online learning.

While I will try to keep the deadlines for the initial pull requests as strictly as I can, we did not establish deadlines for merging, since this will depend on more factors. As long as the pull requests are up, the code review system will push it forward towards the merge. The focus is on teamwork and on feedback cycles as short as possible, as opposed to falling into the trap of delaying work until the night before the deadline.

First thoughts on Orthogonal Matching Pursuit

Posted by & filed under Uncategorized.

I am working on implementing the Orthogonal Matching Pursuit (OMP) algorithm for the scikit. It is an elegant algorithm (that almost writes itself in Numpy!) to compute a greedy approximation to the solution of a sparse coding problem:

$$ \text{argmin} \big|\big|\gamma\big|\big|_0 \text{ subject to }\big|\big|x-D\gamma\big|\big|_2^2 \leq \epsilon$$

or (in a different parametrization)

$$ \text{argmin} \big|\big|x – D\gamma\big|\big|_2^2\text{ subject to }\big|\big|\gamma\big|\big|_0 \leq m$$

The second formulation is interesting in that it gives one of the few algorithms for sparse coding that can control the actual number of non-zero entries in the solution. Some dictionary learning methods need this (I’m thinking of K-SVD).

Both problems are solved by the same algorithm, with a different stopping condition. The gist of it is to include at each iteration, the atom with the highest correlation to the current residual. However, as opposed to regular Matching Pursuit, here, after choosing the atom, the input signal is orthogonally projected to the space spanned by the chosen atoms. This involves the solution of a least squares problem at each step. However, because the problem is almost the same at each iteration, only with one more column added to the matrix, this can be easily solved by maintaining a QR or Cholesky decomposition of the dictionary matrix that is updated at each step.

Rubinstein et al. [1] came up with a clever method to optimize the calculations, based on the fact that usually in practice we never have to find a sparse coding for a single signal, but usually for a batch. They called this method Batch OMP, and it is based on a straightforward modification of the Cholesky update algorithm, taking advantage of precomputing the Gram matrix \( G=D’D\).

Based on my experiments, their batch update is the fastest, even though it lags behind if invoked with too small a batch. As soon as I make sure the implementation is robust and ready for use, I will make some benchmarks.

Update: Here’s a little proof that it works!
Stem plot for sparse signals recovered by OMP

Update 2: Here’s a little benchmark:
Orthogonal Matching Pursuit benchmark
[1] http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf

Sparse PCA

Posted by & filed under dictionary learning, scikit-learn.

I have been working on the integration into the scikits.learn codebase of a sparse principal components analysis (SparsePCA) algorithm coded by Gaël and Alexandre and based on [1]. Because the name “sparse PCA” has some inherent ambiguity, I will describe in greater depth what problem we are actually solving, and what it can be used for.

The problem

Mathematically, this implementation of Sparse PCA solves:

\((U^*, V^*)=underset{U,V}{mathrm{argmin,}}frac{1}{2}||X-UV||_2^2+alpha||V||_1\)

with \(|| U_k ||_2 = 1\) for all \(0 leq k < n_{atoms}\)

This looks really abstract so let’s try to interpret it. We are looking for a matrix factorization \(UV\) of \(X in mathbf{R}^{n_{samples}times n_{features}}\), just like in ordinary PCA. The interpretation is that the \(n_{atoms}\) lines of \(V\) are the extracted components, while the lines of \(U\) are the coordinates of the samples in this projection.

The most important difference between this and PCA is that we enforce sparsity on the components. In other words, we look for a representation of the data as a linear combination of sparse signals.

Another difference is that, unlike in PCA, here we don’t constrain U to be orthogonal, just to consist of normalized column vectors. There are different approaches where this constraint appears too, and they are on the list for this summer, but I digress.

The approach

As usual, such optimization problems are solved by alternatively minimizing one of the variables while keeping the other fixed, until convergence is reached.

The update of \(V\) (the dictionary) is computed as the solution of a Lasso least squares problem.  We allow the user to choose between the least angle regression method (LARS) or stochastic gradient descent as algorithms to solve the Lasso problem.

The update of \(U\) is block coordinate descent with warm restart. This is a batch adaptation of an online algorithm proposed by Mairal et al. in [1].

Sparse PCA as a transformer

Of course, in order to be of practical use, the code needs to be refactored into a scikits.learn transformer object, just like scikits.learn.decomposition.pca. This means that the optimization problem described above corresponds to the fitting stage. The post-fit state of the transformer is given by the learned components (the matrix \(V\) above).

In order to transform new data according to the learned sparse PCA model (for example, prior to classification of the test data), we simply need to do a least squares projection of the new data on the sparse components.

What is it good for?

For applications such as text and image processing, its great advantage is interpretability. When running a regular PCA on a set of documents in bag of words format, we can find an interesting visualisation on a couple of components, and it can show discrimination or clusters. The biggest problem is that the maximum variance components found by PCA have very dense expressions as linear combinations of the initial features. In practice, sometimes interpretation is made by simply marking the \(k\) variables with the highest coefficients in this representation, and basically interpreting as if the rest are truncated to 0 (this has been taught to me in a class on PCA interpretation).

Such an approximation can be highly misleading, and now we offer you the sparse PCA code that can extract components with only few non-zero coefficients, and therefore easy to interpret.

For image data, sparse PCA should extract local components such as, famously, parts of the face in the case of face recognition.

Personally I can’t wait to have it ready for the scikit so that I can play with it in some of my projects. I have two tasks where I can’t wait to see the results: one is related to Romanian infinitives, where PCA revealed structure, and I would love to see how it looks with sparse n-gram components. The other task is to plug it in as feature extractor for handwritten digit classification, for my undergraduate thesis.

[1] http://www.di.ens.fr/sierra/pdfs/icml09.pdf

Customizing scikits.learn for a specific text analysis task

Posted by & filed under nlp, scikit-learn.

Scikits.learn is a great general library, but machine learning has so many different application, that it is often very helpful to be able to extend its API to better integrate with your code. With scikits.learn, this is extremely easy to do using inheritance and using the pipeline module.

The problem

While continuing the morphophonetic analysis of Romanian verbal forms, I found the need to streamline my workflow to allow for more complex models. There were a lot of free model parameters and it would have been painful to interactively tweak everything in order to find a good combination

In my case, I needed to read a file containing infinitives and labels corresponding to conjugation groups, and run a linear support vector classifier on this data. The SVC has its C parameter that needs to be tweaked, but I also had some ideas that arose from the images in my old post. There, I compared the way the data looked when represented as differently sized n-gram features. Furthermore, I compared the count features (ie. features indicating the number of times an n-gram occurs in a string) with binary features (ie. indicating only whether the n-gram occurs in the string or not). It looked to me like, for such a low-level text analysis task, using counts only adds noise.

For this reason, the feature_extraction.text.CountVectorizer was not enough for me. It only returns count features. There was also another thing that needed to be adjusted: by default, its analyzer uses a preprocessor that strips accented characters, and I had strong reasons to believe that Romanian diacritics are very relevant for the learning task. So, I needed to extend the vectorizer.

The solution

The code I came up with is here. I tried to build a class that would be as specific to my needs as possible. It is important to retain the full API, however. Note the y=None parameter in the fit functions. Its necessity will become clear in a moment.

Another tricky part was exposing the max_n parameter from the inner analyzer. This was not really natural, but it simplified the constructions later on.

My InfinitivesExtractor class builds a data matrix from a list of strings. After using it, the data needs to be passed to the classifier, an instance of svm.LinearSVC. The pipeline module in scikits.learn allows us to plug components into eachother in order to build a more complex object. In this case, we would like a classifier that receives a string as input, and directly outputs its label. We wouldn’t want the user to have to manually use the feature extractor prior to classification.

The pipeline is very easy to build:
pipeline = Pipeline([('extr', InfinitivesExtractor()), ('svc', LinearSVC(multi_class=True))])
The pipeline object now works exactly as expected: we can call fit and predict on it. It also exposes the parameters of its constituents, by prefixing them with the name of that component. For example, the support vector machine’s C parameter can be accessed as pipeline.svc__C.

All that is left now is to see whether this is a good model, and what combination of parameters makes it work the best. Scikits.learn provides a great tool for choosing the parameters: the grid_search module. When working with models like support vector machines, model parameters (such as the radial basis kernel width) usually need to be chosen by cross validation, because intuition doesn’t help much when dealing with high dimensional data.

Grid search allows the definition of a discrete range of values for multiple parameters. Then, for each combination of parameters, it fits and evaluates a model using cross-validation, and the model with the best score is the winner. Because we combined the components into a pipeline, it is very easy to run grid search on the combined model, and to simultaneously tweak the settings both for the extractor and for the classifier.

After running the grid search using the code here, I found that indeed, using binary features instead of occurence counts improves performance. I also found that the optimal n-gram length is 5, but the gain is not that big when compared to a length of 3, which generates a lot less features.

Conclusions

I hope that I managed to show the strength of a well-designed API. Because of it, it would be very easy to add, for example, an extra layer for dimensionality reduction before classification. It would only require an extra item in the pipeline constructor. A call from a web-based frontend, for example, would be very short and simple. Because of the consistency in the scikits.learn classes, we can write cleaner and better code, and therefore work with greater efficiency.