GSoC 2012 proposal: Need for scikit-learn speed

Posted by & filed under scikit-learn.

This summer I hope to be able to put in another full-time amount of effort into scikit-learn. After a successful Google Summer of Code project last year on dictionary learning, I now plan to do some low-level work. The title of my proposal is: “Need for scikit-learn speed” and, in a nutshell, will make the scikit go faster and will help it stay that way.

Scikit-learn has always enforced standards of quality that kept all implementations at a non-trivial level (i.e. faster than using the generic optimizers in scipy). However, not all modules are equal: some have received more attention for speed than others (for example the SGD classes). I intend to raise the bar towards a more uniform level.

Are you crazy, can you really do this?

Well, of course. This might not the usual GSoC proposal, but I can show how I can do it and how it’s easily quantifiable. Actually, a very important part of the work will be to make scikit-learn’s speed easily measurable.

As for the specific speed-ups, I have shown in the past that I can do algorithmic and memory layout optimizations in numerical code. There are parts in the scikit-learn that can benefit from such work: for example only recently Peter merged this pull request significantly improving SGDClassifier’s test time performance by switching the memory layout of the coefficients: they were laid out optimally for the training phase, not for the prediction phase.

There are certainly more opportunities for such speed improvements in the scikit. Of course there is a lot of code that can’t reasonably be made any faster (I have a feeling that SGDClassifier is at the moment such a case, but we can’t know for sure without heavy profiling). But generally there are many speed fixes that could weigh a lot: for example, a Cython implementation of the euclidean_distances function that is able to use preallocated memory will improve the performance of raw NearestNeighbours queries as well as of the KMeans and hierarchical clustering algorithms.

How will we be able to tell if you succeed?

A key part of the GSoC project is setting up a CI-style benchmark platform. The point is to be able to track how the speed of certain operations evolves in time. For such purposes, Wes McKinney developed the vbench project, introduced in this blog post. The goal is for every scikit-learn module to have several such benchmarks, for differently shaped and structured data.

Having such a benchmark suite available is the equivalent of a test suite, in terms of performance. It makes developers be extra conscious of the effect of their changes. It also makes it more fun to chase speed improvements, thanks to the positive reinforcement it gives.

There are some static benchmarks comparing the performance of scikit-learn algorithms with other well-known libraries in the ml-benchmarks project. It would be very helpful to have such a benchmark suite that automatically keeps up-to-date.

Side effects

The cool thing about such a project is that it should raise the overall quality of the scikit. The refactoring will lead to an increase in test coverage, because the low-coverage modules are expected to be less optimized as well. Also, the benchmarks will lead to well-backed summaries in the documentation, such as the one recently added in the clustering section.

Since the scikit is reaching a state where many well-known algorithms are available, the 1.0 release is slowly approaching. My Google Summer of Code project should bring the scikit significantly closer to that milestone.

Romanian people and coffee

Posted by & filed under Uncategorized.

So I got my hands of the Google N-gram data for the Romanian language. It’s noisy as hell, has some other subtle issues too, but here’s the first thing I noticed:

The Romanian word for coffee is cafea, and the more you crave it, the longer you pronunce the final a: I really need some cafeaaaa right now.

Thanks to Google, here are the numbers:

Distribution of the length of the final letter in the Romanian word for coffee.

Post scriptum: I hope you like the theme: I installed the sane matplotlib color scheme from Huy Nguyen.

Nash-Williams theorem on the Hamiltonian property of some regular graphs

Posted by & filed under Uncategorized.

I have been digging on the internet for the proof of this theorem for the last couple of days without success. The result was published by Sir Crispin Nash-Williams as Valency Sequences which force graphs to have Hamiltonian Circuits. Interim Rep, University of Waterloo Res Rep., Waterloo, Ontario, 1969. However, this old paper is unavailable online but I have a proof in some lecture notes from my class, that I want to share here.

Theorem. Let \(G=(V, E)\) be an \(n\)-regular graph with \(|V| = 2n + 1\). Then, \(G\) is Hamiltonian.

Proof. We first remark that \(n\) must be even, since $$\sum_{x \in V} d(x) = n(2n + 1) = 2|E|$$ We might try to apply Dirac’s theorem, which would give us a Hamiltonian cycle if \( \forall x \in V, d(x) \geq \frac{|V|}{2}\). But in the current case, \(\forall x \in V, d(x) = n < \frac{2n+1}{2}\).

So we force Dirac by adding an extra vertex \(w\) and connecting it to all \( x \in V \). In this new graph \(G'\), \(d(x) = n + 1 \forall x \in V\) and \(d(w) = 2n + 1\). Therefore we have a Hamiltonian cycle that passes through \(w\) and in which, \(w\) is adjacent to two vertices \(x\) and \(y \in V\). Therefore this cycle induces a Hamiltonian path in \(G\): $$P = [x = v_0, v_1, ..., v_{2n-1}, v_{2n}=y] $$

Suppose that \(G\) is not Hamiltonian. It follows that if \( v_0v_i \in E \), then \( v_{i-1}v_{2n} \notin E\) and also that if \( v_0v_i \notin E \), then \( v_{i-1}v_{2n} \in E\).

We have two cases. If \(v_0\) is adjacent to \(v_1, ..., v_n\) then it follows that \(v_{2n}\) is adjacent to \(v_n, v_{n+1}, ..., v_{2n-1}\), since it cannot be adjacent to any \(v_i, i < n\) without creating a Hamiltonian cycle. But in this case, in the graph induced by the first half \(G[\{v_0, v_1, ... v_n\}]\), \(v_n\) cannot be adjacent to all the others, since in \(G\) it has degree \(n\) and it already has \(2\) outgoing edges. So there is at least one vertex \(v_i, i < n\) that isn't adjacent to it, which means \(v_i\) is adjacent to some \(v_j, j > n\), thus forming a Hamiltonian cycle.

In the second case, we have a vertex \(v_i, 2 \leq i \leq 2n – 1\) such that \(v_0v_i \notin E\) and \(v_0v_{i+1} \in E\). This also means that \(v_{i-1}v_{2n} \in E\).

We therefore have a cycle of length \(2n\) in \(G\) that excludes \(v_i\). Let’s rename this cycle \(C=[y_1, y_2, ..., y_{2n}, y_1]\) and \(v_i=y_0\).

\(y_0\) cannot be adjacent to two consecutive vertices \(y_i\) and \(y_{i+1}\) because this will give a Hamiltonian cycle. But we know that \(deg(y_0) = n\). It follows that it’s adjacent to all of the even or odd numbered vertices. We assume the latter, without loss of generality. Let \(2k\) be some even index. Notice that we have \(\{y_0y_{2k-1}, y_0y_{2k+1}\} \subset E\) and we can follow the cycle \(C\) from \(y_{2k+1}\) all the way back to \(y_{2n-1}\) giving us a new cycle \(C’ = [y_1, y_2, ..., y_{2n-1}, y_0, y_{2k+1}, ..., y_{2n}, y_1]\) also of length \(2n\). So by repeating the same reasoning for every even vertex, by placing it in the middle and building a cycle around it, it follows that every even vertex is adjacent to all the odd vertices. But there are \(n+1\) even indices, so it follows that the degree of any odd vertex is at least \(n+1\), contradicting the initial conditions of the theorem. \(\square\)

Moving out

Posted by & filed under Uncategorized.

Happy new year, friends!

I’ve made a New Year’s resolution to build a better web presence and make better use of the domain that I previously only used for mail.

This has prompted me to move my blog over to http://blog.vene.ro which hopefully is shorter, better, faster and stronger!

The nasty bug crawling in my Orthogonal Matching Pursuit code

Posted by & filed under dictionary learning, scikit-learn.

A while back, Bob L. Sturm blogged about a similar implementation of OMP to the one in scikit-learn. Instead of using the Cholesky decomposition like we did, his Matlab code uses the QR decomposition, to a similar (or maybe even identical) outcome, in theory. So lucky that Alejandro pointed out to him the existence of the scikit-learn implementation, and that Bob’s code exposed a bug that all the test coverage didn’t catch! This plot should increase, certainly not decrease! Something is clearly wrong here.
OMP buggy phase transition, decreasing instead of increasing
Luckily we were able to find it and fix it very quickly. I have updated the old entries I wrote on the OMP optimizations, so they no longer include the bug. But I take this opportunity to explain what exactly went wrong.

A key part of the optimization was that slicing out arbitrary columns out of an array is slow when they are passed to BLAS functions like matrix multiplication. In order to make the most out of your code, the data should have a contiguous layout. We achieved this by swapping active dictionary atoms (columns) to the beginning of the array.

Something that can happen, but won’t happen very often, is that after an atom is selected as active, the atom that takes its place after swapping needs to be selected. This is rare because dictionaries have many columns, out of which only very very few will be active. But when it happens, because the code didn’t keep track of swapped indices, the corresponding coefficient of the solution would get updated twice, leading to more zero entries than we should have. A keen eye could have noticed that the first `n_nonzero_coefs` entries in OMP solution vectors were never non-zero. But alas, my eye was not a keen one at all.

In other words, the following test (that was written after the bug was found, unfortunately) was failing:

def test_swapped_regressors():
    gamma = np.zeros(n_features)
    # X[:, 21] should be selected first, then X[:, 0] selected second,
    # which will take X[:, 21]'s place in case the algorithm does
    # column swapping for optimization (which is the case at the moment)
    gamma[21] = 1.0
    gamma[0] = 0.5
    new_y = np.dot(X, gamma)
    new_Xy = np.dot(X.T, new_y)
    gamma_hat = orthogonal_mp(X, new_y, 2)
    gamma_hat_gram = orthogonal_mp_gram(G, new_Xy, 2)
    # active indices should be [0, 21], but prior to the bugfix
    # the algorithm would update only [21] but twice
    assert_equal(np.flatnonzero(gamma_hat), [0, 21])
    assert_equal(np.flatnonzero(gamma_hat_gram), [0, 21])

Note that this bug has been fixed for a while, but I didn’t get the free time to write this post until now. Good news is: we fixed it, and did so very quickly after the report. So you can still trust me, I guess!

Sampling Gamma random variates through the ratio-of-uniforms method

Posted by & filed under python.

One year ago I had the chance to take a class on Monte Carlo simulation with prof. Ion Văduva, and my assignment for the class was to implement exactly what it says in the title of the blog post. I am going to walk you through the idea behind this.

General formulation

The ratio-of-uniforms is a method that can be applied to many density functions. Essentially, given a density function over \( \mathbb{R}^m\), \( f(x) = \frac{h(x)}{H}\) where \(H\) is a normalization constant (ie. \( h(x) \geq 0\), \( H = \int h(x)dX\)). Given a parameter \( c > 0 \) and a parametrization \(\phi\) from \( \mathbb{R}^{m+1}\) to \( \mathbb{R}^{m}\) expressed as: $$ \phi(v_0, v_1, …, v_m) = \left ( \frac{v_1}{v_0^c}, \frac{v_2}{v_0^c}, …, \frac{v_m}{v_0^c} \right )$$
Define the set \( \mathcal{C} = \{\mathbf{v} \big | \gamma(\mathbf{v}) \leq 0, v_0 > 0\} \in \mathbb{R}^{m+1}\) where
$$\gamma(\mathbf{v}) = \log v_0 – \frac{\log h(\phi(v_0, v_1, …, v_m))}{mc + 1}$$ If \( \mathcal{C}\) is bounded and we sample a uniform vector \( \mathbf{V} \sim \text{Uniform}(\mathcal{C})\) then \( \phi(\mathbf{V}) \sim f(x)\). Also note that the measure (volume) of the set \( \mathcal{C}\) is \( \frac{H}{mc + 1}\). I do not have any references for the proof, except for a book in Romanian, but if you are interested, just leave me a comment and I’ll do a follow-up post with the proofs.

Univariate scenario

For the univariate case, all the above simplifies to $$ \mathcal{C} = \left \{(u, v) \Big | 0 < u < \sqrt
{h\left (\frac{v}{u^c}\right )} \right \} $$. We generate \( (U, V) \sim \text{Uniform}(\mathcal{C})\) and take \( \frac{V}{U^c} \sim f(x)\).
Since we are looking at the (univariate) Gamma distribution, described by: $$ f(x; \nu, \theta) = \frac{x^{\mu - 1} \exp(\frac{-x}{\theta})}{\theta^k\Gamma(k)}$$ \( \nu\) is the shape parameter and \( \theta\) is the scale parameter.
But because of the property that if \( X \sim \text{Gamma}(\nu, \theta)\), then for any \( k > 0\), \( kX \sim \text{Gamma}(\nu, k\theta)\), we conclude that we can fix \( \theta\) to 1 without loss of generality. Replacing in the style of the definition in the previous section, we have \( h(x; \nu) = x^{\nu-1}e^{-x}\) and \( H_\nu = \Gamma(\nu)\).
This allows us to compute the equation of the boundary of the set \( \mathcal{C}\) which ends up being described by \(\gamma(u, v) = \log{u} – \frac{\nu – 1}{c + 1} \log{\left(\frac{v}{u^c}\right)} + \frac{1}{c+1} \frac{v}{u^c}\). For visualisation purposes, here is how it would look like for \( \nu=6, c=1\) (plotted using Wolfram Alpha):

Sampling algorithm

In order to uniformly sample from this set, we can apply basic rejection sampling: just uniformly sample from a rectangular region surrounding the set, and reject the points that do not satisfy the condition. In order to do this as efficiently as possible, we need to compute the minimal bounding box, which can be done by solving a couple of optimization problems using Lagrange multipliers and the KKT conditions. Also by looking closely at the image, you can see that the lower left corner is exactly the origin: this turns out not to be a coincidence. I won’t go into detail here, but here are the bounds I derived:
$$ 0 < u < (\nu - 1)^\frac{\nu - 1}{c + 1} e ^ {-\frac{\nu - 1}{c + 1}} \text{ and } 0< v < \left(\frac{c\nu + 1}{c}\right)^{\frac{c\nu + 1}{c + 1}} e ^ {- \frac {c\nu + 1}{c+1}}$$

The probability of acceptance (which can be seen as the efficiency) of the rejection sampling method is given by the ratio of the areas of the set \( \mathcal{C}\) and the bounding box. The larger this probability, the less points we throw away and the more efficient the algorithm is. Using the values derived above, this probability is: $$ p(\nu, c) = \frac{\Gamma(\nu)e^{\nu}}{(c+1) (\nu - 1)^{\frac{\nu - 1}{c + 1}} \left(\frac{c\nu + 1}{c}\right)^{\frac{c\nu + 1}{c + 1}}}$$

Personally I got stumped here. The idea would be to determine the ideal \( c\) for a given \( \nu\) in order to maximize the probability, but I didn't manage to do it (I leave it as an exercise for the reader ;) ). Anyway, this is enough to proceed with an implementation, so I'm gonna give the Python code for it. Note that I used the name k for the shape parameter instead of \( \nu\). Also note that the case when \( 0 < \nu < 1\) needed to be treated separately, which I did using the following property: Let \( \nu \in (0, 1)\). If \( X' \sim \text{Gamma}(1+\nu, 1), U \sim \text{Uniform}(0, 1)\) then $$ X = X' \cdot \sqrt[\nu]{U} \sim \text{Gamma}(\nu, 1)$$ For a proof of this fact, see [1], which is a great article on generating Gamma variates.

Implementation

from import numpy as np

def _cond(u, v, k, c):
    """Identity function describing the acceptance region"""
    x = v / u ** c
    return (c + 1) * np.log(u) <= (k - 1) * np.log(x) - x

def vn_standard_gamma(k, c=1.0, rng=np.random):
    """Generates a single standard gamma random variate"""
    if k <= 0:
        raise ValueError("Gamma shape should be positive")
    elif k < 1:
        return vn_standard_gamma(1 + k, c, rng) * rng.uniform() ** (1 / k)
    elif k == 1:
        return rng.standard_exponential()
    else:
        a, b = get_bounds(k, c)
        while True:
            u, v = rng.uniform(0, a), rng.uniform(0, b)
            if _cond(u, v, k, c):
                break;
        return v / u ** c

def vn_gamma(k, t, shape=1, c=1.0, rng=np.random):
    """Vectorized function to generate multiple gamma variates"""
    generator = lambda x: t * vn_standard_gamma(k, c, rng)
    generator = np.vectorize(generator)
    return generator(np.empty(shape))

def get_bounds(k, c=1.0):
    """Computes the minimal upper bounds surrounding the acceptance region"""
    a = ((k - 1) / np.e) ** ((k - 1) / (c + 1))
    b = ((c * k + 1) / (c * np.e)) ** ((c * k + 1) / (c + 1))
    return a, b

def prob_acc(k, c=1.0):
    """Calculates the probability of acceptance for the given parameters"""
    from scipy.special import gamma
    a, b = get_bounds(k, c)
    return gamma(k) / ((c + 1) * a * b)

Results

And of course I should show you that it works. Here are some histograms for various values of \( \nu\), with the theoretical density plotted in dotted red, after sampling \( 10^5\) values. The y-axis is the frequency (sorry for labeling in Romanian), and for the red dotted line it can be interpreted as the theoretical probability. You can clearly see the goodness of fit is excellent.

[1]: George Marsaglia and Wai Wan Tsang. 1998. The Monty Python method for generating random variables. ACM Trans. Math. Softw. 24, 3 (September 1998), 341-350. DOI=10.1145/292395.292453

RANLP 2011 in Hissar, BG

Posted by & filed under conferences, nlp.

Last week was marked by the international RANLP (Recent Advances in Natural Language Processing) conference, taking place in a nice spa in Hissar, Bulgaria. The excellent folks from the computational linguistics group at the University of Wolverhampton were behind it, together with the Institute of Information and Communication Technologies from the Bulgarian Academy of Sciences.

A fountain with warm mineral spring water in Hissarya.
Picture by Miranda


I must begin by thanking them: the organization was impeccable! I’m not sure, but I think that at one point Ivelina was even running around buying routers to improve wifi coverage (which is already spectacular in Bulgaria — I’ve received reports from Miranda that you can get wifi in the mountains!)

The schedule was busy, with three tracks going in parallel, in order to cover a wide range of topics in computational linguistics. The student workshop should also be noted for the excellent quality of the works there.

Of course it would be infeasible to write about all the great people I met and their high quality work. And if I were to write about all the fun we had, it would probably make this post look unprofessional :) . This doesn’t mean I forgot about any of you, and as soon as I get the chance to work on something related, I will most certainly write about it, and you.

So, if I would have to summarize the trends and the ideas stated during the conference and especially during the keynotes, I would say:

  • When talking about word sense disambiguation, it’s wrong to speak about the different meanings of a word, but rather about the potential a word has for bringing a certain meaning in a certain context. See Patrick HanksCorpus Pattern Analysis. Without something like this, to have good WSD you need to heavily adjust the overlapping meanings from a Wordnet-style ontology.
  • Certain relations, such as temporal and spacial ones, can naturally be modeled by complex domain-specific logics (see Inderjeet Mani‘s new book, Interpreting Motion: Grounded Representations for Spatial Language, that is due for publishing). But these only appear in a small subset of human communication. The attempt to map human language to a complete logic in which to do general-purpose inference seems futile: Ido Dagan suggests textual entailment: reasoning directly in natural language, and only abstracting away to a formal logic system when need arises.
  • If you have a large enough sample of n-gram frequency data, you can eventually beat the performance you can get with a limited amount of labeled data, and most importantly: it generalizes much better when going out of the domain you trained on. Apparently the best tool for this at the moment is the Google n-gram data, which has some limitations. In time, we can easily extend this data by huge amounts by mining n-grams from Wikipedia (which allegedly has a higher count of distinct n-grams than the Google dataset), and more importantly, by aligning multi-language data, making use of transliterations and cognate identification.

Please note that I may be (or most probably, am) ignorant of older instances of similar ideas, and I may have misunderstood certain claims. Please feel free to discuss in the comments whether you think I forgot about something important, or whether I am plain wrong about something. In particular, I seem to have been completely ignorant of the existence of the Google n-gram data, which has been around for quite a while, so I must have missed other important things as well.

Take care, kind readers, and express your opinion!
V

Dictionary learning in scikit-learn 0.9

Posted by & filed under dictionary learning, scikit-learn.

Thanks to Olivier, Gaël and Alex, who reviewed the code heavily the last couple of days, and with apologies for my lack of activity during a sequence of conferences, Dictionary learning has officially been merged into scikit-learn master, and just in time for the new scikit-learn 0.9 release. Here are some glimpses of the examples you can run for yourself:

Dictionary learned from Lena patches Noisy image for denoising Image denoising with Dictionary learning and Orthogonal matching pursuit

The stars of this new release are: the manifold learning module by Jake Vanderplas and Fabian Pedregosa, the Dirichlet process gaussian mixture model by Alexandre Passos, and many others, as you can see from the development changelog (as soon as the release is made, I will update this post with permanent links).

The release is due tomorrow. I will also be in charge with building the Windows installers for this release, let’s hope I do a good job and you can think of me and smile when installing!

Long overdue update. EuroScipy and SSLST 2011

Posted by & filed under Uncategorized.

Anybody reading my blog should have expected me to blog about the end of my GSoC. Sorry to disappoint, but I simply did not experience anything similar to an ending. On the contrary, I feel like things have barely started. Also, I apologize for one of the few posts here without pretty pictures! :)

For the last two weeks, I’ve been traveling. I attended the EuroScipy conference thanks to Fabian, who offered me a place to sleep during the week. We sprinted hard, we discussed tricky APIs, we drank a lot of coffee, beer, and ate well in lovely Paris. It was great to meet all of the celebrities, the people who keep the scientific Python globe turning.

Many thanks to Gael and Emmanuelle, who worked very, very hard on organizing everything, so they weren’t around and I didn’t get to say goodbye when I ran to catch my plane last Sunday.

I was in a hurry, heading to Tarragona, a beautiful city on the Catalan coast, where the public university organized the 2011 summer school in linguistics and speech technologies (SSLST). This was a great opportunity to meet many fellow young researchers working in computational linguistics. I will not go into details now, because I plan expand on this, but I would like to state a couple of things. Firsty, even though NLP seems to be mostly a Java-dominated affair (note for example Stanford’s NLP toolkit and Sheffield’s GATE), the computational linguistics and psycholinguistics research center (CLiPS) at the University of Antwerp actually briefly manifested its devotion to Python and NLTK via its research director, Walter Daelemans.

It was good to see a little love for Python in this field. NLTK is very underrepresented in the SciPy community, I couldn’t find anybody at the EuroScipy conference knowing too much about it or about the people behind it.

Another lab that has done a lot of cool work is Cental at the Catholic University of Louvain, and they also use Python for natural language processing. Maybe in the coming years, we will see a Python for Computational Linguistics sattelite, along with Physics and Neuroscience. Doesn’t it sound more fun? :P

Secondly, I wish SSLST were organized by someone like Gael! As the discussion at dinner regarding who will organize next year’s EuroScipy went, it is imperative that the organizers be actively involved in the community, and generally passionate about it. Even though I’m comparing apples and oranges, Carlos Martin-Vide behaved in this context like a old, tired, emotionless academic, not taking into account even lunch breaks for the whole group, not to mention any sort of getting together or even a group photo (which, alas, we were not able to take, apart from small groups.) They said it couldn’t be done. Of course it could, they just didn’t want it hard enough.

Finally, before signing off, I would like to announce that because the Romanian Ministry of Education failed to specify the allocated number of public positions for masters’ programmes, the admission exam at the University of Bucharest will be delayed by a couple of weeks. Luckily, this will allow me to attend RANLP 2011 in Hissar, Bulgaria a week from now, where I will present my poster entitled:
“Can alternations be learned? A machine learning approach to Romanian verb conjugation” by Liviu P. Dinu, Emil Ionescu, Vlad Niculae and Octavia-Maria Sulea. See you in Hissar!

Optimizing Orthogonal Matching Pursuit code in Numpy, part 2

Posted by & filed under dictionary learning, python, scikit-learn.

EDIT: There was a bug in the final version of the code presented here. It is fixed now, for its backstory, check out my blog post on it.

When we last saw our hero, he was fighting with the dreaded implementation of least-angle regression, knowing full well that it was his destiny to be faster.

We had come up with a more robust implementation, catching malformed cases that would have broken the naive implementation, and also it was orders of magnitude faster than said implementation. However, our benchmark [1] showed that it was a couple of times slower than least-angle regression.

By poking around the scikits.learn codebase, I noticed that there is a triangular system solver in scikits.learn.utils.arrayfuncs. Unlike the scipy.linalg one, this one only works with lower triangular arrays, and it forcefully overwrites b. Even though if weren’t faster, it should still be used: scikits.learn aims to be as backwards-compatible with SciPy as possible, and linalg.solve_triangular was added in 0.9.0. Anyway, let’s just see whether it’s faster:

In [1]: import numpy as np

In [2]: from scipy import linalg

In [3]: from scikits.learn.datasets import make_spd_matrix

In [4]: from scikits.learn.utils.arrayfuncs import solve_triangular

In [5]: G = make_spd_matrix(1000)

In [6]: L = linalg.cholesky(G, lower=True)

In [7]: x = np.random.randn(1000)

In [8]: y = x.copy()

In [9]: timeit solve_triangular(L, x)
100 loops, best of 3: 3.45 ms per loop

In [10]: timeit linalg.solve_triangular(L, y, lower=True, overwrite_b=True)
10 loops, best of 3: 134 ms per loop

Wow! That’s 40x faster. We’re catching two rabbits with one stone here, let’s do the change! Notice that we can just copy \( \mathbf{v}\) into the appropriate place in \( L\) and then solve in place.

But whoops! When solving the \( LL’\) system, we take advantage of the transpose attribute in linalg.solve_triangular, which the scikits.learn version does not expose. We could think of a solution, but here’s a better idea: Shouldn’t there be some way to directly solve the entire system in one go?

Well, there is. It is an LAPACK function by the name of potrs. If you are not aware, LAPACK is a Fortran library with solvers for various types of linear systems and eigenproblems. LAPACK along with BLAS (on which it is based) pretty much powers all the scientific computation that happens. BLAS is an API with multiple implementations dating from 1979, while LAPACK dates from 1992. If you ever used Matlab, this is what was called behind the scenes. SciPy, again, provides a high-level wrapper around this, the linalg.cho_solve function.

But SciPy also gives us the possibility to import functions directly from LAPACK, through the use of linalg.lapack.get_lapack_funcs. Let’s see how the low-level LAPACK function compares to the SciPy wrapper, for our use case:

In [11]: x = np.random.randn(1000)

In [12]: y = x.copy()

In [13]: timeit linalg.cho_solve((L, True), x)
1 loops, best of 3: 95.4 ms per loop

In [14]: potrs, = linalg.lapack.get_lapack_funcs(('potrs',), (G,))

In [15]: potrs
Out[15]: &lt;fortran object&gt;

In [16]: timeit potrs(L, y)
100 loops, best of 3: 9.49 ms per loop

That’s 10 times faster! So now we found an obvious way to optimize the code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
    min_float = np.finfo(X.dtype).eps
    potrs, = get_lapack_funcs(('potrs',), (X,))
    alpha = np.dot(X.T, y)
    residual = y
    n_active = 0
    idx = []

    max_features = X.shape[1] if eps is not None else n_nonzero_coefs
    L = np.empty((max_features, max_features), dtype=X.dtype)
    L[0, 0] = 1.

    while 1:
        lam = np.abs(np.dot(X.T, residual)).argmax()
        if lam &lt; n_active or alpha[lam] ** 2 &lt; min_float:
            # atom already selected or inner product too small
            warn(&quot;Stopping early&quot;)
            break
        if n_active &gt; 0:
            # Updates the Cholesky decomposition of X' X
            L[n_active, :n_active] = np.dot(X[:, idx].T, X[:, lam]
            solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])
            d = np.dot(L[n_active, :n_active].T, L[n_active, :n_active])
            if 1 - d &lt;= min_float:  # selected atoms are dependent
                warn(&quot;Stopping early&quot;)
                break
            L[n_active, n_active] = np.sqrt(1 - d)
        idx.append(lam)
        # solve LL'x = y in two steps:
        gamma, _ = potrs(L[:n_active, :n_active], alpha[idx], lower=True,
                         overwrite_b=False)
        residual = y - np.dot(X[:, idx], gamma)
        if eps is not None and np.dot(residual.T, residual) &lt;= eps:
            break
        elif n_active == max_features:
            break
    return gamma, idx

Woohoo! But we still lag behind. Now that we delegated the trickiest parts of the code to fast and reliable solvers, it’s time to use a profiler and see what the bottleneck is now. Python has excellent tools for this purpose. What solved the problem in this case was line_profiler [2]. There is a great article by Olivier Grisel here [2] regarding how to use these profilers. I’m just going to say that line_profiler‘s output is very helpful, basically printing the time taken by each line of code next to that line.

Running the profiler on this code, we found that 58% of the time is spent on line 14, 20.5% on line 21, and 20.5% on line 32, with the rest being insignificant (potrs takes 0.1%!). The code is clearly dominated by the matrix multiplications. By running some more timings with IPython, I found that multiplying such column-wise views of the data as X[:, idx] is considerably slower then multiplying a contiguous array. The least-angle regression code in scikits.learn avoids this by swapping columns towards the front of the array as they are chosen, so we can replace X[:, idx] with X[:, :n_active]. The nice part is that if the array is stored in Fortran-contiguous order (ie. column contiguous order, as opposed to row contiguous order, as in C), swapping two columns is a very fast operation!. Let’s see some more benchmarks!

In [17]: X = np.random.randn(5000, 5000)

In [18]: Y = X.copy('F')  # fortran-ordered

In [19]: a, b = 1000, 2500

In [20]: swap, = linalg.get_blas_funcs(('swap',), (X,))

In [21]: timeit X[:, a], X[:, b] = swap(X[:, a], X[:, b])
100 loops, best of 3: 6.29 ms per loop

In [22]: timeit Y[:, a], Y[:, b] = swap(Y[:, a], Y[:, b])
10000 loops, best of 3: 111 us per loop

We can see that using Fortran-order takes us from the order of miliseconds to the order of microseconds!

Side note: I almost fell into the trap of swapping columns the pythonic way. That doesn’t work:

In [23]: X[:, a], X[:, b] = X[:, b], X[:, a]

In [24]: np.testing.assert_array_equal(X[:, a], X[:, b])

In [25]:

However this trick works great for swapping elements of one-dimensional arrays.

Another small optimization that we can do: I found that on my system, it’s slightly faster to compute the norm using the BLAS function nrm2. So by putting all of these together, we end up with the final version of our code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None, overwrite_X=False):
    if not overwrite_X:
        X = X.copy('F')
    else:  # even if we are allowed to overwrite, still copy it if bad order
        X = np.asfortranarray(X)

    min_float = np.finfo(X.dtype).eps
    nrm2, swap = linalg.get_blas_funcs(('nrm2', 'swap'), (X,))
    potrs, = get_lapack_funcs(('potrs',), (X,))

    indices = range(len(Gram))  # keeping track of swapping
    alpha = np.dot(X.T, y)
    residual = y
    n_active = 0

    max_features = X.shape[1] if eps is not None else n_nonzero_coefs
    L = np.empty((max_features, max_features), dtype=X.dtype)
    L[0, 0] = 1.

    while True:
        lam = np.abs(np.dot(X.T, residual)).argmax()
        if lam &lt; n_active or alpha[lam] ** 2 &lt; min_float:
            # atom already selected or inner product too small
            warn(&quot;Stopping early&quot;)
            break
        if n_active &gt; 0:
            # Updates the Cholesky decomposition of X' X
            L[n_active, :n_active] = np.dot(X[:, :n_active].T, X[:, lam])
            solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])
            v = nrm2(L[n_active, :n_active]) ** 2
            if 1 - v &lt;= min_float:  # selected atoms are dependent
                warn(&quot;Stopping early&quot;)
                break
            L[n_active, n_active] = np.sqrt(1 - v)
        X.T[n_active], X.T[lam] = swap(X.T[n_active], X.T[lam])
        alpha[n_active], alpha[lam] = alpha[lam], alpha[n_active]
        indices[n_active], indices[lam] = indices[lam], indices[n_active]
        n_active += 1
        # solves LL'x = y as a composition of two triangular systems
        gamma, _ = potrs(L[:n_active, :n_active], alpha[:n_active], lower=True,
                         overwrite_b=False)

        residual = y - np.dot(X[:, :n_active], gamma)
        if eps is not None and nrm2(residual) ** 2 &lt;= eps:
            break
        elif n_active == max_features:
            break

    return gamma, indices[:n_active]

Now, the benchmark at [1] indicates victory over least-angle regression! I hope you have enjoyed this short tour. See you next time!

[1] OMP vs. Lars benchmark
[2] Profiling Python code