# BibTeX-powered publications list for Pelican with pelican-bibtex

Posted by & filed under Uncategorized.

## Hook

Wouldn’t you like to manage your academic publications list easily within the context of your static website? Without resorting to external services, or to software like bibtex2html, which is very nice but will then require restyling to fit your templates?

Look no more, with the help of pelican-bibtex you can now manage your papers from within Pelican!

## Backstory

At Fabian‘s advice, I started playing around with Pelican, a static website/blog generator for Python. I like it better than the other generators I used before, so I chose it the next time I had to set up a website. I still didn’t make the courage to migrate my current website and blog to it, but I promise I will.

Pelican has a public plugins repository, but they have a license constraint for all contributions. My plugin isn’t complicated, but I had to “reverse engineer” undocumented parts of the pybtex API. I think that maybe that code that I used to render citations programatically can be useful to others, so I don’t want to release it under a restrictive license. For this reason, I publish pelican-bibtex in my personal GitHub account.

You can see it in action in the source code for the website I am working on at the moment, the home page of my research group. Example output generated using pelican-bibtex can be seen here.

## Possible extensions

I have not dug in too deeply but I believe this plugin can be extended, with not much difficulty, to support referencing in Pelican blogs, and render BibTeX references at the end of every post. This idea was suggested by Avaris on #pelican, and I find it very cool. Since I don’t need this feature at the moment, it’s not a priority, but it’s something that I would like to see at some point.

# Really the most common english idioms?

Posted by & filed under corpus linguistics, nlp.

A while back I ran into this blog post and it made me wonder. I’m not a native speaker but the idiomatic phrases that they note as common don’t strike me as such. I don’t think I have ever encountered them very often in real dialogue.

The blog post lists the 10 most common idioms in English. Idioms, also known less ambiguously as fixed expressions, are units of language that span at least two words. Their meaning, relatively to the individual meaning of the parts of the phrase, are figurative. Despite this, fixed expressions don’t classify as creative language, or exploitations. By definition most speakers will unequivocally be familiar with them.

For example, they cite piece of cake as the most common idiomatic expression. This refers to using the phrase to mean that something is easy, that it isn’t challenging. An example of literal use, however, would be when ordering a piece of cake for desert in a restaurant.

Everyone knows that language is a perpetually changing thing, so to begin with it’s even slightly misleading to discuss of the commonness of a phrase, without giving more context. The blog post doesn’t justify the ranking with any numbers anyway, so let’s take them one by one and find out how common they really are!

## Corpus Linguistics

The approach we are taking here is known as corpus linguistics. The best way to argue that a certain phrase is common, that something is used with a specific meaning or that some constructions are normal is, under corpus linguistics, not to make up examples that seem reasonable, but to look at representative collections of text (corpora) and trying to find the examples there. The conclusions you get this way are backed by real-world language use.

An argument often brought against generative linguistics is that it focuses on the (hard) border between grammatical and not grammatical, and the border is usually defined by made-up examples. This is inappropriate for studying how the norms are exploited in real language use, for example. I refer the interested to the work of Patrick Hanks [1, 2].

Corpus linguistics is sensitive to the corpus used. For this example let’s use two British English corpora: the British National Corpus and the Oxford English Corpus. Measuring by number of words, the latter is around 20 times bigger. The strong point of the BNC is the attention given to the mixing proportions of various domains. The OEC, on the other hand, is larger and more recent. I have a feeling (but I cannot strongly affirm) that the differences in the following results arise from the inclusion in the OEC of blogs dating from the mid-2000s.

## Cognitive salience vs. social salience

One of the key ideas that motivate corpus approaches is the mismatch between these. The cognitive salience of something is the ease with which we can recall it. An example often used in language is the fixed expression kicking the bucket. It is one of the standard examples of fixed expressions that people give very often when asked. It is supposed to mean dying.

However, big surprise: the BNC has only 18 instances of this phrase, out of which only 3 are idiomatic, the rest being either literal or metalinguistic. This is a nice example of the salience contrast, but we mustn’t hurry to conclusions. The OEC has 193 examples (still few, relative to its size) but a lot more of them are idiomatic uses. To save the time I didn’t look at all the examples, but took a random sample of size 18, to compare the relative frequencies to BNC. Here, 15 out of 18 instances are idiomatic and none are meta. Quite a difference!

This goes to show the importance of context when we draw conclusions about language use. Now let’s tackle the list with a similar analysis.

## The idioms

1. Piece of cake

In BNC, this phrase occurs 51 times. 29 of these occurrences, however, the meaning is literal. In OEC we find 601 occurrences. In a random sample of size 51 we find 12 literal uses.

2. Costing an arm and a leg

For flexibility we search for the phrase an arm and a leg. In BNC it can be found 29 times: one literal, four with the verb to pay, and 16 with to cost. In OEC it appears 228 times. We take, again, a sample of size 29 and find no literal uses, 16 with to cost, four with to pay, three with to charge and a few different uses. The figurative meaning is the same in all cases: a lot of money.

3. Break a leg

BNC: 16, 13 of which are literal. OEC: 70 hits, 10/16 literal.

4. Hitting the books

BNC: 1 occurrence of hit the record books, which has a different meaning. The idiom is never used. OEC: 135, one of which literal.

5. Letting the cat out of the bag

We just looked for cooccurrences of cat in the context of the phrase out of the bag .
BNC: 19, out of which 3 metalinguistic/literal. OEC: 298, and out of a sample of 19, all were idiomatic.

6. Hitting the nail on the head

BNC: 12 instances, all idiomatic. OEC: 484, and out of a sample of 12 all were idiomatic.

7. When pigs fly

We looked for the lemma fly before the word pigs therefore catching multiple variations.
BNC: 17 hits, OEC: 240.

8. Judging a book by its cover

We looked for the fixed phrase book by its cover, because the leading verb might vary.
In the BNC, 11 instances (1 of them with tell instead of judge). In OEC, 195 instances. Sampling 11, all were idiomatic.

9. Biting off more than one can chew

BNC: 16 occurences, one of which with “to take” instead of “to bite”. OEC: 231, all idiomatic after sampling 16.

10. Scratching one’s back

BNC: 23, out of which only 5 idiomatic. OEC: 756, 5/23 idiomatic.

## Recalculating the rank

We now have enough data to reorder the expressions and compare. The result will be more approximate for the OEC because of our use of small subsamples to estimate the frequencies, but hopefully it will still be interesting. The way we are estimating the counts for the OEC is as follows: take, for instance, break a leg. It was found 70 times, and out of a sample of 16, 10 were literal. The expected number of idiomatic uses is therefore:

$$n = \left ( 1 – \frac{10}{16} \right ) \cdot 70 = 26.25$$

Repeating this computation and skipping a ton of steps leads to the following rankings:

In the British National Corpus:

1. Costing an arm and a leg
2. Piece of cake
3. When pigs fly
4. Letting the cat out of the bag
5. Biting off more than one can chew
6. Hitting the nail on the head
7. Judging a book by its cover
8. Scratching one’s back
9. Break a leg
10. Hitting the books
In the Oxford English Corpus:

1. Hitting the nail on the head
2. Piece of cake
3. Letting the cat out of the bag
4. When pigs fly
5. Biting off more than one can chew
6. Costing an arm and a leg
7. Judging a book by its cover
8. Scratching one’s back
9. Hitting the books
10. Break a leg

We can see that apart from the apparent switching of hitting the nail on the head with costing an arm and a leg, the rankings are not too different. We can quantify this by using the Rank Distance, a metric introduced by Liviu P. Dinu [3, 4]. Here, all our 3 rankings are over the same domain: we are not looking for the most frequent idioms in the corpora, this would be very hard. We are just reordering the proposed rank according to the occurrences in BNC and OEC. In this simple case, Rank Distance reduces to $$\ell_1$$ distance over rank position vectors. The weighted Rank Distance, bounded on $$[0, 1]$$ is in this case given by a scaling factor of $$0.5k^2$$ where k is the length of the rankings (10 in our case).

The computed distance between the original ranking and the BNC reordering is 0.52. Between the original and the OEC reordering, it is 0.68. Our two reorderings are much closer: the distance is 0.28. This is mostly because that the permutations between the two reorderings affect the top position, and are therefore weighted more.

It’s also interesting to look at the ratio of the counts. Interestingly, they approximately differ by a constant factor not far from the relative size difference of the two corpora, as would be expected.

We have to throw away hitting the books because its BNC zero count leads to divisions by zero. After this step, the average of the relative counts of the idioms is 19.5, with a standard deviation of 10.1, while OED is supposed to have around 20 times more words than the BNC.

## Conclusions

Well, it seems people don’t say break a leg and let’s hit the books as often as the original author claims. The popularity of most of the cited idioms seems supported by the data, but we have no easy way to find other idioms that might turn out to be much more frequent. Corpus linguistics is a reliable way to measure the social salience of language patterns It should always be used to verify and back empty claims of the form X is correct, Y is frequent or Nobody says Z.

[1] Patrick Hanks, How people use words to make meanings.
[2] Patrick Hanks, Lexical Analysis: Norms and Exploitations. The MIT Press (January 25, 2013)
[3] Liviu P. Dinu, Florin Manea. An efficient approach for the rank aggregation problem. In: Theoretical Computer Science, Volume 359 Issue 1, 14 August 2006. Pages 455 – 461.
[4] Liviu P. Dinu, On the Classification and Aggregation of Hierarchies with Different Constitutive Elements. Fundam. Inform. 55(1): 39-50 (2003)

# Scikit-learn-speed: An overview on the final day

Posted by & filed under benchmarking, python, scikit-learn.

This summer, I was granted the project called scikit-learn-speed, consisting of developing a benchmarking platform for scikit-learn and using it to find potential speedups, and in the end, make the library go faster wherever I can.

On the official closing day of this work, I’d like to take a moment and recall the accomplishments and failures of this project, and all the lessons to be learned.

## The scikit-learn-speed benchmark platform

Scikit-learn-speed is a continuous benchmark suite for the scikit-learn library. It has the following features:

• vbench-powered integration with Git
• Easily triggered build and report generation: just type make
• Easily readable and writeable template for benchmarks:
    {
'obj': 'LogisticRegression',
'init_params': {'C': 1e5},
'statements': ('fit', 'predict')
}, ...

• Many attributes recorded: time (w/ estimated standard deviation), memory usage, cProfiler output, line_profiler output, tracebacks
• Multi-step benchmarks: i.e. fit followed by predict

What were the lessons I learned here?

### Make your work reusable: the trade-off between good design and get-it-working-now

For the task of rolling out a continuous benchmarking platform, we decided pretty early in the project to adopt Wes McKinney’s vbench. If my goal would’ve been to maintain vbench and extend it into a multi-purpose, reusable benchmarking framework, the work would’ve been structured differently. It also would have been very open-ended and difficult to quantify.

The way things have been, I came up with features that we need in scikit-learn-speed, and tried to implement them in vbench without refactoring too much, but still by trying to make them as reusable as possible.

The result? I got all the features for scikit-learn-speed, but the implementation is not yet clean enough to be merged into vbench. This is fine for a project with a tight deadline such as this one: after it’s done, I will just spend another weekend on cleaning the work up and making sure it’s appreciated upstream. This will be easier because of the constraint to keep compatibility with scikit-learn-speed.

### Never work quietly (unless you’re a ninja)

I know some students who prefer that the professor doesn’t even know they exist until the final, when they would score an A, and (supposedly) leave the professor amazed. In real life, plenty of people would be interested in what you are doing, as long as they know about it. The PSF goes a long way to help this, with the “blog weekly” rule. In the end, however, it’s all up to you to make sure that everybody who should know finds out about your work. It will spare the world the duplicated work, the abandoned projects, but most importantly, those people could point you to things you have missed. Try to mingle in real-life as well, attend conferences, meetups, coding sprints.

I was able to slightly “join forces” with a couple of people who contacted me about my new vbench features (Hi Jon and Joel!), I have shaped my design slightly towards their requirements as well, and hopefully the result will be a more general vbench.

## The speedups

Once scikit-learn-speed was up and running, I couldn’t believe how useful it is to be able to scroll, catch slow code and jump straight at the profiler output with one click. I jumped on the following speed-ups:

• Multiple outputs in linear models. (PR)

Some of them proved trickier than expected, so I didn’t implement it for all the module yet, but it is ready for some estimators.

• Less callable functions passed around in FastICA (merged)
• Speed up euclidean_distances by rewriting in Cython. (PR)

This meant making more operations support an out argument, for passing preallocated memory. This touches many
different objects in the codebase: clustering, manifold learning, nearest neighbour methods.

• Insight into inverse and pseudoinverse computation, new pinvh function for inverting symmetric/hermitian matrices. (PR)

This speeds up the covariance module (especially MinCovDet), ARDRegression and the mixture models. It also lead to an upstream contribution to Scipy

• OrthogonalMatchingPursuit forward stepwise path for cross-validation (PR)

This is only halfway finished, but it will lead to faster and easier optimization of the OMP sparsity parameter.

Lessons? These will be pretty obvious.

### Write tests, tests, tests!

This is a no-brainer, but it still didn’t stick. In that one case out of 10 that I didn’t explicitly test, a bug was obviously hiding. When you want to add a new feature, it’s best to start by writing a failing test, and then making it pass. Sure, you will miss tricky bugs, but you will never have embarrassing, obvious bugs in your code

### Optimization doesn’t have to be ugly

Developers often shun optimization. It’s true, you should profile first, and you shouldn’t focus on speeding up stuff that is dominated by other computations that are orders of magnitude slower. However, there is an elephant in the room: the assumption that making code faster invariably makes it less clear, and takes a lot of effort.

The following code is a part of scipy’s pinv2 function as it currently is written:

    cutoff = cond*np.maximum.reduce(s)
psigma = np.zeros((m, n), t)
for i in range(len(s)):
if s[i] > cutoff:
psigma[i,i] = 1.0/np.conjugate(s[i])
return np.transpose(np.conjugate(np.dot(np.dot(u,psigma),vh)))


psigma is a diagonal matrix, and some time and memory can be saved with simple vectorization. However, this part of the code dominated by an above call to svd. The profiler output would say that we shouldn’t bother, but is it really a bother? Look at Jake’s new version:

    above_cutoff = (s > cond * np.max(s))
psigma_diag = np.zeros_like(s)
psigma_diag[above_cutoff] = 1.0 / s[above_cutoff]

return np.transpose(np.conjugate(np.dot(u * psigma_diag, vh)))


It’s shorter, more elegant, easier to read, and nevertheless faster. I would say it is worth it.

### Small speed-ups can propagate

Sure, it’s great if you can compute an inverse two times faster, say in 0.5s instead of 1s. But if some algorithm calls this function in a loop that might iterate 100, 300, or 1000 times, this small speed-up seems much more important, doesn’t it?

What I’m trying to say with this is that in a well-engineered system, a performance improvement to a relatively small component (such as the function that computes a pseudoinverse) can lead to multiple spread out improvements. Be careful of the double edge of this sword, a bug introduced in a small part can cause multiple failures downstream. But you are fully covered by your test suite, aren’t you?

Overall it has been a fruitful project that may have not resulted in a large number of speed-ups, but a few considerable ones nonetheless. And I venture the claim that the scikit-learn-speed tool will prove useful over time, and that the efforts deployed during this project have stretched beyond the boundary of the scikit-learn.

# Inverses and pseudoinverses. Numerical issues, speed, symmetry.

Posted by & filed under benchmarking, python.

The matrix inverse is a cornerstone of linear algebra, taught, along with its applications, since high school. The inverse of a matrix $$A$$, if it exists, is the matrix $$A^{-1}$$ such that $$AA^{-1} = A^{-1}A = I_n$$. Based on the requirement that the left and right multiplications should be equal, it follows that it only makes sense to speak of inverting square matrices. But just the square shape is not enough: for a matrix $$A$$ to have an inverse, $$A$$ must be full rank.

The inverse provides an elegant (on paper) method of finding solutions to systems of $$n$$ equations with $$n$$ unknowns, which correspond to solving $$Ax = b$$ for $$x$$. If we’re lucky and $$A^{-1}$$ exists, then we can find $$x = A^{-1}b$$. For this to work, it must be the case that:

• We have exactly as many unknowns as equations
• No equation is redundant, i.e. can be expressed as a linear combination of the others

In this setting, there is a unique solution for $$x$$.

## The Moore-Penrose pseudoinverse

What if we have more equations than unknowns? It is most likely the case that we cannot satisfy all the equations perfectly, so let’s settle for a solution that best fits the constraints, in the sense of minimising the sum of squared errors. We solve $$\operatorname{arg\,min}_x ||b – Ax||$$.

And how about the other extreme, where we have a lot of unknowns, but just a few equations constraining them. We will probably have an infinity of solutions, how can we choose one? A popular choice is to take the one of least $$\ell_2$$ norm: $$\operatorname{arg\,min}_x ||x|| \operatorname{s.t.} Ax = b$$. Is there a way to generalize the idea of a matrix inverse for this setting?

The pseudoinverse of an arbitrary-shaped matrix $$A$$, written $$A^{+}$$, has the same shape as $$A^{T}$$ and solves our problem: the answer to both optimization methods above is given by $$x = A^{+}y$$.

The theoretical definition of the pseudoinverse is given by the following conditions. The intuitive way to read them is as properties of $$AA^+$$ or $$A^+A$$:

• $$AA^+A = A$$
• $$A^+AA^+ = A^+$$
• $$(AA^+)^T = AA^+$$
• $$(A^+A)^T = A^+A$$

These conditions do not however give us a way to get our hands on a pseudoinverse, so we need something else.

## How to compute the pseudoinverse on paper

The first time I ran into the pseudoinverse, I didn’t even know its definition, only the expression of the closed-form solution of such a problem, and given as:

$$A^+ = (A^T A)^{-1}A^T$$

What can we see from this expression:

• It gives us a way to compute the pseudoinverse, and hence to solve the problem
• If $$A$$ is actually invertible, it means $$A^T$$ is invertible, so we have $$A^+ = A^{-1}(A^T)^{-1}A^T = A^{-1}$$
• Something bad happens if $$A^TA$$ is not invertible.

The pseudoinverse is still defined, and unique, when $$A^TA$$ is not invertible, but we cannot use the expression above to compute it.

## Numerical issues

Before going on, we should clarify and demystify some of the urban legends about numerical computation of least squares problems. You might have heard the following unwritten rules:

1. Never compute $$A^{-1}$$, solve the system directly
2. If you really need $$A^{-1}$$, use pinv and not inv

The first of these rules is based on some misguided beliefs, but is still good advice. If your goal is a one-shot answer to a system, there’s no use in explicitly computing a possibly large inverse, when all you need is $$x$$. But this paper shows that computing the inverse is not necessarily a bad thing. The key to this is conditional accuracy, and as long as the inv function used has good conditional bounds, you will get as good results as with a least squares solver.

The second rule comes from numerical stability, and will definitely bite you if misunderstood. If $$A$$ is a square matrix with a row full of zeros, it’s clearly not invertible, so an algorithm attempting to compute the inverse will fail and you will be able to catch that failure. But what if the row is not exactly zero, but the sum of several other rows, and a slight loss of precision is propagated at every step?

## Numerical rank vs. actual rank

The rank of a matrix $$A$$ is defined as the number of linearly independent rows (or equivalently, columns) in $$A$$. In other words, the number of non-redundant equations in the system. We’ve seen before that if the rank is less than the total number of rows, the system cannot have a unique solution anymore, so the matrix $$A$$ is not invertible.

The rank of a matrix is a computationally tricky problem. On paper, with small matrices, you would look at minors of decreasing size, until you find the first non-zero one. This is unfeasible to implement on a computer, so numerical analysis has a different approach. Enter the singular value decomposition!

The SVD of a matrix $$A$$ is $$A = USV^{T}$$, where $$S$$ is diagonal and $$U, V$$ are orthogonal. The elements on the diagonal of $$S$$ are called the singular values of $$A$$. It can be seen that to get a row full of zeros when multiplying three such matrices, a singular value needs to be exactly zero.

The ugly thing that could happen is that one (or usually more) singular values are not exactly zero, but very low values, due to propagated imprecision. Why is this a problem? By looking at the SVD and noting its properties, it becomes clear that $$A^{-1} = VS^{-1}U^{T}$$ and since $$S$$ is diagonal, its inverse is formed by taking the inverse of all the elements on the diagonal. But if a singular value is very small but not quite zero, its inverse is very large and it will blow up the whole computation of the inverse. The right thing to do here is either to tell the user that $$A$$ is numerically rank deficient, or to return a pseudoinverse instead. A pseudoinverse would mean: give up on trying to get $$AA^+$$ to be the identity matrix, simply aim for a diagonal matrix with approximately ones and zeroes. In other words, when singular values are very low, set them to 0.

How do you set the threshold? This is actually a delicate issue, being discussed on the numeric Python mailing list.

## Scipy implementations

Scipy exposes inv, pinv and pinv2. inv secretly invokes LAPACK, that ancient but crazy robust code that’s been used since the 70s, to first compute a pivoted LU decomposition that is then used to compute the inverse. pinv also uses LAPACK, but for computing the least-squares solution to the system $$AX = I$$. pinv2 computes the SVD and transposes everything like shown above. Both pinv and pinv2 expose cond and rcond arguments to handle the treatment of very small singular values, but (attention!) they behave differently!

The different implementations also lead to different speed. Let’s look at inverting a random square matrix:

In [1]: import numpy as np

In [2]: from scipy import linalg

In [3]: a = np.random.randn(1000, 1000)

In [4]: timeit linalg.inv(a)
10 loops, best of 3: 132 ms per loop

In [5]: timeit linalg.pinv(a)
1 loops, best of 3: 18.8 s per loop

In [6]: timeit linalg.pinv2(a)
1 loops, best of 3: 1.58 s per loop


Woah, huge difference! But do all three methods return the “right” result?

In [7]: linalg.inv(a)[:3, :3]
Out[7]:
array([[ 0.03636918,  0.01641725,  0.00736503],
[-0.04575771,  0.03578062,  0.02937733],
[ 0.00542367,  0.01246306,  0.0122156 ]])

In [8]: linalg.pinv(a)[:3, :3]
Out[8]:
array([[ 0.03636918,  0.01641725,  0.00736503],
[-0.04575771,  0.03578062,  0.02937733],
[ 0.00542367,  0.01246306,  0.0122156 ]])

In [9]: linalg.pinv2(a)[:3, :3]
Out[9]:
array([[ 0.03636918,  0.01641725,  0.00736503],
[-0.04575771,  0.03578062,  0.02937733],
[ 0.00542367,  0.01246306,  0.0122156 ]])

In [10]: np.testing.assert_array_almost_equal(linalg.inv(a), linalg.pinv(a))

In [11]: np.testing.assert_array_almost_equal(linalg.inv(a), linalg.pinv2(a))


Looks good! This is because we got lucky, though, and a was invertible to start with. Let’s look at its spectrum:

In [12]: _, s, _ = linalg.svd(a)

In [13]: np.min(s), np.max(s)
Out[13]: (0.029850235603382822, 62.949785645178906)


This is a lovely range for the singular values of a matrix, not too small, not too large. But what if we built the matrix in a way that would always pose problems? Specifically, let’s look at the case of covariance matrices:

In [14]: a = np.random.randn(1000, 50)

In [15]: a = np.dot(a, a.T)

In [16]: _, s, _ = linalg.svd(a)

In [17]: s[-9:]
Out[17]:
array([  7.40548924e-14,   6.48102455e-14,   5.75803505e-14,
5.44263048e-14,   4.51528730e-14,   3.55317976e-14,
2.46939141e-14,   1.54186776e-14,   5.08135874e-15])



a has at least 9 tiny singular values. Actually it’s easy to see why there are 950 of them:

In [18]: np.sum(s < 1e-10)
Out[18]: 950


How do our functions behave in this case? Instead of just looking at a corner, let’s use our gift of sight:

The small eigenvalues are large enough that inv thinks the matrix is full rank. pinv does better but it still fails, you can see a group of high-amplitude noisy columns. pinv2 is faster and it also gives us a useful result in this case.

Wait, does this mean that pinv2 is simply better, and pinv is useless?

Not quite. Remember, we are now trying to actually invert matrices, and degrade gracefully in case of rank deficiency. But what if we need the pseudoinverse to solve an actual non-square, wide or tall system?

In [19]: a = np.random.randn(1000, 50)

In [20]: timeit linalg.pinv(a)
10 loops, best of 3: 104 ms per loop

In [21]: timeit linalg.pinv(a.T)
100 loops, best of 3: 7.08 ms per loop

In [22]: timeit linalg.pinv2(a)
10 loops, best of 3: 114 ms per loop

In [23]: timeit linalg.pinv2(a.T)
10 loops, best of 3: 126 ms per loop


Huge victory for pinv in the wide case! Hurray! With all this insight, we can draw a line and see what we learned.

• If you are 100% sure that your matrix is invertible, use inv for a huge speed gain. The implementation of inv from Scipy is based on LAPACK’s *getrf + *getri, known to have good bounds.
• If you are trying to solve a tall or wide system, use pinv.
• If your matrix is square but might be rank deficient, use pinv2 for speed and numerical gain.

## Improving the symmetric case

But wait a second, can’t we do better? $$AA^T$$ is symmetric, can’t we make use of that to speed up the computation even more? Clearly, if $$A$$ is symmetric, in its SVD $$A = USV^T$$, we must have $$U = V$$. But this is exactly the eigendecomposition of a symmetric matrix $$A$$. The eigendecomposition can be computed cheaper than the SVD using Scipy eigh, that uses LAPACK’s *evr. As part of my GSoC this year, with help from Jake VanderPlas, we made a pull request to Scipy containing a pinvh function that is equivalent to pinv2 but faster for symmetric matrices.

In [24]: timeit linalg.pinv2(a)
1 loops, best of 3: 1.54 s per loop

In [25]: timeit linalg.pinvh(a)
1 loops, best of 3: 621 ms per loop

In [26]: np.testing.assert_array_almost_equal(linalg.pinv2(a), linalg.pinvh(a))


# The scikit-learn-speed ship has set sail! Faster than ever, with multi-step benchmarks!

Posted by & filed under benchmarking, python, scikit-learn.

I am pleased to announce that last night at 2:03 AM, the first fully automated run of the scikit-learn-speed test suite has run on our Jenkins instance! You can admire it at its temporary home for now. As soon as we verify that everything is good, we will move this to the official scikit-learn page.

I would like to take this opportunity to tell you about our latest changeset. We made running the benchmark suite tons simpler by adding a friendly Makefile. You can read more about its usage in the guide. But by far, our coolest new toy is:

## Multi-step benchmarks

A standard vbench benchmark has three units of code, represented as strings: code, setup and cleanup. With the original timeit-based benchmarks, this means that for every run, the setup would be executed once. Then, the main loop runs repeat times, and within each iteration, the code is run ncalls times. Then cleanup happens, the best time is returned, and everybody is happy.

In scikit-learn, most of our interesting objects go through a state change called fitting. This metaphor is right at home in the machine learning field, where we separate the learning phase for the prediction phase. The prediction step cannot be invoked on an object that hasn’t been fitted.

For some algorithms, one of these steps is trivial. A brute force Nearest Neighbors classifier can be instantaneously fit, but prediction takes a while. On the opposite end we have linear models, with tons of complicated algorithms to fit them, but evaluation is a simple matrix-vector product that Numpy handles perfectly.

But many of scikit-learn’s estimators have both steps interesting. Let’s take Non-negative Matrix Factorization. It has three interesting functions: The fit that computes $$X = WH$$, the transform that computes a non-negative projection on the components learned in fit, and fit_transform that takes advantage of the observation that when fitting, we also get the transformed $$X$$ for free.

When benchmarking NMF, we initially had to design 3 benchmarks:

• setup = standard, code = obj.fit(X)
• setup = standard, code = obj.fit_transform(X)
• setup = standard + obj.fit(X), code = obj.transform(X)

## How much time were we wasting?

Let’s say it takes 10 seconds. For every benchmark, we time the code by running it 3 times. We run it once more to measure memory usage, once more for cProfile and one last time for line_profiler. This is a total of 6 times per benchmark. We need to multiply this by 2 again for running on two datasets. So when benchmarking NMF, because we need to fit before predicting, we do it 12 extra times. If a fit takes 5 seconds, this means one minute wasted on benchmarking just one estimator. Wouldn’t it be nice to fit, fit_transform and transform in a sequence?

## Behind the scenes

We made the PythonBenchmark code parameter also support getting a sequence of strings, instead of just a string. On the database side, every benchmark result entry gets an extra component in the primary key, the number of the step it measures.

In the benchmark description files, nothing is changed:

{
'obj': 'NMF',
'init_params': {'n_components': 2},
'datasets': ('blobs',),
'statements': ('fit_unsup', 'transform_unsup', 'fit_transform')
},


But before, we would take the cartesian product of datasets and statements, and build a Benchmark object for every pairing. Now, we just pass the tuple as it is, and vbench is smart enough to do the right thing.
We avoided the extra calls to fit in a lot of benchmarks. The whole suite now takes almost half the time to run!

Note: This trick is currently hosted in the abstract_multistep_benchmarks vbench branch in my fork.

# Profiler output, benchmark standard deviation and other goodies in scikit-learn-speed

Posted by & filed under benchmarking, python, scikit-learn.

This post is about the scikit-learn benchmarking project that I am working on, called scikit-learn-speed. This is a continuous benchmarking suite that runs and generates HTML reports using Wes McKinney’s vbench framework, to which I had to make some (useful, I hope) additions.

## What it looks like now

You can check out a teaser/demo that was run on equidistant releases from the last two months. What has changed since the last version? Here’s a list in order of obviousness:

• We now use the lovely scikit-learn theme
• Timing graphs now show the ±1 standard deviation range
• cProfile output is displayed for all the benchmarks, so we can easily see at a glance what’s up
• Said profiler output is collapsible using JQueryUI goodness
• There now is an improved Quick Start guide to running vbench on your machine

## What made this possible

I have done some more refactoring in my vbench fork, because I didn’t want to have a huge, monolithic Benchmark class that was specific to what we want in scikit-learn-speed. So on this branch, I set up a mixin/multiple inheritance hierarchy of benchmark classes.

The Benchmark class in vbench is now an abstract base class, with some common functionality and structure.
Our SklBenchmark class is defined in scikit-learn-speed as:

class SklBenchmark(CProfileBenchmarkMixin, MemoryBenchmarkMixin, PythonBenchmark):

Let’s read this from right to left:

• PythonBenchmark: This class stores code, setup and cleanup Python code as strings, and implements simple timing mechanisms using the time module.
• Bonus: TimeitBenchmark: This class extends PythonBenchmark with the timeit micro-benchmark timing method previously used in vbench. We turned this off in scikit-learn-speed.
• MemoryBenchmarkMixin: This adds memory benchmarking using memory_profiler.
• CProfileBenchmarkMixin: This runs the code through cProfile and implements mechanisms to report the output.

The database is not flexible enough to adapt to arbitrary benchmark structure right now, so if anybody would like to help the effort, it would be very appreciated.

# Scikit-learn-speed HTML reports teaser

Posted by & filed under benchmarking, python, scikit-learn.

EDIT: I made the plots a little more readable, check it out!

Last time, I teased you with a screenshot of local output. Now, I will tease you with the benchmarks run on a couple of recent commits, along with some from earlier this year.

After some effort and bugfixes, the project now reliably runs on different machines, so the next step to host it on a remote server and invoke it daily is getting closer. In the mean time, you can have a look at the sample output.

Note that just last time, the plots look jagged but the differences are mostly minor and significant conclusions cannot be drawn yet, but as the suite will start running daily, the plots will become much more meaningful. I could waste time running the suite on more previous commits, but the results wouldn’t be comparable with the ones from the deployed system, because of hardware differences.

Playing around with this makes me want a couple of features in vbench. One is the possibility to overlay related benchmarks on the same plot (for example, different parameters for the same algorithm and data): this could be useful to spot patterns. A second one is some query / sorting support: see what are the most expensive benchmarks, see what benchmarks show the biggest jump in performance (but this could become a historical wall of fame or shame).

# Memory benchmarking with vbench

Posted by & filed under benchmarking, python, scikit-learn.

The scikit-learn-speed project now has memory usage benchmarking!

This was accomplished by building on what I described in my recent posts, specifically the extensions to Fabian’s memory_profiler that you can find in my fork, but they will be merged upstream soon. The key element is the %magic_memit function whose development I blogged about on several occasions. I plugged this into vbench in a similar way to how the timings are computed, all with great success.

Here is a screenshot of the way a simple benchmark looks now, with just a few data points.

Memory benchmarking in scikit-learn-speed powered by vbench.

You can check it out and use it yourself for your benchmarks, but you need to use the vbench from the memory branch on my fork.

Of course, there are some important caveats. I am running this on my laptop, which runs OS X Lion, so, under the effect of this bug, I hardcoded the ‘-i‘ so the memory benchmarks are not realistic. Also, the y-range should probably be forced wider, because the plots look erratic, showing the very small noise at a large-scale.

# On why my %memit fails on OSX

Posted by & filed under benchmarking, python.

In my last post I mentioned that I’m not satisfied with the current state of %memit, because some more complicated numerical function calls make it crash. I will start this post with a reminder of a pretty important bug:

On MacOS X (10.7 but maybe more), after forking a new process, there is a segfault in Grand Central Dispatch on the BLAS DGEMM function from Accelerate.

EDIT 1: In a hurry, I forgot to mention how Olivier Grisel and David Cournapeau spent some time narrowing down this issue, starting from an odd testing bug in scikit-learn. They reported it to Apple, but there was, as of the date of this post, no reaction.

EDIT 2: MinRK confirms, and I verified shortly after, that this bug is fixed in Mountain Lion (10.8). Still not sure how far back it goes, though, so feedback is welcome.

When I first tried to make the %memit magic, I thought about simply measuring the current memory, running the command, and measuring the memory again. The problem is the results are not consistent, because Python tries to reuse already allocated memory whenever it can.

Using memory_profiler, here’s an example illustrating this elastic memory management:

# mem_test.py
import numpy as np

def make_a_large_array():
return np.ones((1000, 1000))

def main():
make_a_large_array()
make_a_large_array()
make_a_large_array()

# in IPython:
In [1]: import mem_test

In [2]: %mprun -f mem_test.main mem_test.main()
Filename: mem_test.py

Line #    Mem usage  Increment   Line Contents
==============================================
8   24.8477 MB  0.0000 MB   def main():
9   24.8633 MB  0.0156 MB       make_a_large_array()
10   32.4688 MB  7.6055 MB       make_a_large_array()
11   32.4688 MB  0.0000 MB       make_a_large_array()


If this was in an IPython environment, and one would like to see how much memory make_a_large_array() uses, you could say we can simply run it a few times and take the maximum. However, if you happened to accidentally call main() once before, you will no longer get a good result:

In [3]: %mprun -f mem_test.main mem_test.main()
Filename: mem_test.py

Line #    Mem usage  Increment   Line Contents
==============================================
8   32.4922 MB  0.0000 MB   def main():
9   32.5234 MB  0.0312 MB       make_a_large_array()
10   32.5234 MB  0.0000 MB       make_a_large_array()
11   32.5234 MB  0.0000 MB       make_a_large_array()


So how can we get consistent results for the memory usage of an instruction? We could run it in a fresh, new process. I implemented this in %memit and it shows:

In [5]: %memit mem_test.make_a_large_array()
maximum of 3: 8.039062 MB per loop

In [6]: %memit mem_test.make_a_large_array()
maximum of 3: 8.035156 MB per loop

In [7]: %memit mem_test.make_a_large_array()
maximum of 3: 8.042969 MB per loop


This way you can also realistically benchmark assignments:

In [8]: %memit X = mem_test.make_a_large_array()
maximum of 3: 8.054688 MB per loop

In [9]: %memit X = mem_test.make_a_large_array()
maximum of 3: 8.058594 MB per loop

In [10]: %memit X = mem_test.make_a_large_array()
maximum of 3: 8.058594 MB per loop


If we don’t spawn a subprocess, del doesn’t help, but allocating new variables does:

In [11]: %memit -i X = mem_test.make_a_large_array()
maximum of 3: 7.632812 MB per loop

In [12]: del X

In [13]: %memit -i X = mem_test.make_a_large_array()
maximum of 3: 0.000000 MB per loop

In [14]: %memit -i Y = mem_test.make_a_large_array()
maximum of 3: 7.632812 MB per loop

In [15]: %memit -i Z = mem_test.make_a_large_array()
maximum of 3: 7.632812 MB per loop


Now, the problem is that when the function that you are benchmarking contains calls to np.dot (matrix multiplication), the subprocess will consistently fail with SIGSEGV on affected OS X systems. These are actually pretty much all the functions that I intended %memit for: numerical applications. For that reason, I have made %memit notify the user when all subprocesses fail, and to suggest the usage of the -i flag.

I think that, with this update, %memit is flexible and usable enough for actual use, and therefore for merging into memory_profiler.

# More on memory benchmarking

Posted by & filed under benchmarking, python.

Following up on my task to make it easier to benchmark memory usage in Python, I updated Fabian’s memory_profiler to include a couple of useful IPython magics. While in my last post, I used the new IPython 0.13 syntax for defining magics, this time I used the backwards-compatible one from the previous version.

You can find this work-in-progress as a pull request on memory_profiler from where you can trace it to my GitHub repo. Here’s what you can do with it:

## %mprun

Copying the spirit of %lprun, since imitation is the most sincere form of flattery, you can use %mprun to easily view line-by-line memory usage reports, without having to go in and add the @profile decorator.

For example:


In [1]: import numpy as np

In [2]: from sklearn.linear_model import ridge_regression

In [3]: X, y = np.array([[1, 2], [3, 4], [5, 6]]), np.array([2, 4, 6])

In [4]: %mprun -f ridge_regression ridge_regression(X, y, 1.0)

(...)

109   41.6406 MB  0.0000 MB           if n_features > n_samples or \
110   41.6406 MB  0.0000 MB              isinstance(sample_weight, np.ndarray) or \
111   41.6406 MB  0.0000 MB              sample_weight != 1.0:
112
113                                       # kernel ridge
114                                       # w = X.T * inv(X X^t + alpha*Id) y
115                                       A = np.dot(X, X.T)
116                                       A.flat[::n_samples + 1] += alpha * sample_weight
117                                       coef = np.dot(X.T, _solve(A, y, solver, tol))
118                                   else:
119                                       # ridge
120                                       # w = inv(X^t X + alpha*Id) * X.T y
121   41.6484 MB  0.0078 MB               A = np.dot(X.T, X)
122   41.6875 MB  0.0391 MB               A.flat[::n_features + 1] += alpha
123   41.7344 MB  0.0469 MB               coef = _solve(A, np.dot(X.T, y), solver, tol)
124
125   41.7344 MB  0.0000 MB       return coef.T



## %memit

As described in my previous post, this is a %timeit-like magic for quickly seeing how much memory a Python command uses.
Unlike %timeit, however, the command needs to be executed in a fresh process. I have to dig in some more to debug this, but if the command is run in the current process, very often the difference in memory usage will be insignificant, I assume because preallocated memory is used. The problem is that when running in a new process, some functions that I tried to bench crash with SIGSEGV. For a lot of stuff, though, %memit is currently usable:

In [1]: import numpy as np

In [2]: X = np.ones((1000, 1000))

In [3]: %memit X.T
worst of 3: 0.242188 MB per loop

In [4]: %memit np.asfortranarray(X)
worst of 3: 15.687500 MB per loop

In [5]: Y = X.copy('F')

In [6]: %memit np.asfortranarray(Y)
worst of 3: 0.324219 MB per loop


It is very easy, using this small tool, to see what forces memory copying and what does not.

## Installation instructions

First, you have to get the source code of this version of memory_profiler. Then, it depends on your version of IPython. If you have 0.10, you have to edit ~/.ipython/ipy_user_conf.py like this: (once again, instructions borrowed from line_profiler)

# These two lines are standard and probably already there.
import IPython.ipapi
ip = IPython.ipapi.get()

# These two are the important ones.
import memory_profiler
ip.expose_magic('mprun', memory_profiler.magic_mprun)
ip.expose_magic('memit', memory_profiler.magic_memit)


If you’re using IPython 0.11 or newer, the steps are different. First create a configuration profile:

\$ ipython profile create


Then create a file named ~/.ipython/extensions/memory_profiler_ext.py with the following content:

import memory_profiler

ip.define_magic('mprun', memory_profiler.magic_mprun)
ip.define_magic('memit', memory_profiler.magic_memit)


Then register it in ~/.ipython/profile_default/ipython_config.py, like this. Of course, if you already have other extensions such as line_profiler_ext, just add the new one to the list.

c.TerminalIPythonApp.extensions = [
'memory_profiler_ext',
]
c.InteractiveShellApp.extensions = [
'memory_profiler_ext',
]


Now launch IPython and you can use the new magics like in the examples above.