Don’t be tricked by the Hashing Trick

Published in

Booking.com Data Science

12 min readJan 10, 2018

In Machine Learning, the Hashing Trick is a technique to encode categorical features. It’s been gaining popularity lately after being adopted by libraries like Vowpal Wabbit and Tensorflow (where it plays a key role) and others like sklearn, where support is provided to enable out-of-core learning.

Unfortunately, the Hashing Trick is not parameter-free; the hashing space size must be decided beforehand. In this article, the Hashing Trick is described in depth, the effects of different hashing space sizes are illustrated with real world data sets, and a criterion to decide the hashing space size is constructed.

Out-of-core learning

Consider the problem of learning a linear model: an out-of-core algorithm learns the model without loading the whole data set in memory. It reads and processes the data row by row, updating feature coefficients on the fly. This makes the algorithm very scalable since its memory footprint is independent of the number of rows, which is a very attractive property when dealing with data sets that don’t fit in memory.

One possible implementation uses a mapping from a feature name (and possibly a value) to its corresponding coefficient. For example, if the data contains ‘country’ as a feature with values like ‘Netherlands’, ’Argentina’ or ‘Nigeria’, the weights associated to each country are updated like this:

W[‘country_argentina’] = W[‘country_argentina’] + delta(X, y, ‘country’)

This is implicitly applying one-hot encoding to the feature ‘country’. The fact that it happens implicitly is great — we don’t need to create a file with one column for each possible country, use sparse data file formats or any other preprocessing. We don’t even need to know how many categories there are; we can just take our .csv data as is, throw it into our learner and get back a coefficient for each feature. This also plays nicely with numerical features, for which we just need to skip the value:

W[‘price’] = W[‘price’] + delta(X, y, ‘price’)

While this approach looks good at first glance, it relies heavily on the way we implement the mapping from features to coefficients. This takes us to the next topic: Hashing.

Hashing

In order to implement the coefficients table W, we could use a hash table. This is a data structure that maps ‘keys’ of arbitrary length to ‘values’. In our case, the features are the keys and the coefficients are the values. This mapping is performed using a hash function, which takes a key of arbitrary length as input and outputs an integer in a specific range. The hash table uses this integer as an index in a 1-dimensional array where the values are stored.

Good hash functions have an interesting property: they map the expected input evenly over the output range, which means that every hash value has roughly the same probability of being observed after hashing a typical sample of the key space.

Hash functions also come with a not-so-nice side effect: they can hash different keys to the same integer value (this is known as ‘collision’), which is a big problem for our hash table: we can’t simply store each coefficient in the array where the hash of the key points to, since it could be overriding a previously inserted value.

A collision-resolution strategy is necessary. The most straightforward approach (known as ‘separate chaining’) is to store a list in each position of the array, containing all the keys and values hashed to that bucket. When the value associated to a key is requested, the key is hashed giving an index in the array, then the corresponding list is scanned until the requested key is found, at that point the associated value is returned. When a new key and value are added to the table, a similar procedure is followed, the only difference is that instead of returning the value, the key and value are added to the list.

This whole process makes reads and writes slower though, and, forces us to keep the keys in memory. This is bad news for our out-of-core learning algorithm, since it means the whole learning process is slower and its memory footprint is now linear in the amount of features.

Unless we are willing to accept collisions...

The Ostrich Algorithm

The Ostrich Algorithm is very simple: ignore the problem. In our case, this would mean we don’t look to solve collisions — we just go ahead and do:

W = 1-dimensional array of length n

…

W[hash(‘country_’+X[‘country’])] = W[hash(‘country_’+X[’country’])] + delta(X, y, ‘country’)

This means that we don’t need to scan a list nor keep the keys in memory, making our out-of-core learning algorithm constant in memory and fast.

Collisions do happen though. What’s the effect of collisions on this approach?

Figures 1, 2 and 3 show the effect of collisions on predictive power for 3 data sets:

Booking.com’s internal dataset containing ~200k features and ~250M examples for a 4 classes classification problem;
The Criteo Display Advertising Kaggle Challenge data set, which contains ~30M features and ~40M examples;
The Avazu CTR Prediction Kaggle Challenge with about ~40M features and ~40M examples.

For all data sets, the algorithm is a simple logistic regression trained with Vowpal Wabbit (one against all for the multi-class problem).

In all charts, the x-axis is the actual proportion of colliding features during training. The blue curve, associated with the left hand side y-axis is n, the hash size in number of buckets; the red curve, associated with the right hand side axis, is the log loss on a held out data set.

All charts are quite consistent: collisions are bad, the more, the worse the log loss, which is expected but (surprisingly) not that bad — even for 50% colliding features, the performance lost is much less than half percent. This could be considered a big win for the Ostrich Algorithm, but depending on the application such impact could be unacceptable. Furthermore, the impact also depends on the specific data set: at 50% colliding features the Booking.com dataset is impacted more than twice as much as the Criteo data set.

Another effect of the collisions is the potential loss of interpretability of the model. Even if collisions don’t impact predictive power, they might produce absurd models; for example, binary features like ‘logged in/logged out’ could share exactly the same weight, which wouldn’t make much sense. Another example would be colliding countries; if we want to estimate the price of a hotel and use ‘country’ as a feature, we might expect prices in Denmark being much higher than in Greece. But if Denmark and Greece are hashed to the same bucket, they would have exactly the same contribution to the price of a hotel, mistakenly challenging our intuition.

Why not just use as big a coefficients table as possible? That would certainly minimise collisions — but at the expense of memory, which defeats one of the purposes of the Hashing Trick: to avoid the consumption of memory by the features. Minimising memory consumption has many advantages, but one of particular interest is that it allows us to train many versions of the model in parallel in a single computer, which can be very helpful for many things like hyperparameter search for example.

Whether we care about a specific metric to the last bit, or how interpretable our model is, if we want to make the most out of our computational resources, understanding the dynamics of the hashing space size, the feature space size and the number of collisions is key to make principled trade-offs — and, as you’ll see, a lot of fun. Let’s dig deeper.

Expecting Collisions

If we know roughly how many features our data has (denoted by k) and we fix the hash space size (denoted by n), what’s the expected number of features colliding? This is a nice puzzle, and it is related to a very famous problem: The Birthday Paradox. In the original version, it is described as follows:

In a group of k people, what is the probability of at least 2 persons having the same birthday?

But in our case we are interested in a slightly different problem:

In a group of k people, how many of them are expected to share the same birthday with at least one other person?

For example, consider a group of 5 people, 2 of them having their birthday on February 17th, and 3 of them on March 23rd, then everybody shares their birthday with at least someone else. But if 2 of them have their birthday on February 17th and the others on April 25th, August 2nd and December 18th, then only 2 out of 5 people share their birthday with someone else. That is 40% colliding people.

So let’s solve this puzzle step by step:

1 The probability of persons X and Y sharing their birthdays is simply 1/n

2 The probability of persons X and Y not sharing their birthdays is then 1–1/n

3 The probability of person X not sharing their birthday with any of the other k-1 people in the group is therefore:

4 Then the probability of X sharing their birthday with at least one other person is:

5 Define one binary random variable for each person: it gets value 1 when the person does share their birthday with at least one other person in the group and 0 otherwise. It follows a Bernoulli distribution with probability of success p.

6 Define another random variable H as the sum of the k previously defined random variables:

which simply represents how many of the k people share their birthdays with at least someone else.

7 We want the expected value of H. By linearity of expectations we can write:

8 Finally we have that in a group of k people, the expected number of people sharing their birthdays with at least one other person, when we consider n possible birthdays is

Equation 1

Back to hashing. The metaphor is quite haptic: the group of people is the feature space, the number of possible birthdays is the hashing space size, and having the same birthday is sharing the hash value, a collision. We assume that all birthdays are equally probable, which is actually not true for real birthdays, but it is very close to true for hashing: we are just assuming the hash function is a good one.

one experiment is worth a thousand equations

So equation 1 looks good, but, one experiment is worth a thousand equations. Does it actually work? For our 3 data sets, we know how many features there are, how big the hash space is, and how many collisions actually happened. Let’s compare what equation 1 says and what reality shows:

The results are conclusive. They show that the model is correct, since it can successfully predict how many collisions will happen in all 3 data sets.

Controlling Collisions

We now have a good model for the expected number of collisions given feature space and hashing space sizes. A very natural next question follows: for a given feature space size k, how big should the hashing space n be to produce expected number of collisions c? All we need to do is to take equation 1, fix c and write n as a function of k:

Pretty ugly function, let’s see if it works,

Figure 5 shows the results of applying this formula. The x-axis represents the different feature space sizes (k), the y-axis represents the hashing space size (n), and each curve a specific number of collisions (c). The chart can be interpreted as a contour plot of equation 1. One surprising result is that for a fixed number of collisions, the hash size seems to grow quadratically with the number of features. It is also surprising that all curves can be perfectly fit by a parabola, which suggests that it should be possible to find a parabolic form for n given c and k (as an alternative to that ugly formula).

Indeed, under rather weak assumptions ( n>1 and k ≪ n), we can apply a binomial approximation to the power term in equation 1:

and then plug it back to get:

which allows us to write n as a function of k and c:

To validate this approximation we can simply compare with the original formula:

Equation 2 is a very good approximation and can therefore be used directly to decide the hash space size (given the feature space size and the desired number of collisions). It’s worth noting that for the case of 0 collisions we can conveniently set c=1 to get n=k².

Trade-offs

In practice there are two main approaches to implement the hashing trick:

Global hashing space: There’s only one hashing space and one single parameter to decide, but cross-field collisions can happen — countries can collide with user ids, for example. This is how Vowpal Wabbit works.

Per-field hashing space: There’s a hashing space per feature, allowing finer grain control at the cost of some speed and more parameters to tune. This is how Tensorflow works.

Also, in practice, there are two main types of categorical features:

Moderate cardinality and static: These are features with less than a thousand categories that don't change a lot, like country. Usually it’s important to have one weight for each category, so having 0 collisions is what we’re after.

High cardinality and dynamic: These are features with more than a thousand categories and constantly getting new ones, like user id or hotel id. Usually it’s not that important to have a weight for each category; the weight distribution over the categories is what matters. Collisions are acceptable but it’s hard to tell how many we are willing to accept. 5% collisions is a pretty conservative ansatz, giving a simple rule: n=20k . Cross-validation techniques can also help, but require more time and resources.

One criterion: After quite a bit of analysis, we have now all the elements to recommend a practical criterion to decide the size of the hashing space that gives a good balance of memory usage, interpretability and predictive power:

If you can choose the hashing space on a per feature basis, use k² for features with less than a thousand categories and 20k for the others.
If there is only one hashing space and less than twenty thousand features in total, use k², otherwise use n=20k.
If you want to control for collisions as a proportion r of the features, then use n=k/r.

Conclusion

Of course the proposed criterion is not absolutely general, that is not the intent of this article (text problems might show different behaviours for example). This work presents general principles that govern the Hashing Trick, the trade-offs involved, and an analysis that gives tools to construct heuristics and criteria to decide the size of the hash in many different regimes: Your mileage may vary.

A couple of findings were surprising:

In practice, the effect of collisions on predictive power is very low.
The minimum hash size required to expect a fixed number of collisions grows quadratically with the feature hash size (this is especially relevant for the case of 0 collisions, if we only want to control collisions as a percentage of the number of features, then the hash size n grows linear with the feature space size k ).

I like to see the Hashing Trick as a successful example of the Ostrich Algorithm. Indeed algorithmically, the collisions issue is just ignored.

Finally, two things: all the code to reproduce these results are available in my github, which contains some useful scripts to deal with Vowpal Wabbit models and some freaky snippets like maths with big numbers in python. And, if you are interested in a formal analysis of the hashing trick, I recommend Feature Hashing for Large Scale Multitask Learning by Weinberger, Dasgupta, Attenberg, Langford and Smola.

I want to thank Stas Girkin for asking this (at that moment) awkward question, Kristian Holsheimer for figuring out the power expansion, Denis Bilenko for the fruitful discussions, Themis Mavridis for reviewing the article and Steven Baguley and Kristofer Barber for making it readable.