Update 9/30/17: Code for a faster version of Groupby is available here as part of the hdfe package. Although Groupby is much faster than Pandas GroupBy.apply and GroupBy.transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.
|Apply||Cython (i.e. mean)||1.0|
|Transform||Cython (i.e. mean)||4.1|
The current version of Groupby can handle multi-dimensional outputs. It also checks if data is already sorted and runs faster if it is.
With categorical data, we often want to apply an operation within each category. For example, we may have a list of students, a list of teachers, and a list of test scores, so that student[i] has teacher[i] and test score[i], and we may want to know the average test score of each teacher’s students. The Python package Pandas has split-apply-combine functionality that makes this easy.
For example, say we have categorical data ‘first category’, and we want to find the average of ‘y’ within each group of ‘first category’:
first category 0 0.300000 1 0.333333 2 0.200000 Name: y, dtype: float64
This means that on average, someone in category 0 has a y value of 0.3, someone in category 1 has a y value of 1/3, and so on. (I used the scientific computing package Numpy to calculate means.) We may also want to give the vector of means the same dimension as the original data, so that group_means[i] is the mean of the group that has a member in position i. We can do this using the transform method instead of the apply method:
first category y mean 0 0 0.0 0.300000 1 1 0.1 0.333333 2 2 0.2 0.200000 3 0 0.3 0.300000 4 1 0.4 0.333333 5 1 0.5 0.333333 6 0 0.6 0.300000
However, with many groups, apply operations can be slow:
time to compute group means once with Pandas: 0.082 time to compute group means 100 times with Pandas: 7.156
We can do a lot better. Here’s a class that allows for faster groupby-apply operations:
To understand what’s going on here, take a look at the first line of init. Numpy’s unique function returns a numpy array containing the function’s first arguments without duplicates. It’s second return value, when return_inverse is set to True, gives the indices that, when applied to the first return value, give back the original array.
But the “indices” argument is even cooler than that – it’s a way of transforming keys from categorical data to a set of integers, whose highest value is the number of values the data takes on (minus one). So we can store the indices corresponding to the second unique key in the second element self.indices. If the data weren’t positive integers, we would have to use costly dictionary lookups instead of a list or array.
The “broadcast” argument indicates whether the answer should be broadcast to the shape of the original data, or if the length the reuslt should equal the number of keys. Setting broadcast=False is like using Pandas’s “apply” method, and setting broadcast=True is like using Pandas’s “transform” method.
Now we can use this class to get group means:
time to compute group means once with Grouped: 0.013 time to compute group means 100 times with Grouped: 0.848
Nine times faster!