Fast groupby-apply operations in Python with and without Pandas

Update 9/30/17: Code for a faster version of Groupby is available here as part of the hdfe package. Although Groupby is much faster than Pandas GroupBy.apply and GroupBy.transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small.

    Pandas Groupby
Apply Cython (i.e. mean) 1.0  
  Python (user-defined) 60.0 6.8
Transform Cython (i.e. mean) 4.1  
  Python (user-defined) 164.3 7.1

The current version of Groupby can handle multi-dimensional outputs. It also checks if data is already sorted and runs faster if it is.

With categorical data, we often want to apply an operation within each category. For example, we may have a list of students, a list of teachers, and a list of test scores, so that student[i] has teacher[i] and test score[i], and we may want to know the average test score of each teacher’s students. The Python package Pandas has split-apply-combine functionality that makes this easy.

For example, say we have categorical data ‘first category’, and we want to find the average of ‘y’ within each group of ‘first category’:

import numpy as np
import pandas as pd

df = pd.DataFrame({'first category': [0, 1, 2, 0, 1, 1, 0], 
                   'y': np.arange(0, .7, .1)})
group_means = df.groupby('first category')['y'].apply(np.mean)
first category
0    0.300000
1    0.333333
2    0.200000
Name: y, dtype: float64

This means that on average, someone in category 0 has a y value of 0.3, someone in category 1 has a y value of 1/3, and so on. (I used the scientific computing package Numpy to calculate means.) We may also want to give the vector of means the same dimension as the original data, so that group_means[i] is the mean of the group that has a member in position i. We can do this using the transform method instead of the apply method:

df['mean'] = df.groupby('first category')['y'].transform(np.mean)
   first category    y      mean
0               0  0.0  0.300000
1               1  0.1  0.333333
2               2  0.2  0.200000
3               0  0.3  0.300000
4               1  0.4  0.333333
5               1  0.5  0.333333
6               0  0.6  0.300000

However, with many groups, apply operations can be slow:

import time

n_decimals = 3
n_iters = 100

n_obs = 10**4
n_categories = 10**3

first_category = np.random.choice(n_categories, n_obs)
y = np.random.normal(0, 1, n_obs)

df = pd.DataFrame({'first category': first_category,
                   'y': y})
start = time.clock()
grouped = df.groupby('first category')
pandas_answer = grouped.apply(np.mean)
print('time to compute group means once with Pandas: {0}'\
      .format(round(time.clock() - start, n_decimals)))

start = time.clock()
for i in range(n_iters):
print('time to compute group means {0} times with Pandas: {1}'\
      .format(n_iters, round(time.clock() - start, n_decimals)))
time to compute group means once with Pandas: 0.082
time to compute group means 100 times with Pandas: 7.156

Faster operations without Pandas: The Groupby Class

We can do a lot better. Here’s a class that allows for faster groupby-apply operations:

class Groupby:
    def __init__(self, keys):
        _, self.keys_as_int = np.unique(keys, return_inverse = True)
        self.n_keys = max(self.keys_as_int) + 1
    def set_indices(self):
        self.indices = [[] for i in range(self.n_keys)]
        for i, k in enumerate(self.keys_as_int):
        self.indices = [np.array(elt) for elt in self.indices]
    def apply(self, function, vector, broadcast):
        if broadcast:
            result = np.zeros(len(vector))
            for idx in self.indices:
                result[idx] = function(vector[idx])
            result = np.zeros(self.n_keys)
            for k, idx in enumerate(self.indices):
                result[self.keys_as_int[k]] = function(vector[idx])

        return result

To understand what’s going on here, take a look at the first line of init. Numpy’s unique function returns a numpy array containing the function’s first arguments without duplicates. It’s second return value, when return_inverse is set to True, gives the indices that, when applied to the first return value, give back the original array.

unique_keys, indices = np.unique(['a', 'b', 'a', 'a', 'b'], return_inverse = True)
# array([{'b', 'a'}], dtype=object)
# [0, 1, 0, 0, 1]
# array(['a', 'b', 'a', 'a', 'b'], dtype='<U1')

But the “indices” argument is even cooler than that – it’s a way of transforming keys from categorical data to a set of integers, whose highest value is the number of values the data takes on (minus one). So we can store the indices corresponding to the second unique key in the second element self.indices. If the data weren’t positive integers, we would have to use costly dictionary lookups instead of a list or array.

The “broadcast” argument indicates whether the answer should be broadcast to the shape of the original data, or if the length the reuslt should equal the number of keys. Setting broadcast=False is like using Pandas’s “apply” method, and setting broadcast=True is like using Pandas’s “transform” method.

Now we can use this class to get group means:

start = time.clock()
grouped = Groupby(first_category)

group_means = Groupby(first_category).apply(np.mean, y, broadcast=False)
print('time to compute group means once with Grouped: {0}'\
      .format(round(time.clock() - start, n_decimals)))

start = time.clock()
grouped = Groupby(first_category)
for i in range(n_iters):
    grouped.apply(np.mean, y, broadcast=False)
print('time to compute group means {0} times with Grouped: {1}'\
      .format(n_iters, round(time.clock() - start, n_decimals)))
time to compute group means once with Grouped: 0.013
time to compute group means 100 times with Grouped: 0.848

Nine times faster!