Skip to content

Instantly share code, notes, and snippets.

@waylonflinn
Created September 30, 2015 16:38
Show Gist options
  • Save waylonflinn/fee3920534f2088754b7 to your computer and use it in GitHub Desktop.
Save waylonflinn/fee3920534f2088754b7 to your computer and use it in GitHub Desktop.
Filtering and Aggregation with Bquery
import bcolz
import bquery
data_path = '/some/place/with/bcolz/data'
c_table = bquery.ctable(rootdir=data_path)
## Filter
# create the criteria
string_feature = 'some_string_feature'
criterion = "feature == b'{0}'".format(string_feature)
# create the boolean array, with numexpr
boolarr = c_table.eval(criterion)
## Aggregate
# column_name whose unique values define groups
group_columns = ['group_column_name']
# input_column_name, operation, output_column_name
aggregation_operations = [['number_column_name', 'mean', 'mean_column']]
# use boolean array in aggregation
mean_repin_count = c_table.groupby(group_columns,
aggregation_operations, bool_arr=boolarr)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment