Skip to content

Instantly share code, notes, and snippets.

MattFaus /
Last active Aug 3, 2018
All of the code necessary to implement and test protobuf projection in a Google Appengine web application.
import db_util
MattFaus /
Created Nov 17, 2014
Verification of my keybase public key

Keybase proof

I hereby claim:

  • I am mattfaus on github.
  • I am mattfaus ( on keybase.
  • I have a public key whose fingerprint is 1CF5 6643 9369 2689 9402 2358 69E8 0354 58E5 E154

To claim this, I am signing this object:

MattFaus /
Created Oct 29, 2014
Writes CSV data into multiple output shards, grouping rows by keys. Output shards are written to Google Cloud Storage.
class BatchedGcsCsvShardFileWriter(object):
"""Writes CSV data into multiple output shards, grouping rows by keys.
This class is a context manager, which closes all shards upon exit.
Say you are writing a lot of CSV data, like:
[0, "Bakery"],
[2, "Francisco"],
[3, "Matt"],
MattFaus /
Last active Feb 23, 2022
Merge-reads several sorted .csv files stored on Google Cloud Storage.
class SortedGcsCsvShardFileMergeReader(object):
"""Merges several sorted .csv files stored on GCS.
This class is both an iterator and a context manager.
Let's say there are 2 .csv files stored on GCS, with contents like:
[0, "Matt"],
[0, "Sam"],
MattFaus /
Created Oct 29, 2014
A Pipeline job which launches a map-only job to sort .csv files in memory. Each .csv file is read from Google Cloud Storage into memory, sorted by the specified key, and then written back out to Google Cloud Storage. The machine running the sorting process must have roughly 10x the amount of memory as the size of each .csv file.
class ParallelInMemorySortGcsCsvShardFiles(pipeline.Pipeline):
def run(self, input_bucket, input_pattern, sort_columns,
model_type, output_bucket, output_pattern):
"""Sorts each input file in-memory, then writes it to an output file.
input_bucket - The GCS bucket which contains the unsorted .csv
input_pattern - A regular expression used to find files in the
MattFaus /
Created Oct 8, 2014
An improvement over the CompressedFeatures class introduced at by not requiring the key->component mapping to be stored.
class DeterministicCompressedFeatures(CompressedFeatures):
"""Generates random components after seeding with the component_key.
By using a known seed to generate the random components, we do not need to
store or manage them. We can just recompute them whenever we need.
def __init__(self, num_features=RANDOM_FEATURE_LENGTH):
super(DeterministicallyRandomFeatures, self).__init__(num_features)
MattFaus / 2014_05_31_transformed.Video.json
Created Jun 4, 2014
BigQuery's JSON representation of the schema of 2014_05_31_transformed.Video.
View 2014_05_31_transformed.Video.json
u 'fields': [{
u 'type': u 'STRING',
u 'name': u 'playlists',
u 'mode': u 'REPEATED'
}, {
u 'type': u 'STRING',
u 'name': u 'source_table',
u 'mode': u 'NULLABLE'
}, {
MattFaus /
Last active Aug 29, 2015
Some helper functions to build a SELECT statement for defining a view.
def get_table_schema(dataset, table):
"""If the table exists, returns its schema. Otherwise, returns None."""
table_service = BigQueryService.get_service().tables()
get_result = table_service.get(
return get_result['schema']
MattFaus /
Created May 1, 2014
Experimental code demonstrating arbitrary mappers and reducers in the mapreduce library
import collections
import jinja2
import logging
import os
import request_handler
import third_party.mapreduce
import third_party.mapreduce.input_readers
import third_party.mapreduce.output_writers
import third_party.mapreduce.lib.files
import third_party.mapreduce.operation
MattFaus /
Created Mar 22, 2014
A custom property transformer to translate a ndb.JsonProperty into a repeated record with fields for each of the keys in the original JSON.
class TransformedVideoTranslationInfo(bq_property_transform.TransformedEntity):
'translated_youtube_ids': {
'name': 'translated_youtube_ids',
'type': 'record',
'mode': 'repeated',
'fields': [
{'name': 'language',
'type': 'string'},