Skip to content

Instantly share code, notes, and snippets.

@MattFaus
MattFaus / appengine_config.py
Last active August 3, 2018 12:28
All of the code necessary to implement and test protobuf projection in a Google Appengine web application.
import db_util
db_util.enable_db_protobuf_projection()
db_util.enable_ndb_protobuf_projection()
@MattFaus
MattFaus / keybase.md
Created November 17, 2014 19:35
Verification of my keybase public key

Keybase proof

I hereby claim:

  • I am mattfaus on github.
  • I am mattfaus (https://keybase.io/mattfaus) on keybase.
  • I have a public key whose fingerprint is 1CF5 6643 9369 2689 9402 2358 69E8 0354 58E5 E154

To claim this, I am signing this object:

@MattFaus
MattFaus / BatchedGcsCsvShardFileWriter.py
Created October 29, 2014 21:27
Writes CSV data into multiple output shards, grouping rows by keys. Output shards are written to Google Cloud Storage.
class BatchedGcsCsvShardFileWriter(object):
"""Writes CSV data into multiple output shards, grouping rows by keys.
This class is a context manager, which closes all shards upon exit.
Say you are writing a lot of CSV data, like:
[0, "Bakery"],
[2, "Francisco"],
[3, "Matt"],
@MattFaus
MattFaus / SortedGcsCsvShardFileMergeReader.py
Last active February 23, 2022 11:58
Merge-reads several sorted .csv files stored on Google Cloud Storage.
class SortedGcsCsvShardFileMergeReader(object):
"""Merges several sorted .csv files stored on GCS.
This class is both an iterator and a context manager.
Let's say there are 2 .csv files stored on GCS, with contents like:
/bucket/file_1.csv:
[0, "Matt"],
[0, "Sam"],
@MattFaus
MattFaus / ParallelInMemorySortGcsCsvShardFiles.py
Created October 29, 2014 21:01
A Pipeline job which launches a map-only job to sort .csv files in memory. Each .csv file is read from Google Cloud Storage into memory, sorted by the specified key, and then written back out to Google Cloud Storage. The machine running the sorting process must have roughly 10x the amount of memory as the size of each .csv file.
class ParallelInMemorySortGcsCsvShardFiles(pipeline.Pipeline):
def run(self, input_bucket, input_pattern, sort_columns,
model_type, output_bucket, output_pattern):
"""Sorts each input file in-memory, then writes it to an output file.
Arguments:
input_bucket - The GCS bucket which contains the unsorted .csv
files.
input_pattern - A regular expression used to find files in the
@MattFaus
MattFaus / DeterministicCompressedFeatures.py
Created October 8, 2014 20:49
An improvement over the CompressedFeatures class introduced at http://derandomized.com/post/51709771229/compressed-features-for-machine-learning#disqus_thread by not requiring the key->component mapping to be stored.
class DeterministicCompressedFeatures(CompressedFeatures):
"""Generates random components after seeding with the component_key.
By using a known seed to generate the random components, we do not need to
store or manage them. We can just recompute them whenever we need.
"""
def __init__(self, num_features=RANDOM_FEATURE_LENGTH):
super(DeterministicallyRandomFeatures, self).__init__(num_features)
@MattFaus
MattFaus / 2014_05_31_transformed.Video.json
Created June 4, 2014 21:07
BigQuery's JSON representation of the schema of 2014_05_31_transformed.Video.
{
u 'fields': [{
u 'type': u 'STRING',
u 'name': u 'playlists',
u 'mode': u 'REPEATED'
}, {
u 'type': u 'STRING',
u 'name': u 'source_table',
u 'mode': u 'NULLABLE'
}, {
@MattFaus
MattFaus / bq_connection.py
Last active August 29, 2015 14:02
Some helper functions to build a SELECT statement for defining a view.
def get_table_schema(dataset, table):
"""If the table exists, returns its schema. Otherwise, returns None."""
table_service = BigQueryService.get_service().tables()
try:
get_result = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()
return get_result['schema']
@MattFaus
MattFaus / advanced_mapreduce.py
Created May 1, 2014 20:38
Experimental code demonstrating arbitrary mappers and reducers in the mapreduce library
import collections
import jinja2
import logging
import os
import request_handler
import third_party.mapreduce
import third_party.mapreduce.input_readers
import third_party.mapreduce.output_writers
import third_party.mapreduce.lib.files
import third_party.mapreduce.operation
@MattFaus
MattFaus / custom_bq_transformers.py
Created March 22, 2014 00:34
A custom property transformer to translate a ndb.JsonProperty into a repeated record with fields for each of the keys in the original JSON.
class TransformedVideoTranslationInfo(bq_property_transform.TransformedEntity):
CUSTOM_SCHEMAS = {
'translated_youtube_ids': {
'name': 'translated_youtube_ids',
'type': 'record',
'mode': 'repeated',
'fields': [
{'name': 'language',
'type': 'string'},