Skip to content

Instantly share code, notes, and snippets.

Matt Faus MattFaus

Block or report user

Report or block MattFaus

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@MattFaus
MattFaus / appengine_config.py
Last active Aug 3, 2018
All of the code necessary to implement and test protobuf projection in a Google Appengine web application.
View appengine_config.py
import db_util
db_util.enable_db_protobuf_projection()
db_util.enable_ndb_protobuf_projection()
@MattFaus
MattFaus / keybase.md
Created Nov 17, 2014
Verification of my keybase public key
View keybase.md

Keybase proof

I hereby claim:

  • I am mattfaus on github.
  • I am mattfaus (https://keybase.io/mattfaus) on keybase.
  • I have a public key whose fingerprint is 1CF5 6643 9369 2689 9402 2358 69E8 0354 58E5 E154

To claim this, I am signing this object:

@MattFaus
MattFaus / BatchedGcsCsvShardFileWriter.py
Created Oct 29, 2014
Writes CSV data into multiple output shards, grouping rows by keys. Output shards are written to Google Cloud Storage.
View BatchedGcsCsvShardFileWriter.py
class BatchedGcsCsvShardFileWriter(object):
"""Writes CSV data into multiple output shards, grouping rows by keys.
This class is a context manager, which closes all shards upon exit.
Say you are writing a lot of CSV data, like:
[0, "Bakery"],
[2, "Francisco"],
[3, "Matt"],
@MattFaus
MattFaus / SortedGcsCsvShardFileMergeReader.py
Last active Aug 29, 2015
Merge-reads several sorted .csv files stored on Google Cloud Storage.
View SortedGcsCsvShardFileMergeReader.py
class SortedGcsCsvShardFileMergeReader(object):
"""Merges several sorted .csv files stored on GCS.
This class is both an iterator and a context manager.
Let's say there are 2 .csv files stored on GCS, with contents like:
/bucket/file_1.csv:
[0, "Matt"],
[0, "Sam"],
@MattFaus
MattFaus / ParallelInMemorySortGcsCsvShardFiles.py
Created Oct 29, 2014
A Pipeline job which launches a map-only job to sort .csv files in memory. Each .csv file is read from Google Cloud Storage into memory, sorted by the specified key, and then written back out to Google Cloud Storage. The machine running the sorting process must have roughly 10x the amount of memory as the size of each .csv file.
View ParallelInMemorySortGcsCsvShardFiles.py
class ParallelInMemorySortGcsCsvShardFiles(pipeline.Pipeline):
def run(self, input_bucket, input_pattern, sort_columns,
model_type, output_bucket, output_pattern):
"""Sorts each input file in-memory, then writes it to an output file.
Arguments:
input_bucket - The GCS bucket which contains the unsorted .csv
files.
input_pattern - A regular expression used to find files in the
@MattFaus
MattFaus / DeterministicCompressedFeatures.py
Created Oct 8, 2014
An improvement over the CompressedFeatures class introduced at http://derandomized.com/post/51709771229/compressed-features-for-machine-learning#disqus_thread by not requiring the key->component mapping to be stored.
View DeterministicCompressedFeatures.py
class DeterministicCompressedFeatures(CompressedFeatures):
"""Generates random components after seeding with the component_key.
By using a known seed to generate the random components, we do not need to
store or manage them. We can just recompute them whenever we need.
"""
def __init__(self, num_features=RANDOM_FEATURE_LENGTH):
super(DeterministicallyRandomFeatures, self).__init__(num_features)
@MattFaus
MattFaus / 2014_05_31_transformed.Video.json
Created Jun 4, 2014
BigQuery's JSON representation of the schema of 2014_05_31_transformed.Video.
View 2014_05_31_transformed.Video.json
{
u 'fields': [{
u 'type': u 'STRING',
u 'name': u 'playlists',
u 'mode': u 'REPEATED'
}, {
u 'type': u 'STRING',
u 'name': u 'source_table',
u 'mode': u 'NULLABLE'
}, {
@MattFaus
MattFaus / bq_connection.py
Last active Aug 29, 2015
Some helper functions to build a SELECT statement for defining a view.
View bq_connection.py
def get_table_schema(dataset, table):
"""If the table exists, returns its schema. Otherwise, returns None."""
table_service = BigQueryService.get_service().tables()
try:
get_result = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()
return get_result['schema']
@MattFaus
MattFaus / advanced_mapreduce.py
Created May 1, 2014
Experimental code demonstrating arbitrary mappers and reducers in the mapreduce library
View advanced_mapreduce.py
import collections
import jinja2
import logging
import os
import request_handler
import third_party.mapreduce
import third_party.mapreduce.input_readers
import third_party.mapreduce.output_writers
import third_party.mapreduce.lib.files
import third_party.mapreduce.operation
@MattFaus
MattFaus / custom_bq_transformers.py
Created Mar 22, 2014
A custom property transformer to translate a ndb.JsonProperty into a repeated record with fields for each of the keys in the original JSON.
View custom_bq_transformers.py
class TransformedVideoTranslationInfo(bq_property_transform.TransformedEntity):
CUSTOM_SCHEMAS = {
'translated_youtube_ids': {
'name': 'translated_youtube_ids',
'type': 'record',
'mode': 'repeated',
'fields': [
{'name': 'language',
'type': 'string'},
You can’t perform that action at this time.