Skip to content

Instantly share code, notes, and snippets.

@mr1azl
mr1azl / 0-self-publishing.md
Created August 31, 2022 08:43 — forked from caseywatts/0-self-publishing.md
Self-Publishing via Markdown
@mr1azl
mr1azl / load_parquet_s3.py
Created December 12, 2018 11:45 — forked from asmaier/load_parquet_s3.py
Pyspark script for downloading a single parquet file from Amazon S3 via the s3a protocol. It also reads the credentials from the "~/.aws/credentials", so we don't need to hardcode them. See also https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html .
#
# Some constants
#
aws_profile = "your_profile"
aws_region = "your_region"
s3_bucket = "your_bucket"
#
# Reading environment variables from aws credential file
#
@mr1azl
mr1azl / .block
Created December 6, 2016 15:56 — forked from Bl3f/.block
Simple D3.js heatmap
license: WTFPL
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mr1azl
mr1azl / faster_toPandas.py
Created April 13, 2016 22:13 — forked from joshlk/faster_toPandas.py
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
@mr1azl
mr1azl / tweet_dumper.py
Created January 18, 2016 14:01 — forked from yanofsky/LICENSE
A script to download all of a user's tweets into a csv
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
@mr1azl
mr1azl / comparison.py
Created December 29, 2015 09:59 — forked from patrickfuller/comparison.py
Compares tornado.auth.GoogleMixin with tornado.auth.GoogleOAuth2Mixin. The latter is required after google's OAuth updates.
"""
A webserver to test Google OAuth in a couple of scenarios.
"""
import argparse
import time
import tornado.ioloop
import tornado.web
import tornado.auth
import tornado.gen
@mr1azl
mr1azl / mysql_to_big_query.sh
Created December 24, 2015 15:19 — forked from shantanuo/mysql_to_big_query.sh
Copy MySQL table to big query. If you need to copy all tables, use the loop given at the end. Exit with error code 3 if blob or text columns are found. The csv files are first copied to google cloud before being imported to big query.
#!/bin/sh
TABLE_SCHEMA=$1
TABLE_NAME=$2
mytime=`date '+%y%m%d%H%M'`
hostname=`hostname | tr 'A-Z' 'a-z'`
file_prefix="trimax$TABLE_NAME$mytime$TABLE_SCHEMA"
bucket_name=$file_prefix
splitat="4000000000"
bulkfiles=200
@mr1azl
mr1azl / MaximumValueProgam.scala
Created December 3, 2015 09:41 — forked from kbastani/MaximumValueProgam.scala
PregelProgram Abstraction for Spark GraphX
package org.mazerunner.core.programs
import org.apache.spark.graphx.{Graph, EdgeTriplet, VertexId}
import org.mazerunner.core.abstractions.PregelProgram
/**
* @author kbastani
* The [[MaximumValueProgram]] is an example graph algorithm implemented on the [[PregelProgram]]
* abstraction.
*/
@mr1azl
mr1azl / apply_df_by_multiprocessing.py
Created November 20, 2015 09:59 — forked from yong27/apply_df_by_multiprocessing.py
pandas DataFrame apply multiprocessing
import multiprocessing
import pandas as pd
import numpy as np
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')