Skip to content

Instantly share code, notes, and snippets.

@Mlawrence95
Mlawrence95 / read_csv_from_aws_s3_targz.python
Created July 27, 2020 22:54
Given a CSV file that's inside a tar.gz file on AWS S3, read it into a Pandas dataframe without downloading or extracting the entire tar file
# checked against python 3.7.3, pandas 0.24.2, s3fs 0.4.2
import tarfile
import io
import s3fs
import pandas as pd
tar_path = f"s3://my-bucket/debug.tar.gz" # path in s3
metadata_path = "debug/metadata.csv" # path inside of the tar file
@Mlawrence95
Mlawrence95 / md5_decorator.py
Last active May 20, 2020 22:31
A python decorator that adds a column to your pandas dataframe -- the MD5 hash of the specified column
import pandas as pd
from hashlib import md5
def text_to_hash(text):
return md5(text.encode("utf8")).hexdigest()
def add_hash(column_name="document"):
"""
Decorator. Wraps a function that returns a dataframe, must have column_name in columns.
@Mlawrence95
Mlawrence95 / mp3_to_plot.py
Created April 21, 2020 21:30
[python] convert .mp3 file into a .wav, then visualize the sound using a matplotlib plot
import matplotlib.pyplot as plt
import soundfile as sf
from pydub import AudioSegment
# we want to convert source, mp3, into dest, a .wav file
source = "./recordings/test.mp3"
dest = "./recordings/test.wav"
# conversion - check!
@Mlawrence95
Mlawrence95 / get_timestamp.py
Created March 31, 2020 22:18
Use python's time library to print the date as a single string in m/d/y format, GMT. Useful for adding timestamps to filenames
import time
def get_timestamp():
"""
Print the date in m/d/y format, GMT
>>> get_timestamp()
'3_31_2020'
"""
t = time.gmtime()
@Mlawrence95
Mlawrence95 / open_files.py
Created March 26, 2020 18:03
Helpers to open common file types to python data analysis, json and pickle. Great addition to your startup.ipy file in ~/.ipython/profile_default/startup/
import json
import pickle
def openJSON(path):
"""
Safely opens json file at 'path'
"""
with open(path, 'r') as File:
data = json.load(File)
@Mlawrence95
Mlawrence95 / pyplot_set_params.py
Created December 16, 2019 17:24
matplotlib allows you to set plot parameters via a param dict. Here's one such example
import matplotlib.pyplot as plt
params = {'legend.fontsize': 'x-large',
'figure.figsize': (15, 15),
'axes.labelsize': 'x-large',
'axes.titlesize': 'x-large',
'xtick.labelsize': 'x-large',
'ytick.labelsize': 'x-large'}
plt.rcParams.update(params)
@Mlawrence95
Mlawrence95 / make_old_pickles_openable.py
Created December 5, 2019 23:51
Old pickle files can be a pain to work with. This can make SliceTypes and ObjectType exceptions go away in certain circumstances.
import pickle
import dill
dill._dill._reverse_typemap['SliceType'] = slice
dill._dill._reverse_typemap['ObjectType'] = object
@Mlawrence95
Mlawrence95 / clone_private_repo.txt
Created December 5, 2019 23:45
Trying to access a private repo? Use this format to pull it down. (Yes, it asks for your password at the command line. Only do this in low-risk environments)
git clone https://[insert username]:[insert password]@github.com/[insert organisation name]/[insert repo name].git
@Mlawrence95
Mlawrence95 / get_word_counts.py
Last active November 5, 2019 19:13
Takes a document (string) or iterable of documents and returns a Pandas dataframe containing the number of occurrences of each unique word. Note that this is not efficient enough to replace Scikit's CountVectorizer class for a bag of words transformer.
import numpy as np
import pandas as pd
def get_word_counts(document: str) -> pd.DataFrame:
"""
Turns a document into a dataframe of word, counts
Use preprocessing/lowercasing before this step for best results.
If passing many documents, use document = '\n'.join(iterable_of_documents)
@Mlawrence95
Mlawrence95 / shallow_flatten_directory.py
Last active October 21, 2019 23:38
** DESTRUCTIVE CODE -- DON'T COPY AND PASTE WITHOUT READING** Unpacks folders at the specified location to one level. Can be applied recursively to flatten everything if desired.
import os
import shutil
def flatten_directory(directory, delete_after=False):
"""
Flattens all folders in directory, deleting the empty folders after.
**WARNING**
This code WILL DELETE YOUR FILES
if used naively. Seriously.