Skip to content

Instantly share code, notes, and snippets.

@zredlined
zredlined / bike-orders.json
Created December 15, 2020 16:54
bike-orders.json - open dataset that's useful for benchmarking and testing tabular NER. 100 records.
[
{
"CustomerID": 26159,
"Title": null,
"FirstName": "Virginia",
"MiddleName": null,
"LastName": "Raman",
"Suffix": null,
"AddressLine1": "3242 Coralie Drive",
"AddressLine2": null,
@zredlined
zredlined / setup-tensorflow-gpu-ubuntu-18_04.sh
Created September 25, 2020 18:17
Shell script to setup NVIDIA GPU acceleration on TensorFlow with Ubuntu 18.04 and CUDA 10.1
# Shell script to setup GPU acceleration for TensorFlow on Ubuntu 18.04
# Tested on a default Ubuntu 18.04 VM image in Google Compute
# Install CUDA
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
#!pip install gretel-synthetics --upgrade
from gretel_synthetics.batch import DataFrameBatch
from pathlib import Path
config_template = {
"max_lines": 0,
"max_line_len": 2048,
"epochs": 7,
"vocab_size": 20000,
@zredlined
zredlined / create_knn_dataset.py
Created July 22, 2020 16:51
SMOTE-like approach building training dataset for synthetics from K-nearest neighbors to minority class
#!pip install s3fs smart_open pandas sklearn
import pandas as pd
from smart_open import open
from sklearn.neighbors import NearestNeighbors
# Set params
NEAREST_NEIGHBOR_COUNT = 5
TRAINING_SET = 's3://gretel-public-website/datasets/creditcard_train.csv'
@zredlined
zredlined / customer_validator-uci.py
Created May 18, 2020 16:02
Custom record validator function for training a model on UC Irvine's Heart Disease Dataset
# Validate each generated record
# Note: This custom validator verifies the record structure matches
# the expected format for UCI healthcare data, and also that
# generated records are Female (e.g. column 1 is 0)
def validate_record(line):
rec = line.strip().split(",")
if not int(rec[1]) == 0:
raise Exception("record generated must be female")
if len(rec) == 14:
@zredlined
zredlined / gretel_synthetics_config-uci.py
Last active May 18, 2020 16:11
Optimal settings for training a synthetic data generation model on the UCI heart disease dataset from Kaggle
from pathlib import Path
from gretel_synthetics.config import LocalConfig
# Create a config that we can use for both training and generating, with CPU-friendly settings
# The default values for ``max_chars`` and ``epochs`` are better suited for GPUs
config = LocalConfig(
max_lines=0, # read all lines (zero)
epochs=15, # 15-30 epochs for production
vocab_size=20000, # tokenizer model vocabulary size
@zredlined
zredlined / ehr_config.py
Created May 4, 2020 15:19
Gretel synthetic data configuration optimized for EHR datasets
from gretel_synthetics.config import LocalConfig
# EHR configuration, optimal settings
# Note: this config is optimized for calculation on a GPU
config = LocalConfig(
max_lines=0, # read all lines (zero)
epochs=30, # 30 epochs for production
vocab_size=25000, # vocabulary size
character_coverage=1.0, # tokenizer model character coverage percent
gen_chars=0, # the maximum number of characters possible per-generated line of text
@zredlined
zredlined / example_ontonotes5_spacy_format.json
Created March 13, 2020 16:50
example ontonotes5 converted to spacy training format
{
"id": "fake",
"paragraphs": [
{
"raw": "Israel has blockaded all West Bank cities after 10 people died in one of the worst days of Israeli-Palestinian violence in more than 10 weeks. Israeli tank-fire killed five Palestinians including four policemen in the West Bank town of Jenine. Israeli forces killed one Palestinian near Bethlehem and another in Arab East Jerusalem. Palestinian gunmen in the West Bank killed two Jewish settlers in a roadside ambush near Hebron and a third Israeli in an attack against a bus outside of Jericho.",
"sentences": [
{
"tokens": [
{
"dep": "",
# training settings
max_chars: 0 # use a non-zero number to limit training data
epochs: 30 # number of training epochs (typically 15-30)
# RNN settings
batch_size: 64 # training batches
buffer_size: 10000 # maximum buffer size
seq_length: 100 # max length sentence for a single input in characters
embedding_dim: 256 # the embedding dimension
rnn_units: 256 #1024 # number of RNN units
logging.info("Utilizing differential privacy in optimizer"
RMSPropOptimizer = tf.compat.v1.train.RMSPropOptimizer
DPRmsPropGaussianOptimizer = make_dp_gaussian_optimizer(RMSPropOptimizer)
optimizer = DPRmsPropGaussianOptimizer(
l2_norm_clip=store.l2_norm_clip,
noise_multiplier=store.noise_multiplier,
num_microbatches=store.microbatches,