Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active May 2, 2022 18:08
Show Gist options
  • Save rjurney/5e926262041dc1475f0dd8b2743d6ad5 to your computer and use it in GitHub Desktop.
Save rjurney/5e926262041dc1475f0dd8b2743d6ad5 to your computer and use it in GitHub Desktop.
Create a random sample of records from a JSON Lines file
import sys, os, re
import json
import numpy as np
import math
with open("data/repos.jsonl") as f:
records = [json.loads(x) for x in f]
count = len(records)
sample_ratio = 0.01
sample_count = math.ceil(count * sample_ratio)
sample_indexes = np.random.choice(
count,
sample_count
)
sample_records = []
for sample_index in sample_indexes:
sample_record = records[sample_index]
sample_records.append(sample_record)
assert len(sample_records) == sample_count
with open("data/repos_sample_0.01.jsonl", "w") as f:
for record in sample_records:
f.write(json.dumps(record) + "\n")
print("Sampled {} records from {} original records.".format(
sample_count,
count
))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment