Skip to content

Instantly share code, notes, and snippets.

@chapmanjacobd
Last active October 15, 2022 01:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chapmanjacobd/99f3be0780a53e6238ec77b9592e559e to your computer and use it in GitHub Desktop.
Save chapmanjacobd/99f3be0780a53e6238ec77b9592e559e to your computer and use it in GitHub Desktop.
# https://old.reddit.com/r/pushshift/comments/ajmcc0/information_and_code_examples_on_how_to_use_the/
with open("filename.zst", 'rb') as fh:
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)
with dctx.stream_reader(fh) as reader:
previous_line = ""
while True:
chunk = reader.read(2**24) # 16mb chunks
if not chunk:
break
string_data = chunk.decode('utf-8')
lines = string_data.split("\n")
for i, line in enumerate(lines[:-1]):
if i == 0:
line = previous_line + line
object = json.loads(line)
# do something with the object here
previous_line = lines[-1]
@chapmanjacobd
Copy link
Author

I think I'll use unzstd and stdin instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment