Skip to content

Instantly share code, notes, and snippets.

@mattbillenstein
Last active December 6, 2018 18:53
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mattbillenstein/34cf2907390102ffbabd982a3662b204 to your computer and use it in GitHub Desktop.
Save mattbillenstein/34cf2907390102ffbabd982a3662b204 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
import json
import time
start = time.time()
L = []
i = 0
with open('in.json') as f:
for line in f:
L.append(json.loads(line))
i += 1
if i % 100000 == 0:
print(i)
print('read', time.time() - start)
L.sort(key=lambda x: x['id'])
print('sort', time.time() - start)
i = 0
with open('out.json', 'w') as f:
for d in L:
f.write(json.dumps(d, sort_keys=True) + '\n')
i += 1
if i % 100000 == 0:
print(i)
print('write', time.time() - start)
@mattbillenstein
Copy link
Author

Total runtime about 90s, memory usage ~5GB (~16GB on python2!)

$ wc -l in.json out.json
   1164999 in.json
   1164999 out.json
   2329998 total

$ ls -lh in.json out.json
-rw-rw-r-- 1 push push 4.2G Dec  6 17:08 in.json
-rw-rw-r-- 1 push push 4.2G Dec  6 17:10 out.json

$ jq '.id' in.json | head
1563347
1104667
1077234
1038933
1741123
1626877
1098945
1521056
237805
334571

$ jq '.id' out.json | head
17
18
19
20
21
22
23
24
25
26

$ ./foo.py
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
read 55.55036544799805
sort 56.46403622627258
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
write 92.79180216789246

@nvictor
Copy link

nvictor commented Dec 6, 2018

not my experience dealing with large new line delimited JSON files. what's inside in.json? the bottleneck has always been with the json module...

@mattbillenstein
Copy link
Author

It's part of a db table dump - just part of the largest line-delimited json I had lying around -- I was curious what a python script could do re https://genius.engineering/faster-and-simpler-with-the-command-line-deep-comparing-two-5gb-json-files-3x-faster-by-ditching-the-code/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment