Skip to content

Instantly share code, notes, and snippets.

@mattbillenstein mattbillenstein/foo.py
Last active Dec 6, 2018

Embed
What would you like to do?
#!/usr/bin/env python3
import json
import time
start = time.time()
L = []
i = 0
with open('in.json') as f:
for line in f:
L.append(json.loads(line))
i += 1
if i % 100000 == 0:
print(i)
print('read', time.time() - start)
L.sort(key=lambda x: x['id'])
print('sort', time.time() - start)
i = 0
with open('out.json', 'w') as f:
for d in L:
f.write(json.dumps(d, sort_keys=True) + '\n')
i += 1
if i % 100000 == 0:
print(i)
print('write', time.time() - start)
@mattbillenstein

This comment has been minimized.

Copy link
Owner Author

commented Dec 6, 2018

Total runtime about 90s, memory usage ~5GB (~16GB on python2!)

$ wc -l in.json out.json
   1164999 in.json
   1164999 out.json
   2329998 total

$ ls -lh in.json out.json
-rw-rw-r-- 1 push push 4.2G Dec  6 17:08 in.json
-rw-rw-r-- 1 push push 4.2G Dec  6 17:10 out.json

$ jq '.id' in.json | head
1563347
1104667
1077234
1038933
1741123
1626877
1098945
1521056
237805
334571

$ jq '.id' out.json | head
17
18
19
20
21
22
23
24
25
26

$ ./foo.py
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
read 55.55036544799805
sort 56.46403622627258
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
write 92.79180216789246
@nvictor

This comment has been minimized.

Copy link

commented Dec 6, 2018

not my experience dealing with large new line delimited JSON files. what's inside in.json? the bottleneck has always been with the json module...

@mattbillenstein

This comment has been minimized.

Copy link
Owner Author

commented Dec 6, 2018

It's part of a db table dump - just part of the largest line-delimited json I had lying around -- I was curious what a python script could do re https://genius.engineering/faster-and-simpler-with-the-command-line-deep-comparing-two-5gb-json-files-3x-faster-by-ditching-the-code/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.