Skip to content

Instantly share code, notes, and snippets.

@haridas
Created November 23, 2018 05:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save haridas/33175b1ebcd9955893a124393c694210 to your computer and use it in GitHub Desktop.
Save haridas/33175b1ebcd9955893a124393c694210 to your computer and use it in GitHub Desktop.
Ensure a big jsons particular field does't includes null, Helpful as part of datacleanup process.
import json
def read_json_lines(fname, filed_name):
num = 0
doc_size = []
error_docs = []
with open(fname) as f:
while True:
line = f.readline()
if not line:
break
print (num)
num += 1
d = json.loads(line)
try:
doc_size.append(len(d[field_name].split()))
except Exception as ex:
error_docs.append(d)
return doc_size, error_docs
doc_size, error_docs = read_json_lines("./file.json", "data")
# Quick analysis of doc size, when the pd.read_json / pd.read_csv fails to read the orginal file.
# Fix the error_docs by removing or updating it.
doc_df = pd.DataFrame([doc_size])
doc_df.describe()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment