Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@danielecook
Last active November 17, 2019 07:50
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save danielecook/3175c578c8a0118ead35 to your computer and use it in GitHub Desktop.
Save danielecook/3175c578c8a0118ead35 to your computer and use it in GitHub Desktop.
Sense / infer / generate a big query schema string for import #bigquery
import mimetypes
import sys
from collections import OrderedDict
filename = sys.argv[1]
def file_type(filename):
type = mimetypes.guess_type(filename)
return type
filetype = file_type(filename)[1]
if filetype == "gzip":
import gzip
readfile = gzip.GzipFile(filename, 'r')
else:
readfile = open(filename,'r')
with readfile as f:
header = next(f).strip().split("\t")
lines = [dict(zip(header,next(f).strip().split("\t"))) for x in xrange(50000)]
schema = OrderedDict(zip(header, [bool]*len(header)))
def boolify(s):
if s == 'True' or s == "TRUE" or s == "T":
return True
if s == 'False' or s == "FALSE" or s == "F":
return False
raise ValueError("huh?")
def autoconvert(s):
for fn in (boolify, int, float):
try:
return fn(s)
except ValueError:
pass
return s
type_precedence = {str:0, float:1, int:2,bool:3}
type_map = {str:"STRING", float:"FLOAT", int:"INTEGER", bool:"BOOLEAN"}
# Sense header
for line in lines:
for k,v in line.items():
if v == "" or v == ".":
pass
else:
sense_type = type(autoconvert(v))
if schema[k] == sense_type or schema[k] == str:
pass
elif type_precedence[schema[k]] > type_precedence[sense_type]:
schema[k] = sense_type
print ','.join([ k.replace("/","_") + ":" + type_map[v] for k,v in schema.items()])
@rohandora
Copy link

hi i have tried your code,it gives me error:: File "bigquery_schema.py", line 22, in
lines = [dict(zip(header,next(f).strip().split("\t"))) for x in xrange(50000)]
StopIteration

Thanks for your valuable code

@hlecuanda
Copy link

hlecuanda commented Jan 6, 2017

@rohandora : did you try to generate a schema for JSON data?
its not quite obvious, but the reason you get that error is because the script expects data in TSV format, not JSON.
while @danielcook kindly provides even for the case of a gzipped file to be read and processed. Its a good script, but lacking in documentation or even comments, :( as you can see, the relevant line:

 lines = [dict(zip(header,next(f).strip().split("\t"))) for x in xrange(50000)]

processes a chunk of 50K lines in a file that is being cleaned of extra white space (strip()) and split on tab delimiters (split("\t"))

@danielecook
Copy link
Author

@hlecuanda the lack of documentation is a point well taken. I'd like to throw together something a bit more formal if I have time. I'll add it to my to do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment