Skip to content

Instantly share code, notes, and snippets.

@evianzhow
Last active August 29, 2015 14:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save evianzhow/8c8297deedc93e826c9d to your computer and use it in GitHub Desktop.
Save evianzhow/8c8297deedc93e826c9d to your computer and use it in GitHub Desktop.
Github Data Cleaner
#!/usr/bin/env ruby
require 'json'
def valid_json?(json)
err_cnt = 0
json.each_line do |line|
begin
JSON.parse(line)
rescue Exception => e
# Debug
# puts e
err_cnt += 1
end
end
if err_cnt > 0
return false
else
return true
end
end
def split(json)
stack = []
insert_cnt = 0
inString = false
inEscape = false
json.each_char.with_index(1) { |char, index|
case char
when '\\'
inEscape = !inEscape
when '"'
if inEscape
inEscape = false
else
inString = !inString
end
when '{'
if !inString
stack.push('{')
end
when '}'
if !inString
stack.pop
end
if stack.empty?
json.insert(index+insert_cnt, "\n")
insert_cnt += 1
end
else
if inEscape
inEscape = !inEscape
end
end
}
json
end
if !ARGV.count
puts "Give me a path!"
return
end
ARGV.each do |filename|
command = "wc -l #{filename}"
exec = `#{command}`
next if exec == "" || exec.split(' ').first.to_i != 0
file = File.read("#{filename}")
if file
if valid_json?(split(file))
File.open("fixed-"+"#{filename}", 'w') { |f| f.write(split(file)) }
# puts "Valid: #{filename}"
else
# File.open("error-"+"#{filename}", 'w') { |f| f.write(split(file)) }
puts "Invalid: #{filename}"
end
end
end
@evianzhow
Copy link
Author

Github Archive unfortunately for some periods of time suppressed all JSON objects into one huge line. This script inserted at all candidate newline symbol and fixed it. This problem is described in Finding great software engineers with GitHub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment