Skip to content

Instantly share code, notes, and snippets.

@ncarboni
Created May 26, 2020 10:16
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ncarboni/a1457b34b84b7472d170a1f238362578 to your computer and use it in GitHub Desktop.
Save ncarboni/a1457b34b84b7472d170a1f238362578 to your computer and use it in GitHub Desktop.
Convert geonames RDF to usable format
# script modified from https://github.com/rhasan/sw/blob/master/genames/convert2ntriples.py
# This script will take genames rdf dump available here http://download.geonames.org/all-geonames-rdf.zip
# and convert each triples to N-Triple seralization.
# The dump has one rdf document per toponym on every line of the file.
# The produced N-Triples will be written in geonames.nt file. The final geonames.nt file is approximately 13.21GB
#!/usr/bin/python
import rdflib
fo = open("geonames.nt", "wb")
totalStmt = 0
with open("all-geonames-rdf.txt") as fileobject:
count = 0
for line in fileobject:
# print ("Line number: ", count+1, ":", line)
if count/10000 == int(count/10000):
print(count)
if count%2 != 0:
g = rdflib.Graph()
result = g.parse(data=line,format='xml')
#print("graph has %s statements." % len(g))
totalStmt += len(g)
s = g.serialize(format='nt')
fo.write(s)
#print s
#g.serialize(format='nt', destination='out.nt')
#else:
# print "Feature: ", line
count = count + 1
# if count == 3000:
# break
print ("Total statements: ", totalStmt)
fo.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment