Skip to content

Instantly share code, notes, and snippets.



Last active Nov 17, 2020
What would you like to do?
Import DBpedia 2020 into Neo4j v4 with Neosemantics

Import DBpedia 2020 into Neo4j v4 with Neosemantics

  1. Prerequisite: OpenJDK 11. If you run ubuntu with root you can use

    apt-get install default-jdk

    Otherwise, consider using docker :

    Third option, not recommended, you can install Java in userspace, you will have to play around with terminal configuration. Here is a starting point under "Installing OpenJDK Manually":

  2. Get Neo4j v4.1.X Community server and install Neosemantics plugin, also configure neosemantics and add required index

  3. Download DBpedia Files, uncompress, ready to be imported

    ./ dbpedia_files.txt
  4. Load the data files Notice 1: DBpedia contains malformed IRIs, I've done my best to exclude those, but still some can pass through. A better solution is needed. Notice 2: DBpedia has multi-valued properties with inconsistent types. At the moment handleMultival: "OVERWRITE" could be an option.

  5. Test data is all right:

    • Count nodes

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r:Resource) RETURN COUNT(r)"
    • Count edges

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN COUNT(l)"
    • Distinct relationship types

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CALL db.relationshipTypes() YIELD relationshipType RETURN relationshipType"
    • Example node-edges

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN r1, l, r2 LIMIT 20"
set -e
export DATA_DIR="${PWD}/data"
export NEO4J_HOME=${PWD}/neo4j-server
export NEO4J_IMPORT="${NEO4J_HOME}/import"
mkdir -p -v "${DATA_DIR}"
mkdir -p -v "${NEO4J_IMPORT}"
if [ "$#" -ne 1 ]; then
echo "Illegal number of parameters."
exit 1
if [ -d $DATA_DIR ]
echo "Downloading files..."
rm -v ${DATA_DIR}/*.* || true
while read -r line; do
[[ "$line" =~ ^#.*$ ]] && continue
wget -P ${DATA_DIR}/ $line
bzip2 -dk ${DATA_DIR}/${line##*/}
filename=$(basename -- "${DATA_DIR}/${line##*/}")
# Remove corrupted chars and lines
iconv -f utf-8 -t ascii -c "${DATA_DIR}/${filename}" | grep -E '^<(https?|ftp|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[A-Za-z0-9\+&@#/%?=~_|]>\W<' | grep -v 'xn--b1aew' > ${DATA_DIR}/clean-${filename}
rm -v "${DATA_DIR}/${filename}"
split -l 5000000 --numeric-suffixes ${DATA_DIR}/clean-${filename} ${NEO4J_IMPORT}/part-${filename}
done < $1
chmod -R 777 ${NEO4J_IMPORT}
echo "No destination folder ${DATA_DIR}"
ulimit -n 65535
rm -rf neo4j-server
wget${NEO4J_VERSION}-unix.tar.gz -O neo4j.tar.gz
tar xf neo4j.tar.gz
mv neo4j-community-${NEO4J_VERSION} neo4j-server
rm neo4j.tar.gz
export NEO4J_HOME=${PWD}/neo4j-server
export NEO4J_DATA_DIR=${NEO4J_HOME}/data
rm -rf $NEO4J_DATA_DIR
# APOC_FILE=apoc-${APOC_VERSION}-core.jar
# there is a difference between `core` and `all`
# In theory we don't need this, since
# apoc- contains a subset of the functionality and will be bundled from Neo4j 4.1.1
#if [ ! -f ${NEO4J_HOME}/plugins/${APOC_FILE} ]
# echo "Downloading Neo4j APOC plugin..."
# wget -P ${NEO4J_HOME}/plugins/${APOC_VERSION}/${APOC_FILE}
# Do we need the following?
echo "Installing Neo4j APOC plugin..."
echo '*' >> ${NEO4J_HOME}/conf/neo4j.conf
echo 'apoc.export.file.enabled=true' >> ${NEO4J_HOME}/conf/neo4j.conf
echo 'apoc.import.file.use_neo4j_config=false' >> ${NEO4J_HOME}/conf/neo4j.conf
if [ ! -f ${NEO4J_HOME}/plugins/${NEOSEM_FILE} ]
echo "Downloading Neo4j RDF plugin..."
echo "Installing Neo4j RDF plugin..."
echo 'dbms.unmanaged_extension_classes=n10s.endpoint=/rdf' >> ${NEO4J_HOME}/conf/neo4j.conf
${NEO4J_HOME}/bin/neo4j start
sleep 10
$NEO4J_HOME/bin/neo4j-admin set-initial-password admin
$NEO4J_HOME/bin/neo4j restart
sleep 10
echo "Creating index"
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CREATE CONSTRAINT n10s_unique_uri ON (r:Resource) ASSERT r.uri IS UNIQUE;"
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' 'call n10s.graphconfig.init( { handleMultival: "OVERWRITE", handleVocabUris: "SHORTEN", keepLangTag: false, handleRDFTypes: "NODES" })'
echo Neo4j log:
tail -n 12 $NEO4J_HOME/logs/neo4j.log
export NEO4J_HOME=${PWD}
export NEO4J_IMPORT="${NEO4J_HOME}/neo4j-server/import"
export NEO4J_DB_DIR=$NEO4J_HOME/neo4j-server/data/databases/graph.db
ulimit -n 65535
echo "Importing"
for file in ${NEO4J_IMPORT}/*.ttl*; do
# Extracting filename
echo $file
filename="$(basename "${file}")"
echo "Importing $filename from ${NEO4J_HOME}"
${NEO4J_HOME}/neo4j-server/bin/cypher-shell -u neo4j -p 'admin' "CALL n10s.rdf.import.fetch(\"file://${NEO4J_IMPORT}/$filename\",\"Turtle\");"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment