Tested with Cloudera 5.12.0 Quickstart VM (https://www.cloudera.com/downloads/quickstart_vms/5-12.html)
Library | Version |
---|---|
JanusGraph | 0.3.0-SNAPSHOT |
TinkerPop | 3.3.0 |
Spark | 2.2.0 |
HBase | 1.2.0 |
Cassandra | 2.2.11 |
Java | 1.8.0_151 |
Maven | 3.5.2 |
Update from packages to parcels
sudo /home/cloudera/parcels
Update to Java 1.8
https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_cm_upgrading_to_jdk8.html
Update to Spark 2.2
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html
Create application JAR with required dependencies. Build and copy shaded JAR to directory accessible across all cluster nodes (/public
in example below).
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-spark</artifactId>
<packaging>jar</packaging>
<version>0.0.1-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-hadoop-2</artifactId>
<version>0.3.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<!-- needed to resolve NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableInputFormat -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.0</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
</project>
Common Spark configuration
spark.master=yarn
spark.deploy-mode=client
spark.executor.memory=1g
# include path to Spark jars and Hadoop native libs
spark.yarn.jars=/opt/cloudera/parcels/SPARK2/lib/spark2/jars/*
spark.yarn.am.extraJavaOptions=-Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
# use Java 1.8
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_151/jre
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_151/jre
# prepend shaded dependency jar to executor classpath
spark.executor.extraClassPath=/public/janusgraph-spark-0.0.1-SNAPSHOT.jar
spark.serializer=org.apache.spark.serializer.KryoSerializer
Gremlin classpath (set before launching ./bin/gremlin.sh
)
export CLASSPATH=/public/janusgraph-spark-0.0.1-SNAPSHOT.jar:/etc/hadoop/conf:/opt/cloudera/parcels/SPARK2/lib/spark2/jars/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*
Note if using HBase snapshots include /etc/hbase/conf
in both CLASSPATH
and spark.executor.extraClassPath
above.
Update conf/hadoop-graph/hadoop-load.properties
to include Spark configuration from above
Upload test data to HDFS
hadoop fs -mkdir data
hadoop fs -copyFromLocal data/grateful-dead.kryo data
(Gremlin shell) Load schema
:load data/grateful-dead-janusgraph-schema.groovy
graph = JanusGraphFactory.open('conf/janusgraph-hbase.properties')
defineGratefulDeadSchema(graph)
graph.close()
(Gremlin shell) Execute vertex program
:plugin use tinkerpop.spark
graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
blvp = BulkLoaderVertexProgram.build().writeGraph('conf/janusgraph-hbase.properties').create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
graph.close()
Update conf/hadoop-graph/read-hbase.properties
to include Spark configuration from above.
(Groovy shell) Execute traversal
:plugin use tinkerpop.spark
graph = GraphFactory.open('conf/hadoop-graph/read-hbase.properties')
g = graph.traversal().withComputer(SparkGraphComputer)
g.V().count()
graph.close()
What about the Hadoop version? Do I really need a running Hadoop cluster to use Hadoop Graphs?
I would like to substitute the HDFS by a NAS and run all calculations on Spark.
Best regards,
Mirko