Skip to content

Instantly share code, notes, and snippets.

@sjudeng
Last active November 6, 2022 12:54
Show Gist options
  • Save sjudeng/093687d5f435ddbf46ea1808fbc4b398 to your computer and use it in GitHub Desktop.
Save sjudeng/093687d5f435ddbf46ea1808fbc4b398 to your computer and use it in GitHub Desktop.
Testing OLAP using JanusGraph with TinkerPop 3.3.0 and Spark 2.2 on Yarn (Cloudera)

Tested with Cloudera 5.12.0 Quickstart VM (https://www.cloudera.com/downloads/quickstart_vms/5-12.html)

Library Version
JanusGraph 0.3.0-SNAPSHOT
TinkerPop 3.3.0
Spark 2.2.0
HBase 1.2.0
Cassandra 2.2.11
Java 1.8.0_151
Maven 3.5.2

Update Cloudera to Spark 2.2

Update from packages to parcels

sudo /home/cloudera/parcels

Update to Java 1.8

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_cm_upgrading_to_jdk8.html

Update to Spark 2.2

https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html

Application JAR

Create application JAR with required dependencies. Build and copy shaded JAR to directory accessible across all cluster nodes (/public in example below).

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.janusgraph</groupId>
  <artifactId>janusgraph-spark</artifactId>
  <packaging>jar</packaging>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.2</version>
        <configuration>
            <filters>
                <filter>
                    <artifact>*:*</artifact>
                    <excludes>
                        <exclude>META-INF/*.SF</exclude>
                        <exclude>META-INF/*.DSA</exclude>
                        <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                </filter>
            </filters>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  <dependencies>
    <dependency>
      <groupId>org.janusgraph</groupId>
      <artifactId>janusgraph-hadoop-2</artifactId>
      <version>0.3.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>18.0</version>
    </dependency>
    <!-- needed to resolve NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableInputFormat -->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>1.2.0</version>
      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
  </dependencies>
</project>

Common Spark configuration

spark.master=yarn
spark.deploy-mode=client
spark.executor.memory=1g
# include path to Spark jars and Hadoop native libs
spark.yarn.jars=/opt/cloudera/parcels/SPARK2/lib/spark2/jars/*
spark.yarn.am.extraJavaOptions=-Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
# use Java 1.8
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_151/jre
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_151/jre
# prepend shaded dependency jar to executor classpath
spark.executor.extraClassPath=/public/janusgraph-spark-0.0.1-SNAPSHOT.jar
spark.serializer=org.apache.spark.serializer.KryoSerializer

Gremlin classpath (set before launching ./bin/gremlin.sh)

export CLASSPATH=/public/janusgraph-spark-0.0.1-SNAPSHOT.jar:/etc/hadoop/conf:/opt/cloudera/parcels/SPARK2/lib/spark2/jars/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*

Note if using HBase snapshots include /etc/hbase/conf in both CLASSPATH and spark.executor.extraClassPath above.

Test BulkLoaderVertexProgram

Update conf/hadoop-graph/hadoop-load.properties to include Spark configuration from above

Upload test data to HDFS

hadoop fs -mkdir data
hadoop fs -copyFromLocal data/grateful-dead.kryo data

(Gremlin shell) Load schema

:load data/grateful-dead-janusgraph-schema.groovy
graph = JanusGraphFactory.open('conf/janusgraph-hbase.properties')
defineGratefulDeadSchema(graph)
graph.close()

(Gremlin shell) Execute vertex program

:plugin use tinkerpop.spark
graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
blvp = BulkLoaderVertexProgram.build().writeGraph('conf/janusgraph-hbase.properties').create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
graph.close()

Test OLAP Traversal

Update conf/hadoop-graph/read-hbase.properties to include Spark configuration from above.

(Groovy shell) Execute traversal

:plugin use tinkerpop.spark
graph = GraphFactory.open('conf/hadoop-graph/read-hbase.properties')
g = graph.traversal().withComputer(SparkGraphComputer)
g.V().count()
graph.close()
@Miroka96
Copy link

What about the Hadoop version? Do I really need a running Hadoop cluster to use Hadoop Graphs?
I would like to substitute the HDFS by a NAS and run all calculations on Spark.

Best regards,
Mirko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment