Skip to content

Instantly share code, notes, and snippets.

@robcowie
Created June 13, 2016 08:41
Show Gist options
  • Save robcowie/ec6dde807f13a32d3e5d42db06ad55d4 to your computer and use it in GitHub Desktop.
Save robcowie/ec6dde807f13a32d3e5d42db06ad55d4 to your computer and use it in GitHub Desktop.
Configure S3 filesystem support for Spark on OSX

Configure S3 filesystem support for Spark on OSX

Homebrew installed Spark 1.6.x

  1. Install hadoop to get the required jars (brew install hadoop)
  2. Create a spark-env.sh (cp /usr/local/Cellar/apache-spark/1.6.1/libexec/conf/spark-env.sh.template /usr/local/Cellar/apache-spark/1.6.1/libexec/conf/spark-env.sh)
  3. Set HADOOP_CONF_DIR in spark-env.sh (export HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.2/libexec/etc/hadoop/)
  4. Add the required jars to the SPARK_CLASSPATH in spark-env.sh
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.2/libexec
export SPARK_CLASSPATH="$SPARK_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/guava-11.0.2.jar"
# Or
# export SPARK_CLASSPATH="$SPARK_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"
  1. Ensure core-site.xml exists in HADOOP_CONF_DIR
  2. Add config properties defining s3 and s3n filesystem implementations
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.s3.impl</name>
        <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
    </property>

    <property>
        <name>fs.s3n.impl</name>
        <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
    </property>
</configuration>
@erikcw
Copy link

erikcw commented Oct 5, 2016

Have you tried this configuration with Spark 2.0 and Hadoop 2.7.3? Doesn't seem to work in my case...

In 16:03:59 [7]: path = 's3://my-bucket/spark/application_1475092284966_0002_1475103701/part-r-01355-1b911ca0-9a25-473f-8ee7-54180024b534.json'

In 16:04:00 [8]: df = sqlContext.read.json(path)
16/10/05 16:04:00 WARN DataSource: Error while looking for metadata directory.
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-8-6f085f5ce069> in <module>()
----> 1 df = sqlContext.read.json(path)

/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.pyc in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord)
    218             path = [path]
    219         if type(path) == list:
--> 220             return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    221         elif isinstance(path, RDD):
    222             def func(iterator):

/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    931         answer = self.gateway_client.send_command(command)
    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934
    935         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    310                 raise Py4JJavaError(
    311                     "An error occurred while calling {0}{1}{2}.\n".
--> 312                     format(target_id, ".", name), value)
    313             else:
    314                 raise Py4JError(

Py4JJavaError: An error occurred while calling o127.json.
: java.io.IOException: No FileSystem for scheme: s3
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
        at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)


In 16:04:06 [9]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment