I'm recieving a strange error on a new install of Spark. I have setup a small 3 node spark cluster on top of an existing hadoop instance. The error I get is the same for any command I try to run on pyspark shell I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/rdd.py", line 1041, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/spark/python/pyspark/rdd.py", line 1032, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/opt/spark/python/pyspark/rdd.py", line 906, in fold
vals = self.mapPartitions(func).collect()
File "/opt/spark/python/pyspark/rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd
self._jrdd_deserializer, profiler)
File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function
sc.pythonVer, broadcast_vars, sc._javaAccumulator)
File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 1414, in __call__
answer, self._gateway_client, None, self._fqn)
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
It appears the pyspark is unable to find the class org.apache.spark.api.python.PythonFunction
. If I'm reading the code correctly pyspark uses py4j to connect to an existing JVM, in this case I'm guessing there is a Scala file it is trying to gain access to, but it fails. Any ideas?
In an effort to understand what calls are being made by py4j to java I manually added some debugging calls to:
py4j/java_gateway.py
command = proto.CONSTRUCTOR_COMMAND_NAME +\
self._command_header +\
args_command +\
proto.END_COMMAND_PART
print(" ")
print("proto.CONSTRUCTOR_COMMAND_NAME")
print("%s", proto.CONSTRUCTOR_COMMAND_NAME)
print("self._command_header")
print("%s", self._command_header)
print("args_command")
print("%s", args_command)
print("proto.END_COMMAND_PART")
print("%s", proto.END_COMMAND_PART)
print(" ")
print(" ")
When I run pyspark shell after adding the debug prints above this is the ouput I get on a simple command:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Python version 3.5.2 (default, Dec 5 2016 08:51:55)
SparkSession available as 'spark'.
>>> sc.parallelize(range(1000)).count()
proto.CONSTRUCTOR_COMMAND_NAME
%s i
self._command_header
%s java.util.HashMap
args_command
%s
proto.END_COMMAND_PART
%s e
proto.CONSTRUCTOR_COMMAND_NAME
%s i
self._command_header
%s java.util.ArrayList
args_command
%s
proto.END_COMMAND_PART
%s e
proto.CONSTRUCTOR_COMMAND_NAME
%s i
self._command_header
%s java.util.ArrayList
args_command
%s
proto.END_COMMAND_PART
%s e
proto.CONSTRUCTOR_COMMAND_NAME
%s i
self._command_header
%s org.apache.spark.api.python.PythonFunction
args_command
%s jgAIoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2ZpbGxfZnVuY3Rpb24KcQAoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX21ha2Vfc2tlbF9mdW5jCnEBY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2J1aWx0aW5fdHlwZQpxAlgIAAAAQ29kZVR5cGVxA4VxBFJxBShLAksASwJLBUsTY19jb2RlY3MKZW5jb2RlCnEGWBoAAADCiAAAfAAAwogBAHwAAHwBAMKDAgDCgwIAU3EHWAYAAABsYXRpbjFxCIZxCVJxCk6FcQspWAUAAABzcGxpdHEMWAgAAABpdGVyYXRvcnENhnEOWCAAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL3JkZC5weXEPWA0AAABwaXBlbGluZV9mdW5jcRBNZgloBlgCAAAAAAFxEWgIhnESUnETWAQAAABmdW5jcRRYCQAAAHByZXZfZnVuY3EVhnEWKXRxF1JxGF1xGShoAChoAWgFKEsCSwBLAksCSxNoBlgMAAAAwogAAHwBAMKDAQBTcRpoCIZxG1JxHE6FcR0pWAEAAABzcR5oDYZxH2gPaBRNWQFoBlgCAAAAAAFxIGgIhnEhUnEiWAEAAABmcSOFcSQpdHElUnEmXXEnaAAoaAFoBShLAUsASwNLBEszaAZYMgAAAMKIAQB9AQB4HQB8AABEXRUAfQIAwogAAHwBAHwCAMKDAgB9AQBxDQBXfAEAVgFkAABTcShoCIZxKVJxKk6FcSspaA1YAwAAAGFjY3EsWAMAAABvYmpxLYdxLmgPaBRNggNoBlgIAAAAAAEGAQ0BEwFxL2gIhnEwUnExWAIAAABvcHEyWAkAAAB6ZXJvVmFsdWVxM4ZxNCl0cTVScTZdcTcoY19vcGVyYXRvcgphZGQKcThLAGV9cTmHcTpScTt9cTxOfXE9WAsAAABweXNwYXJrLnJkZHE+dFJhaDmHcT9ScUB9cUFOfXFCaD50UmgAKGgBaBhdcUMoaAAoaAFoJl1xRGgAKGgBaAUoSwFLAEsBSwJLU2gGWA4AAAB0AAB8AADCgwEAZwEAU3FFaAiGcUZScUdOhXFIWAMAAABzdW1xSYVxSlgBAAAAeHFLhXFMaA9YCAAAADxsYW1iZGE+cU1NCARjX19idWlsdGluX18KYnl0ZXMKcU4pUnFPKSl0cVBScVFdcVJoOYdxU1JxVH1xVU59cVZoPnRSYWg5h3FXUnFYfXFZTn1xWmg+dFJoAChoAWgYXXFbKGgAKGgBaCZdcVxoAChoAWgFKEsBSwBLAUsDS1NoBlgdAAAAdAAAZAEAZAIAwoQAAHwAAETCgwEAwoMBAGcBAFNxXWgIhnFeUnFfTmgFKEsBSwBLAksCS3NoBlgVAAAAfAAAXQsAfQEAZAAAVgFxAwBkAQBTcWBoCIZxYVJxYksBToZxYylYAgAAAC4wcWRYAQAAAF9xZYZxZmgPWAkAAAA8Z2VuZXhwcj5xZ00RBGgGWAIAAAAGAHFoaAiGcWlScWopKXRxa1JxbFguAAAAUkRELmNvdW50Ljxsb2NhbHM+LjxsYW1iZGE+Ljxsb2NhbHM+LjxnZW5leHByPnFth3FuaEmFcW9YAQAAAGlxcIVxcWgPaE1NEQRoTykpdHFyUnFzXXF0aDmHcXVScXZ9cXdOfXF4aD50UmFoOYdxeVJxen1xe059cXxoPnRSaAAoaAFoBShLAksASwJLBUsTaAZYJgAAAHQAAMKIAAB8AADCgwEAwogAAHwAAGQBABfCgwEAwogBAMKDAwBTcX1oCIZxflJxf05LAYZxgFgGAAAAeHJhbmdlcYGFcYJoDGgNhnGDWCQAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL2NvbnRleHQucHlxhGgjTcIBaAZYAgAAAAABcYVoCIZxhlJxh1gIAAAAZ2V0U3RhcnRxiFgEAAAAc3RlcHGJhnGKKXRxi1JxjF1xjShoAChoAWgFKEsBSwBLAUsESxNoBlgfAAAAwogCAHQAAHwAAMKIAQAUwogAABvCgwEAwogDABQXU3GOaAiGcY9ScZBOhXGRWAMAAABpbnRxkoVxk2gMhXGUaIRoiE2/AWgGWAIAAAAAAXGVaAiGcZZScZcoWAkAAABudW1TbGljZXNxmFgEAAAAc2l6ZXGZWAYAAABzdGFydDBxmmiJdHGbKXRxnFJxnV1xnihLAk3oA0sASwFlfXGfh3GgUnGhfXGiTn1xo1gPAAAAcHlzcGFyay5jb250ZXh0caR0UksBZWifh3GlUnGmfXGnaIFjX19idWlsdGluX18KeHJhbmdlCnGoc059calopHRSZWg5h3GqUnGrfXGsTn1xrWg+dFJlaDmHca5Sca99cbBOfXGxaD50UmVoOYdxslJxs31xtE59cbVoPnRSTmNweXNwYXJrLnNlcmlhbGl6ZXJzCkJhdGNoZWRTZXJpYWxpemVyCnG2KYFxt31xuChYCQAAAGJhdGNoU2l6ZXG5SwFYCgAAAHNlcmlhbGl6ZXJxumNweXNwYXJrLnNlcmlhbGl6ZXJzClBpY2tsZVNlcmlhbGl6ZXIKcbspgXG8fXG9WBMAAABfb25seV93cml0ZV9zdHJpbmdzcb6Jc2J1YmNweXNwYXJrLnNlcmlhbGl6ZXJzCkF1dG9CYXRjaGVkU2VyaWFsaXplcgpxvymBccB9ccEoaLlLAGi6aLxYCAAAAGJlc3RTaXplccJKAAABAHVidHHDLg==
ro24
ro25
s/usr/local/bin/python3.5
s3.5
ro26
ro14
proto.END_COMMAND_PART
%s e
# ERROR
# Then the error from above prints here...
Any ideas?
did you ever figure this out?