Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
pyspark error

I'm recieving a strange error on a new install of Spark. I have setup a small 3 node spark cluster on top of an existing hadoop instance. The error I get is the same for any command I try to run on pyspark shell I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/rdd.py", line 1041, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/opt/spark/python/pyspark/rdd.py", line 1032, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/opt/spark/python/pyspark/rdd.py", line 906, in fold
    vals = self.mapPartitions(func).collect()
  File "/opt/spark/python/pyspark/rdd.py", line 809, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd
    self._jrdd_deserializer, profiler)
  File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function
    sc.pythonVer, broadcast_vars, sc._javaAccumulator)
  File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 1414, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in get_return_value
    format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:235)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

It appears the pyspark is unable to find the class org.apache.spark.api.python.PythonFunction. If I'm reading the code correctly pyspark uses py4j to connect to an existing JVM, in this case I'm guessing there is a Scala file it is trying to gain access to, but it fails. Any ideas?

In an effort to understand what calls are being made by py4j to java I manually added some debugging calls to: py4j/java_gateway.py

        command = proto.CONSTRUCTOR_COMMAND_NAME +\
            self._command_header +\
            args_command +\
            proto.END_COMMAND_PART

        print(" ")
        print("proto.CONSTRUCTOR_COMMAND_NAME")
        print("%s", proto.CONSTRUCTOR_COMMAND_NAME)
        print("self._command_header")
        print("%s", self._command_header)
        print("args_command")
        print("%s", args_command)
        print("proto.END_COMMAND_PART")
        print("%s", proto.END_COMMAND_PART)
        print(" ")
        print(" ")

When I run pyspark shell after adding the debug prints above this is the ouput I get on a simple command:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2
      /_/

Using Python version 3.5.2 (default, Dec  5 2016 08:51:55)
SparkSession available as 'spark'.
>>> sc.parallelize(range(1000)).count()

proto.CONSTRUCTOR_COMMAND_NAME
%s i

self._command_header
%s java.util.HashMap

args_command
%s
proto.END_COMMAND_PART
%s e




proto.CONSTRUCTOR_COMMAND_NAME
%s i

self._command_header
%s java.util.ArrayList

args_command
%s
proto.END_COMMAND_PART
%s e




proto.CONSTRUCTOR_COMMAND_NAME
%s i

self._command_header
%s java.util.ArrayList

args_command
%s
proto.END_COMMAND_PART
%s e




proto.CONSTRUCTOR_COMMAND_NAME
%s i

self._command_header
%s org.apache.spark.api.python.PythonFunction

args_command
%s jgAIoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2ZpbGxfZnVuY3Rpb24KcQAoY3B5c3BhcmsuY2xvdWRwaWNrbGUKX21ha2Vfc2tlbF9mdW5jCnEBY3B5c3BhcmsuY2xvdWRwaWNrbGUKX2J1aWx0aW5fdHlwZQpxAlgIAAAAQ29kZVR5cGVxA4VxBFJxBShLAksASwJLBUsTY19jb2RlY3MKZW5jb2RlCnEGWBoAAADCiAAAfAAAwogBAHwAAHwBAMKDAgDCgwIAU3EHWAYAAABsYXRpbjFxCIZxCVJxCk6FcQspWAUAAABzcGxpdHEMWAgAAABpdGVyYXRvcnENhnEOWCAAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL3JkZC5weXEPWA0AAABwaXBlbGluZV9mdW5jcRBNZgloBlgCAAAAAAFxEWgIhnESUnETWAQAAABmdW5jcRRYCQAAAHByZXZfZnVuY3EVhnEWKXRxF1JxGF1xGShoAChoAWgFKEsCSwBLAksCSxNoBlgMAAAAwogAAHwBAMKDAQBTcRpoCIZxG1JxHE6FcR0pWAEAAABzcR5oDYZxH2gPaBRNWQFoBlgCAAAAAAFxIGgIhnEhUnEiWAEAAABmcSOFcSQpdHElUnEmXXEnaAAoaAFoBShLAUsASwNLBEszaAZYMgAAAMKIAQB9AQB4HQB8AABEXRUAfQIAwogAAHwBAHwCAMKDAgB9AQBxDQBXfAEAVgFkAABTcShoCIZxKVJxKk6FcSspaA1YAwAAAGFjY3EsWAMAAABvYmpxLYdxLmgPaBRNggNoBlgIAAAAAAEGAQ0BEwFxL2gIhnEwUnExWAIAAABvcHEyWAkAAAB6ZXJvVmFsdWVxM4ZxNCl0cTVScTZdcTcoY19vcGVyYXRvcgphZGQKcThLAGV9cTmHcTpScTt9cTxOfXE9WAsAAABweXNwYXJrLnJkZHE+dFJhaDmHcT9ScUB9cUFOfXFCaD50UmgAKGgBaBhdcUMoaAAoaAFoJl1xRGgAKGgBaAUoSwFLAEsBSwJLU2gGWA4AAAB0AAB8AADCgwEAZwEAU3FFaAiGcUZScUdOhXFIWAMAAABzdW1xSYVxSlgBAAAAeHFLhXFMaA9YCAAAADxsYW1iZGE+cU1NCARjX19idWlsdGluX18KYnl0ZXMKcU4pUnFPKSl0cVBScVFdcVJoOYdxU1JxVH1xVU59cVZoPnRSYWg5h3FXUnFYfXFZTn1xWmg+dFJoAChoAWgYXXFbKGgAKGgBaCZdcVxoAChoAWgFKEsBSwBLAUsDS1NoBlgdAAAAdAAAZAEAZAIAwoQAAHwAAETCgwEAwoMBAGcBAFNxXWgIhnFeUnFfTmgFKEsBSwBLAksCS3NoBlgVAAAAfAAAXQsAfQEAZAAAVgFxAwBkAQBTcWBoCIZxYVJxYksBToZxYylYAgAAAC4wcWRYAQAAAF9xZYZxZmgPWAkAAAA8Z2VuZXhwcj5xZ00RBGgGWAIAAAAGAHFoaAiGcWlScWopKXRxa1JxbFguAAAAUkRELmNvdW50Ljxsb2NhbHM+LjxsYW1iZGE+Ljxsb2NhbHM+LjxnZW5leHByPnFth3FuaEmFcW9YAQAAAGlxcIVxcWgPaE1NEQRoTykpdHFyUnFzXXF0aDmHcXVScXZ9cXdOfXF4aD50UmFoOYdxeVJxen1xe059cXxoPnRSaAAoaAFoBShLAksASwJLBUsTaAZYJgAAAHQAAMKIAAB8AADCgwEAwogAAHwAAGQBABfCgwEAwogBAMKDAwBTcX1oCIZxflJxf05LAYZxgFgGAAAAeHJhbmdlcYGFcYJoDGgNhnGDWCQAAAAvb3B0L3NwYXJrL3B5dGhvbi9weXNwYXJrL2NvbnRleHQucHlxhGgjTcIBaAZYAgAAAAABcYVoCIZxhlJxh1gIAAAAZ2V0U3RhcnRxiFgEAAAAc3RlcHGJhnGKKXRxi1JxjF1xjShoAChoAWgFKEsBSwBLAUsESxNoBlgfAAAAwogCAHQAAHwAAMKIAQAUwogAABvCgwEAwogDABQXU3GOaAiGcY9ScZBOhXGRWAMAAABpbnRxkoVxk2gMhXGUaIRoiE2/AWgGWAIAAAAAAXGVaAiGcZZScZcoWAkAAABudW1TbGljZXNxmFgEAAAAc2l6ZXGZWAYAAABzdGFydDBxmmiJdHGbKXRxnFJxnV1xnihLAk3oA0sASwFlfXGfh3GgUnGhfXGiTn1xo1gPAAAAcHlzcGFyay5jb250ZXh0caR0UksBZWifh3GlUnGmfXGnaIFjX19idWlsdGluX18KeHJhbmdlCnGoc059calopHRSZWg5h3GqUnGrfXGsTn1xrWg+dFJlaDmHca5Sca99cbBOfXGxaD50UmVoOYdxslJxs31xtE59cbVoPnRSTmNweXNwYXJrLnNlcmlhbGl6ZXJzCkJhdGNoZWRTZXJpYWxpemVyCnG2KYFxt31xuChYCQAAAGJhdGNoU2l6ZXG5SwFYCgAAAHNlcmlhbGl6ZXJxumNweXNwYXJrLnNlcmlhbGl6ZXJzClBpY2tsZVNlcmlhbGl6ZXIKcbspgXG8fXG9WBMAAABfb25seV93cml0ZV9zdHJpbmdzcb6Jc2J1YmNweXNwYXJrLnNlcmlhbGl6ZXJzCkF1dG9CYXRjaGVkU2VyaWFsaXplcgpxvymBccB9ccEoaLlLAGi6aLxYCAAAAGJlc3RTaXplccJKAAABAHVidHHDLg==
ro24
ro25
s/usr/local/bin/python3.5
s3.5
ro26
ro14

proto.END_COMMAND_PART
%s e

# ERROR
# Then the error from above prints here...

Any ideas?

@lostinplace

This comment has been minimized.

Copy link

lostinplace commented Jan 26, 2018

did you ever figure this out?

@varunpatwardhan

This comment has been minimized.

Copy link

varunpatwardhan commented Nov 20, 2018

If somebody stumbles upon this in future without getting an answer, I was able to work around this using findspark package and inserting findspark.init() at the beginning of my code

@ppetruneac

This comment has been minimized.

Copy link

ppetruneac commented Nov 27, 2018

I am having the similar issue, but findspark.init(spark_home='/root/spark/', python_path='/root/anaconda3/bin/python3') did not solve it.

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-2-1727818e5299> in <module>()
      6 from pyspark.sql import SQLContext
      7 
----> 8 sc = SparkContext(appName="Pi")
      9 # spark = SQLContext(sc)
     10 

~/anaconda3/envs/spark-env/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 118                           conf, jsc, profiler_cls)
    119         except:
    120             # If an error occurs, clean up in order to allow future SparkContext creation:

~/anaconda3/envs/spark-env/lib/python3.7/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
    186         self._accumulatorServer = accumulators._start_update_server()
    187         (host, port) = self._accumulatorServer.server_address
--> 188         self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
    189         self._jsc.sc().register(self._javaAccumulator)
    190 

~/anaconda3/envs/spark-env/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

~/anaconda3/envs/spark-env/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    330                 raise Py4JError(
    331                     "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 332                     format(target_id, ".", name, value))
    333         else:
    334             raise Py4JError(

Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonAccumulatorV2. Trace:
py4j.Py4JException: Constructor org.apache.spark.api.python.PythonAccumulatorV2([class java.lang.String, class java.lang.Integer]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:237)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Using Python 3 with Anaconda.
Spark 2.4.0

Any suggestions?

@vadalikasi

This comment has been minimized.

Copy link

vadalikasi commented Jan 21, 2019

I had similar issue as spark version and pyspark module version are different. After correcting this issue got resolved

@dharadhruve

This comment has been minimized.

Copy link

dharadhruve commented Jan 24, 2019

Any Ideas?? Getting same error mentioned in main thread

@dharadhruve

This comment has been minimized.

Copy link

dharadhruve commented Apr 1, 2019

Anyone finds the solution. Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.