Skip to content

Instantly share code, notes, and snippets.

@alokito
Created November 17, 2014 21:20
Show Gist options
  • Save alokito/40878fc25af21984463f to your computer and use it in GitHub Desktop.
Save alokito/40878fc25af21984463f to your computer and use it in GitHub Desktop.
GroupBy bug in Spark's Java API, see https://issues.apache.org/jira/browse/SPARK-4459
<project>
<groupId>edu.berkeley</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<repositories>
<repository>
<id>Akka repository</id>
<url>http://repo.akka.io/releases</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.2</version>
</dependency>
</dependencies>
</project>
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
public class SimpleApp {
private static Function<Tuple2<String,Long>, Long> indexMod4 = new Function<Tuple2<String,Long>, Long>() {
@Override
public Long call(Tuple2<String, Long> arg0) throws Exception {
return (long)(arg0._2 / 4);
}
};
public static void main(String[] args) {
String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
"$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
final JavaRDD<Long,Iterable<Tuple2<String,Long>>> parsedFiles1 = logData
.zipWithIndex()
.groupBy(indexMod4);
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
@alokito
Copy link
Author

alokito commented Nov 17, 2014

If you make a simple maven project with the attached pom.xml and src/main/java/SimpleApp.java, you should find that "mvn package" fails with a "no suitable method found for groupBy(org.apache.spark.api.java.function.Function<scala.Tuple2<java.lang.String,java.lang.Long>,java.lang.Long>)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment