Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@mhausenblas
Created February 8, 2015 16:07
Show Gist options
  • Star 18 You must be signed in to star a gist
  • Fork 33 You must be signed in to fork a gist
  • Save mhausenblas/9f8ac25d2b4f94d07c99 to your computer and use it in GitHub Desktop.
Save mhausenblas/9f8ac25d2b4f94d07c99 to your computer and use it in GitHub Desktop.
Scala Spark skeleton implementing grep
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>my.org</groupId>
<artifactId>spark-grep</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
</plugin>
</plugins>
</build>
</project>
package spark.example
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkGrep {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: SparkGrep <host> <input_file> <match_term>")
System.exit(1)
}
val conf = new SparkConf().setAppName("SparkGrep").setMaster(args(0))
val sc = new SparkContext(conf)
val inputFile = sc.textFile(args(1), 2).cache()
val matchTerm : String = args(2)
val numMatches = inputFile.filter(line => line.contains(matchTerm)).count()
println("%s lines in %s contain %s".format(numMatches, args(1), matchTerm))
System.exit(0)
}
}
@Ayush257
Copy link

I ran this code on 700 files giving argument(2) as //.txt . I want the output as 4 lines in //part-123.txt contain . but instead of that I am getting 4 lines in //.txt contain how I will get to know the File name out of this 700 files where my search variable is . Can you please help me out here. Will really be helpful from your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment