tegansnyder/apache spark csv sql.md

## apache spark csv sql.md

      
    Raw
  

              apache spark csv sql.md
            
          
    Using Apache Spark to Query a CSV Like with SQL like syntax.

Load up the spark shell with the appropriate package for csv parsing:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
In the scala terminal type the following, referencing the path to your csv file. Example below:
import org.apache.spark.sql.SQLContext
val dataFile = "/Users/teg/Desktop/vewProductName.csv"
val sqlContext = new SQLContext(sc)
sqlContext.load("com.databricks.spark.csv", Map("path" -> dataFile, "header" -> "true")).registerTempTable("products")


sqlContext.sql("""select * from products WHERE SKU = 'ZZ888806041'""").save("/tmp/agg.csv", "com.databricks.spark.csv")
Essentially Spark breaks up the csv into fragments in the /tmp/agg.csv directory. You then need to piece the fragments back together to get the results of the above query. To do that do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
 
def merge(srcPath: String, dstPath: String): Unit =  {
	val hadoopConfig = new Configuration()
	val hdfs = FileSystem.get(hadoopConfig)
	FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}

merge("/tmp/agg.csv", "agg.csv")
Then if you look at the data in agg.csv you will see the results of your query.