Skip to content

Instantly share code, notes, and snippets.

@daschl
Created August 22, 2016 07:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save daschl/20c9d64dcb254256cbc70ee63843a853 to your computer and use it in GitHub Desktop.
Save daschl/20c9d64dcb254256cbc70ee63843a853 to your computer and use it in GitHub Desktop.
Couchbase Spark Samples
// Start the Shell
./pyspark --packages com.couchbase.client:spark-connector_2.10:1.2.1 --conf "spark.couchbase.bucket.travel-sample="
// Create a DF
>>> df = sqlContext.read.format("com.couchbase.spark.sql.DefaultSource").option("schemaFilter", "type=\"airline\"").load()
// Print the Schema
>>> df.printSchema()
root
|-- META_ID: string (nullable = true)
|-- callsign: string (nullable = true)
|-- country: string (nullable = true)
|-- iata: string (nullable = true)
|-- icao: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
==========
Available options:
- schemaFilter => the predicate used like above in the WHERE clause of each query that defines the schema/type (see http://developer.couchbase.com/documentation/server/4.5/connectors/spark-1.2/spark-sql.html)
- bucket => if more than one bucket is open, bucket specifies the bucket name to use
- idField => renames the document ID field, by default its META_ID and thats how you'd access it from your sparksql query
@markmikostv
Copy link

Hi,

Thanks this has been really helpful.

At the moment I have 3 buckets - Game, Beer and Airline. If I have linked to more than one bucket within my interpreter how do I specify a specific bucket?

thanks,

Mark

@daschl
Copy link
Author

daschl commented Oct 18, 2016

@markmikostv you can provide more "bucket" properties on startup and then you need to, on each dataframe add an option saying which bucket you want.

@markmikostv
Copy link

@daschl

I am looking to infer the schema first from either an id or a json document then load my data based on a "type="airline"".

Whats the best way to infer the schema then load the data?

df = sqlContext.read.format("com.couchbase.spark.sql.DefaultSource").option("schemaFilter", "type="airline"").load()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment