Instantly share code, notes, and snippets.

Embed
What would you like to do?
Ubiquity conf. workshop setup

Setup for the "Processing And Analyzing Real-time Event Streams in the Cloud" workshop

This workshop will use some Google Cloud Platform services, including Cloud Dataflow, Cloud Pub/Sub, and BigQuery.

You'll use your own laptop for the the workshop. Before it starts, do some initial setup and config prep so that you don't need to spend time on that during the workshop.

Cloud Project setup

Follow the instructions on the Dataflow 'getting started' page for Cloud Project Setup. Create a Cloud Project as necessary (see the free trial info below), set up billing, and enable the APIs we'll use.

Note: While it's necessary to enable billing on your project, we can apply credits to your account so that you don't get charged for the workshop activities.

  • If you do not already have a Google Cloud Platform Project created, go to https://cloud.google.com/ and click the "Try it Free" button to set up your project.

    You will then fill out a form. This will set up your new project and apply some starter credits to it. You will need to provide billing information when you set up your project, but the credits will cover this workshop (and much more) so you won't be charged.

  • If you already signed up for the free trial, but the credits have expired, talk to the instructor at the start of the workshop.

Then, enable the necessary project APIs using the provided link on that page.

Development environment setup

Once you've created your project, follow the instructions on the Dataflow 'getting started' page for Development Environment Setup.

  • Install the Cloud SDK, authenticate it with your account, and configure it to use your project as default.

  • Then, install Java as necessary. For this workshop, install Java 8. We'll run examples that use Java 8 lambda expression syntax. (The Dataflow Java SDK itself requires Java Development Kit version 1.7 or higher).

Install Maven

Install Maven according to the instructions for your OS.

https://cloud.google.com/dataflow/getting-started-maven

Download the example code and configure its maven dependencies

Download the Dataflow for Java SDK

Download or clone the Dataflow Java SDK repo from GitHub.

This SDK includes the examples we'll be using.

Check the maven pom file in the examples directory

Check that the maven pom file in the examples directory (<path_to_sdk>/examples/pom.xml) includes the Dataflow dependency under the <dependencies> tag. It will probably look like this:

    <dependency>
      <groupId>com.google.cloud.dataflow</groupId>
      <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
      <version>${project.version}</version>
    </dependency>

If you'd rather compile the SDK directly instead of using the Maven build, see the instructions in the Dataflow SDK top-level README. In a nutshell, run:

$ mvn install -DskipTests

from the top-level directory of the SDK to compile the SDK. Do this if you have any issues compiling the examples using the maven dependency.

Compile the Dataflow examples as a sanity check

Once you have either added the Dataflow maven dependency or compiled the SDK, you should be able to compile the examples. Do this to check your config. From the top-level directory of the SDK, run:

$ mvn clean install -pl examples

Download and configure project service account credentials

In the workshop, you'll run code to publish to a PubSub topic from your laptop. You'll need to use your project credentials for this.

Create a service account using the Google Developers Console. Navigate to the section APIs & Auth, then the sub-section Credentials. Create a service account or choose an existing one, then select Generate new JSON key and download the key. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file downloaded, e.g.:

$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials-key.json

(The code we'll run uses Application Default Credentials -- see that page for more info on how this works.)

Create a Google Cloud Storage 'staging bucket' for Dataflow

Follow these instructions on creating a GCS bucket to use when deploying Dataflow pipelines. Remember the name of that bucket.

Create a BigQuery Dataset

You'll need to create a BigQuery dataset to hold your data, or know the name of one of your existing datasets.

To create a new dataset, go to the BigQuery dashboard, make sure the desired project is selected in the left nav bar, then click the arrow to the right of the project name to create a dataset.

Optional: set up Eclipse

If you're an Eclipse user, you might want to use the Cloud Dataflow Plugin for Eclipse. (This workshop won't assume use of Eclipse, though).

Troubleshooting and tips

  • Multiple maven installs will likely cause trouble. If you have more than one maven installation, you will probably need to clear/delete one.

  • With the GCP Free Trial, you get by default only 8 GCE cores. In that case, you'll need to run your Dataflow streaming jobs using 2 workers (the default is 3). This is because each worker uses 4 cores. Do this by adding
    --numWorkers=2 as a command-line arg when you start up the dataflow pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment