This workshop will use some Google Cloud Platform services, including Cloud Dataflow, Cloud Pub/Sub, and BigQuery.
You'll use your own laptop for the the workshop. Before it starts, do some initial setup and config prep so that you don't need to spend time on that during the workshop.
- Cloud Project setup
- Development environment setup
- Install Maven
- Download the example code and configure its maven dependencies
- Download and configure project service account credentials
- Create a Google Cloud Storage 'staging bucket' for Dataflow
- Create a BigQuery Dataset
- Optional: set up Eclipse
- Troubleshooting and Tips
Follow the instructions on the Dataflow 'getting started' page for Cloud Project Setup. Create a Cloud Project as necessary (see the free trial info below), set up billing, and enable the APIs we'll use.
Note: While it's necessary to enable billing on your project, we can apply credits to your account so that you don't get charged for the workshop activities.
- If you do not already have a Google Cloud Platform Project created, go to https://cloud.google.com/ and click the "Try it Free" button to set up your project.
You will then fill out a form. This will set up your new project and apply some starter credits to it. You will need to provide billing information when you set up your project, but the credits will cover this workshop (and much more) so you won't be charged.
- If you already signed up for the free trial, but the credits have expired, talk to the instructor at the start of the workshop.
Then, enable the necessary project APIs using the provided link on that page.
Once you've created your project, follow the instructions on the Dataflow 'getting started' page for Development Environment Setup.
-
Install the Cloud SDK, authenticate it with your account, and configure it to use your project as default.
-
Then, install Java as necessary. For this workshop, install Java 8. We'll run examples that use Java 8 lambda expression syntax. (The Dataflow Java SDK itself requires Java Development Kit version 1.7 or higher).
Install Maven according to the instructions for your OS.
https://cloud.google.com/dataflow/getting-started-maven
Download or clone the Dataflow Java SDK repo from GitHub.
This SDK includes the examples we'll be using.
Check that the maven pom file in the examples directory (<path_to_sdk>/examples/pom.xml
) includes the Dataflow dependency under the <dependencies>
tag. It will probably look like this:
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>${project.version}</version>
</dependency>
If you'd rather compile the SDK directly instead of using the Maven build, see the instructions in the Dataflow SDK top-level README. In a nutshell, run:
$ mvn install -DskipTests
from the top-level directory of the SDK to compile the SDK. Do this if you have any issues compiling the examples using the maven dependency.
Once you have either added the Dataflow maven dependency or compiled the SDK, you should be able to compile the examples. Do this to check your config. From the top-level directory of the SDK, run:
$ mvn clean install -pl examples
In the workshop, you'll run code to publish to a PubSub topic from your laptop. You'll need to use your project credentials for this.
Create a service account using the Google Developers Console. Navigate to the section APIs & Auth, then the sub-section Credentials. Create a service account or choose an existing one, then select Generate new JSON key
and download the key. Set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON file downloaded, e.g.:
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials-key.json
(The code we'll run uses Application Default Credentials -- see that page for more info on how this works.)
Follow these instructions on creating a GCS bucket to use when deploying Dataflow pipelines. Remember the name of that bucket.
You'll need to create a BigQuery dataset to hold your data, or know the name of one of your existing datasets.
To create a new dataset, go to the BigQuery dashboard, make sure the desired project is selected in the left nav bar, then click the arrow to the right of the project name to create a dataset.
If you're an Eclipse user, you might want to use the Cloud Dataflow Plugin for Eclipse. (This workshop won't assume use of Eclipse, though).
-
Multiple maven installs will likely cause trouble. If you have more than one maven installation, you will probably need to clear/delete one.
-
With the GCP Free Trial, you get by default only 8 GCE cores. In that case, you'll need to run your Dataflow streaming jobs using 2 workers (the default is 3). This is because each worker uses 4 cores. Do this by adding
--numWorkers=2
as a command-line arg when you start up the dataflow pipeline.