Skip to content

Instantly share code, notes, and snippets.

@shraddhaag
Last active May 8, 2020 05:05
Show Gist options
  • Save shraddhaag/1bd7e973fee49aff109907951fbe3658 to your computer and use it in GitHub Desktop.
Save shraddhaag/1bd7e973fee49aff109907951fbe3658 to your computer and use it in GitHub Desktop.
Hasura Notes

Problem

As mentioned in #1558, users don't want be restricted by a specific project directory structure. Removing config.yaml will enable us in doing that.

Context

Purpose of config.yaml:

  1. Provide configuration key values.
  2. Get execution directory location (by validating project directory and which in return, enables us to run commands from anywhere inside the hasura project directory tree.)

Current alternatives available:

  • For 1, all the key/value pairs in config.yaml can be provided using CLI flags / ENV vars / .env file (user defined or default .env).

  • To set execution directory, we validate project directory by checking if the required file, config.yaml is found or not. Thus, 2 proves to be a blocker.

Proposed Solutions

Proposed ideas for unblocking 2:

  • Sol 1: Remove complete support for config.yaml and only allow running commands from project root. In this case, pwd will always be treated as the project root.

    Cons: User can not run commands from any location inside the project directory tree expect project's root.

  • Sol 2. Remove complete support for config.yaml and define a --config-file global flag, so that users provide the complete path to the configuration file with every command.

    Pros: Solves 2 completely. Directory validation can be done using the path provided for the config file (assuming config file provided is infact at the project root) and run commands from anywhere inside the hasura project directory.

    Cons: providing the location of the config file while running every command might not provide the best user experience.

We need to choose between allowing users to run commands from anywhere inside the project directory tree and users passing the --config-file with every command.

  • Sol 3: Combining the above, we can define a global flag --config-file and make config.yaml optional.

    In this case, depending on how config is handled:

    • User defined config file ie --conifg-file flag provided: execution directory derived from config file provided. Commands can be run from anywhere inside the project root.
    • default config file: what our current workflow is. We can probably extend our support regarding the definition of default config file. One way is to provide options between config.yaml, hasura-config.yaml, etc. We can add a flag on the hasura init command to select which to use.
    • no config file: pwd is treated as execution directory. Every commands needs to be run from project root.

    In all the above cases, configuration key/values read from CLI flags / ENV vars / .env file / config file (if present). If none provided, default values used.

The above provides a reasonable trade-off between the two choices.

Pub/Sub to BigQuery

This can be done in 2 ways:

  • If data processing needed: use Apache Beam
  • If no data processing needed: use Cloud Functions

Cost Calculator

If data porcessing needed: Apache Beam approach

Reference

Direction of Data Flow: Overview

Pub/Sub -> Apache Beam-> BigQuery

Data flow inside Apache Beam:

Pub/Sub -> Read Transform -> Write Transform -> BigQuery

Google Provided Template for streaming insert from Pub/Sub to BigQuery - link [Caveat: This feature is in beta]

Apache Beam

Reading from Pub/Sub:

Since Pub/Sub is a continously-updating data source, the resulting PCollection will be an unbounded one. Thus we will explicitly have to tell it to make a bounded PCollection.

Writing to BigQuery:

Documentation

  • write disposition: append rows to an existing table

  • insert method: Python SDK doesn't support setting the insert method, it is chosen depnding on the PCollection type. (bounded for batch, unbonded for streaming inserts), BigQueryDisposition.WRITE_APPEND

  • Transform: WriteToBigQuery

    • applied to a PCollection of dictionaries

Pricing

If no data processing needed: Cloud Functions

Drection of Data Flow

Pub/Sub -> Cloud Funtion -> BigQuery

Cloud Functions:

Trigger: There are two ways to do this:

  • Pub/Sub Trigger: Trigger a cloud function whenever messages are published to a Pub/Sub topic. Every message published to this topic will trigger function execution with message contents passed as input data.

  • Cloud Scheduler: Trigger a cloud function at regular intervals.

Costs will be incurred by:

  • Cloud Functions
  • Pub/Sub
  • Cloud Scheduler: $0.10 per job per month, 3 free jobs

Requirements:

  • Scalability: easily up/down scaled depending on our needs
  • No ops: no/least hassle of infrastructure management
  • Reliability: always-on availability and constant uptime
  • Speed: ingestion and querying data with least latency
  • analytics: native support for analytics
  • cost effective

DataStore

Google BigQuery

  • Infrastructure: serverless, SaaS like, automatically scales to fit needs in the background,requires minimum to no management

  • Pricing: Details

    • pay for streaming inserts, storage and queries
    • loading and exporting is free
    • pay as you go ($5 per TB, 1TB per month free)or flat rate pricing ($10,000/month for 500 slots)
  • BigQuery Limits

Loading data into BigQuery:

Batch Streaming
Delivery time Delayed Almost instant, near real-time
Cost Free (pay only for data storage) $0.01/200MB
Loading Steps Google Cloud Service -> BigQuery Directly
Table Rate Limit (per day) 1000 load jobs 100K records
Reprocess Data Supported Depemds on the streaming service

Q: Do we need real-time analysis?

Data Visulaisation:

  • Metabase provides a driver to connecting to BigQuery directly. Since we already use Metabase this makes sense. (Details)

  • Other options: Google's DataStudio and Google Analytics and other supported tools

  • Note: DataStudio supports visulaising on Map on basis of Country, City, Region from all data sources (including Big Query) [Details]

Comparision with Amazon RedShift

DataWarehouse BigQuery RedShift
Infrastructure Serverless Cluster and nodes
Scaling Automatically done in the background, with no downtime Requires downtime, needs to configured manually
No Ops Yes User configures infrastructure, periodic management tasks required
Max Table Columns 10,000 per table 1,600
Streaming data ingestion supported Must use Amazon Kinesis Firehose
Metabase support Yes, metabase provides a native driver Yes
Pricing Costs may add up from streaming data and querying If we know all our requirements before hand, we can efficiently setup a pricing plan, but it charges on hourly basis as opposed to usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment