shraddhaag/rfc-config-file.md

## rfc-config-file.md

      
    Raw
  

              rfc-config-file.md
            
          
    Problem

As mentioned in #1558, users don't want be restricted by a specific project directory structure. Removing config.yaml will enable us in doing that.
Context

Purpose of config.yaml:

Provide configuration key values.
Get execution directory location (by validating project directory and which in return, enables us to run commands from anywhere inside the hasura project directory tree.)

Current alternatives available:


For 1, all the key/value pairs in config.yaml can be provided using CLI flags / ENV vars / .env file (user defined or default .env).


To set execution directory, we validate project directory by checking if the required file, config.yaml is found or not. Thus, 2 proves to be a blocker.


Proposed Solutions

Proposed ideas for unblocking 2:


Sol 1: Remove complete support for config.yaml and only allow running commands from project root. In this case, pwd will always be treated as the project root.
Cons: User can not run commands from any location inside the project directory tree expect project's root.


Sol 2. Remove complete support for config.yaml and define a --config-file global flag, so that users provide the complete path to the configuration file with every command.
Pros: Solves 2 completely. Directory validation can be done using the path provided for the config file (assuming config file provided is infact at the project root) and run commands from anywhere inside the hasura project directory.
Cons: providing the location of the config file while running every command might not provide the best user experience.


We need to choose between allowing users to run commands from anywhere inside the project directory tree and users passing the --config-file with every command.


Sol 3: Combining the above, we can define a global flag --config-file and make config.yaml optional.
In this case, depending on how config is handled:

User defined config file ie --conifg-file flag provided: execution directory derived from config file provided. Commands can be run from anywhere inside the project root.
default config file: what our current workflow is. We can probably extend our support regarding the definition of default config file. One way is to provide options between config.yaml, hasura-config.yaml, etc. We can add a flag on the hasura init command to select which to use.
no config file: pwd is treated as execution directory. Every commands needs to be run from project root.

In all the above cases, configuration key/values read from CLI flags / ENV vars / .env file / config file (if present). If none provided, default values used.


The above provides a reasonable trade-off between the two choices.

  
## telemetry-architecture.md

      
    Raw
  

              telemetry-architecture.md
            
          
    Pub/Sub to BigQuery

This can be done in 2 ways:

If data processing needed: use Apache Beam
If no data processing needed: use Cloud Functions

Cost Calculator
If data porcessing needed: Apache Beam approach

Reference
Direction of Data Flow: Overview

Pub/Sub -> Apache Beam-> BigQuery
Data flow inside Apache Beam:

Pub/Sub -> Read Transform -> Write Transform -> BigQuery
Google Provided Template for streaming insert from Pub/Sub to BigQuery - link [Caveat: This feature is in beta]
Apache Beam

Reading from Pub/Sub:

Since Pub/Sub is a continously-updating data source, the resulting PCollection will be an unbounded one. Thus we will explicitly have to tell it to make a bounded PCollection.
Writing to BigQuery:

Documentation


write disposition: append rows to an existing table


insert method: Python SDK doesn't support setting the insert method, it is chosen depnding on the PCollection type.
(bounded for batch, unbonded for streaming inserts), BigQueryDisposition.WRITE_APPEND


Transform: WriteToBigQuery

applied to a PCollection of dictionaries


Pricing
If no data processing needed: Cloud Functions

Drection of Data Flow

Pub/Sub -> Cloud Funtion -> BigQuery
Cloud Functions:

Trigger: There are two ways to do this:


Pub/Sub Trigger: Trigger a cloud function whenever messages are published to a Pub/Sub topic. Every message published to this topic will trigger function execution with message contents passed as input data.


Cloud Scheduler: Trigger a cloud function at regular intervals.


Costs will be incurred by:


Cloud Functions
Pub/Sub
Cloud Scheduler: $0.10 per job per month, 3 free jobs


## telemetry-ideation.md

      
    Raw
  

              telemetry-ideation.md
            
          
    Requirements:

Scalability: easily up/down scaled depending on our needs
No ops: no/least hassle of infrastructure management
Reliability: always-on availability and constant uptime
Speed: ingestion and querying data with least latency
analytics: native support for analytics
cost effective

DataStore

Google BigQuery


Infrastructure: serverless, SaaS like, automatically scales to fit needs in the background,requires minimum to no management


Pricing: Details

pay for streaming inserts, storage and queries
loading and exporting is free
pay as you go ($5 per TB, 1TB per month free)or flat rate pricing ($10,000/month for 500 slots)


BigQuery Limits


Loading data into BigQuery:


Batch
Streaming


Delivery time
Delayed
Almost instant, near real-time


Cost
Free (pay only for data storage)
$0.01/200MB


Loading Steps
Google Cloud Service -> BigQuery
Directly


Table Rate Limit (per day)
1000 load jobs
100K records


Reprocess Data
Supported
Depemds on the streaming service


Q: Do we need real-time analysis?
Data Visulaisation:


Metabase provides a driver to connecting to BigQuery directly. Since we already use Metabase this makes sense. (Details)


Other options: Google's DataStudio and Google Analytics and other supported tools


Note:  DataStudio supports visulaising on Map on basis of Country, City, Region from all data sources (including Big Query) [Details]


Comparision with Amazon RedShift


DataWarehouse
BigQuery
RedShift


Infrastructure
Serverless
Cluster and nodes


Scaling
Automatically done in the background, with no downtime
Requires downtime, needs to configured manually


No Ops
Yes
User configures infrastructure, periodic management tasks required


Max Table Columns
10,000 per table
1,600


Streaming data ingestion
supported
Must use Amazon Kinesis Firehose


Metabase support
Yes, metabase provides a native driver
Yes


Pricing
Costs may add up from streaming data and querying
If we know all our requirements before hand, we can efficiently setup a pricing plan, but it charges on hourly basis as opposed to usage


RedShift Limits
RedShift Pricing
	Batch	Streaming
Delivery time	Delayed	Almost instant, near real-time
Cost	Free (pay only for data storage)	$0.01/200MB
Loading Steps	Google Cloud Service -> BigQuery	Directly
Table Rate Limit (per day)	1000 load jobs	100K records
Reprocess Data	Supported	Depemds on the streaming service
DataWarehouse	BigQuery	RedShift
Infrastructure	Serverless	Cluster and nodes
Scaling	Automatically done in the background, with no downtime	Requires downtime, needs to configured manually
No Ops	Yes	User configures infrastructure, periodic management tasks required
Max Table Columns	10,000 per table	1,600
Streaming data ingestion	supported	Must use Amazon Kinesis Firehose
Metabase support	Yes, metabase provides a native driver	Yes
Pricing	Costs may add up from streaming data and querying	If we know all our requirements before hand, we can efficiently setup a pricing plan, but it charges on hourly basis as opposed to usage