Skip to content

Instantly share code, notes, and snippets.

@StevenACoffman
Last active October 5, 2020 14:18
Show Gist options
  • Save StevenACoffman/5ceabea63d46c2cf9ae9a573f3b9ce45 to your computer and use it in GitHub Desktop.
Save StevenACoffman/5ceabea63d46c2cf9ae9a573f3b9ce45 to your computer and use it in GitHub Desktop.
Go SQL

Database Environments

A schema contains a group of tables. A database contains a group of schemas.

In prod, we use a cloudsql PostgreSQL instance. The reports DB is for prod only. It's presence should be a gatekeeper to prevent destructive teardowns.

Everywhere else, we assume a local PostgreSQL running on port 5432 with two existing databases. The khan_test DB is for integration tests. The khan_dev DB is for local development.

We can use the postgres user in all three environments, but in prod we want to vary the user to better differentiate the access pattern metrics.

The schemas for all three should be kept the same.

We do this with migrations.

Migrations

How to manually run these

Assuming you have golang-migrate CLI installed (e.g. brew install golang-migrate).

NOTE: Running migrations in prod is super-dangerous! Measure twice, cut once!

The golang-migrate provides a CLI can be used to manually perform the database migrations.

For instance to get the initial schema created:

migrate -verbose -path .  -database postgres://localhost:5432/khan_test?sslmode=disable goto 1

For an existing database that already has the schemas applied, you can set the version without applying migrations via:

migrate -verbose -path .  -database postgres://localhost:5432/khan_test?sslmode=disable force 1

A complete tutorial is available here.

khan_dev=# \l
                           List of databases
   Name    |  Owner   | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+-------+-------------------
 khan_dev  | postgres | UTF8     | C       | C     |
 khan_test | postgres | UTF8     | C       | C     |
 postgres  | steve    | UTF8     | C       | C     |

This is an example of a script that will set up the empty databases for dev and test:

# A simple script to create the test / dev postgres databases needed for webapp
# tests and development that depend on postgres.

# Our test and dev environments assume that the user has created a postgres
# user with create db privs.  This should have been done within dotfiles, but
# we try our best to do it here anyways.

if ! psql -tc "SELECT rolname from pg_catalog.pg_roles"  postgres \
  | grep -c 'postgres' > /dev/null 2>&1 ; then
    psql --quiet -c "CREATE ROLE postgres LOGIN SUPERUSER;" postgres;
fi

# Create the db that we'll use in our tests.
if ! psql -U postgres -l | grep khan_test ; then
  createdb khan_test -U postgres -O postgres
fi
psql -U postgres -qc "create extension if not exists btree_gist" khan_test;
psql -U postgres -qc "create extension if not exists pg_stat_statements" khan_test;

# We'll also create the dev db, in case we run tests that use the
# dev_appserver
if ! psql -U postgres -l | grep khan_dev ; then
  createdb khan_dev -U postgres -O postgres
fi
psql -U postgres -qc "create extension if not exists btree_gist" khan_dev;
psql -U postgres -qc "create extension if not exists pg_stat_statements" khan_dev;

How to add migrations

Please Note the naming of the files in this directory.

They need to be named so that the migrations can be ordered lexicographically. It's often accomplished by using a numbered prefix like 001_something.sql.

If you added another file you would name it 002_whatever_you_want.sql.

Another possibility is yyyymmdd_whatever_you_want.sql, e.g. 20200921_something.sql but you would need to rename all the files that already exist here to follow this convention.

How to Create The Initial Migration

The initial migrations in 001_initial_schema.up.sql were generated by running:

cloud_sql_proxy -dir=/Users/steve/cloudsql -instances=$DBHOST=tcp:5433 &
pg_dump -U $DBUSER -h 127.0.0.1 -p 5433 $DBNAME -s -f 001_initial_schema.up.sql

I deleted a few out-of-the-box functions, as well as an old deprecated table or two.

Subsequent migrations should be placed in files lexicographically sorted.

How to update the migration bindata for unit and integration tests

Assuming you have installed go-bindata using go get -u github.com/go-bindata/go-bindata/...:

cd ../..
go-bindata -pkg migrations -ignore bindata -nometadata -prefix reports/migrations/ -o ./reports/migrations/bindata.go ./reports/migrations

This will allow the unit and integration tests to verify the application is synched to the code.

No ORM

I prefer to write queries by hand. Maximum control and minimum mental overhead.

Generally ORMs are more trouble then they are worth, especially since the Go standard library has nice support out of the box. Beyond the database/sql package, there is also the handy https://github.com/jmoiron/sqlx

Postgres driver choices

  • github.com/jackc/pgx - nice type support
  • github.com/lib/pq - has been around for ages and is battle tested. Cloud SQL Proxy uses this.

DB setup and migrations

I like to do migrations externally myself. The libs always seemed overly complex and loaded with features that my teams did not need. If a library is needed then golang-migrate is worth a look.

Testing databases stuff

Make sure you can run go test ./... without that docker set up. Skip tests that require a database unless it's explicitly configured via environment variable or build tag IMO.

I’ve also use sqlite under test and postgresql in production and worked around the incompatibilities (few).

Google Cloud SQL

Postgres Connection Pooling In Go

Cloud SQL Postgres has connection limits.

Briefly, if the number of clients is more than ~100, the performance of Postgres degrades. That happens because of Postgres-side implementation quirks: launching a dedicated process for every connection, snapshot-taking mechanism, and using shared memory for interactions — all these factors are relevant. You can brush up on the reasons behind that in this brilliant article brandur.org/postgres-connections.

http://go-database-sql.org/connection-pool.html

Generally, try to opt not to use pgbouncer and you should instead prefer using go’s connection pool. PG Bouncer requires prepared statements and is a no-go IMO. On GCP, you are already connecting through their proxy connector. That said, it is still challenging as you scale up your services talking to PostgreSQL. Also challenging in that GCP doesn’t let you change the connection limit parameter for their managed postgresql service

A good step is to introspect either service count or connection count and SetMaxOpenConns based on it.

Need more scaling for Postgres?

PostgreSQL and go demo app

Data stuff:

Parquet is a columnar format that's faster than csv:

NextCloud

Prometheus Metrics

Graphql Client

Google Cloud Run

https://medium.com/google-cloud/how-to-run-serverless-batch-jobs-on-google-cloud-ca45a4e33cb1 https://medium.com/google-cloud/cloud-run-and-cloud-function-what-i-use-and-why-12bb5d3798e1 https://medium.com/google-cloud/cloud-run-vs-cloud-functions-whats-the-lowest-cost-728d59345a2e https://medium.com/google-cloud/making-requests-to-cloud-run-with-the-service-account-620014dc1486

Google Cloud Composer

Google Cloud Composer is basically Apache Airflow. We handle a wide variety of workflows on a daily basis that require advanced tooling such as complex dependency management, fan-in and fan-out, and more. In addition, Airflow provides useful tools for logging, metrics, and monitoring.

Airflow, in its design, provides the wrong abstraction. Airflow Operators, instead of simply orchestrating work to be executed, actually implement some of the functional work themselves. This means that Airflow Operators inherently combine orchestration bugs with execution bugs.

  1. Each step of this DAG is a different functional task, so each step is created using a different Airflow Operator. Developers must spend time researching, understanding, using, and debugging the Operator they want to use. This also means that each time a developer wants to perform a new type of task, they must repeat all of these steps with a new Operator. And, as we found at Bluecore, Operators are often buggy. Developer after developer moved a previously-working workflow over to Airflow only to have it brought down by an issue with an Airflow Operator itself.

  2. Developers can quickly get started creating DAGs using the plug-and-play nature of the Operators, but in the face of any issues, the Operators themselves complicate root-cause analysis. This is because Operators themselves often handle the bulk of the work! This can range from creating connections, querying databases, parsing results, and more. Ultimately, this is abstracting away functionality that the developer should, ideally, totally understand!

  3. Operators are executed on the Airflow workers themselves. The Airflow Scheduler, which runs on Kubernetes Pod A, will indicate to a Worker, which runs on Kubernetes Pod B, that an Operator is ready to be executed. At that point, the Worker will pick up the Operator and execute the work directly on Pod B. This will happen for every Operator that it executes. This means that all Python package dependencies from each workflow will need to be installed on each Airflow Worker for Operators to be executed successfully. Different workflows can have very different requirements. In the best case, this means soaking up valuable memory by loading all of the packages onto each Worker (and costing more money). In the worst case, Python package conflicts could prevent workflows from being run on the same Airflow instance altogether. These are the issues that arise simply from using Airflow in its prescribed way.

In lieu of a growing list of functionality-specific Operators, we believe that there should be a single, bug-free Operator that would be able to execute any arbitrary task. This shift would allow us to separate workflow management from workflow execution, simplifying our understanding of Airflow and our ability to quickly debug issues.

Other Google Cloud stuff

Interesting opinion stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment