Skip to content

Instantly share code, notes, and snippets.

@profjsb
Last active October 2, 2018 13:50
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save profjsb/8c079b33f9a2a2effeefecd093aeeb81 to your computer and use it in GitHub Desktop.
Save profjsb/8c079b33f9a2a2effeefecd093aeeb81 to your computer and use it in GitHub Desktop.
Refreshing Materialized Views in Postgres using Docker
# Dockerfile to run the cron jobs in root, including python/psycopg2
FROM alpine:edge
RUN apk add --no-cache -X http://dl-cdn.alpinelinux.org/alpine/edge/testing py-psycopg2
EXPOSE 6000
COPY refresh.py /refresh.py
COPY root /var/spool/cron/crontabs/root
CMD crond -l 2 -f
{
"AWSEBDockerrunVersion": "1",
"Ports": [
{ "ContainerPort": "6000" }
],
"Logging": "/var/log"
}
CREATE MATERIALIZED VIEW usefulname_hourly AS
SELECT super_horrible_SQL_that_makes_your_eyes_bleed
FROM massive_table
JOIN another_massive_table ...
JOIN yet_another_massive_table ...
WHERE ...
GROUP BY ...
CREATE UNIQUE INDEX usefulname_hourly_i ON
usefulname_hourly(column_to_index);

Refresh Materialized Views

This small codebase uses Docker to refresh materialized views in Postgresql on a periodic basis. It's intended to be installed in Elasticbeanstalk but can be run from your laptop. The Docker image is about 52 MB.

Materialized views defined in the target database with names ending in hourly and daily will get refreshed.

Development

  1. Build the Docker image
docker build -t mvs_refresh .
  1. If you need to jump into the image to test the script python /refresh.py, use the appropriate environment variables:
docker exec  -e MV_DB_HOST='...rds.amazonaws.com' \
   -e MV_DB_USER='db_write_user' \
   -e MV_DB_PW='secret' \
   -e MV_DB='database' \
   -it mvs_refresh /bin/sh

Elasticbeanstalk Deployment

  1. Make a file .ebextensions/secret.config that includes the environment variable info (fill in ... appropriately):
option_settings:  
  - option_name: MV_DB_HOST
    value: ...
  - option_name: MV_DB_USER
    value: ...
  - option_name: MV_DB_PW
    value: ...
  - option_name: MV_DB
    value: ...
  1. Start eb and deploy
eb init
# follow prompts
eb deploy

N.B.: You'll need to make sure that your database has given your EC2 instance(s) proper permissions to access the database.

Creating a Materialized View on the DB

  1. As a user that can SELECT on all the relevant tables and can CREATE in that database, make a materialize view. You should name it something that you'll remember and add a suffix hourly or daily so it can be refreshed appropriately.
CREATE MATERIALIZED VIEW usefulname_hourly AS 
SELECT super_horrible_SQL_that_makes_your_eyes_bleed
  FROM massive_table
  JOIN another_massive_table ...
  WHERE ...
  GROUP BY ...
  1. For the refresh to work (using CONCURRENT access), you'll need to create an index on at least one of the columns in that MV:
# Simply script to refresh materialized views in postgres
# J.Bloom, Wise.io 2016
import os
import sys
import psycopg2
if len(sys.argv) >= 2:
timescale_str = sys.argv[1]
else:
timescale_str = "hourly"
host = os.environ.get("MV_DB_HOST")
user = os.environ.get("MV_DB_USER")
pw = os.environ.get("MV_DB_PW")
db = os.environ.get("MV_DB")
ignores = []
conn = psycopg2.connect(
database = db,
user = user,
password = pw,
host = host
)
print "Connected to DB at {host}".format(host=host)
with conn.cursor() as cur:
# get all the materalized views and then loop
get_all_mv_SQL = """SELECT oid::regclass::text as name
FROM pg_class
WHERE relkind = 'm'
and oid::regclass::text LIKE '%{rate}';""".format(rate=timescale_str)
print "Getting the materialized views"
cur.execute(get_all_mv_SQL)
rows = cur.fetchall()
for row in rows:
mvname = row[0]
if mvname in ignores:
print "Ignoring a refresh for",mvname
continue
print "Refreshing",mvname,"..."
cur.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY {mvname};".format(mvname=mvname))
print "done."
conn.close()
1 * * * * python /refresh.py hourly
31 12 * * * python /refresh.py daily
@ededdneddyfan
Copy link

Since the connection was not opened using a with statement, shouldn't there be a conn.commit() at some point after executing a refresh?
ref: http://initd.org/psycopg/docs/connection.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment