Ran Google Cloud Platform Big Query examples for working with time-series financial data. Used gcloud
command-line tools to connect to Google Cloud project financials-test
. Initialized bq
(BigQuery tools), and pointed it at my project. Imported two large csv files of USD-GBP exchange rates in 2014.
# my GCP project
gcloud config set project financials-test
# my temporary data set that is imported to BigQuery
bq mk timeseries
# note the column definitions appended to the `load` command
bq load timeseries.gbpusd_0114 gs://solutions-public-assets/time-series-master/\_
GBPUSD_2014_01.csv venue:STRING,currencies:STRING,time:TIMESTAMP,bid:FLOAT,ask:FLOAT
bq load timeseries.gbpusd_0214 gs://solutions-public-assets/time-series-master/\_
GBPUSD_2014_02.csv venue:STRING,currencies:STRING,time:TIMESTAMP,bid:FLOAT,ask:FLOAT
Then, switched to BigQuery Web UI, and used the SQL query builder to run queries on the data sets.
I ran a simple query to just SELECT * FROM <timeseries> ORDER BY <column> LIMIT 1000;
to check the data.
Then, I updated the query to use the FORMAT_UTC_USEC(time) as time
pattern, to print miliseconds.
SELECT
venue,
currencies,
FORMAT_UTC_USEC(time) as time,
bid,
ask
FROM
timeseries.gbpusd_0114
ORDER BY
time ASC
LIMIT 1000;
Next, I selected the time
and bid
column values, where time
was within a 24 hour period.
SELECT
TIME(time) AS time,
bid
FROM
timeseries.gbpusd_0114
WHERE
time BETWEEN TIMESTAMP("2014-01-01 00:00:00.000")
AND TIMESTAMP("2014-01-01 00:29:59.999")
ORDER BY
time ASC;
I learned to query data across two tables, with common columns, based on a query of the id
of the table itself. For example, find the min
and max
time
, in either table with id
containing "gbpusd"
.
select
min(time) as time1,
max(time) as time2
from (table_query(timeseries, 'table_id contains "gbpusd"'));
I briefly used a string-formatter to clip the month-day-year information off of the time
, so that I could easily view all results within a single day, without the extra string overhead.
SELECT
STRFTIME_UTC_USEC(time, "%m-%d-%Y %r") AS time,
ask
FROM
[timeseries.gbpusd_0114] [timeseries.gbpusd_0214]
WHERE
time BETWEEN TIMESTAMP("2014-01-31 23:30:00.000")
AND TIMESTAMP("2014-02-01 00:29:59.999")
ORDER BY
time ASC;
To produce a histogram, I grouped the count of ticks (time
records) by hour
, for a 12 and 24 hour period. Then, to make it more interesting, I calculated the average ticks per hour, over all the days of the data set.
SELECT
HOUR(time) AS hour,
COUNT(time)/COUNT(DISTINCT DAY(time)) AS avg_ticks
FROM
[timeseries.gbpusd_0114]
GROUP BY
hour
ORDER BY
hour ASC;
Last, and most importantly, I calculated the Simple Moving Average (SMA) for 60 seconds, for each tick in a 3-hour period, while joining the data from two tables. I selected time
and AVG(bid)
, from two tables joined using JOIN EACH
for optimization, where the timestamp was between the tick's time
and time - 60(seconds)
, and where the timestamp was within the 3-hour period.
I expect to reproduce this method, to calculate more interesting indicators.
SELECT
TIME(S1.time) AS bid_time,
AVG(S2.bid) AS avg_bid
FROM
timeseries.gbpusd_0114 AS S1
JOIN EACH
timeseries.gbpusd_0114 AS S2
ON
S1.time = S2.time
WHERE
TIMESTAMP_TO_SEC(S2.time)
BETWEEN (TIMESTAMP_TO_SEC(S1.time) - 60)
AND TIMESTAMP_TO_SEC(S1.time)
AND
S1.time
BETWEEN TIMESTAMP("2014-01-16 09:00:00.000")
AND TIMESTAMP("2014-01-16 11:59:59.999")
GROUP BY
bid_time
ORDER BY
bid_time ASC;