judell/steampipe-hackernews-accumulator.md

## steampipe-hackernews-accumulator.md

      
    Raw
  

              steampipe-hackernews-accumulator.md
            
          
    Use case: I want to collect hackernews stories that mention certain keywords. The keywords are defined by a regex, for example steampipe|sql|postgres|\sapi\s'.
The basic steampipe query:
select id, 'hn_new' as source, title, time from hackernews_new where title ~* hn_search_regex()
union
select id, 'hn_show' as source, title, time from hackernews_show_hn where title ~* hn_search_regex()
union
select id, 'hn_ask' as source, title, time from hackernews_ask_hn where title ~*  hn_search_regex()
order by time desc;

To avoid having to rewrite the regex in many places, I've encapsulated it in a Postgres function, hn_search_regex().
create or replace function hn_search_regex() returns text as $$
  begin
    return squote('steampipe|sql|postgres|\sapi\s')
  end;
$$ language plpgsql;

That function has a helper, squote(), to wrap single quotes around the returned value.
Here's some output from the query.
    id    | source  |                                      title                                      |        time
----------+---------+---------------------------------------------------------------------------------+---------------------
 28642326 | hn_new  | A single person answered 76k questions about SQL on StackOverflow               | 2021-09-24 13:52:08
 28636464 | hn_show | Show HN: A Tableau+SQL course for data professionals                            | 2021-09-23 22:37:48
 28630563 | hn_show | Show HN: pganalyze Index Advisor – A better way to index your Postgres database | 2021-09-23 15:25:47

The tables hn_new, hn_ask, and hn_show return recent info. How to accumulate it over time? Here's one way: into an accumulator table.
create table hn(id bigint primary key, source text, title text, ts timestamp with timezone);

Here's its definition.
     Table "public.hn"
 Column |           Type           | 
--------+--------------------------+
 id     | bigint                   |
 source | text                     |
 title  | text                     |
 ts     | timestamp with time zone |
Indexes:
    "hn_pkey" PRIMARY KEY, btree (id)

Here's a procedure to insert rows into that table.
create or replace procedure hn(id bigint, source text, title text, ts timestamp with time zone) as $$
  begin
    insert into hn(id, source, title, ts) values (id, source, title, ts);
  end;
$$ language plplgsl;

Calling that procedure in steampipe.
psql -h localhost -p 9193 -d steampipe -U steampipe

steampipe=> call hn(1, 'hn_new', 'Sample title', current_timestamp);
CALL

steampipe=> call hn(1, 'hn_new', 'Sample title', current_timestamp);
ERROR:  duplicate key value violates unique constraint "hn_pkey"

A file, hn.sql, containing SQL that generating calls to the procedure
with data as (
  select id, 'hn_new' as source, title, time from hackernews_new where title ~* hn_search_regex()
  union
  select id, 'hn_show' as source, title, time from hackernews_show_hn where title ~* hn_search_regex()
  union
  select id, 'hn_ask' as source, title, time from hackernews_ask_hn where title ~*  hn_search_regex()
  order by time desc
)
select
  concat(
    'call hn(',
    d.id,
    ',' || squote(escape_single_quote(d.source)),
    ',' || squote(escape_single_quote(d.title)),
    ',' || squote(d.time::text),
    ');'
  ) as sql
from data d;

Ouput of that query at a point in time.
                                                                 sql
--------------------------------------------------------------------------------------------------------------------------------------
 call hn(28642326,'hn_new','A single person answered 76k questions about SQL on StackOverflow','2021-09-24 13:52:08');
 call hn(28636464,'hn_show','Show HN: A Tableau+SQL course for data professionals','2021-09-23 22:37:48');
 call hn(28630563,'hn_show','Show HN: pganalyze Index Advisor – A better way to index your Postgres database','2021-09-23 15:25:47');

A script, hn.sh, to run that query, outputs rows only (-t) to a file, hn.txt, strips double quotes, and then pipes the procedure calls into psql.
#!/bin/bash

cd /home/jon/playground
psql -h localhost -p 9193 -U steampipe -t --csv < hn.sql > hn.txt
sed -i 's/\"//g' hn.txt
psql -h localhost -p 9193 -U steampipe < hn.txt

A crontab entry to run hn.sh every 30 minutes.
30 * * * * /home/jon/playground/hn.sh