Skip to content

Instantly share code, notes, and snippets.

@supercargo
Last active September 10, 2021 18:36
Show Gist options
  • Save supercargo/040429fa3ac2281a31fdc9297991cef4 to your computer and use it in GitHub Desktop.
Save supercargo/040429fa3ac2281a31fdc9297991cef4 to your computer and use it in GitHub Desktop.
Notes on Datomic Analytics Preview

Datomic Analytics in Preview

Initial notes and findings

Cognitect has bundled presto-server direclty in the datomic distribution. Taking the datomic plugin and dropping it into stock presto (v0.226) resulted in an error on startup, not sure if the bundled version of presto has been modified somehow?

2019-10-04T15:52:26.045-0400	INFO	main	com.facebook.presto.server.PluginManager	-- Loading plugin /Users/adam/dev/presto-data/plugin/datomic --
2019-10-04T15:52:26.085-0400	WARN	main	com.facebook.presto.server.PluginManager	No service providers of type com.facebook.presto.spi.Plugin

The presto plugin uses peer server to access the db instead of the Peer library directly, I hope this isn't a fundamental limitation as it seems like this could limit scalability since data access can't go directly against the storage as with Peer. It would be cool if every worker node in the presto cluster could be a full datomic peer, along with something similar to cloud's query group functionality for cache coherence.

Release notes mention that latest version of datomic is required. It appears that in addition to this, the db schema also needs to be upgraded with tuple support

presto:adam_dev> show tables;
Query 20191004_201958_00005_f3u47 failed: Assert failed: db must support tuples - see https://docs.datomic.com/client-api/datomic.client.api.html#var-administer-system
(:db/tupleTypes ret)

With the schema upgraded on my dev database, trying an initial query. My dev datomic database is, among other things, loaded with a complete copy of my Salesforce.com org database. Using a minimal metaschema edn, I should have access to all the Lead objects in this org:

{:tables {:sfdc.Lead/Id {} }}
presto:adam_dev> SELECT COUNT(*) FROM sfdc_lead;

Query 20191004_211906_00011_f3u47, FAILED, 1 node
Splits: 36 total, 3 done (8.33%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20191004_211906_00011_f3u47 failed: [?start ?e] not bound in expression clause: [(>= ?e ?start)]

Hmmmm....it is trying, but something doesn't seem to work. I guess COUNT(*) isn't supported by the connector at this point.

Let's try checking the metadata:

presto:adam_dev> DESCRIBE sfdc_lead;
             Column             |     Type      | Extra | Comment 
--------------------------------+---------------+-------+---------
 annualrevenue                  | decimal(38,2) |       |         
 website                        | varchar       |       |         
 donotcall                      | boolean       |       |         
 description                    | varchar       |       |         
 isconverted                    | boolean       |       |         
 state                          | varchar       |       |         
 fastcall__productinterest__c   | bigint        |       |         
 city                           | varchar       |       |         
 email                          | varchar       |       |         
 qualifier_need__c              | bigint        |       |         
 dandbcompanyid                 | bigint        |       |         
 primary__c                     | bigint        |       |         
 fastcall__primary__c           | bigint        |       |         
 jigsaw                         | varchar       |       |         
 lastname                       | varchar       |       |         
 currentgenerators__c           | varchar       |       |         
 address                        | bigint        |       |         
 mobilephone                    | varchar       |       |         
 status                         | bigint        |       |         
 lasttransferdate               | timestamp     |       |         
 last_website_visit_date__c     | timestamp     |       |         
 productinterest__c             | bigint        |       |         
 systemmodstamp                 | timestamp     |       |         
 emailbounceddate               | timestamp     |       |         
 ownerid                        | bigint        |       |         
 rating                         | bigint        |       |         
 phone                          | varchar       |       |         
 reason_for_reason__c           | bigint        |       |         
 qualifier_budget__c            | bigint        |       |         
 convertedcontactid             | bigint        |       |         
 leadsource                     | bigint        |       |         
 call_count__c                  | double        |       |         
 fax                            | varchar       |       |         
 emailbouncedreason             | varchar       |       |         
 id                             | varchar       |       |         
 numberofemployees              | bigint        |       |         
 name                           | varchar       |       |         
 score__c                       | double        |       |         
 salutation                     | bigint        |       |         
 fastcall__currentgenerators__c | varchar       |       |         
 siccode__c                     | varchar       |       |         
 fastcall__siccode__c           | varchar       |       |         
 hasoptedoutofemail             | boolean       |       |         
 createddate                    | timestamp     |       |         
 hasoptedoutoffax               | boolean       |       |         
 street                         | varchar       |       |         
 qualifier_authority__c         | bigint        |       |         
 country                        | varchar       |       |         
 geocodeaccuracy                | bigint        |       |         
 statusreason__c                | bigint        |       |         
 title                          | varchar       |       |         
 lastreferenceddate             | timestamp     |       |         
 jigsawcontactid                | varchar       |       |         
 qualifier_timing__c            | bigint        |       |         
 converteddate                  | timestamp     |       |         
 company                        | varchar       |       |         
 convertedopportunityid         | bigint        |       |         
 numberoflocations__c           | double        |       |         
 bdr_owner__c                   | bigint        |       |         
 postalcode                     | varchar       |       |         
 lastmodifiedbyid               | bigint        |       |         
 convertedaccountid             | bigint        |       |         
 lastmodifieddate               | timestamp     |       |         
 photourl                       | varchar       |       |         
 firstname                      | varchar       |       |         
 isunreadbyowner                | boolean       |       |         
 longitude                      | double        |       |         
 actiontakenid__c               | varchar       |       |         
 industry                       | bigint        |       |         
 latestcampaign__c              | varchar       |       |         
 fastcall__numberoflocations__c | double        |       |         
 companydunsnumber              | varchar       |       |         
 star__c                        | boolean       |       |         
 masterrecordid                 | bigint        |       |         
 latitude                       | double        |       |         
 createdbyid                    | bigint        |       |         
 acuityid__c                    | varchar       |       |         
 isdeleted                      | boolean       |       |         
 lastvieweddate                 | timestamp     |       |         
 timezone__c                    | varchar       |       |         
 lastactivitydate               | timestamp     |       |         
 db__id                         | bigint        |       |         
(82 rows)

Query 20191004_211826_00009_f3u47, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:01 [82 rows, 6.07KB] [105 rows/s, 7.81KB/s]

Nice! It looks like all Lead object columns a brought in automatically just by following datomic naming conventions.

Let's try another query that might work better than COUNT(*):

presto:adam_dev> select name from sfdc_lead limit 10;
        name         
---------------------
 Num 29108 Load Test 
 Num 29109 Load Test 
 Num 29106 Load Test 
 Num 29107 Load Test 
 Num 29112 Load Test 
 Num 29113 Load Test 
 Num 29110 Load Test 
 Num 29111 Load Test 
 Num 29116 Load Test 
 Num 29117 Load Test 
(10 rows)

Query 20191004_211856_00010_f3u47, FINISHED, 1 node
Splits: 36 total, 26 done (72.22%)
0:02 [58.7K rows, 57.3KB] [28.3K rows/s, 27.7KB/s]

Getting somewhere...all my test data seems to be accessible. Let's try annother aggregate:

presto:adam_dev> SELECT MAX(createddate) FROM sfdc_lead;
          _col0          
-------------------------
 2019-01-28 15:57:34.000 
(1 row)

Query 20191004_212752_00012_f3u47, FINISHED, 1 node
Splits: 36 total, 36 done (100.00%)
0:03 [106K rows, 103KB] [40.4K rows/s, 39.5KB/s]

That worked...so I guess just something about COUNT(*)...I'll raise a support case for this.

Connecting Metabase

Aside from some rough edges, datomic analytics seems to give us SQL over datomic databases nearly for free (the cost is writing a metaschama, although in our case we can probably auto generate metaschema from our own metamodel). Let's see if Metabase lets us get free reporting and dashboard tools on top of it.

Metabase as a presto database type available. It seems like username is a required field in Metabase, but, having not setup any authentication for this test, it seems like any username is accepted.

Metabase is syncing in the schema successfully. It has some sort of auto-generated field labels that look weird, it is putting the word breaks in the wrong place. It probably doesn't help that there is no case preservation, but it would be even better if there were some way of exposing human friendly field names through the catalog. I imagine this could be done as a logical extension of the metaschema (e.g. use :my/label attribute declared on attributes to generate something like SQL Comment) or even the built-in :db/doc?

It seems like Metabase can connect and is able to sync my simple schema. It is a little slugish, but that may have to do with everything running on my laptop while connecting over the Internet to DDB. Also, as caches warm up things are a bit better. Metabase is hitting presto with all sorts of queries, some directly related to what I'm doing, while others seem to be part of its internal data anlysis functionality.

Adding another table (Account) and joins for my SFDC picklists, metaschema now looks like this:

{
 :tables {
  :sfdc.Lead/Id {} 
  :sfdc.Lead.LeadSource/value {}
  :sfdc.Lead.Industry/value {}
  :sfdc.Account/Id {}
 }
 :joins {
  :sfdc.Lead/LeadSource "sfdc_lead_leadsource"
  :sfdc.Lead/Industry "sfdc_lead_industry"
 }
}

Another query failure is cropping up through this, with a different sort of error compared to the COUNT(*) issue:

The query:

-- Metabase
SELECT "adam_dev"."sfdc_account"."targetingscore__c" AS "targetingscore__c", "adam_dev"."sfdc_account"."shippinglongitude" AS "shippinglongitude", "adam_dev"."sfdc_account"."billingaddress" AS "billingaddress", "adam_dev"."sfdc_account"."billingstreet" AS "billingstreet", "adam_dev"."sfdc_account"."sla__c" AS "sla__c", "adam_dev"."sfdc_account"."jigsaw" AS "jigsaw", "adam_dev"."sfdc_account"."systemmodstamp" AS "systemmodstamp", "adam_dev"."sfdc_account"."numberofemployees" AS "numberofemployees", "adam_dev"."sfdc_account"."createdbyid" AS "createdbyid", "adam_dev"."sfdc_account"."name" AS "name", "adam_dev"."sfdc_account"."lastreferenceddate" AS "lastreferenceddate", "adam_dev"."sfdc_account"."accountsource" AS "accountsource", "adam_dev"."sfdc_account"."site" AS "site", "adam_dev"."sfdc_account"."shippingstate" AS "shippingstate", "adam_dev"."sfdc_account"."shippingstreet" AS "shippingstreet", "adam_dev"."sfdc_account"."yearstarted" AS "yearstarted", "adam_dev"."sfdc_account"."isexcludedfromrealign" AS "isexcludedfromrealign", "adam_dev"."sfdc_account"."dandbcompanyid" AS "dandbcompanyid", "adam_dev"."sfdc_account"."accountnumber" AS "accountnumber", "adam_dev"."sfdc_account"."accounttype__c" AS "accounttype__c", "adam_dev"."sfdc_account"."shippingaddress" AS "shippingaddress", "adam_dev"."sfdc_account"."photourl" AS "photourl", "adam_dev"."sfdc_account"."annualrevenue" AS "annualrevenue", "adam_dev"."sfdc_account"."shippingcountry" AS "shippingcountry", "adam_dev"."sfdc_account"."naicsdesc" AS "naicsdesc", "adam_dev"."sfdc_account"."lastvieweddate" AS "lastvieweddate", "adam_dev"."sfdc_account"."numberoflocations__c" AS "numberoflocations__c", "adam_dev"."sfdc_account"."lastmodifiedbyid" AS "lastmodifiedbyid", "adam_dev"."sfdc_account"."upsellscore__c" AS "upsellscore__c", "adam_dev"."sfdc_account"."phone" AS "phone", "adam_dev"."sfdc_account"."slaserialnumber__c" AS "slaserialnumber__c", "adam_dev"."sfdc_account"."billinggeocodeaccuracy" AS "billinggeocodeaccuracy", "adam_dev"."sfdc_account"."website" AS "website", "adam_dev"."sfdc_account"."acuityid__c" AS "acuityid__c", "adam_dev"."sfdc_account"."shippinglatitude" AS "shippinglatitude", "adam_dev"."sfdc_account"."tradestyle" AS "tradestyle", "adam_dev"."sfdc_account"."fax" AS "fax", "adam_dev"."sfdc_account"."sic" AS "sic", "adam_dev"."sfdc_account"."tickersymbol" AS "tickersymbol", "adam_dev"."sfdc_account"."billingcountry" AS "billingcountry", "adam_dev"."sfdc_account"."creditscore__c" AS "creditscore__c", "adam_dev"."sfdc_account"."shippinggeocodeaccuracy" AS "shippinggeocodeaccuracy", "adam_dev"."sfdc_account"."description" AS "description", "adam_dev"."sfdc_account"."billingstate" AS "billingstate", "adam_dev"."sfdc_account"."naicscode" AS "naicscode", "adam_dev"."sfdc_account"."sicdesc" AS "sicdesc", "adam_dev"."sfdc_account"."billingcity" AS "billingcity", "adam_dev"."sfdc_account"."shippingpostalcode" AS "shippingpostalcode", "adam_dev"."sfdc_account"."shippingcity" AS "shippingcity", "adam_dev"."sfdc_account"."employee_growth__c" AS "employee_growth__c", "adam_dev"."sfdc_account"."db__id" AS "db__id", "adam_dev"."sfdc_account"."ownership" AS "ownership", "adam_dev"."sfdc_account"."isdeleted" AS "isdeleted", "adam_dev"."sfdc_account"."lastactivitydate" AS "lastactivitydate", "adam_dev"."sfdc_account"."revenuegrowth__c" AS "revenuegrowth__c", "adam_dev"."sfdc_account"."runoffscore__c" AS "runoffscore__c", "adam_dev"."sfdc_account"."capex__c" AS "capex__c", "adam_dev"."sfdc_account"."slaexpirationdate__c" AS "slaexpirationdate__c", "adam_dev"."sfdc_account"."upsellopportunity__c" AS "upsellopportunity__c", "adam_dev"."sfdc_account"."jigsawcompanyid" AS "jigsawcompanyid", "adam_dev"."sfdc_account"."rating" AS "rating", "adam_dev"."sfdc_account"."lastmodifieddate" AS "lastmodifieddate", "adam_dev"."sfdc_account"."billingpostalcode" AS "billingpostalcode", "adam_dev"."sfdc_account"."billinglongitude" AS "billinglongitude", "adam_dev"."sfdc_account"."call_count__c" AS "call_count__c", "adam_dev"."sfdc_account"."parentid" AS "parentid", "adam_dev"."sfdc_account"."customerpriority__c" AS "customerpriority__c", "adam_dev"."sfdc_account"."billinglatitude" AS "billinglatitude", "adam_dev"."sfdc_account"."ownerid" AS "ownerid", "adam_dev"."sfdc_account"."dunsnumber" AS "dunsnumber", "adam_dev"."sfdc_account"."region__c" AS "region__c", "adam_dev"."sfdc_account"."active__c" AS "active__c", "adam_dev"."sfdc_account"."createddate" AS "createddate", "adam_dev"."sfdc_account"."industry" AS "industry", "adam_dev"."sfdc_account"."type" AS "type", "adam_dev"."sfdc_account"."masterrecordid" AS "masterrecordid" FROM "adam_dev"."sfdc_account" LIMIT 10000

results in an error:

java.lang.ArithmeticException: Rounding necessary
	at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
	at java.math.BigDecimal.needIncrement(BigDecimal.java:4235)
	at java.math.BigDecimal.divideAndRound(BigDecimal.java:4143)
	at java.math.BigDecimal.setScale(BigDecimal.java:2455)
	at java.math.BigDecimal.setScale(BigDecimal.java:2389)
	at io.prestosql.spi.type.Decimals.encodeScaledValue(Decimals.java:161)
	at datomic.presto$create_connector$reify$reify$reify$reify__2400.getSlice(presto.clj:278)
	at io.prestosql.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:119)
	at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:280)
	at io.prestosql.operator.Driver.processInternal(Driver.java:379)
	at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
	at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
	at io.prestosql.operator.Driver.processFor(Driver.java:276)
	at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
	at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.prestosql.$gen.Presto_316____20191004_220045_1.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The docs seem to indicate that any configuration around scale is optional, but that is the only idea I have as to the cause of this sort of error since this is a trivial projection, with no arithmetic in the SQL.

Support response on this is that scale should be specified to indicate what is in the attribute, not what you want in the column. I'm still unsure why this is an issue, but setting :scale {:sfdc.Lead/AnnualRevenue 0} has resolved these exceptions. It is strange, because in Java this doesn't raise an exception: new BigDecimal("10").setScale(2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment