Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Spark SQL generate rowcounts for all tables in a hive database
val dbname = "datalake"
val rowCountSQL = spark.catalog.listTables("$dbname")
.map(name => spark.sql(s"SELECT 'datalake.$name' as table, count(*) as rowcount FROM $dbname.$name")
.mkString("\nUNION ALL\n")
val rowCounts = spark.sql(rowCountSQL).as[(String, Long)]

This comment has been minimized.

Copy link
Owner Author

commented Apr 12, 2018

Return a dataset of table names in a given Hive database, and using that collection of table names, generate a SQL query to count rows for each table, collect the dataset and join all the rows (of sql strings) into one big string with UNION ALL on a line between each, then take that big string and execute it returning a new dataset ( which can then be written to a Hive table). The \n newlines around the UNION ALL could easily be spaces, but newlines allow some pretty printing.

It's fun using functional programming constructs like map to do SQL or other code generation, but this is also pretty horrendous. There's probably a much faster way of getting rowcounts from Hive metadata that I'm not aware of, and possible I should stick to using the Spark SQL DSL rather than the inline SQL, although does it make a difference? Don't they both resolve to the same thing? I'd prefer to stick to SQL over learning yet another SQL DSL.

I find this code generation technique using map() very useful, and Scala's s"$var" string interpolation construct is cool too. Python 3 now has f-strings which work similarly, bash (and other sh) have had that forever of course. So useful except when it's not and then it's inception.

I also like using as[(Tuple of types)] construct for creating strongly typed Spark datasets when prototyping or scripting as opposed to creating case classes, which I do use in Spark apps. You should be using datasets instead of dataframes and RDDs where possible. Strong types are good for your digestion, don't shy away from being explicit about them. Ultimately all data needs types and structuring to be useful. Even unsupervised machine learning across unstructured data is in reality taking soooo long to train and optimize hyper-parameters because it's seeking emergent structure. How philosophical, aren't we all? Reducing entropy takes time, help it along if you can, don't infer types if you can avoid it, you'll save both time and errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.