Skip to content

Instantly share code, notes, and snippets.

@mcuadros
Last active April 23, 2019 08:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mcuadros/2da1267404648fb7fbb97f4282cf1d2c to your computer and use it in GitHub Desktop.
Save mcuadros/2da1267404648fb7fbb97f4282cf1d2c to your computer and use it in GitHub Desktop.

EXPLODE solutions applied to diff, blame problems

This solutions is based on the behaviour of EXPLODE in SparkSQL and the premise that the UDFs can return complex types like lists, structs and maps (this types are represented on wire as JSON type)

Functions

  • DIFF_TREE(repository_id, from<revision>, [to<revision>] -> something similar to []git.Change
  • DIFF(repository_id, from<revision>, [to<revision>] -> []struct{file_path, marker, line, content}
  • COMMIT_FILES_STATS(repository_id, from<revision>, [to<revision>] -> []struct{file_path, additions, deletions}
  • BLAME(repository_id, from<revision>, [file_path]] -> []struct{file_path, line, author_name, when, content}

Queries

The queries are originale extracted from this document

Query 1

Get all files in HEAD with the timestamp of their last change (repository, file path, timestamp)

NOTE: In fact his queries returns the last files changed by HEAD, but this is how was done in the google docs example queries

SELECT 
    repository_id, 
    file_path_to, 
    committer_when 
FROM (
    SELECT *, EXPLODE(DIFF_TREE(repository_id, commit_hash))
    FROM commits 
    NATURAL JOIN refs 
    WHERE ref_name = 'HEAD'
) as raw

A more acurate anweser is:

SELECT
   repository_id,
   file_path_to,
   MAX(committer_when)
FROM (
   SELECT *, EXPLODE(BLAME(repository_id, commit_hash))
   FROM commits
   NATURAL JOIN refs
   WHERE ref_name = 'HEAD'
) as raw
GROUP BY repository_id, file_path_to;

Query 2

Get the number of lines added and deleted per developer and language (repository, author email, language, lines added, lines deleted). It would be ok to do two different queries for added and deleted, but both in the same query would be ideal.

SELECT 
   commit_author_email, 
   x.marker,
   LANG(blob_content, file_path) AS lang, 
   COUNT(1) as lines
FROM (
   SELECT *, EXPLODE(DIFF(repository_id, commit_hash))
   FROM commits 
   NATURAL JOIN commit_files
   WHERE 
      marker = '>' OR 
      marker = '<' OR 
      marker = '|' 
) as raw
GROUP BY     
   commit_author_email,   
   lang, 
   x.marker

Version with COMMIT_FILE_STATS

SELECT 
   commit_author_email, 
   SUM(additions), SUM(deletions)
   LANG(blob_content, file_path) AS lang, 
   COUNT(1) as lines
FROM (
   SELECT *, EXPLODE(COMMIT_FILES_STATS(repository_id, commit_hash)) 
   FROM commits 
   NATURAL JOIN commit_files
) as raw
GROUP BY commit_author_email, lang

Query 3

Get all added lines per developer and language (repository, author email, language, line). Note that this is not the number of lines, but the lines themselves.

SELECT 
    repository_id, 
    author_email, 
    LANG(blob_content, file_path) as lang,  
    line 
FROM (
   SELECT *, EXPLODE(DIFF(repository_id, commit_hash))
   FROM files 
   NATURAL JOIN commit_files 
   NATURAL JOIN commits
) as raw
WHERE d.marker = '>'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment