Skip to content

Instantly share code, notes, and snippets.

@rgbkrk
Last active June 1, 2017 07:49
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rgbkrk/f9c498f8dfd9eec6e583071325780562 to your computer and use it in GitHub Desktop.
Save rgbkrk/f9c498f8dfd9eec6e583071325780562 to your computer and use it in GitHub Desktop.
What are notebooks requiring?

nteract doesn't support requirejs because we have the builtin require at our fingertips. Jupyter notebook however has long operated under the assumption that you can use the builtin requirejs for loading modules asynchronously:

require(['d3'], function(d3) {...

I started exploring the idea of providing some of these modules in a "quirks" sort of mode where we provide limited access to "requirejs" while still sandboxed. To find out what modules were commonly required, I turned to Google BigQuery, the GitHub dataset, and a User Defined Function (UDF) written in JavaScript.

I'll flesh out this gist or a blog post later. For now, I'll just provide my query code:

# JavaScript UDF for extracting requireJS modules from Jupyter Notebooks on GitHub

CREATE TEMPORARY FUNCTION
  extractModules(notebookJSON STRING)
  RETURNS Array<string>
  LANGUAGE js AS """
    /**
     * Grab all the modules loaded with requirejs within a jupyter notebook.
     */
    function getModules(s) {
      // Note: The backslash has to be escaped for BigQuery's editor
      // Visualize half as many backslashes here ;)
      var re = new RegExp(/require\\((\\[[^\\]]+\\])/, 'gm');
      
      var modules = [];
      if(!s) {
        return []
      }
      
      var match = re.exec(s);
      while(match !== null) {
        try {
          var hopefullyJSONArray = match[1].replace(/'/g, '"');       
          var arr = JSON.parse(hopefullyJSONArray);
          if(Array.isArray(arr)) {
            modules = modules.concat(arr);
          }
        } catch(e) {
          // assume invalid, can't use
        }
        
        match = re.exec(s);
      }
      
      return modules;
    }
    
    function flatten(a,b) {
      return a.concat(b);
    }
  
    try {
      var notebook = JSON.parse(notebookJSON);
      if(!notebook.cells) {
        return []
      }
      
      var mods = notebook.cells.map(function(cell) {
        if(!cell.outputs) {
          return []
        }
        
        return cell.outputs.map(function(output) {
          var modules = [];
          if((output.output_type === "display_data" || output.output_type === "execute_result") && (output.data['text/html'] || output.data['application/javascript'])) {         
            var html = output.data['text/html'];
            var js = output.data['application/javascript'];
           
            if(html) {
              modules = modules.concat(getModules(html))
            }
            if(js) {
              modules = modules.concat(getModules(js))
            }
          }
          return modules;
        }).reduce(flatten, []);
      }).reduce(flatten, []);
      
      return [...new Set(mods)];
      
    } catch (e) {
      return ["ERROR" + e.toString()];
    }
  """;
  
  
SELECT
  CONCAT("https://github.com/", F.repo_name, "/blob/master/", F.path) AS URL,
  extractModules(C.content) AS modules
FROM (
  SELECT
    id,
    content
  FROM
    `bigquery-public-data.github_repos.contents`
  WHERE
    REGEXP_CONTAINS(content, "require\\(\\['") ) AS C
JOIN (
  SELECT
    repo_name,
    path,
    id
  FROM
    `bigquery-public-data.github_repos.files`
  WHERE
    path LIKE '%.ipynb' ) AS F
ON
  C.id = F.id
@rgbkrk
Copy link
Author

rgbkrk commented May 26, 2017

Excuse the poor JS above, it was a bit strange to write embedded code for a UDF.

Note that this is standard SQL in google bigquery, not their legacy syntax. You'll have to enable standard SQL to use it.

@fhoffa
Copy link

fhoffa commented Jun 1, 2017

Suggestion: First you extract all .ipynb, then run analysis over smaller a way smaller set.

(see https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-analyze-all-the-open-source-code.html)

Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment