Skip to content

Instantly share code, notes, and snippets.

@vmarkovtsev
Last active August 21, 2017 17:34
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vmarkovtsev/e798410648d301aab6c36484d7785967 to your computer and use it in GitHub Desktop.
Save vmarkovtsev/e798410648d301aab6c36484d7785967 to your computer and use it in GitHub Desktop.
ML Spark API usecases

Domains

First of all. ML has two quite different activity domains:

  1. Running something on many repositories.
  2. Running something on a single repository

Depending on the size of (2), it makes or does not make sense to launch Spark. For example, consider the topic model application scenario:

  1. Load the topic model (basically, a giant sparse matrix)
  2. Extract all source code identifiers from a single repository, calculate their frequencies.
  3. Multiply the vector by the matrix, sort and report the most relevant topics.

This is supposed to run inside a docker container. The model size is about 300 MB + 150 MB OS+Python runtime. Just imagine that you have to include JRE + Spark additionally to pulling the Babelfish server.

This makes me think that Spark API applies to (1) and relatively large repos in (2). Moderately sized repos (95%) can be successfully handled by Python.

Use cases

Terminology

Must means that I am 80% sure. Should means that I am 50% sure.

Identifier embeddings

Train

  1. Obtain the list of repos to work on.
  2. For every file in every repo we extract/load UASTs.
  3. For every UAST we extract identifiers and transform them into several subidentifiers. The tree structure is not preserved.
  4. We build a set of these subidentifiers for every repo.
  5. We turn sets into maps and sum them acrosss all the repos aka calculate "document frequencies".
  6. Sort items by value and pick the greatest N. N~1MM or even bigger.
  7. We save the chosen items with their values. This is the "document frequencies" model. It is needed in various use cases and ML team stores it in Modelforge format. The typical size is 30-50 MB (ASDF+zlib).
  8. For every file in every repo we extract/load UASTs.
  9. For every UAST we extract identifiers and transform them into several subidentifiers. The tree structure is preserved.
  10. We throw away subidentifiers which are not chosen in (5).
  11. We calculate the sparse co-occurrence matrix from the remaining tree of subidentifiers. The shape is NxN.
  12. We sum those matrices for every file in every repo.
  13. We save the resulting matrix in Modelforge format.
  14. We embed this matrix and save the result in Modelforge format.

Steps which must involve ast2vec:

  • (2) identifier -> subidentifiers
  • (6) Modelforge format
  • (8) identifier -> subidentifiers
  • (10) tree of subidentifiers -> sparse co-occurrence matrix. It is supposed to be scipy.sparse.csr_matrix
  • (11) summing scipy.sparse.csr_matrix
  • (12) Modelforge format
  • (13) not related to Spark API

Steps 8..10 are the single ast2vec step internally and should not require Spark API.

Apply

Not directly applicable.

Repository similarity

Train

  1. Obtain the list of repos to work on.
  2. For every file in every repo we extract/load UASTs.
  3. For every UAST we extract identifiers and transform them into several subidentifiers. The tree structure is not preserved.
  4. We load the document frequencies model (step 5, "Identifier embeddings") and throw away subidentifiers which are not present.
  5. We calculate the frequencies of the approved subidentifiers for every repo.
  6. Apply TF-IDF to every value in the map. This requires the document frequencies model once again.
  7. Concatenate (not merge!) the resulting maps.
  8. Save the list of maps into Modelforge format aka "bags-of-words".

Steps which must involve ast2vec:

  • (2) identifier -> subidentifiers
  • (3) loading the docfreq model
  • (5) TF-IDF
  • (7) Modelforge format

Steps 2..5 are the single ast2vec step internally and should not require Spark API.

Steps 0..7 are named "bags-of-words generation".

Apply

  1. We select the single repository. The repository can be outside of our dataset.
  2. For every file in it we extract/load UASTs.
  3. For every UAST we extract identifiers and transform them into several subidentifiers. The tree structure is not preserved.
  4. We load the document frequencies model (step 5, "Identifier embeddings") and throw away subidentifiers which are not present.
  5. We calculate the frequencies of the approved subidentifiers.
  6. Apply TF-IDF to every value in the map. This requires the document frequencies model once again.
  7. We load "Identifier embeddings" model.
  8. Search for the similar repositories combining (5) and (6).

Steps 0..5 can be replaced by a simple fetch from the "bags-of-words" model.

Steps which must involve ast2vec:

  • (2) identifier -> subidentifiers
  • (3) loading the docfreq model
  • (5) TF-IDF
  • (6) Modelforge format
  • (7) src-d/vecino

Topic modeling

Train

Steps 0-7 Are the same as in "Repository similarity" The subsequent steps are not related to Spark API.

Apply

  1. We select the single repository. The repository can be outside of our dataset.
  2. For every file in it we extract/load UASTs.
  3. For every UAST we extract identifiers and transform them into several subidentifiers. The tree structure is not preserved.
  4. We load the document frequencies model (step 5, "Identifier embeddings") and throw away subidentifiers which are not present.
  5. We calculate the frequencies of the approved subidentifiers.
  6. Apply TF-IDF to every value in the map. This requires the document frequencies model once again.
  7. We load the model which we generated in "Train".
  8. We combine both entities and output the result - list of topics with ranks.

Steps which must involve ast2vec:

  • (2) identifier -> subidentifiers
  • (3) loading the docfreq model
  • (5) TF-IDF
  • (6) Modelforge format
  • (7) combining

Babelfish role embeddings

Train

  1. Obtain the list of repos to work on.
  2. For every file in every repo we extract/load UASTs.
  3. We transform every UAST into a black-box serializable entity.
  4. We perform the reduction of those entities across all the files in all the repos.
  5. We save the result in Modelforge format.

Apparently, steps 2..4 involve ast2vec/role2vec.

Apply

Undecided, but most likely does not require Spark API.

Snippet exploratory search

Train

  1. Gather the list of repositories corresponding to a specific ecosystem. Besides, we know the repository of the ecosystem provider.
  2. For every file in ecosystem provider repo we generate the set of subidentifiers (steps 2..3 in "Identifier embeddings")
  3. We save it in Modelforge format.
  4. For every UAST in ecosystem repos we extract the subtrees of functions.
  5. We filter the functions by using the set from (2).
  6. For every function we extract identifiers and transform them into several subidentifiers. The tree structure is not preserved.
  7. Save all the results to Modelforge format.
  8. Terra incognita

Steps 1..2 can be replaced by a simple fetch from the "bags-of-words" model.

Apply

Undecided, but most likely does not require Spark API.

Tabs vs spaces

  1. Choose the list of repos.
  2. For every file in every repo, exec a simple Python script (see blog)
  3. Aggregate and reduce the result by language.

Shell command prediction

Train

  1. Get the list of repos.
  2. Scan for files with specific names.
  3. Save their contents in a structured format.

Apply

Does not require Spark API.

Neural code completion

Train

  1. Choose the list of repos.
  2. For every file in every repo, extract UASTS.
  3. Save the UASTs in Modelforge format (actually, may be too slow, not decided yet).

Apply

Does not require Spark API.

Hercules

  1. Choose the list of repos.
  2. For every *git.Repository, run Hercules which generates the file with results. Currently the format is PB.
  3. Save the files. Later they will be processed by JS/Python into plots.

Working days stats

blog post

It boils down to aggregation and reduction ops over languages and commit metadata.

Fuzzy forks

That is, when 99% of the repo is the same as the base one. blog post

It boils down to analysis on "bags-of-words".

Blackduck Protex Killer

BD Protex is the tool which reports the fuzzy copy-paste from open source repos.

It boils down to operating on extracted UASTs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment