TL;DR: keep working on ML for language identification in MD code blocks, read up on implementations of large scale collection/storage pipelines in both, theory (papers) and practice (Software Heritage).
Recap: suggest a programming language for fenced code blocks in markdown, so user do not need to type it in.
Last OSD's fenced code blocks dataset was collected using regexps REGEXP_EXTRACT_ALL(c.content, r"(```[\S]*?\n[\s\S]+?\n```)")
.
Small, 12mb 7k exmples and quality is quite low due to Markdown parser spec complexity :/
Re-collected same dataset using actual reference Mardown parser impl https://github.com/jgm/commonmark.js on all readmes of 3.9m repos - 1.8Gb \w 7m examples. Bonus - nice schema in JSON: repo_name, file_path, lang, text
# Use UDF \w commonmark.js to extract 'code_block'
CREATE TEMPORARY FUNCTION extract_code_blocks(md STRING)
RETURNS ARRAY<STRUCT<lang STRING, text STRING>>
LANGUAGE js AS
"""
if (!md) return [];
var reader = new commonmark.Parser();
var walker = reader.parse(md).walker();
var event, node;
var final_arry = []
while ((event = walker.next())) {
node = event.node;
if (event.entering && node.type === 'code_block') {
final_arry.push({lang: node.info ? node.info : '' , text: node.literal});
}
}
return final_arry
"""
OPTIONS (
library="gs://srcd-production-dataproc/commonmark.js"
);
WITH races AS (
SELECT f.repo_name as repo_name, f.path as path, extract_code_blocks(c.content) as code_blocks
FROM `bigquery-public-data.github_repos.sample_files` f
LEFT JOIN `bigquery-public-data.github_repos.sample_contents` c
ON f.id = c.id
WHERE
REGEXP_CONTAINS(f.path, r'(?i)^(^|/)readme(\.|$)') AND
ARRAY_LENGTH(extract_code_blocks(c.content)) <> 0)
SELECT
repo_name, path, code_block.lang as lang, code_block.text as text
FROM races r
CROSS JOIN UNNEST(r.code_blocks) as code_block;
Results
REDME.md
--------
// 30 Gb
70k blocks, 18 Mb
// 2.29 TBs, 15508s
7m blocks, 1.8 GB
can expect x8 if run over *.md
Stats on info-string \w languages:
3,5m empty
0,5m js, bash
first ~1k langs are meaningfull (out of 3k)
Data avaialbe at https://drive.google.com/open?id=0BxNVgwtOUkMUdmhNNlIwLUV4ekk
gzcat markdown/fenced_code_blocks_json/enced_code_blocks000000000000 | less
TODO:
- for all 16m repos use https://data.world/vmarkovtsev/github-readme-files and Apache Spark with https://github.com/atlassian/commonmark-java expecting x4 dataset size
- run pre-trained fastText over this corpus, calculate accurancy
After we opened ALL our pipelien for repository discovery, collection and storage:
- rovers/borges,
- core \w RootedRepo transactioner
- go-git, go-billy, go-billy-siva, siva
- berserker it makes sence to get fammiliar with tech stack used by Software Heritage for similar efforts.
https://upsilon.cc/~zack/stuff/software-heritage-draft.pdf esp section 6&7 https://archive.softwareheritage.org/api/ https://forge.softwareheritage.org/source
-
storage https://wiki.softwareheritage.org/index.php?title=Repository_snapshot_objects
-
pure python git impl https://github.com/jelmer/dulwich
Custom 'Merkel DAG' impl in Python, super-set of Git, has SHA256, salted SHA1, BLAKE2 impls. Includes files/dirs/revs/releases nodes. +metadata on orgins, type, snapshots/refs. Blobls in KV storage, by hash. Rest of the graph is stored in RDBMS, table-per-node-type.
Append-only, but each node type has "change feed" - modification history.
Strong de-duplicatoin, allows to store .tar, .deb, git/svn/etc.
Allows for context-independet browsing by hash.
Listers (full/incremental) -> loaders -> schedullers -> archivers
Extracted full: Github, Gitorious, Google Code, Degian archive, GNU codebases.
Each copy of object storage currently occupies ≈150 TB of individually compressed file (300 TB of raw).
Each copy of the RDBMS used to store the rest of the graph (Postgres) takes ≈5 TB.
3 copies of the object storage and 2 copies of the database,
the later with point-in-time recovery over a 2-week time window.
A logical graph of Merkle DAG has ≈5 billion nodes and ≈50 billion edges.
More than half of the nodes are (unique) file contents (≈3 B) and there are
≈750 M revision/commit nodes, collected from ≈55 M origins.
TODO: run it locally on same 2k repos as borges
Paper "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services" by Eric Brewer et al.
This paper on large-scale system design was motivation for current implementation of biggest search engine indexing pipeline.
Event-driven, decomposition to stages
Homeostasis + feedback for predictable performance after saturation point (gracefull degradation).