bzz/osd-5-2017-07-03.md Secret

## osd-5-2017-07-03.md

      
    Raw
  

              osd-5-2017-07-03.md
            
          
    OSD #5


TL;DR: keep working on ML for language identification in MD code blocks, read up on implementations of large scale collection/storage pipelines in both, theory (papers) and practice (Software Heritage).

I. Learning Linguist

Recap: suggest a programming language for fenced code blocks in markdown, so user do not need to type it in.
Last OSD's fenced code blocks dataset was collected using regexps REGEXP_EXTRACT_ALL(c.content, r"(```[\S]*?\n[\s\S]+?\n```)").
Small, 12mb 7k exmples and quality is quite low due to Markdown parser spec complexity :/
Re-collected same dataset using actual reference Mardown parser impl https://github.com/jgm/commonmark.js on all readmes of 3.9m repos - 1.8Gb \w 7m examples. Bonus - nice schema in JSON: repo_name, file_path, lang, text
# Use UDF \w commonmark.js to extract 'code_block'

CREATE TEMPORARY FUNCTION extract_code_blocks(md STRING)
  RETURNS ARRAY<STRUCT<lang STRING, text STRING>>
  LANGUAGE js AS
"""
if (!md) return [];

var reader = new commonmark.Parser();
var walker = reader.parse(md).walker();
var event, node;
var final_arry = []

while ((event = walker.next())) {
  node = event.node;
  if (event.entering && node.type === 'code_block') {
    final_arry.push({lang: node.info ? node.info : '' , text: node.literal});
  }
}
    
return final_arry
"""
OPTIONS (
  library="gs://srcd-production-dataproc/commonmark.js"
);

WITH races AS (
  SELECT f.repo_name as repo_name, f.path as path, extract_code_blocks(c.content) as code_blocks
  FROM `bigquery-public-data.github_repos.sample_files` f
  LEFT JOIN `bigquery-public-data.github_repos.sample_contents` c
  ON f.id = c.id
  WHERE 
   REGEXP_CONTAINS(f.path, r'(?i)^(^|/)readme(\.|$)') AND
   ARRAY_LENGTH(extract_code_blocks(c.content)) <> 0)
SELECT
  repo_name, path, code_block.lang as lang, code_block.text as text
FROM races r
CROSS JOIN UNNEST(r.code_blocks) as code_block;
Results
REDME.md
--------
// 30 Gb
70k blocks, 18 Mb

// 2.29 TBs, 15508s
7m blocks, 1.8 GB

can expect x8 if run over *.md
Stats on info-string \w languages:
3,5m empty
0,5m js, bash
first ~1k langs are meaningfull (out of 3k)


Data avaialbe at https://drive.google.com/open?id=0BxNVgwtOUkMUdmhNNlIwLUV4ekk
gzcat  markdown/fenced_code_blocks_json/enced_code_blocks000000000000 | less

TODO:

for all 16m repos use https://data.world/vmarkovtsev/github-readme-files and Apache Spark with https://github.com/atlassian/commonmark-java expecting x4 dataset size
run pre-trained fastText over this corpus, calculate accurancy

II. Sotfware Heritage

After we opened ALL our pipelien for repository discovery, collection and storage:

rovers/borges,
core \w RootedRepo transactioner
go-git, go-billy, go-billy-siva, siva
berserker
it makes sence to get fammiliar with tech stack used by Software Heritage for similar efforts.

https://upsilon.cc/~zack/stuff/software-heritage-draft.pdf esp section 6&7
https://archive.softwareheritage.org/api/
https://forge.softwareheritage.org/source


discovery
https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/


storage
https://wiki.softwareheritage.org/index.php?title=Repository_snapshot_objects


pure python git impl
https://github.com/jelmer/dulwich


Object stograge

Custom 'Merkel DAG' impl in Python, super-set of Git, has SHA256, salted SHA1, BLAKE2 impls.
Includes files/dirs/revs/releases nodes. +metadata on orgins, type, snapshots/refs.
Blobls in KV storage, by hash. Rest of the graph is stored in RDBMS, table-per-node-type.
Append-only, but each node type has "change feed" - modification history.
Strong de-duplicatoin, allows to store .tar, .deb, git/svn/etc.
Allows for context-independet browsing by hash.
High-level architecture:

Listers (full/incremental) -> loaders -> schedullers -> archivers
Current status:

Extracted full: Github, Gitorious, Google Code, Degian archive, GNU codebases.
Each copy of object storage currently occupies ≈150 TB of individually compressed file (300 TB of raw).
Each copy of the RDBMS used to store the rest of the graph (Postgres) takes ≈5 TB.

3 copies of the object storage and 2 copies of the database,
the later with point-in-time recovery over a 2-week time window.

A logical graph of Merkle DAG has ≈5 billion nodes and ≈50 billion edges.
More than half of the nodes are (unique) file contents (≈3 B) and there are
≈750 M revision/commit nodes, collected from ≈55 M origins.

TODO: run it locally on same 2k repos as borges
III. Reading

Paper "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services" by Eric Brewer et al.
This paper on large-scale system design was motivation for current implementation of biggest search engine indexing pipeline.
Event-driven, decomposition to stages
Homeostasis + feedback for predictable performance after saturation point (gracefull degradation).

  
## x-info-string-hist.png

      
    Raw
  

              x-info-string-hist.png