Skip to content

Instantly share code, notes, and snippets.

@bzz
Last active July 4, 2017 21:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bzz/d84147035b52b69eae5aaa9dd6741603 to your computer and use it in GitHub Desktop.
Save bzz/d84147035b52b69eae5aaa9dd6741603 to your computer and use it in GitHub Desktop.

OSD #5

TL;DR: keep working on ML for language identification in MD code blocks, read up on implementations of large scale collection/storage pipelines in both, theory (papers) and practice (Software Heritage).

I. Learning Linguist

Recap: suggest a programming language for fenced code blocks in markdown, so user do not need to type it in.

Last OSD's fenced code blocks dataset was collected using regexps REGEXP_EXTRACT_ALL(c.content, r"(```[\S]*?\n[\s\S]+?\n```)"). Small, 12mb 7k exmples and quality is quite low due to Markdown parser spec complexity :/

Re-collected same dataset using actual reference Mardown parser impl https://github.com/jgm/commonmark.js on all readmes of 3.9m repos - 1.8Gb \w 7m examples. Bonus - nice schema in JSON: repo_name, file_path, lang, text

# Use UDF \w commonmark.js to extract 'code_block'

CREATE TEMPORARY FUNCTION extract_code_blocks(md STRING)
  RETURNS ARRAY<STRUCT<lang STRING, text STRING>>
  LANGUAGE js AS
"""
if (!md) return [];

var reader = new commonmark.Parser();
var walker = reader.parse(md).walker();
var event, node;
var final_arry = []

while ((event = walker.next())) {
  node = event.node;
  if (event.entering && node.type === 'code_block') {
    final_arry.push({lang: node.info ? node.info : '' , text: node.literal});
  }
}
    
return final_arry
"""
OPTIONS (
  library="gs://srcd-production-dataproc/commonmark.js"
);

WITH races AS (
  SELECT f.repo_name as repo_name, f.path as path, extract_code_blocks(c.content) as code_blocks
  FROM `bigquery-public-data.github_repos.sample_files` f
  LEFT JOIN `bigquery-public-data.github_repos.sample_contents` c
  ON f.id = c.id
  WHERE 
   REGEXP_CONTAINS(f.path, r'(?i)^(^|/)readme(\.|$)') AND
   ARRAY_LENGTH(extract_code_blocks(c.content)) <> 0)
SELECT
  repo_name, path, code_block.lang as lang, code_block.text as text
FROM races r
CROSS JOIN UNNEST(r.code_blocks) as code_block;

Results

REDME.md
--------
// 30 Gb
70k blocks, 18 Mb

// 2.29 TBs, 15508s
7m blocks, 1.8 GB

can expect x8 if run over *.md

Stats on info-string \w languages:

3,5m empty
0,5m js, bash
first ~1k langs are meaningfull (out of 3k)

info-string histogram

Data avaialbe at https://drive.google.com/open?id=0BxNVgwtOUkMUdmhNNlIwLUV4ekk

gzcat  markdown/fenced_code_blocks_json/enced_code_blocks000000000000 | less

TODO:

II. Sotfware Heritage

After we opened ALL our pipelien for repository discovery, collection and storage:

  • rovers/borges,
  • core \w RootedRepo transactioner
  • go-git, go-billy, go-billy-siva, siva
  • berserker it makes sence to get fammiliar with tech stack used by Software Heritage for similar efforts.

https://upsilon.cc/~zack/stuff/software-heritage-draft.pdf esp section 6&7 https://archive.softwareheritage.org/api/ https://forge.softwareheritage.org/source

Object stograge

Custom 'Merkel DAG' impl in Python, super-set of Git, has SHA256, salted SHA1, BLAKE2 impls. Includes files/dirs/revs/releases nodes. +metadata on orgins, type, snapshots/refs. Blobls in KV storage, by hash. Rest of the graph is stored in RDBMS, table-per-node-type.

Append-only, but each node type has "change feed" - modification history.

Strong de-duplicatoin, allows to store .tar, .deb, git/svn/etc.

Allows for context-independet browsing by hash.

High-level architecture:

Listers (full/incremental) -> loaders -> schedullers -> archivers

Current status:

Extracted full: Github, Gitorious, Google Code, Degian archive, GNU codebases.

Each copy of object storage currently occupies ≈150 TB of individually compressed file (300 TB of raw).
Each copy of the RDBMS used to store the rest of the graph (Postgres) takes ≈5 TB.

3 copies of the object storage and 2 copies of the database,
the later with point-in-time recovery over a 2-week time window.

A logical graph of Merkle DAG has ≈5 billion nodes and ≈50 billion edges.
More than half of the nodes are (unique) file contents (≈3 B) and there are
≈750 M revision/commit nodes, collected from ≈55 M origins.

TODO: run it locally on same 2k repos as borges

III. Reading

Paper "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services" by Eric Brewer et al.

This paper on large-scale system design was motivation for current implementation of biggest search engine indexing pipeline.

Event-driven, decomposition to stages

Homeostasis + feedback for predictable performance after saturation point (gracefull degradation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment