Skip to content

Instantly share code, notes, and snippets.

@tazjin
Last active September 28, 2019 17:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tazjin/c929e4cceb269cd7fd2117c45a4d1ab1 to your computer and use it in GitHub Desktop.
Save tazjin/c929e4cceb269cd7fd2117c45a4d1ab1 to your computer and use it in GitHub Desktop.
Documentation languages

Documentation format popularity

The SQL query below queries the GitHub dataset on Google BigQuery for information about the popularity of different documentation formats.

Identification is based on file extensions in this version, and duplicates (e.g. READMEs across forks) are deduplicated by content hashes (which the table uses for the identifiers).

Read the query for more information, or behold these results:

Language # of files # of megabytes
Markdown 7982489 26187
AsciiDoc (with .asc) 124213 9059
AsciiDoc (no .asc) 86765 823
Org-Mode 24779 314

Fun fact #1: The first AsciiDoc table row is giving it a significant benefit of the doubt by including files with a .asc extension since AsciiDoc has no standard extension.

Removing this extension removes 91% of the content attributed to AsciiDoc, but only 31% of the files. This leads me to believe that most of the .asc files are actually GPG public keys, but draw your own conclusions.

Fun fact #2: The plain text format (.txt) outclasses all of these by a large margin (9829711 files, 473 gigabytes), but I suspect that a lot of that is data of some kind and that a lot of the actual documentation is actually Markdown without the .md extension.

Without writing some more sophisticated content analysis function this won't be possible to include in the stats.

-- Queries the public BigQuery dataset for Github for all documentation
-- files and their sizes. Files are identified by content hashes, and can
-- thus be deduplicated (to exclude duplicated READMEs in forks and such).
--
-- Query results in the total count of unique per language, and the
-- total number of bytes per language.
#StandardSQL
WITH
-- Subtable of all distinct files (deduplicated per file hash) tagged
-- with the documentation language they are written in, as guessed per
-- extension.
docs AS (
SELECT
id,
ANY_VALUE(
CASE WHEN ENDS_WITH(path, '.md') THEN 'markdown'
WHEN ENDS_WITH(path, '.asciidoc') THEN 'asciidoc'
WHEN ENDS_WITH(path, '.adoc') THEN 'asciidoc'
-- including .asc might also match various GPG keys, but
-- Markdown will still outclass AsciiDoc significantly.
WHEN ENDS_WITH(path, '.asc') THEN 'asciidoc'
WHEN ENDS_WITH(path, '.org') THEN 'org'
-- Plain text files might actually be Markdown, too, but
-- lets ignore that
WHEN ENDS_WITH(path, '.txt') THEN 'plain'
ELSE NULL
END
) AS language
FROM `bigquery-public-data.github_repos.files`
GROUP BY id),
docs_files AS (
SELECT d.id AS id, d.language AS language, c.size AS size
FROM docs d
INNER JOIN `bigquery-public-data.github_repos.contents` c
ON d.id = c.id AND c.binary = false
WHERE d.language IS NOT NULL)
SELECT
language,
COUNT(id) AS total_count,
SUM(size) AS total_size
FROM docs_files
GROUP BY language
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment