tazjin/description.md

## description.md

      
    Raw
  

              description.md
            
          
    Documentation format popularity

The SQL query below queries the GitHub dataset on Google BigQuery for
information about the popularity of different documentation formats.
Identification is based on file extensions in this version, and duplicates (e.g.
READMEs across forks) are deduplicated by content hashes (which the table uses
for the identifiers).
Read the query for more information, or behold these results:


Language
# of files
# of megabytes


Markdown
7982489
26187


AsciiDoc (with .asc)
124213
9059


AsciiDoc (no .asc)
86765
823


Org-Mode
24779
314


Fun fact #1: The first AsciiDoc table row is giving it a significant benefit
of the doubt by including files with a .asc extension since AsciiDoc has no
standard extension.
Removing this extension removes 91% of the content attributed to AsciiDoc, but
only 31% of the files. This leads me to believe that most of the .asc files
are actually GPG public keys, but draw your own conclusions.
Fun fact #2: The plain text format (.txt) outclasses all of these by a large
margin (9829711 files, 473 gigabytes), but I suspect that a lot of that is data
of some kind and that a lot of the actual documentation is actually Markdown
without the .md extension.
Without writing some more sophisticated content analysis function this won't be
possible to include in the stats.

  
## query.sql
-- Queries the public BigQuery dataset for Github for all documentation
-- files and their sizes. Files are identified by content hashes, and can
-- thus be deduplicated (to exclude duplicated READMEs in forks and such).
--
-- Query results in the total count of unique per language, and the
-- total number of bytes per language.
#StandardSQL
WITH
  -- Subtable of all distinct files (deduplicated per file hash) tagged
  -- with the documentation language they are written in, as guessed per
  -- extension.
  docs AS (
     SELECT
       id,
       ANY_VALUE(
        CASE WHEN ENDS_WITH(path, '.md') THEN 'markdown'
             WHEN ENDS_WITH(path, '.asciidoc') THEN 'asciidoc'
             WHEN ENDS_WITH(path, '.adoc') THEN 'asciidoc'
             -- including .asc might also match various GPG keys, but
             -- Markdown will still outclass AsciiDoc significantly.
             WHEN ENDS_WITH(path, '.asc') THEN 'asciidoc'
             WHEN ENDS_WITH(path, '.org') THEN 'org'
             -- Plain text files might actually be Markdown, too, but
             -- lets ignore that
             WHEN ENDS_WITH(path, '.txt') THEN 'plain'
             ELSE NULL
        END
       ) AS language
     FROM `bigquery-public-data.github_repos.files`
     GROUP BY id),

   docs_files AS (
     SELECT d.id AS id, d.language AS language, c.size AS size
     FROM docs d
     INNER JOIN `bigquery-public-data.github_repos.contents` c
     ON d.id = c.id AND c.binary = false
     WHERE d.language IS NOT NULL)

SELECT
  language,
  COUNT(id) AS total_count,
  SUM(size) AS total_size
FROM docs_files
GROUP BY language
Language	# of files	# of megabytes
Markdown	7982489	26187
AsciiDoc (with `.asc`)	124213	9059
AsciiDoc (no `.asc`)	86765	823
Org-Mode	24779	314
	-- Queries the public BigQuery dataset for Github for all documentation
	-- files and their sizes. Files are identified by content hashes, and can
	-- thus be deduplicated (to exclude duplicated READMEs in forks and such).
	--
	-- Query results in the total count of unique per language, and the
	-- total number of bytes per language.
	#StandardSQL
	WITH
	-- Subtable of all distinct files (deduplicated per file hash) tagged
	-- with the documentation language they are written in, as guessed per
	-- extension.
	docs AS (
	SELECT
	id,
	ANY_VALUE(
	CASE WHEN ENDS_WITH(path, '.md') THEN 'markdown'
	WHEN ENDS_WITH(path, '.asciidoc') THEN 'asciidoc'
	WHEN ENDS_WITH(path, '.adoc') THEN 'asciidoc'
	-- including .asc might also match various GPG keys, but
	-- Markdown will still outclass AsciiDoc significantly.
	WHEN ENDS_WITH(path, '.asc') THEN 'asciidoc'
	WHEN ENDS_WITH(path, '.org') THEN 'org'
	-- Plain text files might actually be Markdown, too, but
	-- lets ignore that
	WHEN ENDS_WITH(path, '.txt') THEN 'plain'
	ELSE NULL
	END
	) AS language
	FROM `bigquery-public-data.github_repos.files`
	GROUP BY id),

	docs_files AS (
	SELECT d.id AS id, d.language AS language, c.size AS size
	FROM docs d
	INNER JOIN `bigquery-public-data.github_repos.contents` c
	ON d.id = c.id AND c.binary = false
	WHERE d.language IS NOT NULL)

	SELECT
	language,
	COUNT(id) AS total_count,
	SUM(size) AS total_size
	FROM docs_files
	GROUP BY language