The SQL query below queries the GitHub dataset on Google BigQuery for information about the popularity of different documentation formats.
Identification is based on file extensions in this version, and duplicates (e.g. READMEs across forks) are deduplicated by content hashes (which the table uses for the identifiers).
Read the query for more information, or behold these results:
Language | # of files | # of megabytes |
---|---|---|
Markdown | 7982489 | 26187 |
AsciiDoc (with .asc ) |
124213 | 9059 |
AsciiDoc (no .asc ) |
86765 | 823 |
Org-Mode | 24779 | 314 |
Fun fact #1: The first AsciiDoc table row is giving it a significant benefit
of the doubt by including files with a .asc
extension since AsciiDoc has no
standard extension.
Removing this extension removes 91% of the content attributed to AsciiDoc, but
only 31% of the files. This leads me to believe that most of the .asc
files
are actually GPG public keys, but draw your own conclusions.
Fun fact #2: The plain text format (.txt
) outclasses all of these by a large
margin (9829711 files, 473 gigabytes), but I suspect that a lot of that is data
of some kind and that a lot of the actual documentation is actually Markdown
without the .md
extension.
Without writing some more sophisticated content analysis function this won't be possible to include in the stats.