Skip to content

Instantly share code, notes, and snippets.

@naruaway
Last active January 8, 2022 07:01
Show Gist options
  • Save naruaway/2445a97de36fdfd408631f89f29f72a9 to your computer and use it in GitHub Desktop.
Save naruaway/2445a97de36fdfd408631f89f29f72a9 to your computer and use it in GitHub Desktop.
Filetype detection in different ecosystems

Filetype detection in different ecosystems

The concept of "file type" is important. For example, a text editor wants to figure out whether the current open file is written in Python or TypeScript to provide file type specific features including something like syntax highlighting.

Many editors / tools are using short identifier to distinguish file types (e.g. javascript, typescriptreact, zsh, python). For example, (probably) the most popular open source text editor, VS Code has a list of "known language identifiers" and closely related, LSP spec is also mentioning "common language identifier". Since VS Code and LSP are popular, I think these language identifiers are considered kind of "standard" and other editors tend to follow them (e.g. Vim/Neovim is using "typescriptreact" for ".tsx" files, see this vim GitHub issue for the decision)

Note that VS Code / LSP language identifires are not the single definition of "file types" of course. For example, tree-sitter is using "scope" (e.g. source.js or source.html) and they say they are trying to follow TextMate grammers / Linguist

Editors / tools need to figure out the file type of a given file. To do that, different tools have different set of definitions and strategies. In general, the file type can be known by checking file name extension (e.g. .ts, .js, or .py) but there are many edge cases and sometimes file contents is needed to decide the file type. Also when there is no file name extension, shebang is useful information to inspect.

Let's see how some (randomly chosen...) tools are doing:

Open questions at least I am not so sure

  • Is it possible to share the definition across ecossytem?

    • I am even not sure whether this is beneficial or not though. To some extent each editor / ecosystem wants to handle things in different ways.
    • Note that I know even sharing regex is a pain considering non-JS based editor... Not every ecosystem is bundling ECMAScript Regex engine (e.g. Neovim)
  • Why aren't we using MIME types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment