naruaway/filetype-detection-in-different-ecosystems.md

## filetype-detection-in-different-ecosystems.md

      
    Raw
  

              filetype-detection-in-different-ecosystems.md
            
          
    Filetype detection in different ecosystems

The concept of "file type" is important. For example, a text editor wants to figure out whether the current open file is written in Python or TypeScript to provide file type specific features including something like syntax highlighting.
Many editors / tools are using short identifier to distinguish file types (e.g. javascript, typescriptreact, zsh, python).
For example, (probably) the most popular open source text editor, VS Code has a list of "known language identifiers" and closely related, LSP spec is also mentioning "common language identifier".
Since VS Code and LSP are popular, I think these language identifiers are considered kind of "standard" and other editors tend to follow them (e.g. Vim/Neovim is using "typescriptreact" for ".tsx" files, see this vim GitHub issue for the decision)
Note that VS Code / LSP language identifires are not the single definition of "file types" of course. For example, tree-sitter is using "scope" (e.g. source.js or source.html) and they say they are trying to follow TextMate grammers / Linguist
Editors / tools need to figure out the file type of a given file. To do that, different tools have different set of definitions and strategies.
In general, the file type can be known by checking file name extension (e.g. .ts, .js, or .py) but there are many edge cases and sometimes file contents is needed to decide the file type.
Also when there is no file name extension, shebang is useful information to inspect.
Let's see how some (randomly chosen...) tools are doing:

Vim / Neovim

Using bunch of imperative logic to decide file types using several information including filename, absolute path, file contents, or sometimes even file attributes (e.g. execution flag)
Extremely flexible since it's written in imperative way and can do anything to figure out the file type
Definition is not so well-organized (compared with soemthing like VS Code configuration, which has per-language clear contribution points like firstLine)


Plenary (an external library / util for Neovim)

It's pulling definition data from Linguist
Also relying on some pre-bundled definitions


Atom

it's based on declarative definitions for each file type
input

fileTypes (array or filename suffixes (e.g. .tsx))
firstLineRegex (regex to match against the first line)
contentRegex (regex to match against the whole file)


examples

shell script
javascript


It seems to be it's based on TextMate convention / behavior

Implemention used in Atom


Sublime text

it's based on declarative definitions for each file type
input

file_extensions
first_line_match


examples

javascript
shell script


bat

is using syntect, which uses sublime text synax definitions
it is just pulling syntax definitions from sublimetext repo

https://github.com/sharkdp/bat/tree/master/assets/syntaxes is directly including https://github.com/sublimehq/Packages/


VS Code

input

extensions (e.g. .sh, .tsx)
filenames (full file name like Makefile)
firstLine (i.e. regex to match against the first line of the file)


examples

shellscript
javscript


Emacs

emacs has special handling for shebang

interpreter-mode-alist


Helix

special handling for shebang

implemented here


Linguist

Heuristics for same extensions example
Note that this data is not necessarily directly usable for editor usage. For example, .tsx definition is not including typescriptreact in any field.


tree-sitter

tree-sitter also has language detection mechanism based on provided definitions for each language

e.g. "tree-sitter TypeScript" is including filename extension definition ("ts") and even language injection regex
e.g. "tree-sitter Python" is including filename extension definition ("py")


Open questions at least I am not so sure


Is it possible to share the definition across ecossytem?

I am even not sure whether this is beneficial or not though. To some extent each editor / ecosystem wants to handle things in different ways.
Note that I know even sharing regex is a pain considering non-JS based editor... Not every ecosystem is bundling ECMAScript Regex engine (e.g. Neovim)


Why aren't we using MIME types?