The concept of "file type" is important. For example, a text editor wants to figure out whether the current open file is written in Python or TypeScript to provide file type specific features including something like syntax highlighting.
Many editors / tools are using short identifier to distinguish file types (e.g. javascript
, typescriptreact
, zsh
, python
).
For example, (probably) the most popular open source text editor, VS Code has a list of "known language identifiers" and closely related, LSP spec is also mentioning "common language identifier".
Since VS Code and LSP are popular, I think these language identifiers are considered kind of "standard" and other editors tend to follow them (e.g. Vim/Neovim is using "typescriptreact" for ".tsx" files, see this vim GitHub issue for the decision)
Note that VS Code / LSP language identifires are not the single definition of "file types" of course. For example, tree-sitter is using "scope" (e.g. source.js
or source.html
) and they say they are trying to follow TextMate grammers / Linguist
Editors / tools need to figure out the file type of a given file. To do that, different tools have different set of definitions and strategies.
In general, the file type can be known by checking file name extension (e.g. .ts
, .js
, or .py
) but there are many edge cases and sometimes file contents is needed to decide the file type.
Also when there is no file name extension, shebang is useful information to inspect.
Let's see how some (randomly chosen...) tools are doing:
- Vim / Neovim
- Using bunch of imperative logic to decide file types using several information including filename, absolute path, file contents, or sometimes even file attributes (e.g. execution flag)
- Extremely flexible since it's written in imperative way and can do anything to figure out the file type
- Definition is not so well-organized (compared with soemthing like VS Code configuration, which has per-language clear contribution points like
firstLine
)
- Plenary (an external library / util for Neovim)
- It's pulling definition data from Linguist
- Also relying on some pre-bundled definitions
- Atom
- it's based on declarative definitions for each file type
- input
- fileTypes (array or filename suffixes (e.g.
.tsx
)) - firstLineRegex (regex to match against the first line)
- contentRegex (regex to match against the whole file)
- fileTypes (array or filename suffixes (e.g.
- examples
- It seems to be it's based on TextMate convention / behavior
- Sublime text
- it's based on declarative definitions for each file type
- input
- file_extensions
- first_line_match
- examples
- bat
- is using syntect, which uses sublime text synax definitions
- it is just pulling syntax definitions from sublimetext repo
- VS Code
- input
- extensions (e.g.
.sh
,.tsx
) - filenames (full file name like
Makefile
) - firstLine (i.e. regex to match against the first line of the file)
- extensions (e.g.
- examples
- input
- Emacs
- emacs has special handling for shebang
- interpreter-mode-alist
- emacs has special handling for shebang
- Helix
- Linguist
- Heuristics for same extensions example
- Note that this data is not necessarily directly usable for editor usage. For example,
.tsx
definition is not includingtypescriptreact
in any field.
- tree-sitter
-
Is it possible to share the definition across ecossytem?
- I am even not sure whether this is beneficial or not though. To some extent each editor / ecosystem wants to handle things in different ways.
- Note that I know even sharing regex is a pain considering non-JS based editor... Not every ecosystem is bundling ECMAScript Regex engine (e.g. Neovim)
-
Why aren't we using MIME types?