First draft, Easter Sunday 2014 Proposal for a lightweight, database-less, general purpose, name-based file management app
In line with current trends toward lean and simple software solutions reviving and repurposing long-established standards (open plain text vs proprietary rich text formats; file-based static site generators vs bloated database-driven CMSs), the present proposal inquires into a method (and its application) to device a lightweight, general-purpose solution for small-scale digital asset management.
No database is to be used, there shall be no external dependencies, all information carriers should be self-containing, and everything would be file-based. The software would build on a (yet to be established) convention of file naming, which would store metadata for arbitrary files inside the file name, advancing its portability across platforms.
Put as a — somewhat far-fetched — YC-style one-liner pitch: this proposal is about “Markdown for file names”, harnessing non-tech savvy users to rapid-prototype (yet fully functional) custom asset management apps.
_IMG00123.JPG index.html Turner_-_Rain%2C_Steam_and_Speed_-_National_Gallery_file.jpg
_READ ME ········································.txt J.M.W. Turner | Rain, Steam and Speed | ···· 1844.jpg W. Blake ···· | Newton ·············· | 1795–1805.jpg
Likely enough, the following is an all too wordy elaboration of what is basically a very simple idea. I leave it up to people much wiser and more skillful to improve on the specifics and, maybe — hopefully — can just incite them enough to pump out some minimum viable code…
Problem: Metadata portability
Some file formats allow for metadata to be stored inside the file (e.g. DRM, Exif, ID3, XMP, YAML front matter, etc.). Such embedded metadata features great data portability, as the information is kept inside the file, even when exchanged between platforms. All the while, it lacks extensibility, because users cannot easily expand the set of pre-provided metadata properties, and neither is it feasible to create and market custom file or MIME types for each and every feasible data model. For plain text files, solutions like XMP and YAML front matter do provide the desired extensibility, but binary files are not provided for.
Some operating systems (e.g. OSX Finder Tags) provide means to store file metadata that is not made available by common file formats, within a database that belongs to the OS. Depending on the implementation, users are offered somewhat greater extensibility of metadata properties. However, while the additional metadata belongs to the OS and is not stored within the file, that information is lost when files are exchanged between platforms (OSs).
Applications (e.g. content management systems) can adopt a similar approach as OSs do, and give users means to augment files with tags, keywords, and all sorts of metadata for which file format specs (m.m. MIME types) do not provide. However, that information too needs to be stored outside the file, in a separate index file or database, belonging to the application; moreover, this approach also requires the files to have a unique and persistent identifier hook — which may get complicated to implement and maintain. Alternatively, the application would wrap the original file inside its own custom proprietary “file format” (thereby effectively creating yet another spec) and store the metadata alongside it. Here too, metadata portability is compromised, as soon as the application is not available (it is not, or no longer, installed on the machine of the user, development or maintenance become stale, etc.).
Solution: store metadata inside the file name
By lack of other means to store, read, and edit file metadata (and consequently use it to browse and filter the files to which it is added), it is proposed to store such general purpose metadata for arbitrary files inside the file name.
By “arbitrary files” all files are meant that can have file names, whatever be there type. By “general purpose” it is intended to allow users to put any number of metadata fields (properties) inside the file name, and give them any kind of value (i.e. any string). That is, at least in theory: given the fact that the present metadata storage proposal would effectively hijack the use and purpose of file names, one can do only so much.
Obviously, huge sets of any arbitrary high number of metadata fields cannot be stored inside a file name. Operating systems (including the Web as a platform, with its own restrictions as defined in, e.g., the URL spec) rightfully restrict file names both in length (maximum allowed number of characters), and in character allowance (some characters may be reserved and must be escaped or re-encoded, e.g.
Any implementation of the present proposal should take these restrictions into consideration, and strive for a pragmatical solution that meets both the requirements for data portability across platforms, as well as the broadest possible support for a varied set of use cases. Quite likely, such implementation would look for the lowest common factor between modern OSs, i.e. Unicode support, and a maximum of 255 characters — both to the detriment of support for older OSs.
While the proposed file naming convention would be severely limited by these restrictions (especially the character limit) as regards its use for more complex and larger sets of metadata, it would undoubtedly address a manifold need for small-scale, lightweight file management. Larger, data-heavy applications will keep profiting from dedicated management tools, specifically built on top of custom databases and business logic, wherein hardly changing data structures have been laid out by domain experts. For general purposes, however, small sets of elementary metadata kept inside the file name, without the need for a database, would offer a practical, user-friendly and cross-platform solution for easy management of any kind of files.
As an added benefit, one could foresee such file naming convention (along with an ecosystem of complementary parsers and graphical user interfaces) to establish a foundation on which rapid prototyping apps could be built, harnessing users who do not enjoy necessary programming skills to still exploit the full potential of domain expertise they do may have.
Implementation: core requirements
Any implementation of proposed file naming convention must satisfy both use case scenarios:
(1) Human-readable — At all times, file names should be as human-readable as they can be for circumstances wherein the user has no access to a dedicated app that improves on the readability, browsability and editabilty of the file names and metadata they contain.
(2) Machine-readable — File names should be easily parsable by applications that would go and get structured metadata from them in order to provide an improved user experience (UX), all while dealing with parse errors that may occur as a result of poor syntax introduced by users that would have edited file names directly, without a dedicated file name editing app responsible for preventing malformedness.
These requirements imply that any implementation would strive for the best possible UX, both for the “plain text” editing of file names (say within the interface provided by the OS for string operations on file names), as for editing by means of a dedicated file name editor application which would make the most of the proposed convention.
In said first use case scenario, files could be easily exchanged between platforms, and users could directly peruse, and, whenever appropriate, also edit file names, e.g. right inside the interface of a cloud file hosting service (like Dropbox, Google Drive, OneDrive, etc.). In the second scenario, when users would browse files, and edit their metadata (i.e. their concerning file names) through a dedicated front-end and GUI, probable errors (caused by file names not being well-formed) would be dealt with conveniently.
From the above two requirements, catering for two use case scenarios, a third requirement obviously follows — that of portability:
(3) Self-containing — At any time, a complying implementation would assume that files are self-containing pieces of information, that all of their metadata is stored within the file name (or, by extension, for well-established file types, within the file itself), and, thus, that no external resources are to be allowed.
Like rows in a table, file names of individual files would have nothing in common with those of their siblings, except for their structure. There would be no cross-referencing (at least not at the level of the convention), and the common data structure would be flat, tabular, not-relational, and without hierarchy.
Implementation: optional requirements: smart parsing using external schemas
Although the requirement of portability states that files should be self-containing, that all metadata should be stored either inside the file name, or embedded within the file, and that thus there may not be any dependency on external resources, this does not exclude the concept of external document models or schemas.
For more requiring use cases, an “index file” of sorts, might be provided, which could be stored alongside the files (e.g. in a
_READ ME.txt file, inside the same folder. Such file could contain a cached index, and, more importantly, a pointer to a data schema, or model, which’s pattern the current file names are following.
Well-formedness excepted, a basic implementation would however not be responsible for data validation: it is up to the data model, or schema, to provide the mechanics for such validation, whereas a front-end application would use those methods to implement and enforce validation.
Too, the external index file would only augment the user experience, and should not contain core information that does not also reside inside the file names. In other words: an index file might contain names for keys (column headers, so to speak), but not values (the actual data in the “row cells”).
Implementation: some particulars
By now it will be obvious that the character-separated values (csv) file format may serve as inspiration for an implementation of the here proposed file naming convention.
However, whereas in csv all data is stored in a single plain text file, each record on its own row, and each record field in its own “cell” (between commas, tabs or another delimiter), in proposed file naming convention file metadata will be spread across files (i.e. within the file names), which serve as “rows”. It is only after processing that these “rows” could be assembled into a temporary csv file — or any other appropriate serialized stream.
There exists not an official specification for csv, but the de facto RFC 4180 standard describes some basic, well-established rules for csv encoding. While these rules, likely due to their simplicity, have turned csv into a widely used and proven data storage format, much remains to be desired as for the readability of csv formatted data. Since it is required that proposed convention needs to procure file names that are as readable as possible, at all times, it is proposed to introduce some additional differences between the formatting of traditional csv and that of proposed file naming convention.
In csv, separators (delimiter characters) can be anything, although commas and tabs are most commonly used. Neither can be used in the case of file names, because commas may conflict with existing file naming restrictions, while the tabs key, on most OSs, prompts to cursor outside its filename editing box, instead of inserting a tab character.
Moreover, commas are quite common in user-provided values, and should thus be avoided as delimiters, reducing the need for character escaping — an unfortunate necessity that troubles csv all too often. Tabs, on the other hand, are whitespace non-printing characters, and it may not be always obvious to a user whether the visual space is a regular word space, a tab, or yet something else — white spaces are too error-inducing.
It is therefore proposed to have visual delimiters, that introduce enough typographical separation between metadata fields, all while getting out of the way. The “pipe”, “polon” or “vertical line” character
| (U+007C) seems an obvious candidate, as it does not often occur in user-provided values, and still is directly available on most keyboards (albeit through a combination of keys; on the Mac:
It would be better still to not have a single character as a delimiter, but a string of characters; the combination of
| (space+pipe+space) makes sense. This approach would both improve readability, and reduce the likeliness of ambiguous double use of said string, i.e. as a delimiter, and within a user-provided value. Hence, it too would aid preventing the need for character escaping.
An even better implementation wouldn’t restrict the delimiter to a single reserved character or character string. In a dedicated editor app, on each file save, all file names in a collection would be examined, after which any delimiter is picked that does not conflict or overlap with characters used in any of the field values (possibly from a set of reserved delimiter charstrings:
•, etc.). Conversely, fed with a collection of file names, the parser would intelligently determine the delimiter, look for a distinctive repetition of characters and assume that string to be the delimiter.
The need for character escaping shall be avoided, because the escape method would probably break compliance with common file naming specs (e.g. use of quotes, or percentage encoding).
File names have extensions, which provide by themselves useful metadata on the file. It will be evident that any implementation of proposed convention will always include the filename extension as a separate metadata field. The dot
. that separates the file name per se from its extension, will be accounted for by the parser looking for field delimiters.
In a graphical interface, the extension could be replaced by a thumbnail of the actual file contents, or it could be a clickable giving access to such a document preview. Additionally, it would be extra advantageous to have any embedded metadata (Exif, YAML, etc.) also included in said metadata editor, and have it displayed after the filename extension.
While csv is an versatile data storage format, in cases where values are of uneven length (i.e. number of characters) across rows, the visual alignment of “columns” will break, and readability will be severely hampered. Take this example:
"Joseph Mallord William Turner","14/5/1775-19/12/1851","Rain, Steam and Speed: The Great Western Railway","1844","oil on canvas","91 cm","121,8 cm" "William Blake","28/11/1757-12/8/1827","Newton","1795-1805","monotype","46 cm","60 cm"
as compared to:
Joseph Mallord William Turner | 14/05/1775–19/12/1851 | Rain, Steam and Speed: The Great Western Railway | ···· 1844 | oil on canvas | ·91,0 cm | 121,8 cm William Blake ··············· | 28/11/1757–12/08/1827 | Newton ········································· | 1795–1805 | monotype ···· | ·46,0 cm | ·60,0 cm
Or in an editor with a GUI, after parsing:
Joseph Mallord William Turner
Rain, Steam and Speed: The Great Western Railway
oil on canvas
William Blake ···············
As csv may not be intended to be viewed and edited directly, as plain text, hampered alignement of columns may not be too big of an issue, as it is probably assumed that csv will be typically edited from within an editor with a graphical interface (e.g. a spreadsheet application). On the other hand, viewing and editing of file names directly (e.g. from within default file browsing interfaces provided by OSs), will be a primary use case scenario, requiring the best possible readability and user experience, and thus a solution for unequal values is needed. It is therefore proposed to introduce (and likely reserve) spacer characters to make sure “columns” are always of even width.
While parsing file names, a standard implementation of proposed convention would of course recognize these spacer characters, and temporarily hide or mitigate them in a graphical editor, only to re-add them one file save. Quite likely, it would look for the file/record with the largest value for the relevant field, and add as many spacer characters as needed to make it fit. In addition, a dedicated file name editor app, could offer users the option to put spacer characters either on the left of the value (e.g. for numerical values), or on the right (for alphabetical strings), mimicking text alignment of a graphical interface.
Spacer characters could be anything, but it is recommended that they too are visual characters (i.e. not white spaces, nor any other non-printing characters). Nevertheless, they must be visually as undistracting as possible. The middle dot
· character (U+00B7) might be an appropriate candidate.
Especially providential implementations might use alternatively dedicated spacer characters depending on the data type of the relevant field: dates could be normalized with zeros (e.g.
1979/03/07 instead of
Characters indicating empty values
Whereas, in csv, empty values are just empty strings, it is proposed that individual records/files that, for a particular field, do not contain a value, will be given a special spacer character that visually indicates a value is missing. It seems only natural to use the underscore or “low line”
_ (U+005F) character for this purpose:
14/02/__ | ···114,51 € | ·····_,__ €
Determining which spacer character is to be used (
_) is up to the user, in case of direct manual editing, or, in the usage scenario of a dedicated editor app, the user may be assisted by the application, that would use the data type of the field in case as specified in the schema.
Sample Use Cases
- basic accounting application
- basic photo browser application
- manage a collection of items in a folder that is synched with any web service, through IFTT
- Sublime Text plugin
- OSX Finder plugin
- Chrome plugin allowing to edit file names on Dropbox.com
- stand-alone native app
- web app: connect with Dropbox, Google Drive, etc., manage files