I've been fermenting this post for a while. There is a lot more I want to post on caching + webpack but I want to start here with what I see as a list of priorities and what I have learned concerning those from developing hard-source-webpack-plugin.
- Easy to Use
- Disk Cache First
- Natural Modules
- Extensible Serialization
Easy to Use
best scenario: a user never has to delete the cache
An easy to use cache is one that either responds with the right version of built modules or no version until it can store a fresh correct copy.
A cached module is only valid as long as all of its inputs have not changed. webpack tracks all the file inputs, whose contents is used in a module, but does not at this time track all of the loader, plugin, or webpack's own files and other dependencies used by any of these.
If the cache could additionally track all these other items either in relation to the built modules or whole cache it may open opportunities to add longevity to a cache or parts of it. That will be a really large set of work that could be looked into after a first version of some cache in webpack.
In hard-source's case, it trusts yarn.lock or package-lock.json to represent that and hashes their contents for comparison in future builds. If those aren't available it hashes all of the package.json files of directories directly under node_modules for a project. Since these hashes can't make comparisons of smaller sets of packages, the whole cache is invalid when the hash value changes.
An additional helper to invalidating the cache is placing it by default under node_modules. When it will be deleted if modules were reinstalled. This is an option that may not be desired when combined with another priority.
Another input the cache needs to track is build configuration. Most projects even if they use one configuration file end up with multiple runtime configurations if say they use both the
webpack-dev-server cli tools. When
webpack-dev-server modifies the config to add plugins or change entries this changes the configuration input to the cache.
While a change to the extended dependency set likely means a employing a scorched earth policy on the cache, a different configuration as a cache input, can be resolved with producing multiple caches. As webpack projects may frequently switch between multiple runtime configurations being able to use a different existing cache instead of replacing the current cache improves the benefit gained for users from a filesystem persistent cache.
Multiple caches helps solve a quality of life issue but can also create one. If the cache isn't occasionally removed the multiple internal caches can build up increasing the amount of space the whole cache uses. hard-source does not currently solve this but I think there are some easy to understand solutions. First caches could be removed automatically if their relative date to today is older than some period of time like two weeks. They could be removed when the Nth cache is created. They could be removed when the size of all caches is greater than X MBs. Or some set of these solutions and with options exposed for modifying them could provide a reasonable overall solution.
file and context dependencies
The last common input for module and cache invalidation is the file and context dependencies webpack itself tracks for modules. webpack compares the built time of a Module against the timestamps of those dependencies to determine if it can use the entry of a module in its module cache. That comparison will not be reliable for a disk cache.
A few cases exist for timestamp comparisons for a continuous webpack instance that do not lead to rebuilt modules when there was a change. Context dependency timestamps are an aggregate of the timestamps of all the files underneath it. If the latest file in the context dependency is deleted, the context timestamp decreases instead of increasing, leading to webpack not rebuilding any modules for that dependency. This can also happen for files that are swapped with another existing file whose time stamp is older. The files haven't changed but their names have. Since webpack tracks the names and timestamps and not the internal content it doesn't know a change has happened.
When these edge cases occur, you can currently fall back to a tried and true workaround, turn webpack off and on again, restarting it with a new cache. Since a disk persistent cache will persist through turning the entire computer off and on, that workaround is no longer workable. Changing the timestamp comparisons to be true when the stamps are before or after the module's built time.
The cached modules need to compare the content that was last used to build them and the content that is now on disk. We can fix this by replacing timestamp checks with hash comparisons. File dependencies can use the hashes of the file content. Context dependencies can use a hash of all of the deeply nested paths under that context.
Disk Cache First
if it can't be cached to disk, it's not cacheable
To return a correct module from a disk persistent cache it will be important to reflect that in metadata that covers a module or other item's ability to be cached. Failure to capture that in a memory cache can be worked around. Failure to capture that in a disk persistent cache will leave an invalid module in the cache in some cases.
An example of this is ExtractTextWebpackPlugin. As it is implemented rebuilds with a persistent cache do not emit the assets that were built for the extracted css. This can be easily fixed with ExtractText moving all built assets to the module from the child compilation. With a memory cache this is worked around with each first build. With a persistent cache only the first run without an existing cache would reliably emit those assets.
If ExtractText could not be fixed though, would it still be considered cacheable? I think a simple solution would be good. It would not be considered cacheable.
reusable cache across computers and directories
As a feature request from the community for hard-source-webpack-plugin, portability of the cache has been its greatest inspiration for change.
This priority goes hand in hand with Easy to Use.
Building a value to compare between builds for node_modules and others was added for this. It started as modified timestamps through a third party dependency. To support the cache being "portable" between CI executions, this was changed to hashes of the package.json files in the packages directly under node_modules. Later it defaulted to hashing yarn.lock and package-lock.json and feel back to hashing package.json files.
A second group of inputs needed to be validated with something other than time stamps. The file and context dependencies time stamps are not reliable between one run of a CI environment and another. Tools that stand up a target project may stand up their content and not care to match time stamps to their last state on another CI environment or other computer. This was the original reason to use hashes of file and context dependency content over time stamps. Additional cases found where time stamps were not reliable re-enforced this issue in the Easy to Use priority.
Most recently all of the paths stored in the cache are made relative to the configured context or cache directory.
The configuration input as a hash has its paths made relative to what should be a fixed location in the project, the cache directory. This way the configuration hash is independent of the directory on a system a build occurs in.
All the paths in cached dependency, module, and other objects are relative at rest on the file system. When the cache is loaded, paths are turned back into absolute urls. Some cases are left as relative cases and hard-source has to make up for those by checking some values with an absolute path and a relative path. A key example for the double check is raw request paths. Some modules generated by loaders will use an absolute path in their require statements. This is stored in the cache as relative paths. The raw requests can not be made back into absolute paths without knowing that they were originally absolute paths. Or alternatively they can be left alone when thawing an object and make two lookups or comparisons. This is hard-source's currently handling for this variable member of some objects.
dehydrates as they were before
hard-source-webpack-plugin as it was initially implemented and until recently created facsimile modules in place of the recorded NormalModules and ContextModules. This required using the RecordIds webpack feature at all times and hybrid Dependency objects that reduced the stored info to what is needed to trace the dependencies as the rendered source was whatever was used in the last normal module build. This in some ways reduced the complexity of caching and restoring webpack builds, but created headaches around busting these modules so a NormalModule could be built in its place if a dependency resolved to a new module or if harmony export usage changed.
hard-source-webpack-plugin proved out recreating NormalModules and ContextModules, their dependencies, generators, parsers, resolveDependencies, etc. as they were in the last run. This involves a lot of work to precisely freeze and thaw all of the properties of these objects and creating objects that are heavily manipulated in plugins to recreate their state that can include added functions and objects that may not be part of webpack's core.
Thawed NormalModules can be returned by the
createModule hook. The difference between the thawed module and a regularly created one is the thawed one will already have built dependencies, original source, and cached source, if its source was ever rendered. Already being built it can be side loaded into the compilation cache before being returned letting webpack make the normal needRebuild judgement and either rebuild or use the module as is. Since NormalModule's have a reference to their resolver, parser, and generator (in webpack 4), created by factory methods and plugins, the inputs to those functions can be saved and used when thawing the modules to recreate those objects through the same factories.
ContextModules need a hook like
createModule ot make their thawing more natural and should have resolveDependencies assigned after creation or be created through its own unique plugin hook. Currently hard-source-webpack-plugin support gets a little around this by passing through the afterResolve plugin to get the expected resolveDependencies function.
Transforming webpack modules and writing them to disk is split into two steps, create serializable "cache clones" of modules and other types, and writing to and reading for the disk the cache clones in an efficient manner. The intermediate clones serve an important use where their file system relationships like file dependencies and resolved paths can be accumulated at the beginning to precollect hashes and file existence to verify modules later in needRebuild and other stages. This intermediate data has to wait for a running Compilation to be able to fully thaw into module instances.
Creating and using the clones further breaks down into two ways to turn the runtime instances into plain data. There are varied dynamic types and fixed concrete types. The Dependency instances that fill a module's dependencies array are an example group of the dynamic types. Making up an individual Dependency type is mostly fixed concrete types like requests and arrays of strings. These two definitions for the data types can be handled by two systems to form a larger one that can then freeze and thaw various webpack objects.
Writing and reading the intermediate clones to and from disk and be done with one or two further layers. The required layer is naturally something to write all the modules to disk and read them in the next build. Given JSON objects the layer will stringify and later parse those objects. This layer also handles binary Asset objects for the emitted files wepback can produce through file-loader and other loaders. An optional layer may add a more sophisticated stringify, or bufferify, method that would possibly follow a structure like the creation of the cache clones to turn objects into and from strings or buffers quicker than standard JSON calls.
Varied dynamic shape and relations dehydrate/hydrate hooks
A lot of webpack objects have a similar structure and subclass some common ancestor. webpack's dependency types all subclass Dependency. webpack's module types all subclass Module, which subclasses DependenciesBlock. These and others share some minimum set of members but then branch out into including their own uniquely needed information. Some of that information is the references to other objects that end up creating cyclical references that have made caching webpack to disk difficult.
Instead of using a general algorithm to search and find those cyclical references and remove functions and other transforms to the cache to write it to disk, as well as the reverse work to thaw the cache, a method to bind clone factories for each dependency, module, or other archetype gives a means to have specific handling of each type, their members, and extensions to those types. Some groups of types have extended handling that can vary by active plugins or options.
This area of serializing code handles higher level type serialization, serializing extended information stored by third-party plugins or non-default webpack features, and serializing smaller parts of an archetype.
hard-source-webpack-plugin currently isolates some webpack archetypes
Compilation and Module archetypes are two examples of the third use case. Compilation adds a opportunity to kick off Module serialization. Module also is used this way for ConcatenatedModule to kick off serialization of those Modules.
DependenciesBlock is an archetype used to create clones of Blocks, async Blocks, and Module types. The Module use case provides extension data of Module internal DependenciesBlock data. Blocks and async Blocks handle normal clone creation.
Dependency creates clones of the 49 different webpack core dependency types. As well as extensions to that data for Parser assigned values.
The clone data for these archetypes can be thought of being organized into different lifecycle stages. All of the created data has a first lifecycle stage involving the type's constructor. All of these types either are created later in thawing an item through a constructor (Module and Dependency types) or a common factory method (Resolver, Parser, and Generator through factories on NormalModuleFactory). After that many objects have "assigned" extensions. Dependency types have various info assigned by Parser plugins, webpack's Compilation object or other plugin hooks. Any of that information would be stored by extensions in this area. After that Module types also have "build" and "source" extensions, whose data is set in those methods of their type along with hash values for later validation. "build" is validated through needRebuild. "source" is validated inside its own method checking the stored hash. Freezing and thawing types by stepping through these stages helps organize the information in relation to webpack's workflow.
To help with references and other hard to serialize information all of this handling includes hooks to call other handlers for nested types or to refer to parent objects. For example some Dependency types are created referring to their owning Module. To make this possible to thaw, the Module and DependenciesBlock handling code calling a hook to thaw a Dependency passes extra data including a reference to the Module instance that the to be created Dependency may refer to. This echoes the behaviour in Parser which passes around an extra state object storing objects for later Dependency objects to refer to.
Known fixed shape dehydrate/hydrate helpers
While the varied and dynamic shape handling is rather abstract creating clones and thawed objects of fixed shape is more concrete. While Dependency types are grouped under an archetype, each individual type has a known shape. We can provide libraries and helpers to create data clones of these fixed shapes and later thaw the objects as well.
The functions providing this handling can call the hooks to build dynamic shaped objects or fixed shaped objects, of note paths and request strings. To create a portable cache, all cached information either needs to stored in the cache with relative paths so that when the project and cache are moved to another directory or system, the cache can be inflated with the new context. To help all the various objects needing to freeze and thaw paths, requests, and other combined path information, the libraries providing help include functions to handling this otherwise commonly repeated logic.
How to persist between builds
Persisting the cache to disk and later reading it, is passed through a type in hard-source that separates this concern from the rest of the cache. This has been helpful in growing hard-source from its earlier serialization methods and experimenting in ways to increase performance.
It also provides an avenue to create a second layer of the same api that calls to the actual to/from disk layer, to transform the cache clones into and from strings or buffers. Otherwise the first layer will use JSON calls to serialize. The second layer provides an avenue to serialize objects with systems like protobuf and flatbuf. hard-source hasn't made use of this possibility yet due to needing to figure out changes to the caches clones to make nesting of the various contained types in a larger cloned object.
(kind of obvious)
- mention because it should obviously be maintained as a goal
- a simple hobby project takes 12s w/o HS, 13s w/ HS w/o cache, 2.5s w/ HS w/ cache
- 24 MB of module data on disk
- JSON.parsing 24MB will always take some measurable time (500 ms)
- thawing (350 ms)
- a lodash-es benchmark project takes 1.1s w/o HS, 1.3 w/ HS w/o cache, 0.8s w/ HS w/ cache
- 6.5 MB of module data on disk
- creating cache clones
- transforming between disk format and object format
- garbage creation
- reading and writing the cache
- freezing and thawing the cache
- may be faster to create some extra garbage than try to optimize existing memory use while taking more time to do that\