Skip to content

Instantly share code, notes, and snippets.

@jimmywarting
Last active February 3, 2024 01:57
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jimmywarting/b6e34e018ba31a72068409821ce94ca9 to your computer and use it in GitHub Desktop.
Save jimmywarting/b6e34e018ba31a72068409821ce94ca9 to your computer and use it in GitHub Desktop.
Create your own archive format with just 60 lines of javascript

# Create Your Own Archive Format with Just 60 Lines of JavaScript

Table of Contents

  • Explanation (Introduction, Why & How)
  • Browser Limitations (Memory, Streaming, Writing, Uploading)
  • What are Blobs? (How Do They Work?)
  • The Solution (Code Sample)
    • How to Use the Sample Code
  • Takeaways (Reflection)
  • NodeJS Solution ( <-- )

Explanation

Not all archive formats have to be compatible with operating systems other than your own project. Sometimes, you may only need an internal solution. Alternatively, you might want to avoid the extra weight that comes with formats like zip, tar, 7z, which include checksums, encryption, license information, file permission modes, creation/modification dates, symbolic links, and multi-file logic like .part1, .part2, etc.

Requiring any of these tools can slow down your website (more JS = more dependencies = more downloads = more compilation). Additionally, certain files, such as JPEG and other binary files, are already compressed, making the overhead of compressing already compressed files unnecessary.

The solution is quite simple:

  1. Create an array of all the files.
  2. Append some central directory (json) information about how the files are concatenated (including file names, folders, path, size & offset - similar to how zips are made) at the start or at the end.
  3. Append a number describing how large your central directory (in json) is (also similar to how zips work).
  4. Concatenate the central dir along with the size + json + all the files new Blob([...files, json, json_size])

And you're done! The process is synchronous and instantaneous. This may not be the best-compressed archive format, and it's not compatible with any other unpacker besides your own, but it's an efficient, fast, requires very little RAM and browser-friendly solution.

Before we delve into the code samples, let's discuss some browser limitations and learn a bit about blobs.

Browser limitations

Anyone knows that shipping more code equals to slower code. To make anything fast then one approach is simply to do less things. including a over engineered archive tools such as zip or tar either as WASM or a huge large javascript library requires a lot of extra code.

To understand why I have created my own way of archiving blobs and files, it helps to grasp how Blob really works under the hood:

To understand why I have created my own way of archiving blobs and files, it helps to grasp how Blob really works under the hood:

You can read up on how chrome's blob storage design works here, to summarize it shortly: Blobs are essentially magical containers with multiple Blob Parts that point to different places from which they can read all the content.

For instance, if you were to take two large files from a file input <input type="file" multiple> and merge them together like: new Blob([file1, file2]), you won't create a Blob that consumes the same amount of memory as file1.size + file2.size. Instead, the Blob would be nothing more than a reference pointing to file1 and file2 somewhere on the disc. However, if you were to create a Blob out of strings or typed arrays, then you would start using memory. Therefore, creating a zip file would involve reading the file content into some typed array buffer and then merging the chunks into one final Blob, resulting in significant more RAM usage unless it's pegged to the disk or some limited browser storage.

The solution

This solution minimizes memory usage as it relies on references to other pieces rather than reading or compressing files.

const IS_FOLDER = 0, IS_FILE = 1

function pack (iterator) {
  let chunks = [], tree = [], iterators = [[tree, iterator]], offset = 0

  for (let [dir, iterator] of iterators) {
    for (let item of iterator) {
      if (item instanceof File) {
        chunks.push(item)
        dir.push([IS_FILE, item.name, offset, item.size])
        offset += item.size
      } else {
        const folder = [IS_FOLDER, item.name]
        dir.push(folder)
        iterators.push([folder, item.children])
      }
    }
  }

  // Create a central directory that describes all the files and their offset
  const json = new Blob([JSON.stringify(tree)])
  chunks.push(json)

  // push 4 bytes describing how large the central directory is.
  // This allows us to create a maximum of 4 GiB large central directory
  // and should be well enough for keeping information about the hole archive
  chunks.push(new Uint32Array([json.size]))

  // smash everything together into one large file
  return new File(chunks, 'myfiles.archive')
}

/**
 * @param {Blob}
 */
function unpack (blob, centralDir) {
	const tree = []
  const iterators = [[tree, centralDir]]

  for (let [dir, iterator] of iterators) {
    for (let [type, name, ...item] of iterator) {
      if (type === IS_FILE) {
        const [start, size] = item
      	dir.push(new File([blob.slice(start, start + size)], name))
      } else {
        const folder = { name, children: [] }
        const iterator = item
        dir.push(dir)
        iterators.push([folder.children, iterator])
      }
    }
  }
  
  return tree
}

async function readCentralDir (blob) { 
  // Figure out how large the central dir is
  const size = new DataView(await blob.slice(-4).arrayBuffer()).getUint32(0, true)
  // Get the hole central dir part
  const jsonPart = blob.slice(blob.size - size - 4, blob.size - 4)
  // Parse the central dir part
  return new Response(jsonPart).json()
}

Usage

// create some dummy files
const tree = [
  new File(['abc'], 'sample.txt'), 
  { 
    name: 'Sub Folder', 
    children: [ new File(['xyz'], 'xyz.txt') ] 
  },
  new File(['hello world'], 'hello.txt')
]

// Pack everything together without ever reading the content of the files.
const archive = pack(tree)

// Unpacking
// The only part that isn't synchronous is where you
// have to read the central directory from the file
const centralDir = await readCentralDir(archive)

const tree = unpack(archive, centralDir)
const sample = /** @type {File} */ (tree[0])
const folder = tree[1]
const helloWorld = /** @type {File} */ tree[2]

// using one of the newer blob reading methods
console.log(await helloWorld.text()) // abc

Takeaways

I wanted to share that creating your own archiving tool doesn't have to be too complicated, and you may not need to include third-party modules with additional dependencies.

The amount of code required to implement this is equivalent to what you might use if you employed another archiving tool to pack the necessary files. It would essentially be a wrapper upon another wrapper.

  • The packing function doesn't have to be synchronous if you don't have all the files at hand already.
  • You can start with the central directory at the beginning. When you upload the file to the server, you can early figure out the size of all files, validate them, and abort the connection if necessary. This is in contrast to working with multipart/form-data upload where you don't know the file size of each individual file until afterwards. You will only know the content-length of the whole form, and you can't upload folders and other metadata that you wish to include.
  • You could also do what tar is doing, which doesn't have any central directory. Instead, it appends information about how to jump over each entry, allowing you to seek/jump over the files. If you have one of these archive formats on your server and you only want to fetch one file inside this archive container, be sure to support partial requests by responding with accept byte range request, so one request can be made for reading the central directory, and the next one to select the range that needs to be fetched.
  • If you only have a JSON REST API, uploading a bunch of files as base64 data URLs isn't ideal since it uses more memory and bandwidth. Nothing can stop you from using both binary and JSON in the same request; you just have to decide how to parse the request.

NodeJS solution

Use the new fs.openAsBlob(path) construct a new File from the blob with a filename or alternative use fetch-blob

(this is more like a blog post/article, posted it here since i don't have a blog and gist support syntax color and markdown)

@jimmywarting
Copy link
Author

jimmywarting commented Jul 23, 2023

Alternative if you want to have a custom JSON serializer/retriver then you could use something like this:

function pack(json) {
  /** @type {File[]} */
  const files = [];
  let offset = 0;

  let data = JSON.stringify(json, (key, value) => {
    if (value instanceof File) {
      files.push(value);
      const fileInfo = {
        offset,
        size: value.size,
        name: value.name,
        type: value.type,
        lastModified: value.lastModified,
      };
      // Increment the offset for the next file
      offset += value.size;
      return { $file: fileInfo };
    }
    return value;
  });

  const binaryData = new Blob([data]);
  const size = new Uint32Array(1);
  new DataView(size.buffer).setUint32(0, binaryData.size, true);
  return new Blob([...files, binaryData, size]);
}

/**
 * @param {Blob} data
 */
async function unpack(data) {
  // Read the size of the binary data (4 bytes)
  const jsonSizeArrayBuffer = await data.slice(-4).arrayBuffer();
  const jsonSize = new DataView(jsonSizeArrayBuffer).getUint32(0, true);
  const json = await data.slice(data.size - 4 - jsonSize, -4).text();

  // Parse the JSON data and recreate the objects with files
  const result = JSON.parse(json, (key, value) => {
    if (typeof value === 'object' && value !== null && '$file' in value) {
      const fileInfo = value.$file;
      const { offset, size, name, type, lastModified } = fileInfo;
      const fileBlob = data.slice(offset, offset + size);
      // Reconstruct the File object
      const file = new File([fileBlob], name, { type, lastModified });
      return file;
    }
    return value;
  });

  return result;
}

// Test example
const foo = {
  foundation: "Mozilla",
  model: "box",
  week: 45,
  transport: "car",
  month: 7,
  data: new File(['Hello, World!'], 'foo.txt', { type: 'text/plain' })
};

async function test() {
  const packedData = pack(foo);
  console.log('Packed Data:', packedData);

  const unpackedData = await unpack(packedData);
  console.log('Unpacked Data:', unpackedData);
}

test();

instead of returning a JSON string then it will return a Blob instead to avoid reading the content of the files.

@itsbrex
Copy link

itsbrex commented Aug 15, 2023

Great write up!

@furkankadioglu
Copy link

Great tutorial, thanks Jimmy! I have a question for you:
How can I store as a file? When I save with arrayBuffer, it won't unpacking anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment