donmccurdy/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Evaluation: Meshopt Compression with Tabular Data

Summary: Short evaluation of Meshopt compression for use with tabular data. For the purposes of this evaluation, tabular data is defined as a dataset having many observations ("rows"), where each observation consists of one or more property values from a common schema.
Steps

I downloaded the results of the 2015 NYC Tree Census from BigQuery's NYC Street Trees public dataset, also available through NYC Open Data. The 2015 census consists of 683,788 rows and 41 columns, and is about 500 MB when exported as JSON. Much of that data is text, and because meshopt is designed for numeric input we have good reason to believe results for the numeric columns will be "at least as good" as results for the dataset as a whole. For purposes of simpler evaluation and an upper bound on compression ratio, only numeric columns were included in this evaluation.
Using the attached script main.js, we compute a series of estimates for individual columns, and for the entire dataset.
For each column:

Uncompressed byte size in smallest appropriate binary format
Meshopt-compressed byte size
Gzip-compressed byte size
Meshopt+Gzip-compressed byte size
Meshopt decoding time

For the entire dataset:

Total uncompressed byte size
Total gzip-compressed byte size
Total meshopt-compressed byte size

(a) with columns compressed in isolation
(b) with columns interleaved before compression


Total meshopt-and-gzip-compressed byte size

(a) with columns compressed in isolation
(b) with columns interleaved before compression


The purpose of evaluating columns compressed both interleaved and in isolation is to better understand pros/cons of accessor- and bufferView-based storage for tabular data in 3D Tiles and glTF. The current draft specification stores columns using one buffer view apiece, preventing interleaving. Accessor storage would allow interleaving, which has benefits including compression size and decoding speed, at least for triangle mesh data. Benefits for tabular data are unknown, and an outcome of interest for this evaluation.
Lossy filters were excluded from evaluation; all encoding below is lossless. Exponential filters may be appropriate for some column types, and should be considered in any production implementation. I see no reason to believe that filters would disproportionately benefit either interleaved or isolated arrays.
Meshopt conforms to the padding requirements of GPU APIs; each array element must align to 4-byte boundaries. For scalar uint8[] columns this requires 3 bytes of padding per 1 byte of data, which mostly disappears in compressed storage but is a significant increase in uncompressed memory. That additional padding is included in meshopt compression results, and omitted from uncompressed and gzip-only results.
More details about meshopt encoding are available in the meshoptimizer package documentation.
Results

After observing several script executions on the dataset, the table of results below appear typical. The difference in total decoding time (5% faster with interleaved columns) was not consistent in all trials. For much larger numbers of columns, or much smaller numbers of rows, it may become significant by taking better advantage of SIMD, but that effect was not observed here.
All sizes below are given in bytes:
┌─────────────────────┬──────────────┬───────────┬───────────┬──────────────┬─────────────┐
│ prop                │ uncompressed │   meshopt │      gzip │ meshopt+gzip │ decode (ms) │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ tree_id (uint32)    │    2,735,152 │ 1,218,851 │ 1,349,885 │    1,127,803 │        11.9 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ block_id (uint32)   │    2,735,152 │   853,827 │   700,396 │      755,100 │         9.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ tree_dbh (uint16)   │    1,367,576 │   598,936 │   520,736 │      503,556 │         8.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ stump_diam (uint8)  │      683,788 │   125,769 │    36,136 │       51,217 │         7.5 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ cb_num (uint16)     │    1,367,576 │    53,548 │     8,559 │        7,968 │         6.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ borocode (uint8)    │      683,788 │    42,854 │       753 │          200 │         6.7 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ cncldist (uint8)    │      683,788 │    71,408 │    17,535 │       19,386 │         6.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ st_assem (uint8)    │      683,788 │   104,110 │    30,896 │       37,327 │         8.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ st_senate (uint8)   │      683,788 │    87,424 │    24,164 │       27,681 │         6.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ boro_ct (uint32)    │    2,735,152 │    65,061 │    11,364 │       14,148 │         7.1 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ latitude (float32)  │    2,735,152 │ 1,041,441 │ 1,436,100 │      994,986 │         7.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ longitude (float32) │    2,735,152 │   990,308 │ 1,354,792 │      937,450 │         6.8 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ x_sp (float32)      │    2,735,152 │ 1,455,694 │ 1,825,429 │    1,401,403 │         6.5 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ y_sp (float32)      │    2,735,152 │ 1,608,317 │ 2,054,044 │    1,569,914 │         6.9 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ TOTAL (SEPARATE)    │   25,300,156 │ 8,317,548 │ 9,370,789 │    7,448,139 │       109.0 │
├─────────────────────┼──────────────┼───────────┼───────────┼──────────────┼─────────────┤
│ TOTAL (COMBINED)    │              │ 8,146,167 │           │    7,607,325 │       103.8 │
└─────────────────────┴──────────────┴───────────┴───────────┴──────────────┴─────────────┘
To summarize results, meshopt compression provides lower compression ratios than gzip alone, and does better still when combined with gzip. Appropriate use of exponential filters may improve the average compression ratio still further. However, not all results are unambigously positive:

For smaller storage types (uint8, uint16) meshopt increases size compared to gzip alone. meshopt+gzip is sometimes, but not always, smaller than gzip alone. This is likely a result of the required 4-byte element alignment.
No significant difference in compression ratio is apparent for interleaved or isolated columns. Interleaved does somewhat better before gzip, isolated does better after. The difference seems small enough to be case-dependent, but it's also very possible that gzip compression achieves better results with data in isolated arrays, e.g. to enable tightly-packed runs of similar values. We can conjecture that tabular data is more likely to have such runs than triangle mesh data.
No significant difference is decoding speed is apparent for interleaved or isolated columns. Such a difference would likely appear with a high enough columns-to-rows ratio, but large row counts seem like the more common scenario, as in this dataset.

Compression ratios are about -70% for Meshopt+Gzip and -63% for Gzip only. Use of lossy encoding filters would improve the Meshopt compression ratio — in earlier tests on triangle mesh data, the improvement was about -15% on average, compared to lossless Meshopt compression.
Recommendations

While I'd hoped that the test would indicate a conclusive advantage for interleaving property arrays, either in compression ratio or decoding speed, it does not. However, the process of implementing this did highlight the requirement of 4-byte alignment in array elements, which would not be possible in the current bufferView-based storage plan. As a result, use of meshopt compression with EXT_feature_metadata may require either (a) refactoring the schema to use accessor-based storage that supports padding, or (b) repacking the data into 4-byte array types, like Int32Array or Float32Array, when users want to take advantage of Meshopt.

  
## main.js
import fs from 'fs';
import path from 'path';
import ndjson from 'ndjson';
import gzipSize from 'gzip-size';
import Table from 'cli-table';
import { performance } from 'perf_hooks';
import { MeshoptDecoder, MeshoptEncoder } from 'meshoptimizer';
import { createTable, schema } from './schema.js';

await Promise.all([MeshoptDecoder.ready, MeshoptEncoder.ready]);

const ROW_COUNT = 683788;

let rowCount = 0;
const table = createTable(ROW_COUNT);

fs.createReadStream(path.resolve('./new_york_tree_census_2015.json'))
.pipe(ndjson.parse())
.on('data', (row) => {
    for (const prop in table) {
        table[prop][rowCount] = schema[prop].parse(row[prop]);
    }
    rowCount++;
})
.on('end', main)
.on('error', error);

function main() {
    console.log(`Parsed ${rowCount} rows, ${Object.keys(table).length} columns.`);

    const totalBytes = new Array(4).fill(0);
    let totalDecodeMillis = 0;
    const results = new Table({
        head: ['prop', 'uncompressed', 'meshopt', 'gzip', 'meshopt+gzip', 'decode (ms)'],
        colAligns: ['left', 'right', 'right', 'right', 'right', 'right'],
    });

    for (const prop in table) {
        const propBytes = new Array(4).fill(0);
        const {compressed, decodeMillis} = compressColumn(table[prop]);
        totalBytes[0] += (propBytes[0] = table[prop].byteLength);
        totalBytes[1] += (propBytes[1] = compressed.byteLength);
        totalBytes[2] += (propBytes[2] = gzipSize.sync(Buffer.from(table[prop].buffer)));
        totalBytes[3] += (propBytes[3] = gzipSize.sync(compressed));
        totalDecodeMillis += decodeMillis;
        results.push([`${prop} (${formatType(table[prop])})`, ...propBytes.map(formatLong), formatMillis(decodeMillis)]);
    }

    results.push(['TOTAL (SEPARATE)', ...totalBytes.map(formatLong), formatMillis(totalDecodeMillis)]);

    const {compressed, decodeMillis} = compressTable(table);
    const compressedGzipped = gzipSize.sync(compressed);
    results.push([
        'TOTAL (COMBINED)',
        '',
        formatLong(compressed.byteLength),
        '',
        formatLong(compressedGzipped),
        formatMillis(decodeMillis)
    ]);

    console.log(results.toString());

    process.exit(0);
}

function error() {
    console.error(e);
    console.error(`Error after ${rowCount} rows.`);
    process.exit(1);
}

// -------- COMPRESSION HELPERS --------

function pad(n) {
    return Math.ceil(n / 4) * 4;
}

function padArray(src) {
    if (src.BYTES_PER_ELEMENT === 4) {
        return src;
    } else if (src.BYTES_PER_ELEMENT > 4) {
        throw new Error('64-bit types not implemented.');
    }

    const dstStride = 4 / src.BYTES_PER_ELEMENT;
    const dst = new src.constructor(src.length * dstStride);
    for (let i = 0; i < src.length; i++) {
        dst[i * dstStride] = src[i];
    }

    return dst;
}

function compressColumn(array) {
    // TODO(confirm): Output and check RMSE.
    const paddedArray = padArray(array);
    const compressed = MeshoptEncoder.encodeVertexBuffer(paddedArray, ROW_COUNT, 4);

    const target = new Uint8Array(paddedArray.byteLength);
    const t0 = performance.now();
    MeshoptDecoder.decodeVertexBuffer(target, ROW_COUNT, 4, compressed);
    const decodeMillis = performance.now() - t0;

    return {compressed, decodeMillis};
}

function compressTable(table) {
    let bytesPerRow = 0;
    for (const prop in table) {
        bytesPerRow += table[prop].BYTES_PER_ELEMENT;
    }
    bytesPerRow = pad(bytesPerRow);

    const tableBuffer = new ArrayBuffer(ROW_COUNT * bytesPerRow);
    const tableView = new DataView(tableBuffer);

    let rowByteOffset = 0;
    for (const prop in table) {
        for (let i = 0; i < ROW_COUNT; i++) {
            const byteOffset = i * bytesPerRow + rowByteOffset;
            const value = table[prop][i];
            switch (table[prop].constructor) {
                case Float32Array:
                tableView.setFloat32(byteOffset, value);
                break;
                case Uint32Array:
                tableView.setUint32(byteOffset, value);
                break;
                case Uint16Array:
                tableView.setUint16(byteOffset, value);
                break;
                case Uint8Array:
                tableView.setUint8(byteOffset, value);
                break;
                default:
                throw new Error('Unsupported array type.');
            }
        }
        rowByteOffset += table[prop].BYTES_PER_ELEMENT;
    }

    // TODO(confirm): Output and check RMSE.
    const compressed = MeshoptEncoder.encodeVertexBuffer(new Uint8Array(tableBuffer), ROW_COUNT, bytesPerRow);

    const target = new Uint8Array(tableBuffer.byteLength);
    const t0 = performance.now();
    MeshoptDecoder.decodeVertexBuffer(target, ROW_COUNT, bytesPerRow, compressed);
    const decodeMillis = performance.now() - t0;

    return {compressed, decodeMillis};
}

// -------- FORMATTING HELPERS --------

function formatLong(value) {
    return value.toLocaleString();
}

function formatMillis(ms) {
    return ms.toFixed(1);
}

function formatBytes(bytes, decimals = 2) {
    if (bytes === 0) return '0 Bytes';

    const k = 1024;
    const dm = decimals < 0 ? 0 : decimals;
    const sizes = ['Bytes', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'];

    const i = Math.floor(Math.log(bytes) / Math.log(k));

    return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i];
}

function formatType(array) {
    return array.constructor.name.replace('Array', '').toLowerCase();
}

## package.json
{
    "type": "module",
    "dependencies": {
        "cli-table": "^0.3.6",
        "gzip-size": "^6.0.0",
        "meshoptimizer": "^0.16.1",
        "ndjson": "^2.0.0"
    }
}

## schema.js
export const schema = {
    tree_id: { parse: Number, Array: Uint32Array, required: true, },
    block_id: { parse: Number, Array: Uint32Array, },
    created_at: { parse: (v) => new Date(v).toLocaleDateString('en-US') },
    tree_dbh: { parse: Number, Array: Uint16Array, },
    stump_diam: { parse: Number, Array: Uint8Array },
    curb_loc: { parse: String },
    status: { parse: String },
    health: { parse: String },
    spc_latin: { parse: String },
    spc_common: { parse: String },
    steward: { parse: String },
    guards: { parse: String },
    sidewalk: { parse: String },
    user_type: { parse: String },
    problems: { parse: String },
    root_stone: { parse: String },
    root_grate: { parse: String },
    root_other: { parse: String },
    trunk_wire: { parse: String },
    trnk_light: { parse: String },
    trnk_other: { parse: String },
    brch_light: { parse: String },
    brch_shoe: { parse: String },
    brch_other: { parse: String },
    address: { parse: String },
    zipcode: { parse: String },
    zip_city: { parse: String },
    cb_num: { parse: Number, Array: Uint16Array, },
    borocode: { parse: Number, Array: Uint8Array },
    boroname: { parse: String },
    cncldist: { parse: Number, Array: Uint8Array },
    st_assem: { parse: Number, Array: Uint8Array },
    st_senate: { parse: Number, Array: Uint8Array },
    nta: { parse: String },
    nta_name: { parse: String },
    boro_ct: { parse: Number, Array: Uint32Array, },
    state: { parse: String },
    latitude: { parse: Number, Array: Float32Array },
    longitude: { parse: Number, Array: Float32Array },
    x_sp: { parse: Number, Array: Float32Array },
    y_sp: { parse: Number, Array: Float32Array },
};

export function createTable(count) {
    const table = {};
    for (const prop in schema) {
        if (schema[prop].parse === Number) {
        table[prop] = new schema[prop].Array(count);
        }
    }
    return table;
}
	import fs from 'fs';
	import path from 'path';
	import ndjson from 'ndjson';
	import gzipSize from 'gzip-size';
	import Table from 'cli-table';
	import { performance } from 'perf_hooks';
	import { MeshoptDecoder, MeshoptEncoder } from 'meshoptimizer';
	import { createTable, schema } from './schema.js';

	await Promise.all([MeshoptDecoder.ready, MeshoptEncoder.ready]);

	const ROW_COUNT = 683788;

	let rowCount = 0;
	const table = createTable(ROW_COUNT);

	fs.createReadStream(path.resolve('./new_york_tree_census_2015.json'))
	.pipe(ndjson.parse())
	.on('data', (row) => {
	for (const prop in table) {
	table[prop][rowCount] = schema[prop].parse(row[prop]);
	}
	rowCount++;
	})
	.on('end', main)
	.on('error', error);

	function main() {
	console.log(`Parsed ${rowCount} rows, ${Object.keys(table).length} columns.`);

	const totalBytes = new Array(4).fill(0);
	let totalDecodeMillis = 0;
	const results = new Table({
	head: ['prop', 'uncompressed', 'meshopt', 'gzip', 'meshopt+gzip', 'decode (ms)'],
	colAligns: ['left', 'right', 'right', 'right', 'right', 'right'],
	});

	for (const prop in table) {
	const propBytes = new Array(4).fill(0);
	const {compressed, decodeMillis} = compressColumn(table[prop]);
	totalBytes[0] += (propBytes[0] = table[prop].byteLength);
	totalBytes[1] += (propBytes[1] = compressed.byteLength);
	totalBytes[2] += (propBytes[2] = gzipSize.sync(Buffer.from(table[prop].buffer)));
	totalBytes[3] += (propBytes[3] = gzipSize.sync(compressed));
	totalDecodeMillis += decodeMillis;
	results.push([`${prop} (${formatType(table[prop])})`, ...propBytes.map(formatLong), formatMillis(decodeMillis)]);
	}

	results.push(['TOTAL (SEPARATE)', ...totalBytes.map(formatLong), formatMillis(totalDecodeMillis)]);

	const {compressed, decodeMillis} = compressTable(table);
	const compressedGzipped = gzipSize.sync(compressed);
	results.push([
	'TOTAL (COMBINED)',
	'',
	formatLong(compressed.byteLength),
	'',
	formatLong(compressedGzipped),
	formatMillis(decodeMillis)
	]);

	console.log(results.toString());

	process.exit(0);
	}

	function error() {
	console.error(e);
	console.error(`Error after ${rowCount} rows.`);
	process.exit(1);
	}

	// -------- COMPRESSION HELPERS --------

	function pad(n) {
	return Math.ceil(n / 4) * 4;
	}

	function padArray(src) {
	if (src.BYTES_PER_ELEMENT === 4) {
	return src;
	} else if (src.BYTES_PER_ELEMENT > 4) {
	throw new Error('64-bit types not implemented.');
	}

	const dstStride = 4 / src.BYTES_PER_ELEMENT;
	const dst = new src.constructor(src.length * dstStride);
	for (let i = 0; i < src.length; i++) {
	dst[i * dstStride] = src[i];
	}

	return dst;
	}

	function compressColumn(array) {
	// TODO(confirm): Output and check RMSE.
	const paddedArray = padArray(array);
	const compressed = MeshoptEncoder.encodeVertexBuffer(paddedArray, ROW_COUNT, 4);

	const target = new Uint8Array(paddedArray.byteLength);
	const t0 = performance.now();
	MeshoptDecoder.decodeVertexBuffer(target, ROW_COUNT, 4, compressed);
	const decodeMillis = performance.now() - t0;

	return {compressed, decodeMillis};
	}

	function compressTable(table) {
	let bytesPerRow = 0;
	for (const prop in table) {
	bytesPerRow += table[prop].BYTES_PER_ELEMENT;
	}
	bytesPerRow = pad(bytesPerRow);

	const tableBuffer = new ArrayBuffer(ROW_COUNT * bytesPerRow);
	const tableView = new DataView(tableBuffer);

	let rowByteOffset = 0;
	for (const prop in table) {
	for (let i = 0; i < ROW_COUNT; i++) {
	const byteOffset = i * bytesPerRow + rowByteOffset;
	const value = table[prop][i];
	switch (table[prop].constructor) {
	case Float32Array:
	tableView.setFloat32(byteOffset, value);
	break;
	case Uint32Array:
	tableView.setUint32(byteOffset, value);
	break;
	case Uint16Array:
	tableView.setUint16(byteOffset, value);
	break;
	case Uint8Array:
	tableView.setUint8(byteOffset, value);
	break;
	default:
	throw new Error('Unsupported array type.');
	}
	}
	rowByteOffset += table[prop].BYTES_PER_ELEMENT;
	}

	// TODO(confirm): Output and check RMSE.
	const compressed = MeshoptEncoder.encodeVertexBuffer(new Uint8Array(tableBuffer), ROW_COUNT, bytesPerRow);

	const target = new Uint8Array(tableBuffer.byteLength);
	const t0 = performance.now();
	MeshoptDecoder.decodeVertexBuffer(target, ROW_COUNT, bytesPerRow, compressed);
	const decodeMillis = performance.now() - t0;

	return {compressed, decodeMillis};
	}

	// -------- FORMATTING HELPERS --------

	function formatLong(value) {
	return value.toLocaleString();
	}

	function formatMillis(ms) {
	return ms.toFixed(1);
	}

	function formatBytes(bytes, decimals = 2) {
	if (bytes === 0) return '0 Bytes';

	const k = 1024;
	const dm = decimals < 0 ? 0 : decimals;
	const sizes = ['Bytes', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'];

	const i = Math.floor(Math.log(bytes) / Math.log(k));

	return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i];
	}

	function formatType(array) {
	return array.constructor.name.replace('Array', '').toLowerCase();
	}
	{
	"type": "module",
	"dependencies": {
	"cli-table": "^0.3.6",
	"gzip-size": "^6.0.0",
	"meshoptimizer": "^0.16.1",
	"ndjson": "^2.0.0"
	}
	}
	export const schema = {
	tree_id: { parse: Number, Array: Uint32Array, required: true, },
	block_id: { parse: Number, Array: Uint32Array, },
	created_at: { parse: (v) => new Date(v).toLocaleDateString('en-US') },
	tree_dbh: { parse: Number, Array: Uint16Array, },
	stump_diam: { parse: Number, Array: Uint8Array },
	curb_loc: { parse: String },
	status: { parse: String },
	health: { parse: String },
	spc_latin: { parse: String },
	spc_common: { parse: String },
	steward: { parse: String },
	guards: { parse: String },
	sidewalk: { parse: String },
	user_type: { parse: String },
	problems: { parse: String },
	root_stone: { parse: String },
	root_grate: { parse: String },
	root_other: { parse: String },
	trunk_wire: { parse: String },
	trnk_light: { parse: String },
	trnk_other: { parse: String },
	brch_light: { parse: String },
	brch_shoe: { parse: String },
	brch_other: { parse: String },
	address: { parse: String },
	zipcode: { parse: String },
	zip_city: { parse: String },
	cb_num: { parse: Number, Array: Uint16Array, },
	borocode: { parse: Number, Array: Uint8Array },
	boroname: { parse: String },
	cncldist: { parse: Number, Array: Uint8Array },
	st_assem: { parse: Number, Array: Uint8Array },
	st_senate: { parse: Number, Array: Uint8Array },
	nta: { parse: String },
	nta_name: { parse: String },
	boro_ct: { parse: Number, Array: Uint32Array, },
	state: { parse: String },
	latitude: { parse: Number, Array: Float32Array },
	longitude: { parse: Number, Array: Float32Array },
	x_sp: { parse: Number, Array: Float32Array },
	y_sp: { parse: Number, Array: Float32Array },
	};

	export function createTable(count) {
	const table = {};
	for (const prop in schema) {
	if (schema[prop].parse === Number) {
	table[prop] = new schema[prop].Array(count);
	}
	}
	return table;
	}