diff --git a/README.md b/README.md index 7827f08..29a64aa 100644 --- a/README.md +++ b/README.md @@ -141,6 +141,10 @@ The actual memory usage can be as low as a half of the specified due to the inte **Model base divisor** (CLI `-Zmd|--model-base-divisor DIVISOR`, API `modelRecipBaseCount` in the options object) adjusts how fast should individual contexts adapt *initially*, where larger is faster. The optimal value typically ranges from 10 to 100 for JS code inputs. +**Dynamic model flags** (CLI `-Zdy|--dynamic-models FLAGS`, API `dynamicModels` in the options object) are used to enable or disable specific dynamic models, where each bit is turned on if the model is in use. The value of -1 is specially recognized as a default. There is currently one supported model: + +* The bit 0 (value 1) models quoted strings (', " or \`) and works well for source codes. It assumes that every quotes are paired, so it can't be used in English texts with contractions (e.g. isn't) and turned off by default in non-JS inputs. + **Number of abbreviations** (CLI `-Zab|--num-abbreviations NUM`, API `numAbbreviations` in the options object) affects the preprocessing for JS code inputs. Common identifiers and reserved words can be abbreviated to single otherwise unused bytes during the preprocessing; this lessens the burden of context modelling which can only look at the limited number of past bytes. If this parameter is less than the number of allowable abbreviations some identifiers will be left as is, which can sometimes improve the compression. ### Tips and Tricks diff --git a/cli.mjs b/cli.mjs index 911403a..f3772eb 100644 --- a/cli.mjs +++ b/cli.mjs @@ -46,7 +46,6 @@ Output options: 0 Use the baseline parameters. Default when any optimizable arguments are given. 1 Tries to optimize -S and most -Z arguments with ~30 attempts. - Also tries to replace "-t js" with "-t text" if beneficial. Default when no optimizable arguments are given. 2 Same to -O1 but with ~300 attempts. Anything beyond -O0 prints the best parameters unless -q is given. @@ -74,6 +73,12 @@ Output options: -Zco|--context-bits BITS [Range: 1..24+, Default: derived] Sets the size of each context model, as opposed to the total size (-M). The maximum can range from 24 to 30 depending on the number of contexts. +-Zdy|--dynamic-models FLAGS [Default: 1 for JS, 0 for others] + Enables or disables specific dynamic models. + The value is a bitwise OR of the following bits: + 1 Puts quoted strings (', ", \`) into a separate context. + This is only useful when every quotes are paired, so it + can't be used in English texts with contractions (e.g. isn't). -Zlr|--learning-rate RATE [Range: 1..2^53, Default: 500] Configures the learning rate of context mixer; smaller adapts faster. -Zmc|--model-max-count COUNT [Range: 1..32767, Default: 5] @@ -218,6 +223,9 @@ async function parseArgs(args) { if (options.contextBits !== undefined) throw 'duplicate --context-bits arguments'; options.contextBits = parseInt(getArg(m), 10); // -Zco is not optimizable, so its use doesn't change -O defaults + } else if (m = matchOptArg('dynamic-models', 'Zdy')) { + if (options.dynamicModels !== undefined) throw 'duplicate --dynamic-models arguments'; + options.dynamicModels = parseInt(getArg(m), 10); } else if (m = matchOptArg('learning-rate', 'Zlr')) { if (options.recipLearningRate !== undefined) throw 'duplicate --learning-rate arguments'; options.recipLearningRate = parseInt(getArg(m), 10); @@ -278,6 +286,9 @@ async function parseArgs(args) { if (options.numAbbreviations !== undefined && !between(0, options.numAbbreviations, 64)) { throw 'invalid --num-abbreviations argument'; } + if (options.dynamicModels !== undefined && !between(0, options.dynamicModels, 1)) { + throw 'invalid --dynamic-models argument'; + } if (options.recipLearningRate !== undefined && !between(1, options.recipLearningRate, 2**53)) { throw 'invalid --learning-rate argument'; } @@ -343,12 +354,12 @@ async function compress({ inputs, options, optimize, outputPath, verbose }) { if (typeof combined.recipLearningRate === 'number') { args = `-Zlr${combined.recipLearningRate} ${args}`; } + if (typeof combined.dynamicModels === 'number') { + args = `-Zdy${combined.dynamicModels} ${args}`; + } if (typeof combined.numAbbreviations === 'number') { args = `-Zab${combined.numAbbreviations} ${args}`; } - if (combined.preferTextOverJS) { - args = `-t text ${args}`; - } return args; }; diff --git a/index.d.ts b/index.d.ts index 7acd2b5..71f4ac7 100644 --- a/index.d.ts +++ b/index.d.ts @@ -173,6 +173,10 @@ export interface Input { action: InputAction; } +export const enum DynamicModelFlags { + Quotes = 1 << 0, +} + export interface PackerOptions { sparseSelectors?: number[]; maxMemoryMB?: number; @@ -185,6 +189,7 @@ export interface PackerOptions { arrayBufferPool?: ResourcePool; recipLearningRate?: number; numAbbreviations?: number; + dynamicModels?: number; // bit flags out of DynamicModelFlags allowFreeVars?: boolean; } @@ -195,7 +200,9 @@ export interface OptimizedPackerOptions { modelRecipBaseCount?: number; recipLearningRate?: number; numAbbreviations?: number; + /** @deprecated Replaced by {@link OptimizedPackerOptions.dynamicModels}, no longer used */ preferTextOverJS?: boolean; + dynamicModels?: number; } export class Packer { diff --git a/index.mjs b/index.mjs index 5b97bb7..d898fca 100644 --- a/index.mjs +++ b/index.mjs @@ -681,6 +681,8 @@ const contextBitsFromMaxMemory = options => { // the threshold is implementation-defined, but 2^16 - epsilon seems common. const TEXT_DECODER_THRESHOLD = 65000; +const DYN_MODEL_QUOTES = 1; + export class Packer { constructor(inputs, options = {}) { this.options = { @@ -693,6 +695,7 @@ export class Packer { contextBits: options.contextBits, resourcePool: options.resourcePool || options.arrayBufferPool || new ResourcePool(), numAbbreviations: typeof options.numAbbreviations === 'number' ? options.numAbbreviations : 64, + dynamicModels: options.dynamicModels, allowFreeVars: options.allowFreeVars, disableWasm: options.disableWasm, }; @@ -734,6 +737,10 @@ export class Packer { if (inputs.length !== 1 || !['js', 'text'].includes(inputs[0].type) || !['eval', 'write', 'console', 'return'].includes(inputs[0].action)) { throw new Error('Packer: this version of Roadroller supports exactly one JS or text input, please stay tuned for more!'); } + + if (this.options.dynamicModels === undefined) { + this.options.dynamicModels = inputs[0].type === 'js' ? DYN_MODEL_QUOTES : 0; + } } get memoryUsageMB() { @@ -752,15 +759,22 @@ export class Packer { } } - static prepareJs(inputs, { numAbbreviations }) { + static prepareJs(inputs, { dynamicModels, numAbbreviations }) { + const modelQuotes = dynamicModels & DYN_MODEL_QUOTES; + // we strongly avoid a token like 'this\'one' because the context model doesn't // know about escapes and anything after that would be suboptimally compressed. // we can't still avoid something like `foo${`bar`}quux`, where `bar` would be - // suboptimall compressed, but at least we will return to the normal state at the end. - const reescape = (s, pattern) => - s.replace( - new RegExp(`\\\\?(${pattern})|\\\\.`, 'g'), - (m, q) => q ? '\\x' + q.charCodeAt(0).toString(16).padStart(2, '0') : m); + // suboptimally compressed, but at least we will return to the normal state at the end. + const reescape = (s, pattern) => { + if (modelQuotes) { + return s.replace( + new RegExp(`\\\\?(${pattern})|\\\\.`, 'g'), + (m, q) => q ? '\\x' + q.charCodeAt(0).toString(16).padStart(2, '0') : m); + } else { + return s; + } + }; const identFreqs = new Map(); const inputTokens = []; @@ -828,7 +842,7 @@ export class Packer { for (let i = 0; i < 128; ++i) { // even though there might be no whitespace in the tokens, // we may have to need some space between two namelike tokens later. - if (![32, 34, 39, 96].includes(i)) unseenChars.add(String.fromCharCode(i)); + if (i !== 32 && !(modelQuotes && [34, 39, 96].includes(i))) unseenChars.add(String.fromCharCode(i)); } for (const tokens of inputTokens) { for (const token of tokens) { @@ -945,7 +959,7 @@ export class Packer { const inBits = combinedInput.every(c => c <= 0x7f) ? 7 : 8; const outBits = 6; // TODO again, this should be controlled dynamically - const modelQuotes = preparedJs.code.length > 0; + const modelQuotes = !!(options.dynamicModels & DYN_MODEL_QUOTES); const { sparseSelectors, precision, modelMaxCount, modelRecipBaseCount, @@ -1331,24 +1345,17 @@ export class Packer { const performance = await getPerformanceObject(); const copy = v => JSON.parse(JSON.stringify(v)); - const cache = new Map(); // `${preferTextOverJS | 0},${numAbbreviations}` -> { preparedText, preparedJs } + const cache = new Map(); // `${dynamicModels},${numAbbreviations}` -> { preparedText, preparedJs } const mainInputAction = (this.inputsByType['text'] || this.inputsByType['js'])[0].action; let maxAbbreviations = -1; const calculateSize = current => { const options = { ...this.options, ...current }; - const key = `${options.preferTextOverJS},${options.numAbbreviations}`; + const key = `${options.dynamicModels},${options.numAbbreviations}`; if (!cache.has(key)) { - let textInputs = this.inputsByType['text'] || []; - let jsInputs = this.inputsByType['js'] || []; - if (current.preferTextOverJS) { - textInputs = [...textInputs, ...jsInputs.map(input => ({ ...input, type: 'text' }))]; - jsInputs = []; - } - - const preparedText = Packer.prepareText(textInputs, options); - const preparedJs = Packer.prepareJs(jsInputs, options); + const preparedText = Packer.prepareText(this.inputsByType['text'] || [], options); + const preparedJs = Packer.prepareJs(this.inputsByType['js'] || [], options); cache.set(key, { preparedText, preparedJs }); } @@ -1469,15 +1476,18 @@ export class Packer { }); if (best.modelMaxCount === this.options.modelMaxCount) delete best.modelMaxCount; + // optimize dynamicModels + for (let i = 0; i < 2; ++i) { + await updateBestAndReportProgress({ ...best, dynamicModels: i }, 'dynamicModels', i / 2); + } + if (best.dynamicModels === this.options.dynamicModels) delete best.dynamicModels; + // optimize numAbbreviations await search(0, maxAbbreviations, LINEAR, [0, 16, 32, 64], async (i, ratio) => { return await updateBestAndReportProgress({ ...best, numAbbreviations: i }, 'numAbbreviations', ratio); }); if (best.numAbbreviations === this.options.numAbbreviations) delete best.numAbbreviations; - // try to switch the JS input to text if any - await updateBestAndReportProgress({ ...best, preferTextOverJS: true }, 'preferTextOverJS'); - // optimize sparseSelectors by simulated annealing let current = this.options.sparseSelectors.slice(); let currentSize = bestSize; @@ -1523,14 +1533,6 @@ export class Packer { // apply the final result to this this.options = { ...this.options, ...best }; - if (best.preferTextOverJS && (this.inputsByType['text'] || this.inputsByType['js'])) { - this.inputsByType['text'] = [ - ...this.inputsByType['text'] || [], - ...(this.inputsByType['js'] || []).map(input => ({ ...input, type: 'text' })), - ]; - delete this.inputsByType['js']; - } - return { elapsedMsecs: performance.now() - searchStart, best, bestSize }; } } diff --git a/test.mjs b/test.mjs index 93fe97d..e1b95bf 100644 --- a/test.mjs +++ b/test.mjs @@ -435,21 +435,29 @@ test('abbreviations', t => { }); test('reescaping', t => { - t.is(packAndReturn('"asdf\'asdf"'), '"asdf\'asdf"'); - t.is(packAndReturn('"asdf\\"asdf"'), '"asdf\\x22asdf"'); - t.is(packAndReturn('"asdf\\\'asdf"'), '"asdf\\\'asdf"'); - t.is(packAndReturn('"asdf\\\"asdf"'), '"asdf\\x22asdf"'); - t.is(packAndReturn('"asdf\\\\\'asdf"'), '"asdf\\\\\'asdf"'); - t.is(packAndReturn("'asdf\"asdf'"), "'asdf\"asdf'"); - t.is(packAndReturn("'asdf\\'asdf'"), "'asdf\\x27asdf'"); - t.is(packAndReturn("'asdf\\\"asdf'"), "'asdf\\\"asdf'"); - t.is(packAndReturn("'asdf\\\'asdf'"), "'asdf\\x27asdf'"); - t.is(packAndReturn("'asdf\\\\\"asdf'"), "'asdf\\\\\"asdf'"); - t.is(packAndReturn('`asdf\\\`asdf`'), '`asdf\\x60asdf`'); - t.is(packAndReturn('`asdf\\\\\\\`asdf`'), '`asdf\\\\\\x60asdf`'); - t.is(packAndReturn('`foo\\\`${`asdf\\\\\\\`asdf`}\\\`bar`'), '`foo\\x60${`asdf\\\\\\x60asdf`}\\x60bar`'); - t.is(packAndReturn('/[\'"`]/g'), '/[\\x27\\x22\\x60]/g'); - t.is(packAndReturn('/[\\\'\\"\\`]/g'), '/[\\x27\\x22\\x60]/g'); + const examples = [ + ['"asdf\'asdf"', '"asdf\'asdf"'], + ['"asdf\\"asdf"', '"asdf\\x22asdf"'], + ['"asdf\\\'asdf"', '"asdf\\\'asdf"'], + ['"asdf\\\"asdf"', '"asdf\\x22asdf"'], + ['"asdf\\\\\'asdf"', '"asdf\\\\\'asdf"'], + ["'asdf\"asdf'", "'asdf\"asdf'"], + ["'asdf\\'asdf'", "'asdf\\x27asdf'"], + ["'asdf\\\"asdf'", "'asdf\\\"asdf'"], + ["'asdf\\\'asdf'", "'asdf\\x27asdf'"], + ["'asdf\\\\\"asdf'", "'asdf\\\\\"asdf'"], + ['`asdf\\\`asdf`', '`asdf\\x60asdf`'], + ['`asdf\\\\\\\`asdf`', '`asdf\\\\\\x60asdf`'], + ['`foo\\\`${`asdf\\\\\\\`asdf`}\\\`bar`', '`foo\\x60${`asdf\\\\\\x60asdf`}\\x60bar`'], + ['/[\'"`]/g', '/[\\x27\\x22\\x60]/g'], + ['/[\\\'\\"\\`]/g', '/[\\x27\\x22\\x60]/g'], + ]; + + for (const [input, reescaped] of examples) { + // we don't need to reescape literals if we are not affected by quotes model anyway + t.is(packAndReturn(input, { dynamicModels: 0 }), input); + t.is(packAndReturn(input, { dynamicModels: 1 }), reescaped); + } }); const LONG_ENOUGH_INPUT = 100000; // ...so that an alternative code path is triggered diff --git a/tools/demo.html b/tools/demo.html index 7af6441..e5e9eef 100644 --- a/tools/demo.html +++ b/tools/demo.html @@ -113,6 +113,7 @@
Learning rate adjusts how fast would the context mixer adapt, where smaller is faster. The default is 500 which should be fine for long enough inputs. If your demo is smaller than 10 KB you can also try smaller numbers.
Model max count adjusts how fast would individual contexts adapt, where smaller is faster. The model adapts fastest when a particular context is first seen, but that process becomes slower as the context is seen multiple times. This parameter limits how slowest the adaptation process can be. The default of 5 is specifically tuned for JS code inputs.
Model base divisor adjusts how fast should individual contexts adapt initially, where larger is faster. The optimal value typically ranges from 10 to 100 for JS code inputs. +
Dynamic model flags are used to enable or disable specific dynamic models, where each bit is turned on if the model is in use. The value of -1 is specially recognized as a default. There is currently one supported model: +
Number of abbreviations affects the preprocessing for JS code inputs. Common identifiers and reserved words can be abbreviated to single otherwise unused bytes during the preprocessing; this lessens the burden of context modelling which can only look at the limited number of past bytes. If this parameter is less than the number of allowable abbreviations some identifiers will be left as is, which can sometimes improve the compression.