Skip to content

Commit

Permalink
Replace preferTextOverJs with new dynamicModels (-Zdy).
Browse files Browse the repository at this point in the history
so, preferTextOverJs was a hack made after it was observed
that some JS inputs are better modelled w/o quotes handling.
due to its hackish nature there was no separate option, so
it was specially handled and might cause some confusions:
for example once the input type is set to text in the online
demo the subsequent optimization will never set it back to JS
even when it might be beneficial.

a new dedicated option solves this issue and also allows for
using quotes modelling for texts whenever appropriate.
henceforth the old preferTextOverJs is now deprecated.
  • Loading branch information
lifthrasiir committed Sep 13, 2021
1 parent 2c727ea commit 4dd5fe7
Show file tree
Hide file tree
Showing 6 changed files with 93 additions and 55 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,10 @@ The actual memory usage can be as low as a half of the specified due to the inte

**Model base divisor** (CLI `-Zmd|--model-base-divisor DIVISOR`, API `modelRecipBaseCount` in the options object) adjusts how fast should individual contexts adapt *initially*, where larger is faster. The optimal value typically ranges from 10 to 100 for JS code inputs.

**Dynamic model flags** (CLI `-Zdy|--dynamic-models FLAGS`, API `dynamicModels` in the options object) are used to enable or disable specific dynamic models, where each bit is turned on if the model is in use. The value of -1 is specially recognized as a default. There is currently one supported model:

* The bit 0 (value 1) models quoted strings (', " or \`) and works well for source codes. It assumes that every quotes are paired, so it can't be used in English texts with contractions (e.g. isn't) and turned off by default in non-JS inputs.

**Number of abbreviations** (CLI `-Zab|--num-abbreviations NUM`, API `numAbbreviations` in the options object) affects the preprocessing for JS code inputs. Common identifiers and reserved words can be abbreviated to single otherwise unused bytes during the preprocessing; this lessens the burden of context modelling which can only look at the limited number of past bytes. If this parameter is less than the number of allowable abbreviations some identifiers will be left as is, which can sometimes improve the compression.

### Tips and Tricks
Expand Down
19 changes: 15 additions & 4 deletions cli.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ Output options:
0 Use the baseline parameters.
Default when any optimizable arguments are given.
1 Tries to optimize -S and most -Z arguments with ~30 attempts.
Also tries to replace "-t js" with "-t text" if beneficial.
Default when no optimizable arguments are given.
2 Same to -O1 but with ~300 attempts.
Anything beyond -O0 prints the best parameters unless -q is given.
Expand Down Expand Up @@ -74,6 +73,12 @@ Output options:
-Zco|--context-bits BITS [Range: 1..24+, Default: derived]
Sets the size of each context model, as opposed to the total size (-M).
The maximum can range from 24 to 30 depending on the number of contexts.
-Zdy|--dynamic-models FLAGS [Default: 1 for JS, 0 for others]
Enables or disables specific dynamic models.
The value is a bitwise OR of the following bits:
1 Puts quoted strings (', ", \`) into a separate context.
This is only useful when every quotes are paired, so it
can't be used in English texts with contractions (e.g. isn't).
-Zlr|--learning-rate RATE [Range: 1..2^53, Default: 500]
Configures the learning rate of context mixer; smaller adapts faster.
-Zmc|--model-max-count COUNT [Range: 1..32767, Default: 5]
Expand Down Expand Up @@ -218,6 +223,9 @@ async function parseArgs(args) {
if (options.contextBits !== undefined) throw 'duplicate --context-bits arguments';
options.contextBits = parseInt(getArg(m), 10);
// -Zco is not optimizable, so its use doesn't change -O defaults
} else if (m = matchOptArg('dynamic-models', 'Zdy')) {
if (options.dynamicModels !== undefined) throw 'duplicate --dynamic-models arguments';
options.dynamicModels = parseInt(getArg(m), 10);
} else if (m = matchOptArg('learning-rate', 'Zlr')) {
if (options.recipLearningRate !== undefined) throw 'duplicate --learning-rate arguments';
options.recipLearningRate = parseInt(getArg(m), 10);
Expand Down Expand Up @@ -278,6 +286,9 @@ async function parseArgs(args) {
if (options.numAbbreviations !== undefined && !between(0, options.numAbbreviations, 64)) {
throw 'invalid --num-abbreviations argument';
}
if (options.dynamicModels !== undefined && !between(0, options.dynamicModels, 1)) {
throw 'invalid --dynamic-models argument';
}
if (options.recipLearningRate !== undefined && !between(1, options.recipLearningRate, 2**53)) {
throw 'invalid --learning-rate argument';
}
Expand Down Expand Up @@ -343,12 +354,12 @@ async function compress({ inputs, options, optimize, outputPath, verbose }) {
if (typeof combined.recipLearningRate === 'number') {
args = `-Zlr${combined.recipLearningRate} ${args}`;
}
if (typeof combined.dynamicModels === 'number') {
args = `-Zdy${combined.dynamicModels} ${args}`;
}
if (typeof combined.numAbbreviations === 'number') {
args = `-Zab${combined.numAbbreviations} ${args}`;
}
if (combined.preferTextOverJS) {
args = `-t text ${args}`;
}
return args;
};

Expand Down
7 changes: 7 additions & 0 deletions index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,10 @@ export interface Input {
action: InputAction;
}

export const enum DynamicModelFlags {
Quotes = 1 << 0,
}

export interface PackerOptions {
sparseSelectors?: number[];
maxMemoryMB?: number;
Expand All @@ -185,6 +189,7 @@ export interface PackerOptions {
arrayBufferPool?: ResourcePool;
recipLearningRate?: number;
numAbbreviations?: number;
dynamicModels?: number; // bit flags out of DynamicModelFlags
allowFreeVars?: boolean;
}

Expand All @@ -195,7 +200,9 @@ export interface OptimizedPackerOptions {
modelRecipBaseCount?: number;
recipLearningRate?: number;
numAbbreviations?: number;
/** @deprecated Replaced by {@link OptimizedPackerOptions.dynamicModels}, no longer used */
preferTextOverJS?: boolean;
dynamicModels?: number;
}

export class Packer {
Expand Down
62 changes: 32 additions & 30 deletions index.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -681,6 +681,8 @@ const contextBitsFromMaxMemory = options => {
// the threshold is implementation-defined, but 2^16 - epsilon seems common.
const TEXT_DECODER_THRESHOLD = 65000;

const DYN_MODEL_QUOTES = 1;

export class Packer {
constructor(inputs, options = {}) {
this.options = {
Expand All @@ -693,6 +695,7 @@ export class Packer {
contextBits: options.contextBits,
resourcePool: options.resourcePool || options.arrayBufferPool || new ResourcePool(),
numAbbreviations: typeof options.numAbbreviations === 'number' ? options.numAbbreviations : 64,
dynamicModels: options.dynamicModels,
allowFreeVars: options.allowFreeVars,
disableWasm: options.disableWasm,
};
Expand Down Expand Up @@ -734,6 +737,10 @@ export class Packer {
if (inputs.length !== 1 || !['js', 'text'].includes(inputs[0].type) || !['eval', 'write', 'console', 'return'].includes(inputs[0].action)) {
throw new Error('Packer: this version of Roadroller supports exactly one JS or text input, please stay tuned for more!');
}

if (this.options.dynamicModels === undefined) {
this.options.dynamicModels = inputs[0].type === 'js' ? DYN_MODEL_QUOTES : 0;
}
}

get memoryUsageMB() {
Expand All @@ -752,15 +759,22 @@ export class Packer {
}
}

static prepareJs(inputs, { numAbbreviations }) {
static prepareJs(inputs, { dynamicModels, numAbbreviations }) {
const modelQuotes = dynamicModels & DYN_MODEL_QUOTES;

// we strongly avoid a token like 'this\'one' because the context model doesn't
// know about escapes and anything after that would be suboptimally compressed.
// we can't still avoid something like `foo${`bar`}quux`, where `bar` would be
// suboptimall compressed, but at least we will return to the normal state at the end.
const reescape = (s, pattern) =>
s.replace(
new RegExp(`\\\\?(${pattern})|\\\\.`, 'g'),
(m, q) => q ? '\\x' + q.charCodeAt(0).toString(16).padStart(2, '0') : m);
// suboptimally compressed, but at least we will return to the normal state at the end.
const reescape = (s, pattern) => {
if (modelQuotes) {
return s.replace(
new RegExp(`\\\\?(${pattern})|\\\\.`, 'g'),
(m, q) => q ? '\\x' + q.charCodeAt(0).toString(16).padStart(2, '0') : m);
} else {
return s;
}
};

const identFreqs = new Map();
const inputTokens = [];
Expand Down Expand Up @@ -828,7 +842,7 @@ export class Packer {
for (let i = 0; i < 128; ++i) {
// even though there might be no whitespace in the tokens,
// we may have to need some space between two namelike tokens later.
if (![32, 34, 39, 96].includes(i)) unseenChars.add(String.fromCharCode(i));
if (i !== 32 && !(modelQuotes && [34, 39, 96].includes(i))) unseenChars.add(String.fromCharCode(i));
}
for (const tokens of inputTokens) {
for (const token of tokens) {
Expand Down Expand Up @@ -945,7 +959,7 @@ export class Packer {
const inBits = combinedInput.every(c => c <= 0x7f) ? 7 : 8;
const outBits = 6;
// TODO again, this should be controlled dynamically
const modelQuotes = preparedJs.code.length > 0;
const modelQuotes = !!(options.dynamicModels & DYN_MODEL_QUOTES);

const {
sparseSelectors, precision, modelMaxCount, modelRecipBaseCount,
Expand Down Expand Up @@ -1331,24 +1345,17 @@ export class Packer {
const performance = await getPerformanceObject();
const copy = v => JSON.parse(JSON.stringify(v));

const cache = new Map(); // `${preferTextOverJS | 0},${numAbbreviations}` -> { preparedText, preparedJs }
const cache = new Map(); // `${dynamicModels},${numAbbreviations}` -> { preparedText, preparedJs }
const mainInputAction = (this.inputsByType['text'] || this.inputsByType['js'])[0].action;

let maxAbbreviations = -1;
const calculateSize = current => {
const options = { ...this.options, ...current };

const key = `${options.preferTextOverJS},${options.numAbbreviations}`;
const key = `${options.dynamicModels},${options.numAbbreviations}`;
if (!cache.has(key)) {
let textInputs = this.inputsByType['text'] || [];
let jsInputs = this.inputsByType['js'] || [];
if (current.preferTextOverJS) {
textInputs = [...textInputs, ...jsInputs.map(input => ({ ...input, type: 'text' }))];
jsInputs = [];
}

const preparedText = Packer.prepareText(textInputs, options);
const preparedJs = Packer.prepareJs(jsInputs, options);
const preparedText = Packer.prepareText(this.inputsByType['text'] || [], options);
const preparedJs = Packer.prepareJs(this.inputsByType['js'] || [], options);
cache.set(key, { preparedText, preparedJs });
}

Expand Down Expand Up @@ -1469,15 +1476,18 @@ export class Packer {
});
if (best.modelMaxCount === this.options.modelMaxCount) delete best.modelMaxCount;

// optimize dynamicModels
for (let i = 0; i < 2; ++i) {
await updateBestAndReportProgress({ ...best, dynamicModels: i }, 'dynamicModels', i / 2);
}
if (best.dynamicModels === this.options.dynamicModels) delete best.dynamicModels;

// optimize numAbbreviations
await search(0, maxAbbreviations, LINEAR, [0, 16, 32, 64], async (i, ratio) => {
return await updateBestAndReportProgress({ ...best, numAbbreviations: i }, 'numAbbreviations', ratio);
});
if (best.numAbbreviations === this.options.numAbbreviations) delete best.numAbbreviations;

// try to switch the JS input to text if any
await updateBestAndReportProgress({ ...best, preferTextOverJS: true }, 'preferTextOverJS');

// optimize sparseSelectors by simulated annealing
let current = this.options.sparseSelectors.slice();
let currentSize = bestSize;
Expand Down Expand Up @@ -1523,14 +1533,6 @@ export class Packer {

// apply the final result to this
this.options = { ...this.options, ...best };
if (best.preferTextOverJS && (this.inputsByType['text'] || this.inputsByType['js'])) {
this.inputsByType['text'] = [
...this.inputsByType['text'] || [],
...(this.inputsByType['js'] || []).map(input => ({ ...input, type: 'text' })),
];
delete this.inputsByType['js'];
}

return { elapsedMsecs: performance.now() - searchStart, best, bestSize };
}
}
Expand Down
38 changes: 23 additions & 15 deletions test.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -435,21 +435,29 @@ test('abbreviations', t => {
});

test('reescaping', t => {
t.is(packAndReturn('"asdf\'asdf"'), '"asdf\'asdf"');
t.is(packAndReturn('"asdf\\"asdf"'), '"asdf\\x22asdf"');
t.is(packAndReturn('"asdf\\\'asdf"'), '"asdf\\\'asdf"');
t.is(packAndReturn('"asdf\\\"asdf"'), '"asdf\\x22asdf"');
t.is(packAndReturn('"asdf\\\\\'asdf"'), '"asdf\\\\\'asdf"');
t.is(packAndReturn("'asdf\"asdf'"), "'asdf\"asdf'");
t.is(packAndReturn("'asdf\\'asdf'"), "'asdf\\x27asdf'");
t.is(packAndReturn("'asdf\\\"asdf'"), "'asdf\\\"asdf'");
t.is(packAndReturn("'asdf\\\'asdf'"), "'asdf\\x27asdf'");
t.is(packAndReturn("'asdf\\\\\"asdf'"), "'asdf\\\\\"asdf'");
t.is(packAndReturn('`asdf\\\`asdf`'), '`asdf\\x60asdf`');
t.is(packAndReturn('`asdf\\\\\\\`asdf`'), '`asdf\\\\\\x60asdf`');
t.is(packAndReturn('`foo\\\`${`asdf\\\\\\\`asdf`}\\\`bar`'), '`foo\\x60${`asdf\\\\\\x60asdf`}\\x60bar`');
t.is(packAndReturn('/[\'"`]/g'), '/[\\x27\\x22\\x60]/g');
t.is(packAndReturn('/[\\\'\\"\\`]/g'), '/[\\x27\\x22\\x60]/g');
const examples = [
['"asdf\'asdf"', '"asdf\'asdf"'],
['"asdf\\"asdf"', '"asdf\\x22asdf"'],
['"asdf\\\'asdf"', '"asdf\\\'asdf"'],
['"asdf\\\"asdf"', '"asdf\\x22asdf"'],
['"asdf\\\\\'asdf"', '"asdf\\\\\'asdf"'],
["'asdf\"asdf'", "'asdf\"asdf'"],
["'asdf\\'asdf'", "'asdf\\x27asdf'"],
["'asdf\\\"asdf'", "'asdf\\\"asdf'"],
["'asdf\\\'asdf'", "'asdf\\x27asdf'"],
["'asdf\\\\\"asdf'", "'asdf\\\\\"asdf'"],
['`asdf\\\`asdf`', '`asdf\\x60asdf`'],
['`asdf\\\\\\\`asdf`', '`asdf\\\\\\x60asdf`'],
['`foo\\\`${`asdf\\\\\\\`asdf`}\\\`bar`', '`foo\\x60${`asdf\\\\\\x60asdf`}\\x60bar`'],
['/[\'"`]/g', '/[\\x27\\x22\\x60]/g'],
['/[\\\'\\"\\`]/g', '/[\\x27\\x22\\x60]/g'],
];

for (const [input, reescaped] of examples) {
// we don't need to reescape literals if we are not affected by quotes model anyway
t.is(packAndReturn(input, { dynamicModels: 0 }), input);
t.is(packAndReturn(input, { dynamicModels: 1 }), reescaped);
}
});

const LONG_ENOUGH_INPUT = 100000; // ...so that an alternative code path is triggered
Expand Down
18 changes: 12 additions & 6 deletions tools/demo.html
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@
<li><label>Learning rate: <input id=$learningrate type=number value=500 min=1 max=99999></label> <a href=#learning-rate title=Help>ℹ️</a>
<li><label>Model max count: <input id=$modelmaxcount type=number value=5 min=1 max=32767></label> <a href=#model-max-count title=Help>ℹ️</a>
<li><label>Model base divisor: <input id=$modelbasedivisor type=number value=20 min=1 max=99999></label> <a href=#model-base-divisor title=Help>ℹ️</a>
<li><label>Dynamic model flags: <input id=$dynmodels type=number value=-1 min=-1 max=1></label> <a href=#dynamic-models title=Help>ℹ️</a>
<li><label>Number of abbreviations: <input id=$numabbrevs type=number value=64 min=0 max=64></label> <a href=#num-abbreviations title=Help>ℹ️</a>
<!--<li><label><input type=checkbox id=$uncompressed> Optimize for uncompressed size</label> <a href=#uncompressed-only title=Help>ℹ️</a>-->
</ul></details></footer>
Expand Down Expand Up @@ -177,6 +178,10 @@ <h3>Advanced Configuration</h3>
<p id=learning-rate><strong>Learning rate</strong> adjusts how fast would the context mixer adapt, where smaller is faster. The default is 500 which should be fine for long enough inputs. If your demo is smaller than 10 KB you can also try smaller numbers.
<p id=model-max-count><strong>Model max count</strong> adjusts how fast would individual contexts adapt, where smaller is faster. The model adapts fastest when a particular context is first seen, but that process becomes slower as the context is seen multiple times. This parameter limits how slowest the adaptation process can be. The default of 5 is specifically tuned for JS code inputs.
<p id=model-base-divisor><strong>Model base divisor</strong> adjusts how fast should individual contexts adapt <em>initially</em>, where larger is faster. The optimal value typically ranges from 10 to 100 for JS code inputs.
<p id=dynamic-models><strong>Dynamic model flags</strong> are used to enable or disable specific dynamic models, where each bit is turned on if the model is in use. The value of -1 is specially recognized as a default. There is currently one supported model:
<ul>
<li>The bit 0 (value 1) models quoted strings (', " or `) and works well for source codes. It assumes that every quotes are paired, so it can't be used in English texts with contractions (e.g. isn't) and turned off by default in non-JS inputs.
</ul>
<p id=num-abbreviations><strong>Number of abbreviations</strong> affects the preprocessing for JS code inputs. Common identifiers and reserved words can be abbreviated to single otherwise unused bytes during the preprocessing; this lessens the burden of context modelling which can only look at the limited number of past bytes. If this parameter is less than the number of allowable abbreviations some identifiers will be left as is, which can sometimes improve the compression.

<h2>Command-line Usage and API</h2>
Expand Down Expand Up @@ -321,6 +326,7 @@ <h2>Command-line Usage and API</h2>
$learningrate.onchange =
$modelmaxcount.onchange =
$modelbasedivisor.onchange =
$dynmodels.onchange =
$numabbrevs.onchange =
$dirty.onchange = () => refreshOutput();

Expand Down Expand Up @@ -415,6 +421,9 @@ <h2>Command-line Usage and API</h2>
if (!(1 <= modelMaxCount && modelMaxCount <= 32767)) throw 'invalid model max count';
const modelRecipBaseCount = parseInt($modelbasedivisor.value, 10);
if (!(1 <= modelRecipBaseCount && modelRecipBaseCount <= 99999)) throw 'invalid model base divisor';
const dynamicModels = parseInt($dynmodels.value, 10);
if (!(-1 <= dynamicModels && dynamicModels <= 1)) throw 'invalid dynamic model flags';
if (dynamicModels < 0) dynamicModels = undefined;
const numAbbreviations = parseInt($numabbrevs.value, 10);
if (!(0 <= numAbbreviations && numAbbreviations <= 64)) throw 'invalid number of abbreviations';

Expand All @@ -427,6 +436,7 @@ <h2>Command-line Usage and API</h2>
recipLearningRate,
modelMaxCount,
modelRecipBaseCount,
dynamicModels,
numAbbreviations,
allowFreeVars: $dirty.checked,
};
Expand Down Expand Up @@ -456,12 +466,8 @@ <h2>Command-line Usage and API</h2>
if (result.best.modelMaxCount) $modelmaxcount.value = result.best.modelMaxCount;
if (result.best.modelRecipBaseCount) $modelbasedivisor.value = result.best.modelRecipBaseCount;
if (result.best.recipLearningRate) $learningrate.value = result.best.recipLearningRate;
if (result.best.dynamicModels !== undefined) $dynmodels.value = result.best.dynamicModels;
if (result.best.numAbbreviations !== undefined) $numabbrevs.value = result.best.numAbbreviations;
if (result.best.preferTextOverJS) {
for (const input of $inputs.children) {
if (input.i$type.value === 'js') input.i$type.value = 'text';
}
}
$optimize.textContent = 'Optimize parameters (harder)';
}

Expand Down Expand Up @@ -498,7 +504,7 @@ <h2>Command-line Usage and API</h2>
initOutput();
const input = makeInput();
$inputs.append(input);
input.i$data.value = 'document.write`<xmp style=white-space:pre-wrap>\n\n' + $demotext.innerText + '`';
input.i$data.value = 'document.write`<xmp style=white-space:pre-wrap>\n\n' + $demotext.innerText.replace(/[\\`$]/g, m => '\\' + m) + '`';
input.i$data.oninput();

github.hidden = true;
Expand Down

0 comments on commit 4dd5fe7

Please sign in to comment.