You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gbk (should be identical to gb18030 but it is instead broken)
big5
euc-jp
iso-2022-jp
shift_jis (fails at even ascii input)
euc-kr
Unimplemented encodings that throw:
iso-8859-16
x-user-defined
If built without icu, utf-16le encoding also returns invalid results:
>newTextDecoder('utf-16le').decode(Uint16Array.of(0xd800))'�'// correct'\ud800'// no ICU
Performance
utf-8 (aka default) TextDecoder is much slower on ascii input than it can and should be 1.3x on 4096 bytes, ~3x on 1 MiB input
The above applies to buffer.toString() too
It's much slower on ASCII input than a checked js impl (same 1.3x-3x)
windows-1252 aka new TextDecoder('ascii') aka new TextDecoder('latin1')
is ~2x-4x slower than an optimized impl on ascii input
windows-1252 aka new TextDecoder('latin1')
is ~6x-12x slower than an optimized impl on latin1 input
windows-1252 is ~7x-12x slower than an optimized js impl
Other single-byte encodings that are significantly slower than js impl even on non-ascii input: iso-8859-3, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-8-i, windows-1253, windows-1255, windows-1257
None of the single-byte encodings are faster than the js impl even on non-ascii input
All of the single-byte encodings except windows-1252 are >=10x slower than the js impl on ascii input
(windows-1252 is only ~2-4x slower)
References
Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation
Perhaps replace single-byte decoders with a js impl, remove native paths and lib usage. They all are just mappers, the implementation for all of them is identical
Or at least replace the slow, unsupported, or invalid ones. lib: implement all 1-byte encodings in js #61093
Fix or replace implementations for big5, euc-jp, iso-2022-jp, shift_jis, euc-kr
To fix legacy multi-byte decoders, attempt to re-use what Chromium has or import js code from @exodus/bytes
Correctness
Encodings that return invalid results:
ibm866(fails at even ascii input)koi8-uwindows-874windows-1252windows-1253windows-1255gb18030):gbk(should be identical togb18030but it is instead broken)big5euc-jpiso-2022-jpshift_jis(fails at even ascii input)euc-krUnimplemented encodings that throw:
iso-8859-16x-user-definedIf built without
icu,utf-16leencoding also returns invalid results:Performance
utf-8(aka default)TextDecoderis much slower on ascii input than it can and should be1.3xon 4096 bytes,~3xon 1 MiB inputbuffer.toString()tooIt's much slower on ASCII input than a checked js impl (same
1.3x-3x)windows-1252akanew TextDecoder('ascii')akanew TextDecoder('latin1')is ~
2x-4xslower than an optimized impl on ascii inputwindows-1252akanew TextDecoder('latin1')is ~
6x-12xslower than an optimized impl on latin1 inputwindows-1252is ~7x-12xslower than an optimized js impliso-8859-3,iso-8859-6,iso-8859-7,iso-8859-8,iso-8859-8-i,windows-1253,windows-1255,windows-1257windows-1252are>=10xslower than the js impl on ascii input(
windows-1252is only ~2-4xslower)References
Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation
See https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit
See tests in https://github.com/ExodusOSS/bytes/blob/master/tests/encoding/mistakes.test.js (comment out the import and it can be run on Node.js without deps with only that file)
Suggestions
buffer.toString()src: improve StringBytes::Encode perf on ASCII #61119
new TextDecoder().decode(arg)src: improve StringBytes::Encode perf on ASCII #61119
Or at least replace the slow, unsupported, or invalid ones.
lib: implement all 1-byte encodings in js #61093
gbkdecoder path and make it do the same asgb18030as the spec sayslib: gbk decoder is gb18030 decoder per spec #61099
lib: add utf16 fast path for TextDecoder #61559
lib: unify ICU and no-ICU TextDecoder #61409
lib: use utf8 fast path for streaming TextDecoder #61549
lib: add utf16 fast path for TextDecoder #61559
big5,euc-jp,iso-2022-jp,shift_jis,euc-krTo fix legacy multi-byte decoders, attempt to re-use what Chromium has or import js code from
@exodus/bytes