Skip to content

Latest commit

 

History

History
57 lines (42 loc) · 2.64 KB

README.md

File metadata and controls

57 lines (42 loc) · 2.64 KB

UTF-8 Validator for Lua

This is an example of the "branchy" algorithm in Lua as described in Daniel Lemire's paper on Validating UTF-8 In Less Than One Instruction Per Byte.

I needed a fast UTF-8 validator that portably works in all versions of Lua. Since I don't have access to SIMD instructions using plain Lua, the "branchy" algorithm described in the paper was used.

The standard Lua (and Teal) versions in the repo will work in Lua 5.1-5.4 and LuaJIT 2.0/2.1. There are some notes below regarding potential optimizations for Lua 5.2+

Notes

This code will mark the UTF-16 surrogates (U+D800 through U+DFFF) as invalid. If you are using Lua 5.3+'s utf8.len for validation, it will mark these as valid.

Potential Optimizations

Examples of both optimizations are commented in the utf8validator.lua file.

find in the right spot

After nil/range checking the second/third/fourth bytes, a find() pattern match can offload some processing to C for all the ASCII characters in regular Lua. This drastically improves the speed of validation and still passes all the tests (tests run in 52% of the original time)

_, i = input:find(".*", i)
i = i + 1

Note: This slows down LuaJIT (tests run in 160% of the time without it)

Adding goto for a small speedup

A simple speedup in Lua 5.2+ is just adding labels to the nil checks on the bytes after the branches. The tests run in 88% of the original time.

::fourth:: if fourth == nil or fourth < 128 or fourth > 191 then return false, code_points end
::third:: if third == nil or third < 128 or third > 191 then return false, code_points end
::second:: if second == nil or second < 128 or second > 191 then return false, code_points end
::none::

and then adding the goto in the respective byte checks:

if byte <= 127 then
	i = i + 1
	goto none

elseif byte >= 194 and byte <= 223 then
	second = sbyte(input, i + 1)
	i = i + 2
	goto second

-- you can see all the goto's commented in utf8validator.lua/tl

Licenses