Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): local evaluation (HogVM part 2) #16275

Merged
merged 16 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@
!plugin-server/.eslintrc.js
!plugin-server/.prettierrc
!share/GeoLite2-City.mmdb
!hogvm/python
2 changes: 1 addition & 1 deletion .github/actions/run-backend-tests/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ runs:
GROUPS_ON_EVENTS_ENABLED: ${{ inputs.person-on-events }}
shell: bash
run: | # async_migrations are covered in ci-async-migrations.yml
pytest posthog -m "not async_migrations" \
pytest hogvm posthog -m "not async_migrations" \
--splits ${{ inputs.concurrency }} --group ${{ inputs.group }} \
--durations=100 --durations-min=1.0 --store-durations \
$PYTEST_ARGS
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/ci-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ jobs:
# code completely
- 'ee/**/*'
- 'posthog/**/*'
- 'hogvm/**/*'
- 'bin/*.py'
- requirements.txt
- requirements-dev.txt
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,5 @@ playwright/e2e-vrt/**/*-darwin.png
# antlr4 generated temp files when "npm run grammar:build" crashes
gen
upgrade/
hogvm/typescript/dist

55 changes: 55 additions & 0 deletions hogvm/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# HogQL bytecode changelog

## 2023-06-28 - First version

### Operations added

```bash
FIELD = 1 # [arg3, arg2, arg1, FIELD, 3] # arg1.arg2.arg3
CALL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
AND = 3 # [val3, val2, val1, AND, 3] # val1 and val2 and val3
OR = 4 # [val3, val2, val1, OR, 3] # val1 or val2 or val3
NOT = 5 # [val, NOT] # not val
PLUS = 6 # [val2, val1, PLUS] # val1 + val2
MINUS = 7 # [val2, val1, MINUS] # val1 - val2
MULTIPLY = 8 # [val2, val1, MULTIPLY] # val1 * val2
DIVIDE = 9 # [val2, val1, DIVIDE] # val1 / val2
MOD = 10 # [val2, val1, MOD] # val1 % val2
EQ = 11 # [val2, val1, EQ] # val1 == val2
NOT_EQ = 12 # [val2, val1, NOT_EQ] # val1 != val2
GT = 13 # [val2, val1, GT] # val1 > val2
GT_EQ = 14 # [val2, val1, GT_EQ] # val1 >= val2
LT = 15 # [val2, val1, LT] # val1 < val2
LT_EQ = 16 # [val2, val1, LT_EQ] # val1 <= val2
LIKE = 17 # [val2, val1, LIKE] # val1 like val2
ILIKE = 18 # [val2, val1, ILIKE] # val1 ilike val2
NOT_LIKE = 19 # [val2, val1, NOT_LIKE] # val1 not like val2
NOT_ILIKE = 20 # [val2, val1, NOT_ILIKE] # val1 not ilike val2
IN = 21 # [val2, val1, IN] # val1 in val2
NOT_IN = 22 # [val2, val1, NOT_IN] # val1 not in val2
REGEX = 23 # [val2, val1, REGEX] # val1 =~ val2
NOT_REGEX = 24 # [val2, val1, NOT_REGEX] # val1 !~ val2
IREGEX = 25 # [val2, val1, IREGEX] # val1 =~* val2
NOT_IREGEX = 26 # [val2, val1, NOT_IREGEX] # val1 !~* val2
TRUE = 29 # [TRUE] # true
FALSE = 30 # [FALSE] # false
NULL = 31 # [NULL] # null
STRING = 32 # [STRING, 'text'] # 'text'
INTEGER = 33 # [INTEGER, 123] # 123
FLOAT = 34 # [FLOAT, 123.12] # 123.01

# Added for completion, but not yet implemented. Stay tuned!
IN_COHORT = 27 # [val2, val1, IREGEX] # val1 in cohort val2
NOT_IN_COHORT = 28 # [val2, val1, NOT_IREGEX] # val1 not in cohort val2
```

### Functions added

```bash
concat(...) # concat('test: ', 1, null, '!') == 'test: 1!'
match(string, pattern) # match('fish', '$fi.*') == true
toString(val) # toString(true) == 'true'
toInt(val) # toInt('123') == 123
toFloat(val) # toFloat('123.2') == 123.2
toUUID(val) # toUUID('string') == 'string'
```
96 changes: 96 additions & 0 deletions hogvm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# HogVM

A HogVM is a 🦔 that runs HogQL bytecode. It's purpose is to locally evaluate HogQL expressions against any object.

## HogQL bytecode

HogQL Bytecode is a compact representation of a subset of the HogQL AST nodes. It follows a certain structure:

```
1 + 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.PLUS]
1 and 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.AND, 2]
1 or 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.OR, 2]
not true # [_H, op.TRUE, op.NOT]
properties.bla # [_H, op.STRING, "bla", op.STRING, "properties", op.FIELD, 2]
call('arg', 'another') # [_H, op.STRING, "another", op.STRING, "arg", op.CALL, "call", 2]
1 = 2 # [_H, op.INTEGER, 2, op.INTEGER, 1, op.EQ]
'bla' !~ 'a' # [_H, op.STRING, 'a', op.STRING, 'bla', op.NOT_REGEX]
```

## Compliant implementation

The `python/execute.py` function in this folder acts as the reference implementation in case of disputes.

### Operations

To be considered a PostHog HogQL Bytecode Certified Parser, you must implement the following operations:

```bash
FIELD = 1 # [arg3, arg2, arg1, FIELD, 3] # arg1.arg2.arg3
CALL = 2 # [arg2, arg1, CALL, 'concat', 2] # concat(arg1, arg2)
AND = 3 # [val3, val2, val1, AND, 3] # val1 and val2 and val3
OR = 4 # [val3, val2, val1, OR, 3] # val1 or val2 or val3
NOT = 5 # [val, NOT] # not val
PLUS = 6 # [val2, val1, PLUS] # val1 + val2
MINUS = 7 # [val2, val1, MINUS] # val1 - val2
MULTIPLY = 8 # [val2, val1, MULTIPLY] # val1 * val2
DIVIDE = 9 # [val2, val1, DIVIDE] # val1 / val2
MOD = 10 # [val2, val1, MOD] # val1 % val2
EQ = 11 # [val2, val1, EQ] # val1 == val2
NOT_EQ = 12 # [val2, val1, NOT_EQ] # val1 != val2
GT = 13 # [val2, val1, GT] # val1 > val2
GT_EQ = 14 # [val2, val1, GT_EQ] # val1 >= val2
LT = 15 # [val2, val1, LT] # val1 < val2
LT_EQ = 16 # [val2, val1, LT_EQ] # val1 <= val2
LIKE = 17 # [val2, val1, LIKE] # val1 like val2
ILIKE = 18 # [val2, val1, ILIKE] # val1 ilike val2
NOT_LIKE = 19 # [val2, val1, NOT_LIKE] # val1 not like val2
NOT_ILIKE = 20 # [val2, val1, NOT_ILIKE] # val1 not ilike val2
IN = 21 # [val2, val1, IN] # val1 in val2
NOT_IN = 22 # [val2, val1, NOT_IN] # val1 not in val2
REGEX = 23 # [val2, val1, REGEX] # val1 =~ val2
NOT_REGEX = 24 # [val2, val1, NOT_REGEX] # val1 !~ val2
IREGEX = 25 # [val2, val1, IREGEX] # val1 =~* val2
NOT_IREGEX = 26 # [val2, val1, NOT_IREGEX] # val1 !~* val2
TRUE = 29 # [TRUE] # true
FALSE = 30 # [FALSE] # false
NULL = 31 # [NULL] # null
STRING = 32 # [STRING, 'text'] # 'text'
INTEGER = 33 # [INTEGER, 123] # 123
FLOAT = 34 # [FLOAT, 123.12] # 123.01

# Added for completion, but not yet implemented. Stay tuned!
IN_COHORT = 27 # [val2, val1, IREGEX] # val1 in cohort val2
NOT_IN_COHORT = 28 # [val2, val1, NOT_IREGEX] # val1 not in cohort val2
```

### Functions

A PostHog HogQL Bytecode Certified Parser must also implement the following function calls:

```bash
concat(...) # concat('test: ', 1, null, '!') == 'test: 1!'
match(string, pattern) # match('fish', '$fi.*') == true
toString(val) # toString(true) == 'true'
toInt(val) # toInt('123') == 123
toFloat(val) # toFloat('123.2') == 123.2
toUUID(val) # toUUID('string') == 'string'
```

### Null handling

In HogQL equality comparisons, `null` is treated as any other variable. Its presence will not make functions automatically return `null`, as is the ClickHouse default.

```sql
1 == null # false
1 != null # true
```

Nulls are just ignored in `concat`

## Known broken features

- **Regular Expression** support is implemented, but NOT GUARANTEED to the same way across platforms. Different implementations (ClickHouse, Python, Node) use different Regexp engines. ClickHouse uses `re2`, the others use `pcre`. Use the case-insensitive regex operators instead of passing in modifier flags through the expression.
- **DateTime** comparisons are not supported.
- **Cohort Matching** operations are not implemented.
- Only a small subset of functions is enabled. This list is bound to expand.
143 changes: 143 additions & 0 deletions hogvm/python/execute.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
import re
from typing import List, Any, Dict

from hogvm.python.operation import Operation, HOGQL_BYTECODE_IDENTIFIER


class HogVMException(Exception):
pass


def like(string, pattern, flags=0):
pattern = re.escape(pattern).replace("%", ".*")
re_pattern = re.compile(pattern, flags)
return re_pattern.search(string) is not None


def get_nested_value(obj, chain) -> Any:
for key in chain:
if isinstance(key, int):
obj = obj[key]
else:
obj = obj.get(key, None)
return obj


def to_concat_arg(arg) -> str:
if arg is None:
return ""
if arg is True:
return "true"
if arg is False:
return "false"
return str(arg)


def execute_bytecode(bytecode: List[Any], fields: Dict[str, Any]) -> Any:
try:
stack = []
iterator = iter(bytecode)
if next(iterator) != HOGQL_BYTECODE_IDENTIFIER:
raise HogVMException(f"Invalid bytecode. Must start with '{HOGQL_BYTECODE_IDENTIFIER}'")

while (symbol := next(iterator, None)) is not None:
match symbol:
case Operation.STRING:
stack.append(next(iterator))
case Operation.INTEGER:
stack.append(next(iterator))
case Operation.FLOAT:
stack.append(next(iterator))
case Operation.TRUE:
stack.append(True)
case Operation.FALSE:
stack.append(False)
case Operation.NULL:
stack.append(None)
case Operation.NOT:
stack.append(not stack.pop())
case Operation.AND:
stack.append(all([stack.pop() for _ in range(next(iterator))]))
case Operation.OR:
stack.append(any([stack.pop() for _ in range(next(iterator))]))
case Operation.PLUS:
stack.append(stack.pop() + stack.pop())
case Operation.MINUS:
stack.append(stack.pop() - stack.pop())
case Operation.DIVIDE:
stack.append(stack.pop() / stack.pop())
case Operation.MULTIPLY:
stack.append(stack.pop() * stack.pop())
case Operation.MOD:
stack.append(stack.pop() % stack.pop())
case Operation.EQ:
stack.append(stack.pop() == stack.pop())
case Operation.NOT_EQ:
stack.append(stack.pop() != stack.pop())
case Operation.GT:
stack.append(stack.pop() > stack.pop())
case Operation.GT_EQ:
stack.append(stack.pop() >= stack.pop())
case Operation.LT:
stack.append(stack.pop() < stack.pop())
case Operation.LT_EQ:
stack.append(stack.pop() <= stack.pop())
case Operation.LIKE:
stack.append(like(stack.pop(), stack.pop()))
case Operation.ILIKE:
stack.append(like(stack.pop(), stack.pop(), re.IGNORECASE))
case Operation.NOT_LIKE:
stack.append(not like(stack.pop(), stack.pop()))
case Operation.NOT_ILIKE:
stack.append(not like(stack.pop(), stack.pop(), re.IGNORECASE))
case Operation.IN:
stack.append(stack.pop() in stack.pop())
case Operation.NOT_IN:
stack.append(stack.pop() not in stack.pop())
case Operation.REGEX:
args = [stack.pop(), stack.pop()]
stack.append(bool(re.search(re.compile(args[1]), args[0])))
case Operation.NOT_REGEX:
args = [stack.pop(), stack.pop()]
stack.append(not bool(re.search(re.compile(args[1]), args[0])))
case Operation.IREGEX:
args = [stack.pop(), stack.pop()]
stack.append(bool(re.search(re.compile(args[1], re.RegexFlag.IGNORECASE), args[0])))
case Operation.NOT_IREGEX:
args = [stack.pop(), stack.pop()]
stack.append(not bool(re.search(re.compile(args[1], re.RegexFlag.IGNORECASE), args[0])))
case Operation.FIELD:
chain = [stack.pop() for _ in range(next(iterator))]
stack.append(get_nested_value(fields, chain))
case Operation.CALL:
name = next(iterator)
args = [stack.pop() for _ in range(next(iterator))]
if name == "concat":
stack.append("".join([to_concat_arg(arg) for arg in args]))
elif name == "match":
stack.append(bool(re.search(re.compile(args[1]), args[0])))
elif name == "toString" or name == "toUUID":
if args[0] is True:
stack.append("true")
elif args[0] is False:
stack.append("false")
elif args[0] is None:
stack.append("null")
else:
stack.append(str(args[0]))
elif name == "toInt" or name == "toFloat":
try:
stack.append(int(args[0]) if name == "toInt" else float(args[0]))
except ValueError:
stack.append(None)
else:
raise HogVMException(f"Unsupported function call: {name}")
case _:
raise HogVMException(f"Unexpected node while running bytecode: {symbol}")

if len(stack) > 1:
raise HogVMException("Invalid bytecode. More than one value left on stack")

return stack.pop()
except IndexError:
raise HogVMException("Unexpected end of bytecode")
43 changes: 43 additions & 0 deletions hogvm/python/operation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from enum import Enum

HOGQL_BYTECODE_IDENTIFIER = "_h"


SUPPORTED_FUNCTIONS = ("concat", "match", "toString", "toInt", "toFloat", "toUUID")


class Operation(str, Enum):
FIELD = 1
CALL = 2
AND = 3
OR = 4
NOT = 5
PLUS = 6
MINUS = 7
MULTIPLY = 8
DIVIDE = 9
MOD = 10
EQ = 11
NOT_EQ = 12
GT = 13
GT_EQ = 14
LT = 15
LT_EQ = 16
LIKE = 17
ILIKE = 18
NOT_LIKE = 19
NOT_ILIKE = 20
IN = 21
NOT_IN = 22
REGEX = 23
NOT_REGEX = 24
IREGEX = 25
NOT_IREGEX = 26
IN_COHORT = 27
NOT_IN_COHORT = 28
TRUE = 29
FALSE = 30
NULL = 31
STRING = 32
INTEGER = 33
FLOAT = 34
Loading