Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ›‘οΈ High Memory Consumption leading to Network Outage #618

Closed
ccamel opened this issue May 15, 2024 · 5 comments Β· Fixed by axone-protocol/prolog#6 or #703
Closed

πŸ›‘οΈ High Memory Consumption leading to Network Outage #618

ccamel opened this issue May 15, 2024 · 5 comments Β· Fixed by axone-protocol/prolog#6 or #703
Labels
security audit Categorizes an issue or PR as relevant to Security Audit

Comments

@ccamel
Copy link
Member

ccamel commented May 15, 2024

Note

Severity: Critical
target: v7.1.0 - Commit: 3c854270b006db30aa3894da2cdba10cc31b8c5f
Ref: OKP4 Blockchain Audit Report v1.0 - 02-05-2024 - BlockApex

Description

While approaching the prolog code being executed on the okp4 blockchain, we used blackbox approach in which we tried various prolog predicates which were natively allowed in the blockchain. We found that if a large list is generated and then tried to be mapped, it would cause a high memory usage which may lead to termination of that specific blockchain instance. This issue was
also verfied on itchiban/prolog's Scripting engine. If a large list i.e N = 10000000000 was passed it would cause high memory usage that the node instance is forced to terminate. It is interesting to note that on some value of N = 10000, the blockchain will gracefully handle the issue and return gas error.

Furthermore, it was tested that this issue is applicable for all predicates that may have a high memory consumption.

This finding is not isolated to our custom testing environments but was also corroborated using the Ichiban Prolog's scripting engine, highlighting a fundamental issue within the underlying computational logic that poses a serious risk to the blockchain's stability and operational integrity.

Impact

The ramifications of this vulnerability are twofold:

  • Single Node Crash: An attacker could exploit this vulnerability to specifically target a single node by sending a query that triggers the high memory consumption issue. This would result in the node crashing due to memory overload, leading to potential disruptions in service and degradation of network performance.
  • Multiple Node Failure: More alarmingly, this vulnerability could be exploited through a valid transaction propagated across the network. If such a transaction contains a smart contract or operation that invokes the problematic Prolog predicates, it could lead to simultaneous high memory consumption on multiple nodes. This scenario could cause widespread network outages or even facilitate a network takeover by malicious validators who could remain operational while others fail.

Recommandation

TBC

@ccamel ccamel added the security audit Categorizes an issue or PR as relevant to Security Audit label May 15, 2024
@github-project-automation github-project-automation bot moved this to πŸ“‹ Backlog in πŸ’» Development May 15, 2024
@github-project-automation github-project-automation bot moved this to πŸ“‹ Backlog in πŸ’» Development May 15, 2024
@ccamel ccamel moved this from πŸ“‹ Backlog to πŸ“† To do in πŸ’» Development May 15, 2024
@bdeneux
Copy link
Contributor

bdeneux commented Jun 6, 2024

Below is my analysis and steps to reproduce the critical issue at hand. Let's open discussion about this issue and potential solutions to resolve it.

πŸ“ length/2 allocation issue

Here the simplified prolog query that cause hight memory consumption :

length(List, 10000000000), maplist(=(1), List).

Initially, I thought the issue was due to the recursive call of execution of all clauses through maplist/2. However, the recursion is not the problem since the logic module asks for prolog execution through a given context that handles the out of gas limit. The context is given to the Promise that handles each predicate and stops immediately when the gas limit is reached.

When calling length(List, 4), prolog will allocate a new Variable four times and unify it into the List :

https://github.com/axone-protocol/prolog/blob/d6ea3496c94deb1dbb755c4e2c78914507c67d0b/engine/builtin.go#L2926-L2935 :

func lengthRundown(vm *VM, list Variable, n Integer, k Cont, env *Env) *Promise {
	elems, err := makeSlice(int(n))
	if err != nil {
		return Error(resourceError(resourceMemory, env))
	}
	for i := range elems {
		elems[i] = NewVariable()
	}
	return Unify(vm, list, List(elems...), k, env)
}

The only limitation checked here is the memory limit depending on the node host caught by makeSlice(). Otherwise, for very large numbers, it takes time to allocate NewVariable() through the loop range and is not handled by the context limitation. This means, when we call this predicate through a context with a deadline :

func TestInterpreter_performance(t *testing.T) {
	var out bytes.Buffer
	p := New(nil, &out)

	tests := []struct {
		name     string
		query    string
		output   string
		outputFn func(t *testing.T, output string)
		err      error
		waits    bool
	}{
		{name: "1", query: `length(List, 10000000), maplist(=(1), List).`, err: context.DeadlineExceeded, waits: true},
		{name: "2", query: `length(List, 10000000000), maplist(=(1), List).`, err: context.DeadlineExceeded, waits: true},
	}

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			ctx, cancel := context.WithTimeout(context.Background(), 10000*time.Millisecond)
			defer cancel()

			assert.Equal(t, tt.err, p.QuerySolutionContext(ctx, tt.query).Err())
			assert.Equal(t, tt.waits, errors.Is(ctx.Err(), context.DeadlineExceeded))
		})
	}
}

The first test case works since the deadline is handled by maplist, but the second one fails due to the test timeout and not handled by the context cancellation.

Workaround

I made a small workaround for testing purposes by adding the context to the VM and using it to check if the context is not done before continuing :

func lengthRundown(vm *VM, list Variable, n Integer, k Cont, env *Env) *Promise {
	elems, err := makeSlice(int(n))
	if err != nil {
		return Error(resourceError(resourceMemory, env))
	}
	for i := range elems {
		select {
		case <-vm.Ctx.Done():
			fmt.Println("TIME OUT")
			return Error(fmt.Errorf("time out"))
		default:
			elems[i] = NewVariable()
		}
	}
	return Unify(vm, list, List(elems...), k, env)
}

πŸ’Ύ Memory limit

Here is the snippet of code that is used for making a new slice by controlling available memory :

https://github.com/axone-protocol/prolog/blob/d6ea3496c94deb1dbb755c4e2c78914507c67d0b/engine/malloc.go#L14-L43

The calculation of the memory limit for making a new slice is not safe and will depend essentially on the host where the node runs. It's not safe because between the time when the available memory is checked (free := memFree()) and when the slice is allocated (make([]Term, n)), another part of the program could potentially allocate memory, causing the actual memory usage to exceed the limit. (In a blockchain node execution condition, this occurs multiple times on my side during my test). By the way, we cannot catch memory overflows : ichiban/prolog#226.

πŸ”— References

πŸ“ Next steps

Let's discuss possible fixes we can implement πŸ˜…. @ccamel @amimart

@ccamel
Copy link
Member Author

ccamel commented Jun 7, 2024

Thanks @bdeneux for your detailed analysis of the nature of the security problem revealed by the audit.

Regarding the length/2 allocation issue, we can't use the deadline pattern to solve it because it introduces unpredictability into the execution of requests. Depending on the complexity of the query and the current load on the system, the same query may sometimes succeed and sometimes fail, compromising the ability of the validation nodes to reach consensus.

In preliminary consideration, I believe the best approach would be to implement controls on the allocators (memory, variables, etc.) that:

  1. Ensure that the amount allocated of resources (mem, variables, etc) cannot exceed a maximum, which assures validators that allocated resources remain constrained at all times. This serves as the absolute limit.
  2. Account for this allocation by a proportional consumption of gas. This acts as the utilitarian limit.

That said, it's not easy, quite tricky (as you mentioned, ichiban/prolog#226 is a good read)
and you have to apply it to all the code, mainly in the ichiban/prolog library, but also to our native predicates, if necessary.

But I'd need a bit more time to get a better grasp of the issue and see if other approaches might not be more effective and safer.

@amimart
Copy link
Member

amimart commented Jul 4, 2024

@ccamel @bdeneux This is really tricky, I would propose a first far from ideal fix: prohibit unification of the length with a not initialized variable using the length/2 predicate, same approach could be use for the functor/3 predicate. This way we can still get the length of an array, but not initializing one..

What do you think?

@bdeneux
Copy link
Contributor

bdeneux commented Jul 22, 2024

@amimart Yes seems a good temporary solution but after working on axone-protocol/prolog#3, I think it would be better (as a temporary solution too) if we add instead a maximum number of NewVariable possible. It will allow alway to use unification of length with uninitialized variable but with a defined limit, like say @ccamel in his first comment about allocated resources. This is a temporary solution since it act only on the variable resource and not on the global memory but it's a good starting point.

@amimart
Copy link
Member

amimart commented Jul 22, 2024

@amimart Yes seems a good temporary solution but after working on axone-protocol/prolog#3, I think it would be better (as a temporary solution too) if we add instead a maximum number of NewVariable possible. It will allow alway to use unification of length with uninitialized variable but with a defined limit, like say @ccamel in his first comment about allocated resources. This is a temporary solution since it act only on the variable resource and not on the global memory but it's a good starting point.

Totally agree with you that's way better πŸ‘, let's go this way

@github-project-automation github-project-automation bot moved this from πŸ“† To do to βœ… Done in πŸ’» Development Jul 23, 2024
@github-project-automation github-project-automation bot moved this from πŸ“‹ Backlog to βœ… Done in πŸ’» Development Jul 23, 2024
@bdeneux bdeneux linked a pull request Jul 25, 2024 that will close this issue
@bdeneux bdeneux reopened this Jul 25, 2024
@github-project-automation github-project-automation bot moved this from βœ… Done to πŸ“† To do in πŸ’» Development Jul 25, 2024
@github-project-automation github-project-automation bot moved this from βœ… Done to πŸ“† To do in πŸ’» Development Jul 25, 2024
@github-project-automation github-project-automation bot moved this from πŸ“† To do to βœ… Done in πŸ’» Development Jul 25, 2024
@github-project-automation github-project-automation bot moved this from πŸ“† To do to βœ… Done in πŸ’» Development Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
security audit Categorizes an issue or PR as relevant to Security Audit
Projects
Status: βœ… Done
3 participants