Project Proposal: Weighted Audit Contest score and other warden performance stats #28

adamavenir · 2022-06-09T16:05:42Z

adamavenir
Jun 9, 2022

Issue

C4 relies on a mixture of new established talent and emerging talent development in order to continue to produce results that are on par and even surpassing top audit firms.

While we know that we have dramatically outpaced sponsor growth with warden growth, balancing both sides of the community is essential.

We are intending to continue to grow the number of contests we run concurrently. Right now we have only rudimentary methods to ensure contest coverage and to assist wardens in choosing which contests they might be able to add the most value to and therefore benefit most from participating in.

We have a ton of data about how wardens have performed in contests, but it’s not very contextualized and our current approaches of evaluating talent are deeply biased toward “I’ve seen that name at the top of a lot of contests” which becomes less of a useful signal as more people compete.

C4 needs a comparative performance metric

Because of the elegant design of the C4 award mechanism, individual contest leaderboards do a nice job of comparing performance within an individual contest. And on the overall leaderboard, we can get a sense of how player performance compares based on aggregate totals of earnings and issues found. (We could even do a better job of this by showing average performance stats by adding a per-contest divisor.)

But we know that there is a big difference between contests. Contracts written by junior developers produce more vulnerabilities. Some contests feature code that’s already been audited once or even twice.

And some competitors are so new to auditing and/or smart contracts that it would be nice to give them a way to view their performance compared to others closer to their presumed skill level. Providing more meaningful statistical feedback to wardens on their performance based on their zone of proximal development will help people to stay motivated in the journey of leveling up their skills.

Further, because C4 is also a talent aggregator and has played a part in folks getting a job working as a smart contract developer or auditor, having a way of measuring performance will help C4 do a better job playing this critical role for individuals and for the overall ecosystem.

North star: C4 is an esport

Sports excel at tracking numbers and analyzing performance in order to compare players’ contributions and output. We can look to examples from well-established sports as a way to develop our own stat.

I’m a baseball geek—yes, a rarity in our global C4 community—but follow me here, as I think there is a useful analogy in baseball stats that we might be able to use in how we think about a comparative performance stat.

RC (Runs Created) is a metric which attempts to describe a player’s offensive output. The highest impact thing you can do is hit a home run (hit the ball out of the field, which would mean moving four bases). Hitting a ‘double’ (two bases) or a ‘triple’ (three bases) are more impactful than a standard hit (a ‘single’). RC is a counting stat, so one nice thing about RC is that you can see the player’s accumulative impact over time. (See the all-time leaders of runs created for example.)

But baseball is weird. All baseball fields are different sizes—some dramatically different from others. And different eras of the game have been more prone to home runs than others (for lots of reasons: bounciness of the ball, pitching rules that made it harder/easier for hitters, propensity of steroid use). These things obviously impact the likelihood of your ability to hit a home run.

So wRC+ (Weighted Runs Created) also adjusts for different size fields and eras of the game and is specifically scoped to players who play at the same level of competition. (A minor league player could have a 120 wRC+ in lower leagues but a 92 wRC+ when coming into the top league.) In addition, wRC+ removes players from the average set who cannot be expected to perform at the same level (namely pitchers).

To make things more user-friendly, wRC+ normalizes player stats to set 100 as the league average with each point above/below 100 meaning 1 percentage point above/below average. This lets you quickly see where someone’s performance fits on the overall curve: a 105 wRC is 5% above average; a 95 wRC is 5% below average.

What are our RC and wRC+ equivalents?

I find RC and wRC+ make a nice analogy model for us when it comes to thinking of C4 performance.

We have several types of results which graduate in impact: Hitting a home run is a bit like finding a solo high severity issue in a C4 contest :) But we also have shared high impact findings, solo medium findings, shared medium findings, plus the graded performance of a low/non-critical report or gas optimization report.

We actually already kind of have a couple of Audit Contest score methods in their most basic form available to us pretty easily in our findings.csv data, though we have not yet used them:

Simple: High severity findings are worth 10, mediums worth 3. (Set aside lows given the change in scoring that took place this year.) So Rajeev finding 1 high and 3 mediums in audit contest 1 would be an Audit Contest score of 19.
Adjusted: A bug’s value decays based on the number of times it is found, so Rajeev’s adjusted Audit Contest score in contest 1 would be 10 (1 solo high) + 3 (one solo medium) + .81 + .81 (two mediums found 3x) would be an adjusted Audit Contest score of 14.62.

The adjusted version is cool because it tells us how hard the bug was to find based on the competition, with each bug’s value decreasing based on number of times it was found.

Unfortunately using that kind of weighting isn’t useful from contest to contest because we don’t know what the competition looked like:

Turning in a solo high in contest 1 with 8 wardens competing is a different achievement from turning in a solo high in contest 120 with 70 wardens competing.
But, again, if 6 of 8 of those wardens in contest 1 ended up being in the top 10 all-time wardens while only 3 of 8 of the contest 107 were top-10 all time, the competition is _also _different, right? But yet again, a warden at the top of their game in their 50th contest is almost certainly at a different skill level than they were in their 1st contest, so even trying to account for ‘who competed’ isn’t a pure apples-to-apples comparison.

There’s also some other data which would be extremely useful to incorporate:

Percentage of valid vs invalid
Severity downgrades (which can be assessed by comparing the original submission severity in the data folder to the final severity assigned at the end of the contest)

So there is value in having a few statistical models which can tell us:

Cumulative impact (which we kind of have covered in the simple audit contest score described above)
Quality of submission level (based on valid/invalid and downgrades)
How someone performed in a contest against the average score of all wardens
How someone performed in a contest against the average score of wardens who tend to perform around their level. Perhaps this gets broken up into simple tiers based on some kind of career milestones like:
- Tier 1 — 1 high severity finding
- Tier 2 — high severity findings in 3 separate contests
- Tier 3 — multiple high severity findings in 3 separate contests
- Tier 4 — >1 solo high severity finding in 3 separate contests
- Tier 5 — 10+ solo high severity findings (or maybe 10 first place contest finishes?)
Score the comparative difficulty/complexity of contests even on a rudimentary level (prior audit, lines of code, complexity)
Score the depth of competition currently RSVP’d for a contest
Overall average performance score
Probably some other stuff, idk. My brain is now broken from writing all of the above lol

These numbers should ultimately be able to be split out based on certain filters like:

Type of contest, league, etc
Size/scope of contest
Time period
Probably some other things but I am done writing bullet points

tl;dr

hey like what if we had advanced stats

Proposed Solution: stats contest

Run a contest to develop a set of statistical models based on data we have right now, with a second category for stats based on data we could collect.
Use the QA/Gas pool ‘grade’ model to assign awards from a pool of $ARENA tokens.
Recruit a judge to assess

Picodes · 2022-06-09T17:26:59Z

Picodes
Jun 9, 2022
Collaborator

Nice ! Let's push for C4 as an esport - wen C4 at the Olympics ?
So the contest would be about creating custom leaderboard based on past csv data and explaining the metric ?

7 replies

adamavenir Jun 9, 2022
Author

Sounds great to me :)

antoncoding Jun 11, 2022

I think one way that could make the process more fair is to have a checkpoint during the competition and ask people to "submit" what kind of additional data they will be using, and make it available as a dataset for all others competing. Just so we can make sure we got the best model with all possible inputs. (Because some people might have privilege to private data)

To go one step further, we can also make it a 2 step competition:

phase 1: data gathering: let people submit datasets. if any of the dataset got used in the final model, reward the person who submitted this dataset.
phase 2: model building

Just something to think about ;)

adamavenir Jun 11, 2022
Author

Perfect addition @antoncoding

0237h Jul 25, 2022

Hey guys, I didn't see this before but I started plotting some data scraped from the Code4rena github repos and website as a learning exercise (silly me who just found out that there are already some data in CSV files available ...). You can find the repo here on my Github.

I'm using python from Jupyter notebooks and Altair as a visualization tool, idk if that's what you will be looking for but you could check it out and see if it can suit your needs. The charts could run directly on the Code4rena website since Altair uses Vega which can run natively in browsers. It could integrate nicely to Github as well with something like https://github.com/mcullan/jupyter-actions. Just some thoughts.

Have a nice one!
hm

sseefried Jul 25, 2022

This looks really interesting. Will have a look soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Proposal: Weighted Audit Contest score and other warden performance stats #28

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Project Proposal: Weighted Audit Contest score and other warden performance stats #28

adamavenir Jun 9, 2022

Issue

C4 needs a comparative performance metric

North star: C4 is an esport

What are our RC and wRC+ equivalents?

tl;dr

Proposed Solution: stats contest

Replies: 1 comment · 7 replies

Picodes Jun 9, 2022 Collaborator

adamavenir Jun 9, 2022 Author

antoncoding Jun 11, 2022

adamavenir Jun 11, 2022 Author

0237h Jul 25, 2022

sseefried Jul 25, 2022

adamavenir
Jun 9, 2022

Replies: 1 comment 7 replies

Picodes
Jun 9, 2022
Collaborator

adamavenir Jun 9, 2022
Author

adamavenir Jun 11, 2022
Author