RFC: Grades are coming for everything. Here's the draft rubric. #34

adamavenir · 2022-07-14T10:49:29Z

adamavenir
Jul 14, 2022

Grades for everything

Soon™ we will also be asking judges to grade everything, including medium and high severity issues on the 0 to 100 scale—just as QA and gas reports are now. Within a given set of duplicates, awards will be distributed on a curve based on the judges' grading.

This approach gives judges flexibility to have their own style. (Some prefer buckets, others prefer granularity; the 100-point scale and curve allows both.)

Assume three aspects to grading criteria for quality submissions (60+):

Well assessed — correctly identifies the highest severity impact of the bug
Well evidenced — makes the case for the severity and validity chosen with evidence
Well communicated — clear and understandable; writing worthy of inclusion in the report; exceptional submissions are understandable by non-auditors and include references that explain the type of vulnerability for others to be able to learn from

Only 'passing' grades will be eligible to be included in awards. Minimum threshold for passing is 60.

Medium and high severity finding criteria

Medium and high severity issues will require some level of clear evidence and a justification for why they merit the severity within the criteria guidelines. The more complex the claimed vulnerability, the more they require working code to demonstrate the conclusion. (In order to support this, we will be requesting sponsors to include their full code repo.)

Submitting a high severity issue and failing to include working code which demonstrates the impact is a risk wardens may take, but this may lead to a high severity issue being downgraded and/or deemed ineligible for awards.

To ensure folks are aware of this requirement, we will be adding the severity descriptions to the finding form when selecting the severity level and clearly state that the issue will not be awarded if it does not include the appropriate evidence.

Draft rubric

Passing grades:

90-100 — outstanding submission worthy of inclusion in report
80-89 — meets criteria well
70-79 — meets minimum criteria

Borderline passing:

60-69 — barely meets criteria; judge is probably being generous in giving this grade

Borderline failing:

50-59 — nearly meets all criteria, seems well-intended but is insufficient, out of scope, or seems to have a mistake somewhere in assessing validity or severity.

Below 50 may be increasingly penalized in the future:

30-49 — low/incomplete effort
20-29 — clearly overinflated severity
10-19 — approach is disrespectful of sponsors’ and judges’ time in some way
00-09 — spam

KenzoAgada · 2022-07-14T11:51:29Z

KenzoAgada
Jul 14, 2022

Awesome, I think this mechanism is sound and good for C4.

The only doubt I have is about the working code requirements.
I think that most issues can be sufficiently described without requiring actual code.
Therefore I hope that judges' criteria won't be too severe in requiring code.
As sock said, "The more complex the claimed vulnerability, the more they require working code to demonstrate the conclusion". So I hope that for issues that are not incredibly complex and sufficiently clear in explanation, indeed code will not be needed.

"Below 50 may be increasingly penalized in the future" -
Why not penalize it even from now?
It seems we have quite a few such issues.

1 reply

adamavenir Jul 14, 2022
Author

Therefore I hope that judges' criteria won't be too severe in requiring code.

I think the best way to view it is that you don't want to make the judge do a bunch of work to verify a submission.

@GalloDaSballo made this suggestion and may have some other feedback on the nuance here.

Why not penalize it even from now?

In a word: tools. We are amidst making some nicer tools for judging and some other things like editing submissions, and won't have time to modify awardcalc etc to accommodate penalties until those are done.

IAm0x52 · 2022-07-14T12:03:16Z

IAm0x52
Jul 14, 2022

Interesting idea. I like the idea of scoring the submissions. Poor quality reports in my opinion don't merit a payout. Finding vulnerabilities is useless if you can't explain or justify them. I have a big issue with the payouts being distributed on the curve. All this does is bring more subjectivity into what should be an objective contest. Disqualify reports under a threshold but distribution should be even for everyone that qualifies.

5 replies

adamavenir Jul 14, 2022
Author

Fair to have issue with it. Here's why we want it:

There is a wide variance in the quality of fully valid and good enough submissions.
Wardens very clearly thrive on feedback.
Report-worthy submissions are simply more valuable than others and these require somewhat subjective assessments.

IAm0x52 Jul 14, 2022

Strongly disagree with your third point. If two wardens both submit a report that clearly identifies a vulnerability and it's impact in a way that it can be fixed then both of their work is equally valuable. There is no gray area. Either the vulnerability gets fixed or it doesn't.

Maybe give the winning submission a small bonus but grading on a curve is way too extreme. If you have two reports scored at 95 and 90, the curve would give the 95 nearly double what it would give the 90. Both are report worthy but one gets paid nearly double what the other is paid. Highly unfair and completely subjective.

adamavenir Jul 14, 2022
Author

We didn't say what the curve was here. Incorrect to assume it would be the same as how QA and Gas optz curve works (which is steep indeed).

I don't think award deviation between top and bottom of "passing" submissions should be enormous, as I've said before in chat discussions on this topic. The difference should be meaningful, but only fractional.

IAm0x52 Jul 14, 2022

The difference should be meaningful, but only fractional.

Definitely good context to have, thank you.

peritoflores Jul 20, 2022

I like this idea and also agree with 0x00052. It is important to keep high quality in reports. However, as he mention QA curve is no appropiate in my opinion. The reason is that for that is suppose a known bug like "usage of eth transfer" this a common issue. Who will have the "best report" will get 40% more than the second. ? This is unfair as many of these reports are copy/paste. Is is better to separate the reports HIGH QUALITY, MEDIUM , LOW and pay all HIGH the same al MED the same and all LOW the same (for example). A high quality report deserves the highest payout.

youngDuckling · 2022-07-14T18:32:29Z

youngDuckling
Jul 14, 2022

Only 'passing' grades will be eligible to be included in awards. Minimum threshold for passing is 60.

Will this also be implemented for QA? I believe it should.

1 reply

adamavenir Jul 14, 2022
Author

Yes

Philogy · 2022-07-17T11:54:09Z

Philogy
Jul 17, 2022

Would be nice to have a clear statement of what the intended effect of this proposed change is. As discussed in another comment thread already, how good / bad this change is strongly depends on the grading curve. If the grading is too granular and/or the curve is too steep it will lead to a reduction in audit quality as wardens may start shifting more of their effort into making their submissions more polished to ensure that they receive a higher spot along the curve rather than searching for more issues. A balance needs to be met between requiring a minimum level of quality but capping the maximum useful quality.

I believe that any submission that meets a certain set of criteria should be rewarded the same. Submissions that obviously lack in quality, are malicious and/or spam should be penalized or sanctioned accordingly, but there should nevertheless be a set, objective threshold beyond which an auditor can feel certain that investing more effort will not lead to a larger reward for that issue. This will allow wardens to more strongly focus on finding more issues while still providing an incentive to meet a certain quality standard.

If the total required effort is simply shifted from judges to wardens without a useful reduction in overall effort it simply makes C4 less efficient. The more we can centralize and streamline different steps of the process the more efficient and therefore competitive C4 can be. I'd argue that a warden's core focus should mainly be finding high, medium and low/non-critical issues and creating submissions such that they have a PoC, impact and justification of the severity for each issue. Judging, creating recommendations for mitigation and final report compilation can all be done in a more centralized manner downstream. Specifically requiring wardens to already create mitigation recommendations and make their submissions report ready unnecessarily duplicates work for issues. Instead, these steps of the process should be centralized and separated to allow for specialization and higher efficiency.

As an added note I didn't mention gas issues yet in my comment as I feel that could better be separated into its own offering. This is because:

while similar finding good gas optimizations requires a slightly different skill set to auditing
different sponsors have different preferences in terms of optimization rigor
gas optimizations requiring larger logic changes can have significant impacts on security, potentially requiring a new audit

TL;DR If this proposal leads to judges spending much less time in exchange for wardens spending a little more on submissions I'm all for it. Oppositely if it leads to wardens being required to spend a lot more time on submissions, while saving little for judges and not really contributing to overall audit quality then it should be revised and/or better specified.

5 replies

adamavenir Jul 17, 2022
Author

Outstanding comments; agree with pretty much all of the above here. Will think more on this and respond when I get a chance.

One initial thought is that I actually think it may make sense to remove recommended mitigations from the submission requirement.

I had not thought of that or heard it suggested before, but no doubt that is highly duplicative work from wardens with low value for the duplicated effort, and as you say, warden time is always more valuable to go into finding issues.

@IllIllI000 also had a suggestion in a DM conversation which I thought was outstanding, which is that rather than there being the expectation of code demonstrating an exploit for high severity issues, a judge could request such a coded POC from a warden with a reasonable deadline in instances where they themselves would have to do that work in order to evaluate the issue.

Philogy Jul 17, 2022

Thanks. Do you mean that when in doubt judges would simply request a more detailed PoC from wardens rather than disqualifying issues if the provided evidence seems insufficient? I think that's a good idea, that could further reduce potentially unnecessary upfront effort from wardens.

GalloDaSballo Jul 17, 2022
Collaborator

Allowing wardens to post-factually send POCs is unfair to those that wouldn't receive that treatment, and also lowers the bar to an "I'll demonstrate after" type of dynamic.

Speaking as a sponsor I'll use the feedback provided by wardens as it lands, I'm not waiting 2 months for the report, as such raw submissions (as in submissions during the contest) should be of the highest quality.

GalloDaSballo Jul 17, 2022
Collaborator

I don't mind resistance to "a trivial issue should not require code", but am firmly against breaking the contest deadlines, especially if that's done in a way that seems to be easily gamed and would favour a few warden against others

Philogy Jul 17, 2022

That's a good point @GalloDaSballo, submission of late PoCs should not lead to a delay in the delivery of the report or advantage for submitting warden. If the latter is the case it would create an incentive to potentially be less rigorous with the issue justification.

From the examples I can think of I belive it should generally be disallowed except for specific exceptions, e.g. a unique and technically complex vulnerability is submitted with a clear technical description, while the description is enough to correctly explain the vulnerability and even exploit it's not enough for the judge to independently verify the validity. The judge should then ask the warden to submit a clearer PoC to save judges the effort of having to create the PoC themselves to double-check the validity.

From the example I've given we could extract some initial guidelines for when late PoC submission could be allowed:

The warden has proven they truly understand the issue they've submitted and its impact via a detailed technical explanation and/or exploit scenario
The judges are fairly confident of the issues' validity but due to some complexity are unable to pass some defined threshold of certainty
The necessary effort a judge would have to expend to better understand and create a PoC for the issue is larger than if the warden did it (this will presumably almost always be the case)

But I think when (if at all) late PoC submissions should be required and/or tolerated warrants its own discussion.

GalloDaSballo · 2022-07-19T23:12:01Z

GalloDaSballo
Jul 19, 2022
Collaborator

Why Code is best

I think it's hard to put into words but ultimately High and Med severity findings should "stand on their own".

Due to the sheer amount of submissions, it can be easy to dismiss a malformed report (or just a report that is a bunch of text instead of code).

I've had instances of Judges spending hours trying to honestly give a shot to the Warden, however you cannot reasonably expect this to happen, Judges

Offering a coded POC removes any room for discussion besides additional risks or scenarios and gives the finding the highest chance to stand on its own.

Second to Coded POC a detailed step by step "code quoting" set of steps will help clarify your finding:

Step 1
Step 2

Ideally accompanied by a simple illustration which can further help clarify what you meant.

Here's an example of a valid reEntrancy (part of code cut out on purpose), and an example of the step by step (text is pretty gibberish)

How no-code can be used to game the system / trick judges

To show an example this is me going through a plausible but ultimately invalid report about re-Entrancy.
This is me having to interpret the words of the warden to simulate the hypothetical tx so I can figure out if it's a valid concern or not.

Having the code would have lowered the required effort substantially.
And in the case of false-positives, a lack of code is just a obfuscation technique to try and pass a false / poorly thought out finding as valid.

Ship and iterate

I think due to lack of clarity, it's ok not to REQUIRE a Coded POC for all reports.
Especially with the goal of just getting the new system out ASAP to avoid spam / sybil submissions.

That said it seems clear that over time, the best wardens will be the ones providing coded POCs to corroborate their findings.

2 great examples of why coded POC is better

-> On a High Severity Report, a coded POC will give you the highest change of being featured.
While C4 may never pay a bonus for you to be featured, we are all aware that top wardens are constantly poached for private audits.

-> For Gas report
Even gas reports benefit by adding code, because demonstrating exactly how much gas a refactoring will save is one step closer to getting the refactoring in the codebase, meaning the sponsor will appreciate your work that much more.

6 replies

GalloDaSballo Jul 19, 2022
Collaborator

IllIllI000 Jul 19, 2022

The first part of that sentence wasn't the only, or primary, facet of why POCs shouldn't be required, so I was trying to provide some more coverage

GalloDaSballo Jul 19, 2022
Collaborator

You have to agree with me that it would increase quality and avoid mistakes in marking invalids as valid and invalidating correct findings.

That said it is hard a-priori to ensure the code will be Codeable

But I have no doubt that the top wardens will submit coded POCs even if we don't require it because they'll naturally get higher scores over just worded POCs

csanuragjain Jul 20, 2022

As @GalloDaSballo mentioned, I think "a detailed step by step "code quoting" set of steps" could be made as pre-requisite. This will help judges to evaluate more efficiently and would lead to lesser wastage of their time in absence of coded POC (fully agree coded poc takes lesser time for judging) and also for wardens like me who are still learning the coding part :)

peritoflores Aug 12, 2022

I agree with @GalloDaSballo that PoC is the best way to understand an issue, but also the point of @IllIllI000 that PoC is a time consuming task. I want to add an example (today) of an issue that was maked with "sponsor disputed" then I coded a PoC (which takes me a few hours because I am not good with JS). Finally, the the sponsor agreed (only disputed severity) . code-423n4/2022-08-mimo-findings#153
Now the judge only have to decide on severity which makes work easier for them.
In any case this scenario is being discouraged for any warden because I know the issue is duplicated so at some point I am working for others at this time it is more profitable not to write a PoC. I have read that comment on the chat as well.
The solution maybe is to have two step PoC. First report and then if the bug is not clear then PoC (not every bug needs a PoC). Maybe allow the best report/reports to write a PoC and grant a bonus for that. I believe that this adds more value to sponsors and less time consuming for judges also encourages to make good reports. Judges in my opinion should decide on severity while validity can be proved.
Just thinking I do not know what you think guys.

0xA5DF · 2022-07-20T21:39:28Z

0xA5DF
Jul 20, 2022
Collaborator

in order to support this, we will be requesting sponsors to include their full code repo.

I think it's also important to put some emphasis on having the sponsor to include some basic test/deploy scripts (as they mostly do), and make it as understandable, extendable and usable for the wardens (they mostly are, but there's always room for improvement). This can save a lot of time for the wardens.

2 replies

CloudEllie Jul 29, 2022
Maintainer

Would love to hear how we're doing on this lately; we've instituted a more technical review of contest repos pre-launch in recent weeks and community feedback is most welcome. We also updated our sponsor-facing docs to be more specific about what they should include in their repos.

0xA5DF Aug 2, 2022
Collaborator

Would love to hear how we're doing on this lately;

I haven't participated in many contests yet, but here's an example (I'm not sure you're familiar with that contest's code, but I think you can get the idea anyways).
During the Fractional contest, I was trying to add a leaf to the merkle root of Vault for some PoC, I went to the setUpProof() function and added there my leaf.
This didn't work, and after spending a while debugging I've realized that's because the Vault is created using the BaseVault module - which creates the tree based on the modules sent to it and regardless of the leafs used at setUpProof().
So maybe modifying the test's code so that it'd be easier to add leafs would be too much time-consuming for the sponsor, but a small comment at the setUpProof() function (or just have some general notes for the wardens about the tests, and write it there) that says "hey, if you're trying to add a leaf this wouldn't work" can be helpful and save some time.

Don't get me wrong, the Fractional code and tests were great and relatively easy to use for PoC (I'm saying relatively because it's never easy to modify someone else's code), but putting that extra effort to make it even more clear how to modify the tests can be useful.

There was another contest (which I'm not gonna name here in order to not shame them), which had some really bad tests, some of them didn't even work properly and a few contracts didn't even have tests (you'd probably be surprised that I found some bugs by creating the most basic tests for those contracts).

After all, the time of the wardens isn't just valuable to themselves, it should also be valuable to the sponsor, because the less we need to spend on creating PoC or understanding the code (the contests which I didn't name also didn't have the most clean code I've seen) the more we can spend to find those bugs. And the more time we have the more we'll find.

El-Ku · 2022-07-21T02:05:10Z

El-Ku
Jul 21, 2022

I would suggest that first time offenders get a warning before permanent penalties are imposed on them. If the penalty is only for the contest in hand, then it would be fine.

As a relatively new warden, i have made mistakes myself. And even repeated those mistakes as I had no feedback about my reports. I simply didn't know them. One could argue i should have taken the time and gone through the rules. But I was caught up between learning solidity, coding and submitting. Those times, how much I had wished for some constructive feedback!

It would be unfair to expect from a first time auditor to know all the rules and ways of posting. So for the first offense show the stick, and for the second one onwards give a wack. :)

I appreciate how the C4 team is working towards making C4 even more perfect. Grade system is great. Though it might not benefit a beginner like me at this point, from a wholesome perspective its the right approach. Thank you. _/_

5 replies

sseefried Jul 21, 2022

You mention permanent penalties but I don't think there is any such thing. There's just the idea that there is some minimum standard required to get rewarded.

El-Ku Jul 21, 2022

What about a score below 50 points? It uses the word penalty there. Though it will be implemented in the future.

sseefried Jul 21, 2022

My mistake!

sseefried Jul 21, 2022

Although I don't think well-intentioned wardens are ever likely to get marked down that low.

adamavenir Jul 21, 2022
Author

Fear not: penalties are only a per-contest concept :)

IllIllI000 · 2022-09-01T09:33:12Z

IllIllI000
Sep 1, 2022

Some judges do not currently upgrade QA to Med/High, and given that wardens won't know which judge will be judging ahead of time, this will lead to wardens submitting inflated severities. With the new scoring system, will judges be required to upgrade, or will upgrades be done away with, or will it still be up to each judge to decide what they want to do?

0 replies

horsefacts · 2022-09-21T21:19:47Z

horsefacts
Sep 21, 2022

I am completely, directionally on board with this proposal: grades are going to be a big improvement and I trust that we'll work out the details. (In fact, I am on board with adopting this as is and trusting judges to figure it out.) But I've noticed a couple specific edge cases over the past few weeks I want to call out.

Downgraded findings:
In a recent contest, I reported a single Medium finding that was downgraded to Low. Since this was my only finding, it was then scored as a QA report. I think it was appropriately reported as a Medium, but the downgrade was also a reasonable judgment and I respect the decision. This ended up scored as a 10, which is probably the right place on the curve, but the wrong grade on the rubric above: it does not belong in the low effort/clearly overinflated/disrespectful bucket. A good mechanism should ensure this outcome is not penalized. (To be clear, I'm not suggesting it was penalized in this example. I think it was appropriately scored and awarded, but want to point out the edge case)
Gas reports:
As my skills have improved, I've stopped competing for gas findings the way I used to. I usually have limited time and prefer to focus on higher value findings. However, when I come across an impactful or interesting optimization, I'll still write it up and report it. I don't expect these to be awarded much compared with more comprehensive reports, but it would be a loss for everyone if this was disincentivized by introducing grades. In fact, I think it's a strength of the C4 model that I can count on other wardens to find every i++. Relatedly, I've seen a few gas reports that disclaim "I'm only reporting this because everyone else will," which seems like the dark side of incentivizing comprehensiveness.

If there's a theme in both of these, I think it is that it may be more complicated to combine "grading" and "scoring" than it appears. (But that should not stop us from designing a way to do it!)

3 replies

adamavenir Sep 21, 2022
Author

yes once we switch to the new rubric this scenario wouldn't receive a '10' rating; the grading approach will change once we implement this.

horsefacts Dec 7, 2022

After a few contests using the new model, It is at least my subjective perception that these edge cases are still issues.

It's pretty common for me to downgrade a few issues as I'm writing up findings. But if I don't have the bandwidth to write a comprehensive QA report and end up with a handful of low level findings or one or two gas optimizations, I no longer feel much of an incentive to report them. In fact, I feel disincentivized: before the recent changes these simply wouldn't be awarded much, but now they are likely to get the "unsatisfactory" label.

That might be OK and working as designed! If this kind of submission creates more work than it's worth, it should be disincentivized. The optimization function here should be the one that finds bugs and scales the system, not the one that makes horsefacts feel good 😄.

But at least for me, the incentive before was not really monetary—there are easier ways to earn $20. It was more the feeling of contributing to the collective effort to find everything, from each according to their ability. That's been diluted a bit and I feel more mercenary now than I did before.

horsefacts Dec 7, 2022

(This note is mostly an attempt to articulate the tension. I still think these changes are necessary and directionally good!)

RFC: Grades are coming for everything. Here's the draft rubric. #34

Grades for everything

Medium and high severity finding criteria

Draft rubric

Replies: 9 comments · 28 replies

adamavenir Jul 14, 2022 Author

adamavenir Jul 14, 2022 Author

adamavenir Jul 14, 2022 Author

adamavenir Jul 14, 2022 Author

adamavenir Jul 17, 2022 Author

GalloDaSballo Jul 17, 2022 Collaborator

GalloDaSballo Jul 17, 2022 Collaborator

GalloDaSballo Jul 19, 2022 Collaborator

Why Code is best

How no-code can be used to game the system / trick judges

Ship and iterate

2 great examples of why coded POC is better

GalloDaSballo Jul 19, 2022 Collaborator

GalloDaSballo Jul 19, 2022 Collaborator

0xA5DF Jul 20, 2022 Collaborator

CloudEllie Jul 29, 2022 Maintainer

0xA5DF Aug 2, 2022 Collaborator

adamavenir Jul 21, 2022 Author

adamavenir Sep 21, 2022 Author

Replies: 9 comments 28 replies

adamavenir Jul 14, 2022
Author

adamavenir Jul 14, 2022
Author

adamavenir Jul 14, 2022
Author

adamavenir Jul 14, 2022
Author

adamavenir Jul 17, 2022
Author

GalloDaSballo Jul 17, 2022
Collaborator

GalloDaSballo Jul 17, 2022
Collaborator

GalloDaSballo
Jul 19, 2022
Collaborator

GalloDaSballo Jul 19, 2022
Collaborator

GalloDaSballo Jul 19, 2022
Collaborator

0xA5DF
Jul 20, 2022
Collaborator

CloudEllie Jul 29, 2022
Maintainer

0xA5DF Aug 2, 2022
Collaborator

adamavenir Jul 21, 2022
Author

adamavenir Sep 21, 2022
Author