-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink accessibility scoring #3444
Comments
Forgot to mention, I think a tool like Tenon.io also weighs the score based on page complexity. So if there are 1000 DOM nodes and 1 failing test, it's considered less severe than if there are 100 DOM nodes and a failing test. |
I think having a % based score is a bad idea - 100% implies that the job is done. Perhaps consider reversing it so that the goal is to get to 0 errors. In my mind having 0 errors doesn't have the same implications as reaching 100% |
@jnurthen maybe in a vein like HTML_Sniffer displays results. There you get errors, warnings and notices for the page. You might spice things up with some sort of color coding. Red background color for too many errors on a page with few DOM nodes or a certain amount of errors for an average amount of DOM nodes and so forth like @robdodson suggested . Yellow background for better results and green background for desired results. |
True, but this is basically the case for anything. For performance or best practices or PWA or anything, we'll always be looking at a subset of all worthwhile issues. So I think that just places the onus on the tool developers to be as comprehensive as possible.
SGTM. I'm totally on board with looking at weighting things differently, even if it includes aspects like DOM node count. |
I know we have work to do in aXe-core for inapplicable, and the "WCAG bad page" (which, funny enough, @WilcoFiers said he worked on). I do really appreciate accessibility being given such prominence! Just want to set devs up for success by giving them realistic expectations. |
failure rates for the most recent HTTP Archive run (Sept 1-15). This is from runs of the a11y audits over 427,306 URLs (there were 2,705 URLs the audits weren't able to return results for due to a variety of errors):
(as stated above, the complementary percentile isn't necessarily the pass rate...the audit may have not been applicable to the page being tested) |
Wow that's excellent data! Interesting that one of the rules has 0%. As for the topic at hand. This related to a well known problem in accessibility: Metrics. There just isn't a good way to grade the accessibility of a page. Either you passed, or you didn't. Having X number of issues, or X number of rules failed, or X number of criteria failed generally doesn't say much about how accessible that page is. A page can have 1 very bad accessibility issue and be a disaster to work with, or 100 trivial problems that users with disabilities can easily work around. The W3C had a whole symposium on the problem of a11y metrics with no solution to speak of: https://www.w3.org/TR/accessibility-metrics-report/ I personally don't much like the percentage approach. The problem I have with that is two-fold. First that 100% quite heavily implies there are no problems, which is an impression we should very much try to avoid, since that's not what having no issues in aXe means. We've solved this in our products by adding an indication that further testing is always necessary. The second problem I have with it is that as we add more rules, the numbers change. Going from 100% accessible to 80% accessible due to an update isn't a nice message to receive, and its hard to explain to someone who doesn't understand the inner workings. My favourite approach for metrics has been to just use the absolute number of passes and failures per rule as a score, so, two numbers. There is no "highest number", in the sense that you don't imply that hitting 100% or 10, or A+ or whatever means that you're done. It gives some perspective, because you can still gage passes to failures. And it is relatively easy to understand that new rules will grow the number of tests, which can either mean more passes, more fails, or no change at all because it was inapplicable. Hope that helps! |
In practice we encourage people to see each issue found as a barrier, and how big a barrier depends on the context of the user-journey. For example, a keyboard-inaccessible 'add to basket' button is a huge barrier, missing alt-text for the logo in the footer less so. Would it be possible to flip the metric around to "Barriers found", or something similar? So at the top of an audit it would be the number of issues found (perhaps including a weighting factor, or splitting into higher / lower issues). At the top of the accessibility section it could say something like:
|
@robdodson let's talk some more about this today. We think we can do some quick fixes here by reweighting the audits within the category. And then we can do some research to sort out how to more dynamically adjust the weightings/score based on the results coming back from aXe. |
Just wanted to provide an update for the folks subscribed to this thread. I think we have a multi-part plan we'd like to enact. The first step will be to re-weight the scores based on how bad the offending error is. Currently aXe lists all errors as either major or critical, and there's no way to filter out non-applicable tests, so this re-weighting will be pretty subjective. My current thinking is the stuff that is really egregious, and really common, will be weighted very heavily. I've already started putting together the new weights using the stats brendan linked above. Along with this work, we'll also add language to the report that clearly explains the tests can only test a small subset of a11y issues and folks still need to do manual checks. Similar warnings and manual checks already exist in the PWA tests report, so there is prior art for this: The second step is to work with the aXe team to filter out non-applicable tests (dequelabs/axe-core#473). This would be very helpful because then we could probably switch back to just scoring based on what aXe defines as major vs critical. If you end up with only 1 applicable test, and its critical, and you fail, you'd get a very bad a11y score. The third step is to work with the Lighthouse team on a bigger rethink of how we present these results to the user. There has been talk of doing this for other parts of the Lighthouse report, so we can just make accessibility part of that larger redesign. There are folks on this thread who have said that we shouldn't do scoring at all, however I've also heard from folks on Slack and in-person, that they really like the scoring and it has been helpful inside their larger organizations. I think we'll have to iterate on a few different UI options to see what feels right. I'll ping this thread again when the PR for the re-weighted scores is up so folks can try it out if they're interested. |
Score re-weighting is in PR now if anyone is interested: #3515 |
Closing this out since the PR was merged :) If any outstanding issues feel free to re-open. |
Short update, I'm going to see if I can work with aXe folks at CSUN to look into fixing the non-applicable tests results array. |
Awesome, thanks Rob! :D |
I know this is three years old but thought I should comment nonetheless to say that the "100% is misleading" point still feels true to this day, despite the addition of explanatory text and manual checks in #3834. It’s a common misconception that automated checks can be enough, and Lighthouse presenting "no issues" as "100%" only reinforces this. To the point that Lighthouse is very frequently used as the go-to bad example to demonstrate this problem. I realise this is the same as other Lighthouse scores, but it feels worse for this one due to the more fundamental misconception about automation in accessibility testing. Is there anything else that could be done to bring Lighthouse closer to other accessibility checkers, or otherwise expand upon what it already does to alleviate the confusion? As a starting point, here is a comparison of how major automated accessibility checkers describe a "perfect score", compared to Lighthouse: WAVE
Accessibility Insights
Axe
Lighthouse
The note could do with a stronger choice of words to start with. Perhaps cite how many issues Axe finds on average to bring the point home. Of course this still is just band-aid on the scoring problem – really what would be better is to de-emphasize this "100" and instead steer people towards caring about "0 issues". Perhaps follow the approach of other checkers – upon reaching the perfect score, display something else instead that congratulates the tester and clearly guides them to manual tests as the obvious next step. |
Currently the accessibility tests are all equally weighted. This means even if a test is not applicable it still counts as a pass and artificially inflates the score. As a result, pages in the WCAG's "bad" section still get a score of >89%.
One suggestion is to score inapplicable tests at a weight of 0, however based on this issue I'm not sure if aXe actually returns inapplicable tests.
Another thing we should definitely do in the near term is re-weight the tests. aXe itself has criticality ratings, and we can look at httparchive stats to figure out which tests fail most often and maybe boost their weight even more.
Ultimately though, an automated tool like aXe can only ever test a subset of accessibility issues, so giving someone a score of 100% can be misleading. It's entirely possible to make something that gets a 100% accessibility score but is still not very useable. For this reason we might consider ditching the accessibility score altogether and replacing it with something else. Maybe just an indicator that there are (minor|major|critical) errors? Open to suggestion here :)
@marcysutton @WilcoFiers
The text was updated successfully, but these errors were encountered: