-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strawman: Target privacy constraints #17
Comments
Thanks for the writeup Ben, I note that your information leakage is linked to pairwise to sites. One thing I liked about what we worked out for IPA is that it wasn't pairwise, which can ultimately provide a stronger privacy guarantee (which you might also be able to trade for better utility through larger About the "budget" notion more generally as opposed to simply stating that there is an upper bound on the amount of information that a site can gain about activity on other sites in each epoch. Again, whether that is an upper bound for each site, or a global bound is something I am happy to leave open. Friendly amendment, add "single" here under security constraints:
That might have been implied, but I think it is important. Again, we might offer stronger assurances, but this is a reasonable baseline. No doubt some will object to this constraint (I note that any system that relies exclusively on TEE cannot pass the proposed test), but ... well, we can have that debate when it comes time. Regarding open source requirement on client code, I think that most of the participating browsers will have no problem there, but I don't know if that is universally true. But I don't know whether the requirement is necessary at this stage. What steps a browser vendor takes to allow users to trust that their browser is good are out of scope for standardization. However, if there are cases where code is run by browsers on behalf of others, then maybe this is a fine requirement. It might be premature to add that now. The server-side stuff probably needs a little more development. The best we might reasonably say right now is that there will need to be a process by which server operators are authorized to operate the service. That process probably involves browsers certifying particular operators, but we'll need to get into that more as we get further into this. |
Thank you martinthomson for the feedback! I've updated my issue to reflect your comments.
The only place I disagree is about the Client-side code. I think for both privacy AND competition we need to have something. If we are going to be consistent in our application of the "3 Cs" framework, we need to consider the browser as a potential point of compromise. If TEEs are not acceptable because the TEE manufacturer or cloud operator is a single point of failure which can be compelled to break privacy, then why is that not ALSO the case for the browser / OS? I thought "open source code" was a pretty low bar to aim for, which as you say is already the case for most participating browsers (although NOT the case for iOS). I also think it would be really good to have something to banish the spectre of doubt that stems from competition concerns. If everyone can be confident that Google / Microsoft / Apple are all using the same private measurement API everyone else is, and everyone can validate that there isn't some privileged side-channel Chrome / Edge / Safari are also running in addition to said private measurement API, that will help build trust in the ecosystem. |
Working for a browser-maker, I feel obligated to defend our ability to defend data that doesn't leave the browser. Protecting browsing history is on the list of things that browsers have had to do forever. We also protect cookies and passwords and other much more sensitive stuff. The distinction that I think is relevant here is between treatment of data on a device that the user controls1 and data that leaves that space. As Luke mentioned yesterday, where things are most challenging is where data leaves that a zone the user controls (where we have a well-established understanding or at least expectations about how data is treated), which might make that data available to others. Additional scrutiny on data that exits user-controlled space, particularly when it involves data from multiple people, is entirely appropriate. Systems that aggregate private information from many people are something of a novelty here. But I don't see it as within our remit to talk about treatment within browsers, especially for such a narrow domain. I understand the competition angle (I would be OK with having a bigger discussion about that, is it worth a separate thread?), but I think that we should limit our discussion there to data that leaves the user's device. We are best not talking about the issues of self-preferencing that might occur within larger companies, limited to browser vendors2. The W3C is not even the right place to have that conversation. Various competition regulators are taking a keen interest, for instance. Footnotes
|
First, I believe @benjaminsavage's strawperson and the incorporated updates suggested by @martinthomson are a very defensible position to begin with. Second, like @benjaminsavage, I believe some affordance should be made here on the part of the browser maker to demonstrate the system they are a part of (if/when these APIs are generally available), which hopefully becomes part of powering a multi-billion dollar/euro/etc industry, is a transparent actor. The ask @benjaminsavage is making here is like he says "a pretty low bar," but I think the concession is an important one. We don't have to talk about self-preferencing in browser or OS companies to stipulate that the standard we are developing here has openness for all parties involved in making the API(s) work for software developers and ultimately the better data protection and privacy of end users. |
Quick initial thoughts reading this proposal: High level comment: I would prefer if we split out security and privacy constraints, mostly because I think we can have meaningful discussions about them in relative isolation without mixing things. I have strong concerns about enforcing k = 100 , since for some advertisers conversions can be quite rare events and even a relatively tight epsilon should give good data for many values of k < 100 (e.g. eps=1 will yield only a ~15% error on counts of 10). Regarding privacy unit / privacy grain, I think what is written now is stronger even than IPA which has a privacy unit of user x site. Did you intend to propose full user-level privacy here: "total amount of cross-site/cross-app information a caller can learn about a given person". We should try to be very precise about this. |
This is a very good point, @csharrison. |
You're right. I think splitting this conversation into two will be helpful in making more rapid progress and having more focused conversations. Per your suggestion I've moved the security model into a separate issue: #18 I apologize to all the other commenters for this - as it leaves your comments looking really confusing as the sections of the post they were referencing are no longer visible here. Sorry! Please feel free to copy-paste your comments over to the new issue if you'd like. |
Two thoughts: Firstly, I agree that conversions are rare. The vast majority of Facebook advertisers have only a handful of conversions to measure each week. I care a great deal about supporting small businesses and I want to develop an API that can support their measurement needs. But just to be clear, I am proposing K=100 applies to the impressions, not to the conversions. Even the smallest advertisers who just spend a few dollars wind up getting at least 100 impressions. Just to give an example, if an advertiser spent $5 on ads and got 200 impression, which led to just 3 conversions, that would pass this proposed bar. We would need to add some random noise to the number 3, so the API might add some Gaussian noise, but so long as those 3 conversions originated from a group of > 100 people we would pass this bar. The thinking here is that we can say: "Yeah, there were roughly 3 conversions. We do not know which of these 100 people they came from." This feels like a pretty simple to communicate privacy story. Blending in with a crowd of 100+ people is something all of us have experience with every day. |
@benjaminsavage thanks for clarifying. I see indeed you mentioned "unique match keys appearing across the set of source-events with breakdown key equal to b". I will need to think about this constraint more to see if I am comfortable with it, but yeah it's definitely better than counting attributed convs. |
This is super difficult to explain. I agree we should try to be very precise. Let me try again and see if I can do better on try 3 =). So here is what I wrote:
I used the term "a caller" without defining it. I think this is where I need to be more precise.
In the paragraph I wrote, I had each "app / website" in mind when I wrote the words "a caller". This is one way in which we could achieve an upper bound on the total per-user information leakage to a given app/website. As @martinthomson alluded to above, an alternative way to achieve the same goal would be to do a data analysis to see how many apps / website the P95 user actually interacts with, and based on that decide on a pairwise privacy budget (i.e. a separate budget per source-site x trigger-site combo). This seems to me like it would incur more noise due to the uneven distribution of sites visited per user. |
Helpful clarification @benjaminsavage. I too thought you were talking about conversions. K=100 impressions seems like an ok lower bound. |
About the minimum number of events to be aggregated. There is the concept of K-anonymity. The lower bound recommended in Europe is 15. Meaning each cohort has at least 15 individuals. |
"Cannot be linked to a specific individual" is not enough. Any widely available system will be used by adversaries to carry out attacks on users (Microtargeting as Information Warfare) not just by legit advertisers making win-win offers. Group size will have to depend on adversary's capabilities and goals. Instead of "cannot be linked", the (fraction of targeted users in the group) * (cost of attacking one target) needs to be lower than whatever value the adversary places on a successful attack. |
Would we like to present these in an upcoming meeting? |
I think we've already covered this discussion. @csharrison's presentation on "private measurement of individual events" answered the "(tentative)" part of this original proposal in the negative and we've captured this as an area of consensus here: https://github.com/patcg/docs-and-reports/blob/main/design-dimensions/Dimensions-with-General-Agreement.md#private-measurement-of-single-events From my perspective I think we can close this issue. |
Strawman target privacy constraints
A private measurement API should only return aggregated, anonymous information. That means this information:
The implications of this is are that:
NOTE: implication 2 is marked as (tentative) because it does not improve privacy in the worst case. This is because an adversary could generate a group of 100 source events comprised of 1 authentic event and 99 fake source events, or 1 authentic event and 99 real source events they know will not generate matches. However, this type of constraint would make it much simpler to explain the system to people (e.g. "Each breakdown is aggregated across a group of at least 100 people, after which a small amount of random noise is added to further protect privacy").
Regarding information leakage over time
In each "epoch" (strawman: each week), a private measurement API should provide some upper bound, or limit, on the total amount of cross-site/cross-app information a caller can learn about a given person. That limit should be low enough that it does not in effect leak browsing history.
The text was updated successfully, but these errors were encountered: