-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP: format for denylists for IPFS Nodes and Gateways #299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting this @foreseaz.
As noted in #298 (comment) IPFS project has a clear need for standardizing allow/deny lists, and discussion around this IPIP is welcome.
I will be pinging some stakeholders for additional review, but my initial feedback/asks:
- To be aligned with the rest of IPFS stack, we need an escape hatch for other hash functions than
sha2-256
– details inline - Make it clear which fields are optional (e.g.
description
andstatus_code
) and what is the implicit default (emptydescription
andstatus_code
410
) - Add top level optional
description
field where denylist maintainers can include additional context (purpose, policy, link to more details), which could be displayed in list management UIs, and perhaps on Gateway HTTP error pages?
|
||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
|
||
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. | |
Before the hashing, all CIDs in both `cid` field and `content_path` fields MUST be converted to CIDv1 in Base32 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.
This is a non forward compatible change, pls be carefull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how we can avoid CIDs-as-strings when arbitrary content_path
are hashed.
Are you suggesting this IPIP define different CID normalization rule for cid
and different for content_path
values?
|
||
**Side notes on `hashed_cids` & `hashed_content_paths` types** | ||
|
||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sha256 is sensible, but hard-coding a specific hash function in the spec is against the spirit Multiformats, which we aim to use in IPFS stack.
Perhaps we could future-proof this at the low cost of adding 1220
(hex for Multihash prefix for sha256
).
This will keep the digest string intact, but turn the field into a valid Multihash, allowing list creators to switch the hash function in the future. An alternative is to have hash function type in a separate field, but this seems less expensive.
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | |
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is hashed (by default with `sha256()`) and stored as a [Multihash](https://docs.ipfs.io/concepts/glossary/#multihash) encoded as hex for easier interop with existing tools. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
We could even make it a valid Multibase
string by adding f
(Base16) at the front, but not sure how useful alternatives to hex would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.
We don't need the full cid, so I do like @lidel's idea of just using the multihash.
Why adding content path ? Everything could be done with cids, you just include intermediaries cids. It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways. |
Content paths allow us to block a specific domain. So if /ipns/bad-website.com keeps updating their content, we'll continue to block requests to that domain. |
```js= | ||
{ | ||
action: "block", | ||
entries: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question raised during triage today: what happens when the block list grows to megabytes?
This is a real concern, as https://badbits.dwebops.pub/ alone is getting close to 900KiB, and history shows that even efficient pattern-matching things like adblock lists are multiple megabytes in size (example).
This spec should provide a way for representing handle big, big blocklists.
💡 One idea is to introduce a special entry with type: "import"
and either cid
or content_path
pointing at some other list. This provides a solution for sharding and maintaining big lists AND allows composing denylists using existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sharded lists, we may need to specify a prefix or something to indicate which list is for which shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathew-cf you want to spec a JSON based HAMT ? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that you mention it @Jorropo i think composability is probably enough for now lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey folks,
We have been operating an internal denylist that is synced with badbits in {nft/web3}.storage. We need to align well on this direction with content path. There is at least a limitation around this that we have found:
- if we block a CID with content path, if user just tries to fetch the resource in the path via its own CID (present in response etag from gateway), it won't be flagged or we will need a follow up check in the denylist
- same as before, but on the other way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workaround for the first case (content path blocked, bypassed via CID) could be to resolve the content path and then block that CID as well.
For the second case (CID blocked, bypassed with content path), we're planning on using x-ipfs-roots
to ensure that none of the resolved CIDs are blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently hitting a limitation with current badbits implementation, and thoughts here would not solve that as well.
Currently in nftstorage.link + w3s.link, we rely on anchors fetched from badbits. We put them in a KV Store in the edge and before resolving content (including via Cache), we check KV Store and if is there, we don't serve content. If cid+path resolution, we will also check ETag
presence in our denylist before serving.
However, we want to move forward with more agressive caching, and current approach leads to problems that we cannot solve. Let's consider for instance, https://blog.cloudflare.com/introducing-cache-reserve/ that allows content to be permanently cached on the Edge for up to 30 days (without HITs). This would be a desirable feature for a gateway, given the costs to use this kind of feature are considerably lower than the needed bandwidth to go to the origin all the time. So, if a gateway wants to rely on this kind of feature, it needs to be proactive on purging from Cache bad content. And so, what do we need to purge the content? We need its HTTP URL, aka cid+path.
With the above in mind, I think we should attempt to move out of the hashed_cid
+ hashed_content_path
types. Having root CIDs and content path are likely what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm you describe your service's use case, and not suggesting removal of support for hashed entries?
These must be part of spec like this one, due to very bad bits.
I'll throw my two cents:
|
- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency. | ||
- `content_path`: the content path needs to be blocked). | ||
- `description`: description of the CIDs or content paths. | ||
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be authoritative.
I think anything not 451
, 410
or 200
has no place on a badlist list. I don't want people start to use 3xx or whatever.
Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos
, Legal
(maybe with a reason like Legal Copyright
, Legal Hatespeech
, Legal other
, ...), ...
Some gateway operators might prefer faking 500
s or just flat out ignore certain reasons.
|
||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
|
||
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.
This is a non forward compatible change, pls be carefull.
- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency. | ||
- `content_path`: the content path needs to be blocked). | ||
- `description`: description of the CIDs or content paths. | ||
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A status code of 200
is weird, you already have the action field saying if it's blocked are allowed, what about just saying the usage of status_code
is undefined if the content is allowed ?
|
||
**Side notes on `hashed_cids` & `hashed_content_paths` types** | ||
|
||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.
We don't need the full cid, so I do like @lidel's idea of just using the multihash.
This IPIP is scheduled for discussion & ratification at the next IPFS Implementers call September 29th, 2022! https://calendar.google.com/calendar/u/0/embed?src=ipfs.io_eal36ugu5e75s207gfjcu0ae84@group.calendar.google.com |
I have a few questions about this proposal:
Also, how do we trust these lists? If we keep a decentralized design, I feel that a signature should be added to blocking records to verify the source of a blocking decision. Web browsers use a blocking mechanism called "Safe Browsing" (see https://wiki.mozilla.org/Security/Safe_Browsing for details on the Firefox implementation). The design is very different, since it relies on a centralized service but it has interesting privacy characteristics to prevent leaking the client's navigation while minimizing the amount of extra resources needed (both bandwidth and storage). I think it would be interesting to build something similar on top of the denylist so that light IPFS nodes don't have to download and update large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Some open comments about the way hashes are generated/normalized are not resolved
- We are missing a working implementation – is there any library in the works @foreseaz?
- Missing test fixtures – if we had sample implementation, we could reuse test vectors.
- Bit worried about low engagement from IPFS implementations in general
- I suggest we continue the discussion, but perhaps park ratification talk until IPFS filtering to allow node operators to decide on content they are willing to serve kubo#8492 is picked up? (or a sibling JS/Rust implementation starts similar work)
description: "example.com", | ||
status_code: 410 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UX nits:
description
andstatus_code
per entry should be optionaldescription
andstatus_code
should also be optional at the top level- when entry has no value, global one is used.
- when global
description
andstatus_code
are not defined, and action isblock
,Gone
and410
are used
|
||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
|
||
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how we can avoid CIDs-as-strings when arbitrary content_path
are hashed.
Are you suggesting this IPIP define different CID normalization rule for cid
and different for content_path
values?
|
||
The following concern may not lie in this scope, but it is worth to be mentioned in this proposal. The blocking of CIDs which are not malicious and are widely used can potentially jeopardize the availability of multiple sites on that IPFS gateway. Possible examples include common images, javascript bundles, or stylesheets. | ||
|
||
### Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: useful to mention if https://jsonlines.org was considered instead of JSON. It allows for applying big denylistis without loading entire thing in memory.
Seems the direction is to split lists into composable chunks instead, but worth documenting why we did not choose https://jsonlines.org
```js= | ||
{ | ||
action: "block", | ||
entries: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm you describe your service's use case, and not suggesting removal of support for hashed entries?
These must be part of spec like this one, due to very bad bits.
We've discussed this IPIP during IPFS Implementers call today and want to move forward, including ability to compose lists from smaller ones, however @foreseaz is no longer working on it, which makes applying changes challenging. Next steps:
|
I apologize for the late reply and really sorry that I haven't been following the discussion since I was back to school last month.
There is working implementation related to work, but it may not be shared. Thanks for the discussion on the IPIP today, I just saw the issue has been closed, I may miss the on-going discussion in work channels, so either @mathew-cf creates a new IPIP PR or re-open the issue and I'll update the current PR works for me. Thanks again! |
Hi, is another PR in the works? |
I've iterated over this commit and opened #340 |
This IPIP adds spec for (deny|allow)lists for IPFS Nodes and Gateways.
Supported rules:
cid
content_path
hashed_cid
hashed_content_path
Closes #298