Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP: format for denylists for IPFS Nodes and Gateways #299

Closed
wants to merge 2 commits into from

Conversation

foreseaz
Copy link

@foreseaz foreseaz commented Jul 15, 2022

This IPIP adds spec for (deny|allow)lists for IPFS Nodes and Gateways.

Supported rules:

  • plain
    • cid
    • content_path
  • hashed (for trully bad bits)
    • hashed_cid
    • hashed_content_path

Closes #298

@lidel lidel changed the title ipip: add IPIP0002 Denylists for IPFS Nodes and Gateways IPIP: denylists for IPFS Nodes and Gateways Jul 21, 2022
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this @foreseaz.

As noted in #298 (comment) IPFS project has a clear need for standardizing allow/deny lists, and discussion around this IPIP is welcome.

I will be pinging some stakeholders for additional review, but my initial feedback/asks:

  • To be aligned with the rest of IPFS stack, we need an escape hatch for other hash functions than sha2-256 – details inline
  • Make it clear which fields are optional (e.g. description and status_code) and what is the implicit default (empty description and status_code 410)
  • Add top level optional description field where denylist maintainers can include additional context (purpose, policy, link to more details), which could be displayed in list management UIs, and perhaps on Gateway HTTP error pages?


The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Before the hashing, all CIDs in both `cid` field and `content_path` fields MUST be converted to CIDv1 in Base32 for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.

This is a non forward compatible change, pls be carefull.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how we can avoid CIDs-as-strings when arbitrary content_path are hashed.

Are you suggesting this IPIP define different CID normalization rule for cid and different for content_path values?


**Side notes on `hashed_cids` & `hashed_content_paths` types**

The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sha256 is sensible, but hard-coding a specific hash function in the spec is against the spirit Multiformats, which we aim to use in IPFS stack.

Perhaps we could future-proof this at the low cost of adding 1220 (hex for Multihash prefix for sha256).

This will keep the digest string intact, but turn the field into a valid Multihash, allowing list creators to switch the hash function in the future. An alternative is to have hash function type in a separate field, but this seems less expensive.

Suggested change
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is hashed (by default with `sha256()`) and stored as a [Multihash](https://docs.ipfs.io/concepts/glossary/#multihash) encoded as hex for easier interop with existing tools. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

We could even make it a valid Multibase string by adding f (Base16) at the front, but not sure how useful alternatives to hex would be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.

We don't need the full cid, so I do like @lidel's idea of just using the multihash.

@lidel lidel changed the title IPIP: denylists for IPFS Nodes and Gateways IPIP: format for denylists for IPFS Nodes and Gateways Jul 22, 2022
@Jorropo
Copy link
Contributor

Jorropo commented Jul 22, 2022

Why adding content path ? Everything could be done with cids, you just include intermediaries cids.

It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways.

@lidel lidel requested a review from a team July 22, 2022 16:15
@mathew-cf
Copy link

Why adding content path ? Everything could be done with cids, you just include intermediaries cids.

It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways.

Content paths allow us to block a specific domain. So if /ipns/bad-website.com keeps updating their content, we'll continue to block requests to that domain.

```js=
{
action: "block",
entries: [
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question raised during triage today: what happens when the block list grows to megabytes?
This is a real concern, as https://badbits.dwebops.pub/ alone is getting close to 900KiB, and history shows that even efficient pattern-matching things like adblock lists are multiple megabytes in size (example).

This spec should provide a way for representing handle big, big blocklists.

💡 One idea is to introduce a special entry with type: "import" and either cid or content_path pointing at some other list. This provides a solution for sharding and maintaining big lists AND allows composing denylists using existing ones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sharded lists, we may need to specify a prefix or something to indicate which list is for which shard

Copy link
Contributor

@Jorropo Jorropo Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathew-cf you want to spec a JSON based HAMT ? 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that you mention it @Jorropo i think composability is probably enough for now lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey folks,
We have been operating an internal denylist that is synced with badbits in {nft/web3}.storage. We need to align well on this direction with content path. There is at least a limitation around this that we have found:

  • if we block a CID with content path, if user just tries to fetch the resource in the path via its own CID (present in response etag from gateway), it won't be flagged or we will need a follow up check in the denylist
  • same as before, but on the other way around

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround for the first case (content path blocked, bypassed via CID) could be to resolve the content path and then block that CID as well.
For the second case (CID blocked, bypassed with content path), we're planning on using x-ipfs-roots to ensure that none of the resolved CIDs are blocked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently hitting a limitation with current badbits implementation, and thoughts here would not solve that as well.

Currently in nftstorage.link + w3s.link, we rely on anchors fetched from badbits. We put them in a KV Store in the edge and before resolving content (including via Cache), we check KV Store and if is there, we don't serve content. If cid+path resolution, we will also check ETag presence in our denylist before serving.

However, we want to move forward with more agressive caching, and current approach leads to problems that we cannot solve. Let's consider for instance, https://blog.cloudflare.com/introducing-cache-reserve/ that allows content to be permanently cached on the Edge for up to 30 days (without HITs). This would be a desirable feature for a gateway, given the costs to use this kind of feature are considerably lower than the needed bandwidth to go to the origin all the time. So, if a gateway wants to rely on this kind of feature, it needs to be proactive on purging from Cache bad content. And so, what do we need to purge the content? We need its HTTP URL, aka cid+path.

With the above in mind, I think we should attempt to move out of the hashed_cid + hashed_content_path types. Having root CIDs and content path are likely what we need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm you describe your service's use case, and not suggesting removal of support for hashed entries?
These must be part of spec like this one, due to very bad bits.

@hsanjuan
Copy link
Contributor

I'll throw my two cents:

  • We don't need ONE list-format spec. If anything we need several.
  • This one is particularly bad for processing (parsing an object, with a huge array inside, with objects inside), when all I need are CIDs or paths. Go cannot read each CID in this file without parsing it all into memory for example.
  • Whatever the list format, it is probably going to be quite easy to parse anyways. The question is what is the best format for this? But for a start let's just stick to the badbits format which is no good, but at least we don't have multiple formats that suck in the same ways (like being JSON objects).
  • Where is the badbits format documented? I always forget that.
  • We need a discussion on what formats are good and for what:
    • Easy to read and edit and append: a text file with 1 cid per line, or one path per line, or one double-hash per line (plus modifiers)
    • Fast to move around (gzipped stuff)
    • Easy to share over IPFS (an IPLD linked lists of things)
    • Store-the-hash database files
    • CAR files.
    • What about multiple files?
    • How does ublock &co does it?
    • What about blocking by regexp'ed path?
  • Another discussion is what to exactly to doublehash:
    • Cid + path (like badbits)
    • Resolved paths?
    • Multihash?
  • Block lists should be have as little information as possible.
  • Any double hashing should happen using the fastest hash function available (and no need to use very long hashes here)
  • I think everything should be double hashed. Why not.
  • Details, response codes, descriptions, original hashes etc etc do not belong in the denylists. If you need that keep it out of the denylist, which is derived from it. If anything, put is as comments and not as part of the schema, so that the parser can ignore it.

- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency.
- `content_path`: the content path needs to be blocked).
- `description`: description of the CIDs or content paths.
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be authoritative.
I think anything not 451, 410 or 200 has no place on a badlist list. I don't want people start to use 3xx or whatever.

Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...
Some gateway operators might prefer faking 500s or just flat out ignore certain reasons.


The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.

This is a non forward compatible change, pls be carefull.

- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency.
- `content_path`: the content path needs to be blocked).
- `description`: description of the CIDs or content paths.
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A status code of 200 is weird, you already have the action field saying if it's blocked are allowed, what about just saying the usage of status_code is undefined if the content is allowed ?


**Side notes on `hashed_cids` & `hashed_content_paths` types**

The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.

We don't need the full cid, so I do like @lidel's idea of just using the multihash.

@b5
Copy link
Contributor

b5 commented Sep 16, 2022

This IPIP is scheduled for discussion & ratification at the next IPFS Implementers call September 29th, 2022! https://calendar.google.com/calendar/u/0/embed?src=ipfs.io_eal36ugu5e75s207gfjcu0ae84@group.calendar.google.com

@fabricedesre
Copy link

I have a few questions about this proposal:

  • How are IPFS services expected to exchange lists?
  • How are conflicts resolved when the same resource has different status_code (or whatever reason mechanism ends up being used) in different lists?
  • If other actions than deny are added in the future, how is conflict resolution envisioned if the same resource ends up being denied and allowed?
  • When IPFS will be successful, scalability of these lists will be a problem. Json is not very good even as an interchange format for large data sets.

Also, how do we trust these lists? If we keep a decentralized design, I feel that a signature should be added to blocking records to verify the source of a blocking decision.

Web browsers use a blocking mechanism called "Safe Browsing" (see https://wiki.mozilla.org/Security/Safe_Browsing for details on the Firefox implementation). The design is very different, since it relies on a centralized service but it has interesting privacy characteristics to prevent leaking the client's navigation while minimizing the amount of extra resources needed (both bandwidth and storage). I think it would be interesting to build something similar on top of the denylist so that light IPFS nodes don't have to download and update large datasets.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ I'm positive, but this IPIP needs some additional work before merge:

  • Some open comments about the way hashes are generated/normalized are not resolved
  • We are missing a working implementation – is there any library in the works @foreseaz?
  • Missing test fixtures – if we had sample implementation, we could reuse test vectors.
  • Bit worried about low engagement from IPFS implementations in general

Comment on lines +37 to +38
description: "example.com",
status_code: 410
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX nits:

  • description and status_code per entry should be optional
  • description and status_code should also be optional at the top level
  • when entry has no value, global one is used.
  • when global description and status_code are not defined, and action is block, Gone and 410 are used


The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how we can avoid CIDs-as-strings when arbitrary content_path are hashed.

Are you suggesting this IPIP define different CID normalization rule for cid and different for content_path values?


The following concern may not lie in this scope, but it is worth to be mentioned in this proposal. The blocking of CIDs which are not malicious and are widely used can potentially jeopardize the availability of multiple sites on that IPFS gateway. Possible examples include common images, javascript bundles, or stylesheets.

### Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: useful to mention if https://jsonlines.org was considered instead of JSON. It allows for applying big denylistis without loading entire thing in memory.

Seems the direction is to split lists into composable chunks instead, but worth documenting why we did not choose https://jsonlines.org

```js=
{
action: "block",
entries: [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm you describe your service's use case, and not suggesting removal of support for hashed entries?
These must be part of spec like this one, due to very bad bits.

@lidel
Copy link
Member

lidel commented Sep 29, 2022

We've discussed this IPIP during IPFS Implementers call today and want to move forward, including ability to compose lists from smaller ones, however @foreseaz is no longer working on it, which makes applying changes challenging.

Next steps:

  • I am closing this PR.
  • @mathew-cf will go over feedback and open a new PR and link below, so we can continue there.

@lidel lidel closed this Sep 29, 2022
@foreseaz
Copy link
Author

foreseaz commented Sep 29, 2022

⚠️ I'm positive, but this IPIP needs some additional work before merge:

  • Some open comments about the way hashes are generated/normalized are not resolved

  • We are missing a working implementation – is there any library in the works @foreseaz?

  • Missing test fixtures – if we had sample implementation, we could reuse test vectors.

  • Bit worried about low engagement from IPFS implementations in general

    • I suggest we continue the discussion, but perhaps park ratification talk until ipfs/kubo#8492 is picked up? (or a sibling JS/Rust implementation starts similar work)

I apologize for the late reply and really sorry that I haven't been following the discussion since I was back to school last month.

  • We are missing a working implementation – is there any library in the works @foreseaz?

There is working implementation related to work, but it may not be shared.

Thanks for the discussion on the IPIP today, I just saw the issue has been closed, I may miss the on-going discussion in work channels, so either @mathew-cf creates a new IPIP PR or re-open the issue and I'll update the current PR works for me. Thanks again!

@LaurenSpiegel
Copy link

Hi, is another PR in the works?

@thibmeu
Copy link
Contributor

thibmeu commented Nov 10, 2022

I've iterated over this commit and opened #340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create spec for (allow|deny)lists for IPFS Nodes and Gateways
10 participants