Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec for HTTPS-only providers on Amino DHT #496

Open
lidel opened this issue Mar 6, 2025 · 3 comments
Open

Spec for HTTPS-only providers on Amino DHT #496

lidel opened this issue Mar 6, 2025 · 3 comments
Labels
need/triage Needs initial labeling and prioritization

Comments

@lidel
Copy link
Member

lidel commented Mar 6, 2025

Filling public issue because this died in private channels and threads many times
Part of work towards https://github.com/ipshipyard/roadmaps/issues/9 and https://github.com/ipshipyard/roadmaps/issues/15
This is also helping with webseeds from https://github.com/ipshipyard/roadmaps/issues/19

Need

We need to agree and write down a specification how HTTP-only Trustless Gateway providers can be announced on existing Amino DHT.

Such provider can have a synthetic PeerID for interop with routing systems and software, but in reality it won't have any libp2p networking stack, and only expose HTTP endpoint that follows the Trustless Gateway spec.

Wider context

  • We already have big providers that have HTTP-only "PeerID" provider (e.g. Storacha announces special peer with /tls/http multiaddr – right now it is only announced to IPNI)
  • We already have experimental HTTP-only retrieval in boxo/rainbow that
    • Detects /tls/http multiaddrs
    • Performs content-type negotiation / probe to confirm the HTTP endpoint supports trustless gateway protocol
    • Performs HTTP-only retrieval from such provider
    • The PeerID is not used for anything.
      • There is no auth when client learns about this provider from delegated routing system that proxies to IPNI and DHT, nor IPNI runs any validation checks when it accepts such announcement.
  • We want to allow people to self-host over HTTP and turn off Bitswap
    • Stability and cost reduction thanks to HTTP Caching and free HTTP CDNs makes self-hosting possible with cheap hardware and with narrow bandwidth
  • We can't deploy protocol changes to DHT without waiting 6-24+ months for significant % of DHT server nodes to update
  • Most of people who self-host run Kubo or IPFS Cluster backed by a fleet of Kubo nodes that announce to Amino DHT (and some run sidecar that also announces to IPNI)

North star

Proposed spec direction

Open questions

  1. Any concerns with using /tls/http in announced Multiaddrs for this?
    • Example: is it ok for Kubo user to put their Gateway.NoFetch=true&Gateway.DeserializedResponses gateway behind Cloudflare and URL in Addresses.AppendAnnounce as /dns4/gw.example.com/tcp/443/tls as extra hint for clients that prefer HTTP-only retrieval?
    • afaik we have all specs necessary for reliable interop – see spec direction above, but comment below if any spec gaps exist
  2. If content providers start announcing their trustless, non-recursive gateways as /dnsX/example.com/../tls/http, do we need any auth of DNS names?
    • Should DHT nodes accept and gossip /dnsX/example.com/../tls/http addrs blindly or should extra validation be added at routing system level (DHT, IPNI)?
    • Following prior art of ACME challenges, we need to think about HTTPS on raw IPs, or other setups without access to DNS. We could require a signed peerid published on DNS TXT or as HTTP GET file on .well-known/libp2p/signed-peerid path, but this complicates deployments, requires people to deal with PeerIDs, and at the end of the day, how is this auth request any better from HTTP client blindly sending a trustless probe GET /ipfs/cid and getting 404 indicating endpoint is not a valid gateway?

As usual, ideas, feedback welcome.

@lidel lidel added the need/triage Needs initial labeling and prioritization label Mar 6, 2025
@lidel lidel changed the title Spec for HTTP-only providers on Amino DHT Spec for HTTPS-only providers on Amino DHT Mar 6, 2025
@aschmahmann
Copy link
Contributor

Any concerns with using /tls/http in announced Multiaddrs for this?

Yes. I don't think this is enough information for what you're trying to do (trustless-gateway over HTTP only).

Two main issues:

  1. How to handle more than 1 HTTP-based protocol
  2. Inconsistency regarding attack vectors for HTTPS vs libp2p-based protocols like tcp + tls + yamux + bitswap

These are separate issues that can be solved independently, but both are triggered by the current proposal.

More context and nuance below, but if looking for a suggestion that seems relatively inline with the proposal:

  1. Enable handling multiple HTTP-based protocols by adding a small metadata field (similar to what exists in IPNI) and using it to indicate trustless-http-gateway (which will be ignored by most servers at the start and eventually adopted by them), and for the next months during the upgrade optimistically assume HTTP multiaddrs mean trustless-gateway. Set a deadline (e.g. a year) after which we should stop assuming HTTP multiaddrs mean trustless-gateway on the client side.
    • If desired (or if the idea of setting a deadline for changing behavior seems scary) define a way to identify all the HTTP protocols an endpoint is willing to serve (we can reuse the libp2p one or choose something else if there's a good reason). Will be nice for debugging as well as handling multiple protocols together and doesn't seem super high effort.
  2. Remove the server side check that provider records come from the peer that is advertising them and switch the libp2p kad spec from a MUST to something softer like a MAY here and document it, make it a SHOULD (or MUST) NOT in the Amino DHT spec

Multiple HTTP-based protocols

Background:

  • Amino DHT: Today the main reason it's ok no protocol (e.g. /ipfs/bitswap/1.2.0) is specified along with the provider record is because multistream and identify exist for protocol discovery and negotiation. Otherwise things become more complicated as the application has to guess the supported protocol, for example as we upgraded versions of bitswap over time, or if we wanted to add support for new protocols.
  • IPNI: IPNI solves this problem for HTTP by adding a "transport-ipfs-gateway-http" flag into its metadata.
    • It also happens to have an extra bogus peerID associated with HTTP provider for legacy reasons, which is similar to the bogus peerID that would come with this proposal however this proposal misses the metadata field

With this proposal how would it be possible to support multiple HTTP-based protocols when data was only discovered using the Amino DHT?

Some example additional / alternative HTTP APIs already in use in the generalized IPFS space:

HTTP-based transfer protocols can be added the same way, all they need to do is to use new content-type in Accept / Content-Type header

I don't think this is enough given examples like the above. Just looking at these examples far more changes than just the content types and they may already have their own conflicting content types.

Potential Solutions

Solving this problem could be done with either (or both) of the following:

  1. Add a mechanism for adding a small fixed amount of (protocol) metadata to provider records. It could even be made fixed per peer if the potential (ab)use of per-record metadata was deemed too much.
  2. Leveraging a discovery mechanism for HTTP content retrieval endpoint such as https://github.com/libp2p/specs/tree/6d38f88f7b2d16b0e4489298bcd0737a6d704f7e/http#namespace

I understand that there might be some resistance to adding per-protocol metadata based on the network taking a long time to support new functionality, and that's fine since we have other options that could be pursued in parallel. Some examples:

  • Optimistically guess (the proposal here) for a few months while a solution like metadata is rolled out
  • Insist on using the path gateway form /ipfs/ and /ipns/ rather than the subdomain form and claim those for trustless-gateway (i.e. all other protocols will need their own path-roots)
  • Have a discovery endpoint we could use, and use optimistic guessing to save a round-trip if desired (similar to what is doable with libp2p + bitswap streams)

Inconsistency regarding attack vectors

Currently there is a check in the DHT https://github.com/libp2p/specs/tree/6d38f88f7b2d16b0e4489298bcd0737a6d704f7e/kad-dht#content-provider-advertisement.

Each peer that receives the ADD_PROVIDER RPC should validate that the received PeerInfo matches the sender's peerID, and if it does, that peer should store the PeerInfo in its datastore

This in effect disables the ability for third-parties to advertise "Alice has Foo" into the Amino DHT and instead allows only "I have Foo" to be advertised.

By choosing to not check the corresponding peerID this in effect enables delegated advertising but ONLY for HTTPS retrieval and not for libp2p based protocols (bitswap, pubsub discovery, etc.). It seems perverse that we're choosing to take a libp2p based system and then hobble it ONLY for the libp2p use case,

My understanding is that the primary motivation for this check (which far predates me and IIUC is 11+ years old) is to limit the impact of reflection / amplification attacks https://en.wikipedia.org/wiki/Denial-of-service_attack#Reflected_attack.

Currently the Amino DHT allows for some reflection attacks:

  • Since server nodes do not specifically validate all multiaddrs sent via provider records (or identify) malicious providers can put in bogus IP addresses that will cause some amount of work to be done by the server - however the work only goes up until the security handshake fails
  • If server nodes want to cause clients to perform attacks they can, in addition to manufacturing bogus IP addresses, also redirect requests to peers that don't have the content. This request will be more expensive (security handshake completes and a request for content is made) than the above

Both the HTTP proposal and removing the peerID check allow arbitrary peers to cause the same reflection attacks that previously only malicious server nodes could perform.

Note: This is related to:

Potential Solutions

Solving this problem could be done with either of the following:

  1. Decide that the type of reflection attack mitigated by the peerID check isn't worth it, remove that check (which might take months to propagate but that's fine) and then things end up consistent
  2. Adding additional checks to the HTTP-based approach to bring it roughly equal to the libp2p one. There are options like:

Option 1 seems to be implied by this issue, and that's likely ok. Probably the only thing I would flag is that depending on how this expands over time (e.g. allowing HTTP pathing to download blobs) it becomes even more important that clients don't download arbitrary amounts of data without validating it. If they don't then the amount of amplification they do can really grow.

@guillaumemichel
Copy link
Contributor

On attack vectors

IMO we shouldn't sacrifice security, because we want productions systems to build on IPFS and rely on the DHT. Moreover any abuse of reflected/amplification attacks (even on a non-IPFS target) will make IPFS look bad, and contribute toward an aggressive firewall policy against libp2p/ipfs.io.

That begin said, as @aschmahmann mentioned, we don't have a good way to validate all multiaddrs shared via identify. It is not practical to verify all of the addresses (e.g server is ip4 only and client also advertises ip6 addr), and even if the peer record is signed, when transferred from server0 to server1, server1 must either trust that server0 has actually checked all the addrs, or verify them all again.

Currently, during a FIND_PROVIDERS request a honest DHT server will return:

  1. verified provider peer ids (verified because the server has directly received the PROVIDE request from the providers, which may of may not serve the advertised content)
  2. unverified multiaddrs for the above peer ids (unverified because the server trusts the provider to share valid addresses)

In a libp2p only context, unverified multiaddrs is bad, but not critical since the connection will fail when the client fails the dial, or after peer id mismatch. No important work is performed.

the only thing I would flag is that depending on how this expands over time (e.g. allowing HTTP pathing to download blobs) it becomes even more important that clients don't download arbitrary amounts of data without validating it. If they don't then the amount of amplification they do can really grow.

In a HTTP context, the server would return a 404, but it could become problematic as mentioned above.

A straightforward mitigation to verify that an HTTP addr provided by a libp2p node making a PROVIDE operation is to make a HEAD request for the advertised CID. If multiple CIDs are provided by the same libp2p node, probabilistically checking a few CIDs should be enough to validate the HTTP addr. This check doesn't guarantee that the node will actually serve the data, but it verifies that the user controls the IP+port. The same check could be performed for libp2p maddrs, but we probably want to avoid it since there could be much more addresses, hole punching, etc.

HTTP-only Provides

We don't want to allow any host to advertise that an arbitrary IP+port provides arbitrary CIDs.

Since a DHT is ultimately a key-value store, we could split the PROVIDE operation (CID -> PeerID) and the ADD_ADDR operation (PeerID -> multiaddrs).

The PROVIDE operation would require the advertising peer to be authenticated using HTTP PeerID Auth or its replacement (see below).

based on https://datatracker.ietf.org/doc/rfc9729/ as indicated libp2p/specs#564 (comment) the RFC was approved less than a month ago

Before or after a PROVIDE operation, a node could perform a ADD_ADDR to DHT servers. HTTP ADD_ADDR is an invitation for the DHT server to verify the address provided by the client by running an authentication protocol. DHT servers would only serve verified HTTP addresses along with provider records.

This means that malicious actors could still advertise content that they won't serve, but they aren't able to advertise content to HTTP addresses that they don't control. It should bring the same security guarantees for HTTP addrs as a PROVIDE operation using libp2p and the mitigation described above.

Delegated Provide

So far, we covered only nodes advertising to the DHT that they are serving CIDs.

Allowing delegated provides would allow more actors to be advertised as content providers in the DHT, and removes friction about difficulty to provide to the DHT since you could delegate it to another entity.

  1. Decide that the type of reflection attack mitigated by the peerID check isn't worth it, remove that check (which might take months to propagate but that's fine) and then things end up consistent

removing the peerID check allow arbitrary peers to cause the same reflection attacks that previously only malicious server nodes could perform.

We want to avoid weakening the system, but we can allow delegated provides simply by using signatures.

E.g a node willing to delegate its provides (DHT or other routing system) could sign the following

{
  "peerid" : "12D3Koo1",
  "cid": "bafy...",
  "expires": "2025-04-01 ...",
  (optional) "whitelist": [12D3Koo2, 12D3Koo3],
}

Note that the whitelist argument is optional, if not included it means that any node can relay the advertisement. Otherwise content routing systems should reject an advertisement coming from a peer that is NOT in the whitelist.

A DELEGATED_PROVIDE operation would include the data structure above (encoding can be optimized), and the signature. The DHT server would verify the signature and if the sending peer is in the whitelist before accepting the request.

Compared to current PROVIDE requests, the DELEGATED provide has a larger payload, including additional 1) signature, 2) peerid, 3) timestamp, and 4) (optional) whitelist.

The storage on the node responsible for the (re)provide can be optimised if all delegated provides from a provider share the same expiration date and whitelist. The only additional storage would be for the signatures (1 per CID).

multiaddrs caching

For libp2p nodes delegating provides, the node running for the DELEGATED_PROVIDE operation can include a signed peer record. Alternatively, the DHT server MAY lookup the peer id and add the addresses it finds to its peer store.

For HTTP Providers, the delegated node may also get a certificate allowing it to run DELEGATED_ADD_ADDR to DHT servers. This request invites DHT servers to authenticate the provided HTTP address for the provider, and upon success cache it.

Migration/Deployment

The above would add new capabilities to the DHT servers, hence leading to a breaking change in the network. We have (at least) 2 choices to deploy the change:

  1. Ship the new capabilities on DHT servers and wait for enough adoption (~80%?) before rolling out the change to clients
  2. Nodes running the newer version would form a sub-DHT within the Amino DHT network. Delegated/HTTP requests could immediately be performed in this sub-DHT. The sub-DHT would be bootstrapped to the existing Amino network, ensuring no actor could take over the sub-DHT given its small size at the beginning.

The second solution requires some additional changes to allow sub-DHT, but it will allow us to ship DHT protocol upgrades faster in the future.

References

@lidel
Copy link
Member Author

lidel commented Mar 14, 2025

Thank you. Some extra notes from meetings today:

  • +1 on not relaxing DHT guarantees (we dont want to weaken non-HTTP use cases like PNET)
  • threat model: peers aren't able to advertise HTTP endpoints that they don't control (ensure DHT can't be used for HTTP amplification attack)
    • (option A) validation of HTTP addr before retrieval
      • not feasible, this will be a dead spec – we have NO "libp2p with HTTP semantics" protocols. only ones are HTTP-only trustless gateways. This means, in practice, trustless gateway clients will choose to NOT implement arbitrary peerid check and do basic probe or just send GET directly
    • (option B) validation of HTTP addrs during advertisements, before accepting /tls/http to "DHT server's peerstore"
      • Investigate: Peer ID Authentication over HTTP on GET /.well-known/libp2p/protocols as mechanism DHT server uses before accepting /tls/http addr
        • we don't want to do it for every CID PROVIDE
          • performing and caching the result of one-time check on ADD_ADDR feels like better direction, bt see "Concerns" below
        • the RFC9729 - Concealed HTTP Authentication Scheme might be used in future, but is not feasible right now due to the lack of support in browser contexts (afaik Web APIs and some libraries may not don't give access to TLS Keying Material potentially blocking adoption)
      • ⚠ Concern: introducing network IO on DHT PUTs may create similar problem to the one we are trying to fix
        • someone could constantly spam announcements for random peerids (to cover entire DHT server space) with the same /tls/http endpoint, which produces surface potential abuse..
        • lose-lose situation: DHT servers would either execute peerid auth checks each time and
          • (a) spam (DDoS) that /tls/http endpoint with peer id auth requests, OR
          • (b) they would cache initial failure per /tls/http endpoint to avoid spamming it all the time, but that could effectively act as a DoS for a legitimate owner of that /tls/http endpoint
  • threat model: server returning a lot of bad data
    - 💭 Possible that (b) is a footgun and (a) is acceptable "tax" (HTTP servers are constantly scanned all the time). Number of DHT servers will be always lower than the number of potential clients that could be tricked into attempting HTTP retrieval from /tls/http learned from DHT.
    • Solved: HTTP retrieval introduced in HTTP retrieval proposal boxo#747 limits accepted response to ~2-4MiB on the client side, does not feel like we need to do anything more here
  • delegated provides
    • TBD, we don't have to solve delegated provide right now if we can solve announcing /tls/http without it
  • we want to flesh out details once we have basic DHT spec in next week or two

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants