Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libp2p + HTTP an exploratory refactor #102

Closed
wants to merge 3 commits into from

Conversation

MarcoPolo
Copy link

Hi folks!

I'm play-testing the new libp2p+HTTP api libp2p/go-libp2p#2438 so I wanted to try using it somewhere. I think the dagsync code here would benefit from libp2p+HTTP because:

  1. It means you can delete all of dagsync/dtsync eventually (after clients upgrade).
  2. It consolidates the http and libp2p stream logic.
  3. You no longer have to manage the prefix logic. That is built into libp2p+HTTP.
    a. This means your client and server are also simpler. You can toss all the prefix logic into the trash!

I've added a test case that shows a simple head fetch and sync over HTTP both on top of a native HTTP transport and a libp2p stream.

I'd love feedback on what you think about this API.

cc @masih

Comment on lines +127 to +134
publisherHost, err := libp2phttp.New(
libp2phttp.StreamHost(publisherStreamHost),
libp2phttp.ListenAddrs([]multiaddr.Multiaddr{multiaddr.StringCast("/ip4/127.0.0.1/tcp/0/http")}),
)
req.NoError(err)

go publisherHost.Serve()
defer publisherHost.Close()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "HTTP Host". It's like the libp2p "stream host" (aka core host.Host), but it uses HTTP semantics instead of stream semantics.

You can pass in options on creation like a stream host to do HTTP over libp2p streams, and multiaddrs to create listeners on.

serverHTTPMa := publisherHost.Addrs()[1]
req.Contains(serverHTTPMa.String(), "/http")

publisherHost.SetHttpHandlerAtPath(protoID, "/ipni/", http.StripPrefix("/ipni/", publisher))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we attach our request handler. Note that we are mounting the "/ipni-sync/1" protocol at /ipni/. libp2phttp manages this mapping and clients can learn about the mapping at .well-known/libp2p.

In this case we also want out HTTP handler to not even know about the prefix, so we use the stdlib http.StripPrefix.

clientHost, err := libp2phttp.New()
req.NoError(err)

c, err := clientHost.NamespacedClient(nil, protoID, peer.AddrInfo{Addrs: []multiaddr.Multiaddr{serverHTTPMa}})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates an http.Client that automatically knows about the prefix for a given protocol ID.

clientHost, err := libp2phttp.New()
req.NoError(err)

c, err := clientHost.NamespacedClient(clientStreamHost, protoID, peer.AddrInfo{ID: publisherStreamHost.ID(), Addrs: []multiaddr.Multiaddr{serverStreamMa}})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same thing as above but it is going over a libp2p stream.

Note: I think I'll change the NamespacedClient api to not accept a stream host here, and instead use the one passed in as an option to libp2phttp.New.

}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special here. Just plumbing to set up the test.

Copy link
Member

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@rvagg
Copy link
Contributor

rvagg commented Aug 7, 2023

Neat, but it seems to me that the main trade-off here is that we have to rely on libp2p discovery semantics to figure out how to communicate? a client has to be /.well-known/libp2p aware in order to figure out the path to communicate on to talk in /ipni-sync/1. I guess that means there's a secondary trade-off of an additional round-trip to start actually talking about ipni stuff?

I really like that the code is simpler in here. But unfortunately it's also only simpler if we get to remove the other form of this - but if we do that, then users on both ends need to opt-in to the libp2p version of talking on http. Either using this library, or writing it themselves. One nice thing about the current incarnation is that HTTP is simple, and writing the server (and client) is fairly straightforward, with all of the complexity being in the ipni protocol itself. This would add an additional layer of complexity.

@MarcoPolo
Copy link
Author

Thanks for taking a look!

we have to rely on libp2p discovery semantics to figure out how to communicate

This is just one way to discover a peer's protocol mapping. IPNI decided on encoding this mapping in the multiaddr (which is fine), but I'd prefer not to have every application have reimplement this. I think it makes sense to have this mapping at the server metadata layer rather than defined per application/protocol. It allows the protocols to be agnostic to their path prefix, gives peers a standard way they can learn about each other, and it's an HTTP native resource.

That doesn't mean this is the only solution to create a peer's protocol mapping. IPNI can continue encoding the mapping in the multiaddr, and give that information to the HTTP host. That's not supported right now, but I think that's a great addition to the API (edit: just pushed this change).

users on both ends need to opt-in to the libp2p version of talking on http. Either using this library, or writing it themselves.

I don't think this is a "libp2p version of talking on http". A goal here is to interoperate with existing systems. This is all plain old HTTP. Yes, we're defining the .well-known endpoint to have some relatively easy way (but not free) to learn about a peer given. But if you already have this information about the peer this isn't required as mentioned above.

If this .well-known endpoint is useful for an application, implementing it is trivial and doesn't require a library.

Server:

mux.HandleFunc("/.well-known/libp2p", func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintln(w, `{"protocols":{"/ipni-sync/1":{"path":"/custom-prefix"}}}`)
})

Client:

const prefix = await fetch("<remote>/.well-known/libp2p")
  .then(resp => resp.json())
  .then(mapping => mapping.protocols["/ipni-sync/1"].path)

Alright, so you may ask "if it's so easy what's the point of libp2p+HTTP then?" A couple of reasons, but for IPNI, the biggest benefit is that it lets you delete dagsync/dtsync and consolidate your http + libp2p stream logic. Rather than maintaining two different stacks that do the same thing, you build on top of HTTP primitives and that works on the many underlying transports (thanks to libp2p). And again, this doesn't introduce a libp2p flavored HTTP. It's just HTTP.


This API is all subject to change, so if you see something that could improve here I'm all ears!

@rvagg
Copy link
Contributor

rvagg commented Aug 8, 2023

Overall seems like a positive API addition to libp2p http, it does weaken the argument for httpath in the addr and brings it more inline with the other protocols.

It makes me ponder the possibilities for protocol probing, we do all this juggling in lassie to figure out how we can talk to peers, some of it depends on inspecting the multiaddr and some involves opening a stream and probing: https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/directcandidatefinder.go

Potentially, if the trustless gateway http protocol were to adopt this, we could do some versioning and make allowances for different root paths. We're already at the point of exploring a mix of OPTIONS, Accept headers, and params to do some of the work around different implementation functionality support. I can imagine this getting out of hand once we add more features to the protocol. I'm not sure what you have here would really solve any of this better, but maybe if we needed to do hard versioning breaks it would help. We're also stuck with the fixed /ipfs/ path for this stuff, which is unfortunate (IMO), being able to vary that would be nice.

@gammazero gammazero changed the base branch from main to libp2phttp August 8, 2023 23:05
Copy link
Collaborator

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for writing this. This is very helpful for understanding how this can be used to make libp2p work easily as a transport for HTTP. Thank you!

I have some questions concerning the best way to adapt this for use in IPNI http-sync in the PR review.

clientHost, err := libp2phttp.New()
req.NoError(err)

c, err := clientHost.NamespacedClient(protoID, peer.AddrInfo{Addrs: []multiaddr.Multiaddr{serverHTTPMa}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to mean that a separate NamespacedClient is needed for each separate server that a request is sent to. This is somewhat different than the semantics of the standard http.Client where the same client instance can be used to send requests to servers at different URLs. Is this correct?

This matters because with the structure of the httpsync code as it currently is, the sync is a long-lived object (created once) that contains the http.Client which is reused for separate Syncer requests. What will need to change is that a new NamespacedClient will need to be associated with each per-sync Syncer instance and not with the Sync.

Can NamespacedClient instances be closed and removed from their clientHost?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note: Connections are managed by the underlying Roundtripper. NamespacedClients are a thin wrapper around an underlying RoundTripper that adds a prefix path to http requests. They don't change how connections are managed or cached.

This matters because with the structure of the httpsync code as it currently is, the sync is a long-lived object (created once) that contains the http.Client which is reused for separate Syncer requests. What will need to change is that a new NamespacedClient will need to be associated with each per-sync Syncer instance and not with the Sync.

Yep that makes sense to me. 👍

an NamespacedClient instances be closed and removed from their clientHost?

Not necessary to remove them from the clientHost (the client http host doesn't track them). You may optionally close the idle connections on them with client.CloseIdleConnections, but it's not necessary. The HTTP transport or stream host is responsible for managing/closing these connections. Clients are relatively cheap to make.

urls: nil,
sync: s,
}, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Syncer instance is created for each sync operation and is responsible for making a request to a specific advertisement publisher. This means that each instance of aSyncer requires the addresses to connect to since these will not be known in advance.

I think instead of NewSyncerWithoutAddrs here, that NewSyncer should be modified to create a new clientHost.NamespacedClient if its parent Sync is using a libp2phttp client. The NamespacedClient can then be used as the http.Client in Syncer functions.

Will it be necessary to close each NamespacedClient instance and remove it from the clientHost and remove its peer info from the peerstore? Will that happen automatically after some time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of NewSyncerWithoutAddrs here, that NewSyncer should be modified to create a new clientHost.NamespacedClient if its parent Sync is using a libp2phttp client. The NamespacedClient can then be used as the http.Client in Syncer functions.

👍 . I thought about this change, but wanted to make a less invasive change to demo. I like it though.

Will it be necessary to close each NamespacedClient instance and remove it from the clientHost and remove its peer info from the peerstore? Will that happen automatically after some time?

Happens automatically :) (just like the stock http.Client does it)

gammazero added a commit that referenced this pull request Aug 9, 2023
This is another version of the libp2p + HTTP refactor in #102. The key difference is that a new `NamespacedClient`, having the publisher's address, is created for each sync within the per-sync Syncer.
gammazero added a commit that referenced this pull request Aug 9, 2023
This is another version of the libp2p + HTTP refactor in #102. The key difference is that a new `NamespacedClient`, having the publisher's address, is created for each sync within the per-sync Syncer.
@gammazero
Copy link
Collaborator

@MarcoPolo Please take a look at PR #103 as a modification of this PR, and see PR comments within code.

cc @masih

@MarcoPolo
Copy link
Author

Replying to @rvagg comments first:

It makes me ponder the possibilities for protocol probing, we do all this juggling in lassie to figure out how we can talk to peers, some of it depends on inspecting the multiaddr and some involves opening a stream and probing: https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/directcandidatefinder.go

Long term, I think we'll push this data closer to the "provider record" equivalent. You should know what datatransfer protocols a peer supports (and in the case of HTTP, where they are mounted). I believe the IPNI metadata already has some support for this (extra metadata for the application protocol, which if it is an HTTP-semantic one can include the prefix path). That would save us a round trip and possible a wasted connection when we want to learn about a new peer. Note that this doesn't conflict with a .well-known/libp2p endpoint, it complements it.

Potentially, if the trustless gateway http protocol were to adopt this, we could do some versioning and make allowances for different root paths. We're already at the point of exploring a mix of OPTIONS, Accept headers, and params to do some of the work around different implementation functionality support. I can imagine this getting out of hand once we add more features to the protocol. I'm not sure what you have here would really solve any of this better, but maybe if we needed to do hard versioning breaks it would help. We're also stuck with the fixed /ipfs/ path for this stuff, which is unfortunate (IMO), being able to vary that would be nice.

I wrote a comment around OPTIONS here. In short, I think it makes sense to use OPTIONS to define application level features. If new, backward-incompatible, version is released it can be referred to by a different protocol id (e.g. /trustless-http-gateway/2) and the server can specify where that is mounted.

I think (I'm not super familiar here) that we could mount the existing trustless-http-gateway on a separate prefix path besides root /. Clients can learn about that prefix path either out of band or via .well-known/libp2p. I'm not sure if anything in the trustless gateway spec explicitly forbids this.

@gammazero
Copy link
Collaborator

This package is incorporated into the changes in #113. Closing.

@gammazero gammazero closed this Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants