libp2p + HTTP an exploratory refactor #102

MarcoPolo · 2023-08-04T01:46:05Z

Hi folks!

I'm play-testing the new libp2p+HTTP api libp2p/go-libp2p#2438 so I wanted to try using it somewhere. I think the dagsync code here would benefit from libp2p+HTTP because:

It means you can delete all of dagsync/dtsync eventually (after clients upgrade).
It consolidates the http and libp2p stream logic.
You no longer have to manage the prefix logic. That is built into libp2p+HTTP.
a. This means your client and server are also simpler. You can toss all the prefix logic into the trash!

I've added a test case that shows a simple head fetch and sync over HTTP both on top of a native HTTP transport and a libp2p stream.

I'd love feedback on what you think about this API.

cc @masih

MarcoPolo · 2023-08-04T01:48:17Z

dagsync/httpsync/publisher_test.go

+	publisherHost, err := libp2phttp.New(
+		libp2phttp.StreamHost(publisherStreamHost),
+		libp2phttp.ListenAddrs([]multiaddr.Multiaddr{multiaddr.StringCast("/ip4/127.0.0.1/tcp/0/http")}),
+	)
+	req.NoError(err)
+
+	go publisherHost.Serve()
+	defer publisherHost.Close()


This is the "HTTP Host". It's like the libp2p "stream host" (aka core host.Host), but it uses HTTP semantics instead of stream semantics.

You can pass in options on creation like a stream host to do HTTP over libp2p streams, and multiaddrs to create listeners on.

MarcoPolo · 2023-08-04T01:49:47Z

dagsync/httpsync/publisher_test.go

+	serverHTTPMa := publisherHost.Addrs()[1]
+	req.Contains(serverHTTPMa.String(), "/http")
+
+	publisherHost.SetHttpHandlerAtPath(protoID, "/ipni/", http.StripPrefix("/ipni/", publisher))


Here is where we attach our request handler. Note that we are mounting the "/ipni-sync/1" protocol at /ipni/. libp2phttp manages this mapping and clients can learn about the mapping at .well-known/libp2p.

In this case we also want out HTTP handler to not even know about the prefix, so we use the stdlib http.StripPrefix.

MarcoPolo · 2023-08-04T01:50:44Z

dagsync/httpsync/publisher_test.go

+			clientHost, err := libp2phttp.New()
+			req.NoError(err)
+
+			c, err := clientHost.NamespacedClient(nil, protoID, peer.AddrInfo{Addrs: []multiaddr.Multiaddr{serverHTTPMa}})


This creates an http.Client that automatically knows about the prefix for a given protocol ID.

MarcoPolo · 2023-08-04T01:51:43Z

dagsync/httpsync/publisher_test.go

+			clientHost, err := libp2phttp.New()
+			req.NoError(err)
+
+			c, err := clientHost.NamespacedClient(clientStreamHost, protoID, peer.AddrInfo{ID: publisherStreamHost.ID(), Addrs: []multiaddr.Multiaddr{serverStreamMa}})


This is the same thing as above but it is going over a libp2p stream.

Note: I think I'll change the NamespacedClient api to not accept a stream host here, and instead use the one passed in as an option to libp2phttp.New.

MarcoPolo · 2023-08-04T01:52:28Z

dagsync/httpsync/publisher_test.go

+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {


Nothing special here. Just plumbing to set up the test.

masih

Nice! LGTM

rvagg · 2023-08-07T03:15:37Z

Neat, but it seems to me that the main trade-off here is that we have to rely on libp2p discovery semantics to figure out how to communicate? a client has to be /.well-known/libp2p aware in order to figure out the path to communicate on to talk in /ipni-sync/1. I guess that means there's a secondary trade-off of an additional round-trip to start actually talking about ipni stuff?

I really like that the code is simpler in here. But unfortunately it's also only simpler if we get to remove the other form of this - but if we do that, then users on both ends need to opt-in to the libp2p version of talking on http. Either using this library, or writing it themselves. One nice thing about the current incarnation is that HTTP is simple, and writing the server (and client) is fairly straightforward, with all of the complexity being in the ipni protocol itself. This would add an additional layer of complexity.

MarcoPolo · 2023-08-07T22:59:24Z

Thanks for taking a look!

we have to rely on libp2p discovery semantics to figure out how to communicate

This is just one way to discover a peer's protocol mapping. IPNI decided on encoding this mapping in the multiaddr (which is fine), but I'd prefer not to have every application have reimplement this. I think it makes sense to have this mapping at the server metadata layer rather than defined per application/protocol. It allows the protocols to be agnostic to their path prefix, gives peers a standard way they can learn about each other, and it's an HTTP native resource.

That doesn't mean this is the only solution to create a peer's protocol mapping. IPNI can continue encoding the mapping in the multiaddr, and give that information to the HTTP host. That's not supported right now, but I think that's a great addition to the API (edit: just pushed this change).

users on both ends need to opt-in to the libp2p version of talking on http. Either using this library, or writing it themselves.

I don't think this is a "libp2p version of talking on http". A goal here is to interoperate with existing systems. This is all plain old HTTP. Yes, we're defining the .well-known endpoint to have some relatively easy way (but not free) to learn about a peer given. But if you already have this information about the peer this isn't required as mentioned above.

If this .well-known endpoint is useful for an application, implementing it is trivial and doesn't require a library.

Server:

mux.HandleFunc("/.well-known/libp2p", func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintln(w, `{"protocols":{"/ipni-sync/1":{"path":"/custom-prefix"}}}`)
})

Client:

const prefix = await fetch("<remote>/.well-known/libp2p")
  .then(resp => resp.json())
  .then(mapping => mapping.protocols["/ipni-sync/1"].path)

Alright, so you may ask "if it's so easy what's the point of libp2p+HTTP then?" A couple of reasons, but for IPNI, the biggest benefit is that it lets you delete dagsync/dtsync and consolidate your http + libp2p stream logic. Rather than maintaining two different stacks that do the same thing, you build on top of HTTP primitives and that works on the many underlying transports (thanks to libp2p). And again, this doesn't introduce a libp2p flavored HTTP. It's just HTTP.

This API is all subject to change, so if you see something that could improve here I'm all ears!

rvagg · 2023-08-08T04:27:37Z

Overall seems like a positive API addition to libp2p http, it does weaken the argument for httpath in the addr and brings it more inline with the other protocols.

It makes me ponder the possibilities for protocol probing, we do all this juggling in lassie to figure out how we can talk to peers, some of it depends on inspecting the multiaddr and some involves opening a stream and probing: https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/directcandidatefinder.go

Potentially, if the trustless gateway http protocol were to adopt this, we could do some versioning and make allowances for different root paths. We're already at the point of exploring a mix of OPTIONS, Accept headers, and params to do some of the work around different implementation functionality support. I can imagine this getting out of hand once we add more features to the protocol. I'm not sure what you have here would really solve any of this better, but maybe if we needed to do hard versioning breaks it would help. We're also stuck with the fixed /ipfs/ path for this stuff, which is unfortunate (IMO), being able to vary that would be nice.

gammazero

Thank you so much for writing this. This is very helpful for understanding how this can be used to make libp2p work easily as a transport for HTTP. Thank you!

I have some questions concerning the best way to adapt this for use in IPNI http-sync in the PR review.

gammazero · 2023-08-09T00:06:36Z

dagsync/httpsync/publisher_test.go

+			clientHost, err := libp2phttp.New()
+			req.NoError(err)
+
+			c, err := clientHost.NamespacedClient(protoID, peer.AddrInfo{Addrs: []multiaddr.Multiaddr{serverHTTPMa}})


This appears to mean that a separate NamespacedClient is needed for each separate server that a request is sent to. This is somewhat different than the semantics of the standard http.Client where the same client instance can be used to send requests to servers at different URLs. Is this correct?

This matters because with the structure of the httpsync code as it currently is, the sync is a long-lived object (created once) that contains the http.Client which is reused for separate Syncer requests. What will need to change is that a new NamespacedClient will need to be associated with each per-sync Syncer instance and not with the Sync.

Can NamespacedClient instances be closed and removed from their clientHost?

Just a quick note: Connections are managed by the underlying Roundtripper. NamespacedClients are a thin wrapper around an underlying RoundTripper that adds a prefix path to http requests. They don't change how connections are managed or cached.

This matters because with the structure of the httpsync code as it currently is, the sync is a long-lived object (created once) that contains the http.Client which is reused for separate Syncer requests. What will need to change is that a new NamespacedClient will need to be associated with each per-sync Syncer instance and not with the Sync.

Yep that makes sense to me. 👍

an NamespacedClient instances be closed and removed from their clientHost?

Not necessary to remove them from the clientHost (the client http host doesn't track them). You may optionally close the idle connections on them with client.CloseIdleConnections, but it's not necessary. The HTTP transport or stream host is responsible for managing/closing these connections. Clients are relatively cheap to make.

gammazero · 2023-08-09T00:27:10Z

dagsync/httpsync/sync.go

+		urls:    nil,
+		sync:    s,
+	}, nil
+}


A Syncer instance is created for each sync operation and is responsible for making a request to a specific advertisement publisher. This means that each instance of aSyncer requires the addresses to connect to since these will not be known in advance.

I think instead of NewSyncerWithoutAddrs here, that NewSyncer should be modified to create a new clientHost.NamespacedClient if its parent Sync is using a libp2phttp client. The NamespacedClient can then be used as the http.Client in Syncer functions.

Will it be necessary to close each NamespacedClient instance and remove it from the clientHost and remove its peer info from the peerstore? Will that happen automatically after some time?

I think instead of NewSyncerWithoutAddrs here, that NewSyncer should be modified to create a new clientHost.NamespacedClient if its parent Sync is using a libp2phttp client. The NamespacedClient can then be used as the http.Client in Syncer functions.

👍 . I thought about this change, but wanted to make a less invasive change to demo. I like it though.

Will it be necessary to close each NamespacedClient instance and remove it from the clientHost and remove its peer info from the peerstore? Will that happen automatically after some time?

Happens automatically :) (just like the stock http.Client does it)

This is another version of the libp2p + HTTP refactor in #102. The key difference is that a new `NamespacedClient`, having the publisher's address, is created for each sync within the per-sync Syncer.

gammazero · 2023-08-09T08:15:11Z

@MarcoPolo Please take a look at PR #103 as a modification of this PR, and see PR comments within code.

cc @masih

MarcoPolo · 2023-08-09T18:19:08Z

Replying to @rvagg comments first:

It makes me ponder the possibilities for protocol probing, we do all this juggling in lassie to figure out how we can talk to peers, some of it depends on inspecting the multiaddr and some involves opening a stream and probing: https://github.com/filecoin-project/lassie/blob/main/pkg/retriever/directcandidatefinder.go

Long term, I think we'll push this data closer to the "provider record" equivalent. You should know what datatransfer protocols a peer supports (and in the case of HTTP, where they are mounted). I believe the IPNI metadata already has some support for this (extra metadata for the application protocol, which if it is an HTTP-semantic one can include the prefix path). That would save us a round trip and possible a wasted connection when we want to learn about a new peer. Note that this doesn't conflict with a .well-known/libp2p endpoint, it complements it.

Potentially, if the trustless gateway http protocol were to adopt this, we could do some versioning and make allowances for different root paths. We're already at the point of exploring a mix of OPTIONS, Accept headers, and params to do some of the work around different implementation functionality support. I can imagine this getting out of hand once we add more features to the protocol. I'm not sure what you have here would really solve any of this better, but maybe if we needed to do hard versioning breaks it would help. We're also stuck with the fixed /ipfs/ path for this stuff, which is unfortunate (IMO), being able to vary that would be nice.

I wrote a comment around OPTIONS here. In short, I think it makes sense to use OPTIONS to define application level features. If new, backward-incompatible, version is released it can be referred to by a different protocol id (e.g. /trustless-http-gateway/2) and the server can specify where that is mounted.

I think (I'm not super familiar here) that we could mount the existing trustless-http-gateway on a separate prefix path besides root /. Clients can learn about that prefix path either out of band or via .well-known/libp2p. I'm not sure if anything in the trustless gateway spec explicitly forbids this.

gammazero · 2023-08-31T21:01:38Z

This package is incorporated into the changes in #113. Closing.

MarcoPolo added 2 commits August 3, 2023 18:37

First draft

1caa0de

Use non local go-libp2p version

0f02c80

MarcoPolo commented Aug 4, 2023

View reviewed changes

MarcoPolo mentioned this pull request Aug 4, 2023

feat: Implement HTTP spec libp2p/go-libp2p#2438

Merged

masih requested review from gammazero and rvagg August 4, 2023 08:47

masih reviewed Aug 4, 2023

View reviewed changes

Remove streamHost parameter. Use option

e431ed0

gammazero changed the base branch from main to libp2phttp August 8, 2023 23:05

gammazero reviewed Aug 9, 2023

View reviewed changes

gammazero mentioned this pull request Aug 9, 2023

Exploratory refactor-2 of libp2p + HTTP #103

Closed

gammazero closed this Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libp2p + HTTP an exploratory refactor #102

libp2p + HTTP an exploratory refactor #102

MarcoPolo commented Aug 4, 2023

MarcoPolo Aug 4, 2023

MarcoPolo Aug 4, 2023

MarcoPolo Aug 4, 2023

MarcoPolo Aug 4, 2023

MarcoPolo Aug 4, 2023

masih left a comment

rvagg commented Aug 7, 2023

MarcoPolo commented Aug 7, 2023

rvagg commented Aug 8, 2023

gammazero left a comment

gammazero Aug 9, 2023

MarcoPolo Aug 9, 2023

gammazero Aug 9, 2023

MarcoPolo Aug 9, 2023

gammazero commented Aug 9, 2023

MarcoPolo commented Aug 9, 2023

gammazero commented Aug 31, 2023

libp2p + HTTP an exploratory refactor #102

libp2p + HTTP an exploratory refactor #102

Conversation

MarcoPolo commented Aug 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masih left a comment

Choose a reason for hiding this comment

rvagg commented Aug 7, 2023

MarcoPolo commented Aug 7, 2023

rvagg commented Aug 8, 2023

gammazero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gammazero commented Aug 9, 2023

MarcoPolo commented Aug 9, 2023

gammazero commented Aug 31, 2023