Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jetty 11 adapter #447

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

alexandergunnarson
Copy link
Contributor

No description provided.

ptaoussanis and others added 11 commits September 5, 2023 11:22
Requires further review.

This is a minor but potentially BREAKING change.
Zero urgency, consider holding till next breaking set.

Before this commit:

  Client-side `[:chsk/ws-ping]` events are wrapped to
  `[:chsk/recv [:chsk/ws-ping]]` when `wrap-recv-evs?` is true.

After this commit:

  Client-side `[:chsk/ws-ping` events are NOT wrapped, even
  when `wrap-recv-evs?` is true.

MOTIVATION:

  While `:chsk/ping` is indeed pushed from server, `:chsk/recv` is
  in practice reserved for user-level events. Keeping it that way
  may be better (more convenient for users).
@ptaoussanis
Copy link
Member

Thanks so much for this Alex! I've got my hands a bit full atm with some other open source tasks, but will definitely get this reviewed+merged the next time I'm on batched Sente work.

Will be really nice to have this in 👍

@ptaoussanis ptaoussanis self-assigned this May 9, 2024
@alexandergunnarson
Copy link
Contributor Author

No worries! Thanks :)

@awb99
Copy link

awb99 commented Oct 8, 2024

@ptaoussanis any idea when this will be merged? Thanks!

@ptaoussanis
Copy link
Member

@awb99 Are you blocking on this? Would you need a full release, or just a merge?

@awb99
Copy link

awb99 commented Oct 8, 2024

I am stuck to a very old sente version that still has jetty support. I would like to update to a more uptodate version of sente that again supports jetty. I have two reasons: 1) old jetty needs websocket handlers to be marked in the config as such (which means my routing table needs to be done effectively twice, or at least my config becomes more convoluted). 2) old sente versions sometimes to not receive disconnected client notification, so I believe I am sometimes sending push event's to clients that are already gone. I could just use a git commit ID to refer to sente via deps.edn if this is what you refer to. I really would like to upgrade ... but I prefer to do it once it is clear the new code works. Thanks!

@ptaoussanis
Copy link
Member

Thanks, and just to clarify - would you need a full release, or just a review and merge of this PR?

@awb99
Copy link

awb99 commented Oct 8, 2024

If you just review and merge, then I can use it in deps.edn projects fine. And if it is helpful to you I could try it in the next 2-4 weeks on some of my apps and report here in case there are issues before you do a full release.

@ptaoussanis
Copy link
Member

ptaoussanis commented Oct 8, 2024

@awb99

And if it is helpful to you I could try it in the next 2-4 weeks on some of my apps and report here in case there are issues before you do a full release.

That'd be great, thank you! 🙏 I've just pushed v1.20.0-SNAPSHOT without fully reviewing or testing this PR except for adding missing dependencies, etc.

Please let me know if/when you've had an opportunity to test.

The reference example includes tools for testing both WebSocket and AJAX features.

@awb99
Copy link

awb99 commented Oct 12, 2024

Unfortunately I cannot add snapshot dependencies; most of my libraries are on clojars, and I cannot depend
on snapshot dependencies. I was looking for a commit somewhere, a branch or something that I could use?
But I cound not find any commits here. What do I overlook? Thanks

ptaoussanis added a commit that referenced this pull request Oct 12, 2024
@ptaoussanis
Copy link
Member

I've just added the relevant (temporary, experimental) commits to master 👍

@ptaoussanis
Copy link
Member

@awb99 Have you had any opportunity to test this adapter? I'm aiming to cut a new stable release shortly, and it'd be nice to have this in if possible.

ptaoussanis pushed a commit that referenced this pull request Oct 28, 2024
…narson)

Not officially tested, waiting on confirmation from users that this works as intended
@ptaoussanis
Copy link
Member

Actually, I'll just go ahead and merge this as v1.20 RC1. Hopefully someone can try the adapter out and confirm that it works as expected before v1.20 final - will keep the PR open till then.

@danielsz
Copy link
Collaborator

danielsz commented Oct 28, 2024

Heads up: upon trying the current release candidate in a project of mine, I immediately get a java.lang.ExceptionInInitializerError when starting. Probably a typo somewhere because it is caused by the Clojure compiler not understanding syntax.

@ptaoussanis
Copy link
Member

@daniels Thanks for the heads-up Daniel! Is there any other info you can share (ns, etc.)? It seems to be running fine on my side, but I might be missing something.

@danielsz
Copy link
Collaborator

I'm investigating, but I can share the following stack trace already.

Caused by: Syntax error macroexpanding at (taoensso/sente.cljc:1:1). at clojure.lang.Compiler.load(Compiler.java:8177) at clojure.lang.RT.loadResourceScript(RT.java:401) at clojure.lang.RT.loadResourceScript(RT.java:392) at clojure.lang.RT.load(RT.java:479) at clojure.lang.RT.load(RT.java:444) at clojure.core$load$fn__6931.invoke(core.clj:6189) at clojure.core$load.invokeStatic(core.clj:6188) at clojure.core$load.doInvoke(core.clj:6172) at clojure.lang.RestFn.invoke(RestFn.java:411) at clojure.core$load_one.invokeStatic(core.clj:5961) at clojure.core$load_one.invoke(core.clj:5956) at clojure.core$load_lib$fn__6873.invoke(core.clj:6003) at clojure.core$load_lib.invokeStatic(core.clj:6002) at clojure.core$load_lib.doInvoke(core.clj:5981) at clojure.lang.RestFn.applyTo(RestFn.java:145) at clojure.core$apply.invokeStatic(core.clj:669) at clojure.core$load_libs.invokeStatic(core.clj:6044) at clojure.core$load_libs.doInvoke(core.clj:6028) at clojure.lang.RestFn.applyTo(RestFn.java:140) at clojure.core$apply.invokeStatic(core.clj:669) at clojure.core$require.invokeStatic(core.clj:6066) at clojure.core$require.doInvoke(core.clj:6066) at clojure.lang.RestFn.invoke(RestFn.java:485) at system.components.sente$loading__6812__auto____12652.invoke(sente.cljc:1) at system.components.sente__init.load(Unknown Source) at system.components.sente__init.<clinit>(Unknown Source) ... 71 more

@danielsz
Copy link
Collaborator

OK, good news. What happened is that I changed the version in deps.edn, but I didn't think about the compiled classes in my target directory. In other words, I wasn't upgrading cleanly. I have now, and the problem went away. I will be happy to test the release candidate further.

@ptaoussanis
Copy link
Member

Okay, happy to hear that! Thanks for updating, and for the testing - much appreciated 🙏

@stefanroex
Copy link

Nice to see Jetty support added in RC1!

Gave it a spin, but I ran into an issue. It seems like the current implementation is tied to Jetty 11 due to org.eclipse.jetty.websocket.api.WebSocketListener, which is no longer available in Jetty 12. I was able to get it running with Jetty 12 by swapping that class with org.eclipse.jetty.websocket.common.WebSocketSession.

Would it make sense to dynamically import based on the Jetty version or create version-specific adapters? I couldn’t sort out the dynamic import part, so any ideas are welcome.

@ptaoussanis
Copy link
Member

@stefanroex Hi Stefan, thanks for pinging about this-

It'd definitely be nice to have the Jetty adapter support additional version/s if possible. I'm not familiar on the details of what'd need to change from version to version - but I can help with the conditional compilation (dynamic import) if you or someone else can provide an example of the relevant diffs needed between versions.

@stefanroex
Copy link

stefanroex commented Nov 1, 2024

@ptaoussanis Jetty 12’s working fine on my machine so far. Planning to roll it out gradually in production to see how it holds up. So maybe more changes are required. But for now the difference between jetty 11 and 12 is minimal. Instead of using org.eclipse.jetty.websocket.api.WebSocketListener here:

[org.eclipse.jetty.websocket.api WebSocketListener]

(extend-type WebSocketListener

It should import and extend-type for class org.eclipse.jetty.websocket.common.WebSocketSession.

@ptaoussanis
Copy link
Member

Thanks! That should be easy to do dynamically (or at least, at compile time). I'll wait on confirmation from you that everything looks good, then I'll add the change 👍

@stefanroex
Copy link

We'll start using websockets in production by the end of the year, though I’m not sure if you want to wait that long. Right now, it’s running smoothly in dev, staging, and production (behind a feature flag for a small user set).

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

AFAICT, the code doesn't handle timeouts properly. I.e. ajax-read! will return :ajax-resp-timeout by default, but then it'll be treated as a string which will result in a much less transparent ClassCastException.

@ptaoussanis
Copy link
Member

@p-himik Hi Eugene, please note that you'll need to look here for the code that was actually committed. I changed a few things while reviewing, including removing :ajax-resp-timeout-val.

Thanks for taking a look, it definitely helps to get extra eyes on this 👍

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

@ptaoussanis Right, but that only substitutes CCE with NPE. Perhaps a custom exception makes sense, or java.util.concurrent.TimeoutException or some other existing one.

On a separate topic - surprisingly, I can't quite get it to work with Jetty 12, even with the aforementioned change.
This line keeps getting triggered and at the moment I don't have enough background knowledge to understand why exactly it happens.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 11, 2024

Unfortunately can't offer any advice since I've little familiarity with this myself.

Have just marked the issue - hopefully someone else (Alex / Stefan?) might be able to make a suggestion! Otherwise I'll try investigate next time I'm on batched Sente work.

Edit to add: re: CCE/NPE - would be happy to see a PR for the Jetty adapter (and presumably also Undertow?) for a clearer error, assuming that's the best we can do. Would also be nice to update the docstrings to make it clear what actually happens on timeout (exception is thrown, etc.).

@alexandergunnarson
Copy link
Contributor Author

alexandergunnarson commented Nov 11, 2024

@p-himik

AFAICT, the code doesn't handle timeouts properly. I.e. ajax-read! will return :ajax-resp-timeout by default, but then it'll be treated as a string which will result in a much less transparent ClassCastException.

Ah, so this is why there are some random Keyword to String cast complaints in our logs :D The stacktrace is suppressed by Sente IIRC, by an errorf.

@p-himik

On a separate topic - surprisingly, I can't quite get it to work with Jetty 12 ...

Yeah I'm not sure about interaction with Jetty 12. We're running ring/ring-jetty-adapter {:mvn/version "1.13.0"} in production, and that uses Jetty 11. Would need more testing/debugging to get to the bottom of what's happening to you.

@stefanroex
Copy link

@p-himik I’m only running with WebSockets, no Ajax fallback, on Jetty 12. I suspect the ClassCastException and WritePendingException are specific to the Ajax fallback. Unfortunately, that means I’m not much help here. But if you think this pops up with WebSockets, let me know. I’m happy to dig in and see if it shows up on our end!

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

I think I figured it out, PEBKAC is to blame for at least 80% of it:

  • One of my middlewares was throwing an exception specifically for WS responses
  • In my current logging setup, exceptions logged with clojure.tools.logging get printed into stdout, but exceptions logged with Timbre get printed into stderr. It seems that it sometimes result in the exceptions being reported out of order, so I was blaming the exception from Jetty even though it was my middleware
  • The exception returns a plain HTTP 500 response with the exception as the body, but Chrome DevTools Network panel for some reason doesn't print it at all and just reports "Status: Finished" with no payload or messages (had to use ngrep to actually see what's going on)
  • Sente doesn't know about the issue on the server side (after all, ajax-get-or-ws-handshake-fn was called and returned a proper result) and proceeds as usual. No clue what actually goes on in the internals of Jetty here in such a degenerate case, but in the end it results in that obscure WritePendingException

I'm not sure what I can improve here to make such errors more tractable. Perhaps close the Sente connection on the server side when there's an error downstream of Sente? @ptaoussanis But what would be a proper way to do it, given (= ev [:chsk/close]) ; Currently undocumented?

As a second-degree consequence, a different potential issue has manifested itself: after receiving HTTP 500 from the WS request, Sente on the client side keeps spamming the server with plain GET requests with no upgrade headers.

And due to a potential error in the adapter implementation, Sente on the server side keeps responding with HTTP 200 with an empty body, which after a delay (most likely, the one specified by ajax-resp-timeout-ms) results in a deluge of the aforementioned ClassCastException from Keyword to String.

@ptaoussanis
Copy link
Member

@p-himik Thanks for the update! [:chsk/close] will continue to be supported and should be documented, just haven't done so yet. Unfortunately can't comment offhand - would need to take a closer look, and might not have a chance for a while.

Is there any urgency from your side to get feedback from me on this? If so, can try make some time this weekend.

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

@ptaoussanis No urgency at all, and thanks for confirming that I can safely use [:chsk/close] - I just found that I've actually used it before in a different project.

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

@stefanroex Do you not experience timeouts? After fixing the initial issue, I'm getting lines like this in the logs, approximately once every 30 seconds:

[taoensso.sente:784] - [ws/on-error] Server sch error for u_2/c_5d8640/n_374957: org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout

It's triggered by org.eclipse.jetty.websocket.core.WebSocketConnection#onIdleExpired, with the default timeout specified by org.eclipse.jetty.websocket.core.WebSocketConstants#DEFAULT_IDLE_TIMEOUT, which is exactly 30 seconds.

Even though Sente has a built-in ping-pong mechanism that by default is set to 25 seconds for the server side, the actual ping messages seem to have a longer delay - 28 seconds on the picture, that one time it took less than 30 seconds so I didn't get a timeout exception with reconnection:
image

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

@ptaoussanis Not urgent - maybe I misunderstand something but to me it seems that the documentation of the ws-kalive-ms option is inaccurate.

The current description in make-channel-socket-server!: "Ping to keep a WebSocket conn alive if no activity w/in given msecs." So it makes it sound as if when there's no activity within the specified time, a ping is immediately sent.

However, here you describe it differently:
"Check every ws-kalive-ms if there's been connection activity. Noop if yes. Otherwise attempt to send server->client ping." So, if at T0 there's a check, at T0+1ms there's a message, at T0+ws-kalive-ms another check - that check would result in noop, so only the check at T0+ws-kalive-ms*2 would result in a ping, effectively making the pings happen half as frequently as ws-kalive-ms specifies.

I haven't analyzed Sente's code enough to confirm that the above scenario is exactly what's going on, but at least it corresponds to what I see - after setting ws-kalive-ms to 10 s, I get pings every 20 s:
image

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 12, 2024

@p-himik Hi Eugene, thanks for checking on this-

It's late here and I'm about to sign off for the night - so I might also be missing something quite obvious ^^ - but it seems the two descriptions are compatible?

The actual implementation is (IIRC) as you linked here. There's a fixed-interval check every kalive msecs. If there was no activity in the previous window, then a ping will be fired.

So the kalive setting effectively sets the maximum time allowed between activity.

Let's say that kalive is 10s and there's NO other activity. I'd then expect a ping at t=10s, t=20s, t=30s, ...

So, if at T0 there's a check, at T0+1ms there's a message, at T0+ws-kalive-ms another check - that check would result in noop, so only the check at T0+ws-kalive-ms*2 would result in a ping, effectively making the pings happen half as frequently as ws-kalive-ms specifies.

So there should be opportunities to ping at T0+ws-kalive-ms, T0+ws-kalive-ms*2, T0+ws-kalive-ms*3, ...

Could it be that you're observing the result of pong's coming back and therefore registering as activity?

I.e.:

Period1 (t0->t1): no activity    => PING at t1
Period2 (t1->t2): pong received  => noop at t2
Period3 (t2->t3): no activity    => PING at t3
Period4 (t3->t4): pong received  => noop at t4

etc.

This'll show as a ping every second interval if there's a successful pong in between each.

Does that make sense? Could well be I'm confused, or could be there's some issue with the implementation - will check back in the morning.

I'll note that there does seem to be a potential ambiguity in "Ping to keep a WebSocket conn alive if no activity w/in given msecs" since it's not stated whether it's talking about a rolling or fixed window/period. I'll adjust to make that clearer.

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

Yeah, same, about to log off.

I think the behavior that I've observed and your description are perfectly compatible. It's just that the docs are vague enough to allow for different interpretations.

Your description operates with separate discrete time windows, and I was initially thinking in the context of a sliding window of sorts. Because that's how I've seen timeouts generally being implemented - actions reset a timer, and when the timer reaches 0, a timeout exception is triggered. So the longest possible period of inactivity is exactly the setting of the timer.
But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Alas, I can't quite think of an alternative description of the argument that's unambiguous and doesn't take a whole paragraph.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 13, 2024

Your description operates with separate discrete time windows, and I was initially thinking in the context of a sliding window of sorts. Because that's how I've seen timeouts generally being implemented - actions reset a timer, and when the timer reaches 0, a timeout exception is triggered. So the longest possible period of inactivity is exactly the setting of the timer.

Makes sense 👍

But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Unless I'm missing something, I'm not sure this part is right though. I'd assert that the longest period of inactivity remains ws-kalive-ms even with the fixed (rather than sliding) window.

I'm intentionally considering a pong as activity on the connection - do you not? The purpose of ws-kalive-ms was to ensure that connections don't idle longer than the specified amount - and I do believe that that holds true?

A pong will still keep the connection open.

So if we have a 10s ws-kalive-ms, I assert (believe) that the connection won't idle for longer than 10 seconds.

In your example the pong is received very shortly after the ping - that's common, but not universal - especially on sketchier mobile connections, etc.

A sliding window would indeed trigger pings more frequently, but it's not immediately obvious to me that that would necessarily be preferable.

To clarify:

  • I'm definitely happy to detail specifics in the docstring!
  • I'm also happy to consider a different mechanism for the keep-alive if there's a reason to

Just want to make sure that I understand what the motivations are for pursuing this. Is there a specific reason you're concerned about the ping interval? (Context was given here)

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 13, 2024

Okay after re-reading my post, I finally think I understand the point that you've been making - sorry about the slow uptake ^^

The current behaviour is indeed unintentional, my assertion here was wrong:

I'd assert that the longest period of inactivity remains ws-kalive-ms even with the fixed (rather than sliding) window.

Your original example is accurate:

But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Would be happy to see a PR that switched to a sliding window (and documented that), otherwise I'll do that myself next time I'm on batched Sente work.

Still want to double check on context though - is there a specific reason you're concerned about the ping interval? (Context was given here, apologies for missing that)

@ptaoussanis
Copy link
Member

Created a separate issue for the keep-alive bug here 👍

Apologies for noise on this, just juggling too many things atm.

@stefanroex
Copy link

Thanks for clarifying how ws-kalive-ms works. We've run into timeouts as well and weren't entirely sure about its behavior. To be safe, we've reduced the value to ping every 10 seconds, keeping it well below Jetty's default 30-second timeout. We did add a ticket to our backlog to get a better grasp of its functionality.

We initially thought ws-kalive-ms acted like a sliding window: it would trigger activity exactly after ws-kalive-ms had passed since the last message. That behavior would make it easier to fine-tune Sente to match server timeouts.

Turns out I was seeing the exact same thing as @p-himik but hadn't started debugging it yet. Thanks @p-himik and @ptaoussanis for clarifying!

@awb99
Copy link

awb99 commented Nov 25, 2024

I got it to work with the official ring version that uses jetty 11.
The last working release of jetty adapter used the jetty adapter from sunng87.
Am I correct that we now use official ring version and no longer sunng87 version?
I ask because it seems that ring has now updated at least to jetty 11, but
jetty 12 seems to be only in sunng87.
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants