Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering ERR_CONNECTION_Closed to serve bigger file over https #3183

Closed
Somritag opened this issue Jul 11, 2024 · 14 comments
Closed

Encountering ERR_CONNECTION_Closed to serve bigger file over https #3183

Somritag opened this issue Jul 11, 2024 · 14 comments

Comments

@Somritag
Copy link

Somritag commented Jul 11, 2024

Hello Andy,
I am referring libwebsocket minimal example for https server, trying to open leaf.jpg (appx 7 mb)
For SSL i am using Mbedtls

I see its always getting time out to serve this file or any file of large size, while the smaller files get served.
we always hit here :
/* 1: something requested a callback when it was OK to write */

if ((pollfd->revents & LWS_POLLOUT) &&
    lwsi_state_can_handle_POLLOUT(wsi) &&
    lws_handle_POLLOUT_event(wsi, pollfd)) {
	if (lwsi_state(wsi) == LRS_RETURNED_CLOSE)
		lwsi_set_state(wsi, LRS_FLUSHING_BEFORE_CLOSE);
	/* the write failed... it's had it */
	wsi->socket_is_permanently_unusable = 1;

	**return LWS_HPI_RET_PLEASE_CLOSE_ME;**
}

And the __lws_sul_service_ripe() -> function
pt->inside_lws_service = 1;
hit->cb(hit); // here we are hitting lws_sul_wsitimeout_cb()
pt->inside_lws_service = 0;

Attaching the log:
image

@lws-team
Copy link
Member

... could you perhaps spend more than two words in the title to describe your problem?

@Somritag Somritag changed the title Encountering ERR_CONNECTION Encountering ERR_CONNECTION_Closed to serve bigger file over https Jul 11, 2024
@Somritag
Copy link
Author

Somritag commented Jul 11, 2024

Updated my issue, that was a hasty mistake.

In summary Https-> server (tried with minimal http server tls example code with a bigger file to load)-> Getting time out every time while serving
Http->server (tried with minimal http server example code with a bigger file to load)-> Does not get time out and files get served successfully
Post opening a WebSocket -> trying to load bigger file over https-> Server-> Able to serve the files

Is there any limitations in case of only https server?

@lws-team
Copy link
Member

Is stuff transferring over tls? Eg, if you look with tcpdump, how far does it get?

What do the logs say in the failing case? What is the platform?

@Somritag
Copy link
Author

Somritag commented Jul 11, 2024

Log:
https-fail.txt

After some fragments it times out -> The Time is not fixed mostly around

We want to host a webserver on our stm32H5 controller
Using libwebsockets for that purpose
FreeRtos + LWIP + MbedTLS

Here we are unable to serve a single file.
image

@lws-team
Copy link
Member

Does this help?

#3165

@Somritag
Copy link
Author

Hi Andy this did not fix my issue, I am attaching the log

https_serve_fail_log.txt

@lws-team
Copy link
Member

It doesn't seem to show why it wants to close when it does. Without a timestamp on the logs, it makes it impossible to understand what is happening close in time, or after some delay.

closing [wsisrv|1|adopted|h1|h2]: parent [null wsi]
_realloc: size 49: lws_metrics_hist_bump_ (free heap 560831)
 [mux|0|h2_sid1_(wsisrv|1|adopted)|default|GET|/], his parent [wsisrv|1|adopted|h1|h2]: siblings:
closing [mux|0|h2_sid1_(wsisrv|1|adopted)|default|GET|/]: parent [wsisrv|1|adopted|h1|h2]

For a timeout there should be a TIMEOUT log before, for other problems they also should log. The question is how did we end up at closing [wsisrv|1|adopted|h1|h2]: parent [null wsi]

@Somritag
Copy link
Author

Somritag commented Jul 12, 2024

https_serve_fail_leaf.txt
https_file_size_10mb_error.txt

Attached the logs with probable time stamp :
Instance 1 :
closing [mux|0|h2_sid1_(wsisrv|1|adopted)|default|GET|/]: parent [wsisrv|1|adopted|h1|h2]
disentangled from sibling [null wsi]
[1970/01/01 00:00:32:0000] I:
lws_close_free_wsi: shutdown conn: [mux|0|h2_sid1(wsisrv|1|adopted)|default|GET|/] (sk -1, state 0x15)
real just_kill_connection: sockfd -1
fd 5 events 517 -> 517
lwsi_set_state 0x20000015 -> 0x20000020
cce=1
[1970/01/01 00:00:33:0000] I:
rops_destroy_role_h2: [mux|0|h2_sid1
(wsisrv|1|adopted)|default|GET|/]: ah det due to close
[1970/01/01 00:00:33:0000] I:
_lws_header_table_detach: [mux|0|h2_sid1(wsisrv|1|adopted)|default|GET|/]: ah 20053de0 (tsi=0, count = 2)
[1970/01/01 00:00:33:0000] D:
_lws_header_table_detach: [mux|0|h2_sid1(wsisrv|1|adopted)|default|GET|/]: ah held 1720785130s, role/state 0x20000000 0x20,
[1970/01/01 00:00:33:0000] I:
__lws_header_table_detach: nobody usable waiting
[1970/01/01 00:00:32:0000] I:
_lws_destroy_ah: freed ah 20053de0 : pool length 1
[1970/01/01 00:00:32:0000] I:
_lws_header_table_detach: [mux|0|h2_sid1(wsisrv|1|adopted)|default|GET|/]: ah 20053de0 (tsi=0, count = 1)
vh default: count_bound_wsi 2
tsi fds count 2
fd 6 events 517 -> 517
TIMEDOUT WAITING 15, dhdr 0, ah 20052628, wl 0

Instance 2:
lws_serve_http_file_fragment: sending 4087
15 secs, reason 10
[1970/01/01 00:00:38:0000] D:
lws_h2_frame_write: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5890930, nwsi->txcr=15328114
ssl_capable_write (4096) says 4096
[1970/01/01 00:00:38:0000] I:
lws_h2_tx_cr_get: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/]: own tx credit 5886843: nwsi credit 15324027
[1970/01/01 00:00:38:0000] D:
lws_serve_http_file_fragment: sending 4087
15 secs, reason 10
[1970/01/01 00:00:38:0000] D:
lws_h2_frame_write: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5886843, nwsi->txcr=15324027
ssl_capable_write (4096) says 4096
[1970/01/01 00:00:38:0000] I:
lws_h2_tx_cr_get: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/]: own tx credit 5882756: nwsi credit 15319940
[1970/01/01 00:00:38:0000] D:
lws_serve_http_file_fragment: sending 4087
15 secs, reason 10
[1970/01/01 00:00:38:0000] D:
lws_h2_frame_write: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5882756, nwsi->txcr=15319940
ssl_capable_write (4096) says 4096
[1970/01/01 00:00:38:0000] I:
lws_h2_tx_cr_get: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/]: own tx credit 5878669: nwsi credit 15315853
[1970/01/01 00:00:38:0000] D:
lws_serve_http_file_fragment: sending 4087
15 secs, reason 10
[1970/01/01 00:00:38:0000] D:
lws_h2_frame_write: [mux|2|h2_sid1_(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5878669, nwsi->txcr=15315853
ssl_capable_write (4096) says 4096
sid 1, pending writable
sid 0, pending writable
fd 6 events 517 -> 519
[1970/01/01 00:00:38:0000] D:
lws_serve_http_file_fragment says 0
[1970/01/01 00:00:38:0000] D:
already pending writable
sid 1, pending writable
sid 0, pending writable
fd 6 events 519 -> 519
fd 6 events 519 -> 519
TIMEDOUT WAITING 15, dhdr 0, ah 20052660, wl 0

@lws-team
Copy link
Member

Not sure why this log is different and mentions TIMEDOUT, it seems to be sending up to the end. Every time it sends, it resets the timeout for 15s.

Time seems to go backwards 33 -> 32s on the https_serve_fail_leaf.txt is this because multiple things are pasted in the log?

If the time isn't stable, timeouts won't act right. But not sure that's what happened (ntp happened partway through?)

It says "lws_ssl_capable_write failed: 5" near the end, this seems to imply it was the client side that timed out and closed the tls connection, the server would close the tls connection as part of its general close handling later. It didn't stop because of a server side timeout. So I don't know what the timeout is at the end, the client's timeout?

@Somritag
Copy link
Author

https_file_serve_sucess_case.txt

Hi Andy ,
Attached a log where this time it served the file, it happens once in a while , one difference with failed case

TIMEDOUT WAITING 15, dhdr 0, ah 20052630, wl 0

happened in between (line no 3143) but

after closing it reconnects
caller: timeout
real just_kill_connection A: (sockfd 6)
[1970/01/01 00:00:45:0000] I:
[wsisrv|0|adopted|h1|h2], his parent [null wsi]: siblings:
[1970/01/01 00:00:45:0000] I:
closing [wsisrv|0|adopted|h1|h2]: parent [null wsi]
closing child
caller: mux child recurse
[1970/01/01 00:00:45:0000] D:
realloc: size 49: lws_metrics_hist_bump (free heap 567895)
real just_kill_connection A: (sockfd -1)
[1970/01/01 00:00:45:0000] I:
[mux|0|h2_sid3_(wsisrv|0|adopted)|default|GET|/], his parent [wsisrv|0|adopted|h1|h2]: siblings:
[1970/01/01 00:00:45:0000] I:
closing [mux|0|h2_sid3_(wsisrv|0|adopted)|default|GET|/]: parent [wsisrv|0|adopted|h1|h2]
disentangled from sibling [null wsi]
real just_kill_connection: sockfd -1
fd 5 events 1 -> 517
lwsi_set_state 0x20000015 -> 0x20000020
cce=1
[1970/01/01 00:00:45:0000] I:
rops_destroy_role_h2: [mux|0|h2_sid3_(wsisrv|0|adopted)|default|GET|/]: ah det due to close
[1970/01/01 00:00:45:0000] I:
_lws_header_table_detach: [mux|0|h2_sid3(wsisrv|0|adopted)|default|GET|/]: ah 20053de8 (tsi=0, count = 2)
[1970/01/01 00:00:45:0000] D:
_lws_header_table_detach: [mux|0|h2_sid3(wsisrv|0|adopted)|default|GET|/]: ah held 19s, role/state 0x20000000 0x20,
[1970/01/01 00:00:45:0000] I:
__lws_header_table_detach: nobody usable waiting
[1970/01/01 00:00:45:0000] I:
_lws_destroy_ah: freed ah 20053de8 : pool length 1
[1970/01/01 00:00:45:0000] I:
_lws_header_table_detach: [mux|0|h2_sid3(wsisrv|0|adopted)|default|GET|/]: ah 20053de8 (tsi=0, count = 1)
vh default: count_bound_wsi 2
tsi fds count 2
[1970/01/01 00:00:45:0000] D:
SSL_shutdown=1 for fd 6
real just_kill_connection: sockfd 6
fd 5 events 517 -> 517
lwsi_set_state 0x20000119 -> 0x20000020
cce=1
[1970/01/01 00:00:45:0000] I:
lws_tls_restrict_return: 1 -> 0
[1970/01/01 00:00:45:0000] N:
lws_gate_accepts: on = 0
[1970/01/01 00:00:45:0000] I:
rops_destroy_role_h2: [wsisrv|0|adopted|h1|h2]: ah det due to close
[1970/01/01 00:00:45:0000] I:
__lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah 20052630 (tsi=0, count = 1)
[1970/01/01 00:00:45:0000] D:
__lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah held 19s, role/state 0x20000000 0x20,
[1970/01/01 00:00:45:0000] I:
__lws_header_table_detach: nobody usable waiting
[1970/01/01 00:00:45:0000] I:
_lws_destroy_ah: freed ah 20052630 : pool length 0
[1970/01/01 00:00:45:0000] I:
__lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah 20052630 (tsi=0, count = 0)
vh default: count_bound_wsi 1
tsi fds count 1
[1970/01/01 00:00:45:0000] D:
accepted new conn port 51111 on fd=6
1 29
[1970/01/01 00:00:45:0000] D:
_realloc: size 1192: __lws_wsi_create_with_role (free heap 566752)...

And continues ..

But in other failed cases once it losses connection does not revive.

This try was in mozilla though i dont think so its browser dependent .

@Somritag
Copy link
Author

One more observation
defining
info.protocols with lws_callback_http vs while we define "http-only" while defining mount
shows different behavior.
http-only by default points to lws_callback_http_dummy() i suppose does this play role here?

image

@lws-team
Copy link
Member

The mystery retry can be the browser deciding to have another go because from their pov, it failed partway through (or timed out). It's worth looking at Ctrl-Shift-C on firefox and the network tab to see what it thinks happens on each stream.

All the protocol callbacks should call through to the dummy callback for stuff they didn't handle themselves. I dunno what "different behaviour" means.

@Somritag
Copy link
Author

Somritag commented Jul 15, 2024

image

I see when we try https-> client is issuing a GOAWAY after a FIN set, and then a RST is happening. And No retry (close connection)

But post opening a socket querying the same same page leads to a success

image

Is there any difference the way the https request getting served while socket is open vs socket is not open ?

At times I see retry automatically happens, is there a way to retry in such scenarios?

@Somritag
Copy link
Author

Hi Andy, took some time. Indeed, it was a timing out due to improperly setting the lws_now_usecs() and lws_now_secs().

HTTPS server file querry is stable now.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants