Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallocate is interrupted by signal at startup #1368

Open
chrhong opened this issue Sep 8, 2021 · 8 comments
Open

fallocate is interrupted by signal at startup #1368

chrhong opened this issue Sep 8, 2021 · 8 comments

Comments

@chrhong
Copy link

chrhong commented Sep 8, 2021

A pool create failed issue is detected in our system, error shows the system call fallocate is interruptted:
"odp_ishm.c:707:create_file():Huge page memory allocation failed: fd=582, file=/dev/hugepages/0/odp-16-ishm-pool_008_pkt-rx:7-0, err="Interrupted system call""

Is that better to retry the system call after getting the error return ?
While the signal is raised is unknown yet...

@chrhong
Copy link
Author

chrhong commented Sep 8, 2021

@MatiasElo Do you have any comments for this ?

@MatiasElo
Copy link
Collaborator

Hmm, this is the first time I've seen this failure. Does this happen constantly or was it a random occurrence? Also, what was the return code of fallocate() and the size of allocated shm block?

@chrhong
Copy link
Author

chrhong commented Sep 8, 2021

The error occurs easily on k8s env, 10% recurrence. I think fallocate return core is EINTR(Interrupted system call)。 Size is around 4M

@MatiasElo
Copy link
Collaborator

Thanks for the info. Looks like a good solution would be to add a number of retries if EINTR is received.

@MatiasElo
Copy link
Collaborator

Does this change fix the issue you are seeing?

@chrhong
Copy link
Author

chrhong commented Sep 10, 2021

strange that the issue is not reproduced after I recompile...update later....

@chrhong
Copy link
Author

chrhong commented Sep 16, 2021

Update:

  1. When I recompile odp and copy new libs to my docker, the issue cannot be detected even in hundreds of restart;
  2. When I not update odp, the issue occurs easily. The most important thing is, there is nothing changed related with startup between new and old odp libs.
    Matias, do you know any method to trace which/why signal interrupt the system call ? I want to dig why the call is only interrupted with older libs.
    I use linux strace to trace my process, but didn't see any signal in my process...
mkdir("/dev/hugepages/0", 0744)         = -1 EEXIST (File exists)
open("/dev/hugepages/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618659840)         = -1 EINTR (**Interrupted system call**)
write(2, "odp_ishm.c:707:create_file():Hug"..., 151) = 151
close(602)                              = 0
unlink("/dev/hugepages/0/odp-48-ishm-far_pool") = 0
write(2, "odp_ishm.c:1168:_odp_ishm_reserv"..., 112) = 112
mkdir("/dev/shm/0", 0744)               = -1 EEXIST (File exists)
open("/dev/shm/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618139648)         = -1 ENOSPC (No space left on device)
write(2, "odp_ishm.c:707:create_file():Nor"..., 147) = 147
close(602)                              = 0
unlink("/dev/shm/0/odp-48-ishm-far_pool") = 0

The other issue, similar to this is that I sometimes meet SIGSEGV in dpdk which is called odp_pktio_start() at startup.
Since the pktio handler is created by odp_pktio_open(), so I do not think this is app codes issue.
I wonder if this is related with my env initialize ? do you have any env initialize example ?
Currently, we only create hugepages and load pmd for DPDK.

Thanks.

@MatiasElo
Copy link
Collaborator

Hmm, I haven't had to trace signals before, so unfortunately I cannot help much. Usually I just isolate the data plane cores and redirect all signals to a set of control cores.

One thing which pops out in your log is No space left on device error. Perhaps you are running out space in /dev/shm. In the ODP CI Docker images we set --shm-size 8g to be on the safe side. I don't do any special environment setup for DPDK. I just map the huge pages and bind NICs as you have done.

MatiasElo added a commit to MatiasElo/odp that referenced this issue Jan 7, 2022
fallocate() (and ftruncate()) may fail due to system interrupts, so retry
the operation FALLOCATE_RETRIES times.

Fixes: OpenDataPlane#1368

Signed-off-by: Matias Elo <matias.elo@nokia.com>
Reported-and-tested-by: Christian Hong <guochun.hgc@alibaba-inc.com>
MatiasElo added a commit to MatiasElo/odp that referenced this issue Jan 7, 2022
fallocate() (and ftruncate()) may fail due to system interrupts, so retry
using TEMP_FAILURE_RETRY macro.

Fixes: OpenDataPlane#1368

Signed-off-by: Matias Elo <matias.elo@nokia.com>
Reported-and-tested-by: Christian Hong <guochun.hgc@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants