Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obj: handle ENOMEM during vector allocations #5537

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bfaccini
Copy link
Contributor

@bfaccini bfaccini commented Feb 13, 2023

When missing, Handle ENOMEM during vector allocations to avoid later crashing.

Ref: pmem/issues #5515
Signed-off-by: Bruno Faccini bruno.faccini@intel.com


This change is Reviewable

@@ -1013,8 +1019,16 @@ pmemobj_tx_commit(void)
operation_start(tx->lane->external);

struct user_buffer_def *userbuf;
VEC_FOREACH_BY_PTR(userbuf, &tx->redo_userbufs)
operation_add_user_buffer(tx->lane->external, userbuf);
struct operation_context *ctx = tx->lane->external;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I indicated earlier in the issue, I don't think we need to check for this error here. Instead, we simply need to ensure that the operation_add_user_buffer never fails (like you've already done in tx_construct_user_buffer).

To have the FATAL/ASSERT that you added you can change the operation_add_user_buffer to return an error (-1) if VEC_PUSH_BACK(&ctx->next, buffer_offset); fails. This would let you avoid making all the memops structs public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to fully follow your reasoning, but anyway I will try to push a new commit based on what I think I have understood...

@@ -39,7 +40,53 @@ struct user_buffer_def {
size_t size;
};

struct operation_context;
enum operation_state {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we kept the structs private and instead relied on some inter-module API if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok....

@@ -1013,8 +1015,12 @@ pmemobj_tx_commit(void)
operation_start(tx->lane->external);

struct user_buffer_def *userbuf;
struct operation_context *ctx = tx->lane->external;
uint64_t buf_off;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf_off is unusued.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry to have missed that

@bfaccini bfaccini marked this pull request as ready for review February 23, 2023 17:40
Copy link
Member

@lukaszstolarczuk lukaszstolarczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on @bfaccini and @pbalcer)


-- commits line 20 at r3:
you can squash all the commits into a single one, please


src/libpmemobj/memops.h line 2 at r3 (raw file):

/* SPDX-License-Identifier: BSD-3-Clause */
/* Copyright 2016-2020, Intel Corporation */

Since you're using an Intel e-mail address for your commits, you have to update all dates to -2023.
E.g., in this file: 2016-2023


src/libpmemobj/tx.c line 714 at r3 (raw file):

			userbuf.size - TX_INTENT_LOG_BUFFER_OVERHEAD;
	} else {
		if (operation_add_user_buffer(ctx, &userbuf) == -1);

is this ; expected here?

@bfaccini
Copy link
Contributor Author

bfaccini commented Mar 1, 2023

you can squash all the commits into a single one, please

Ok

Since you're using an Intel e-mail address for your commits, you have to update all dates to -2023. E.g., in this file: 2016-2023

Ok

  if (operation_add_user_buffer(ctx, &userbuf) == -1);

is this ; expected here?

Oops, sorry about the typo :-(

All these changes will be in next commit.

When missing, Handle ENOMEM during vector allocations
to avoid later crashing.

Ref: pmem/issues#5515
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
@bfaccini bfaccini requested review from lukaszstolarczuk and removed request for pbalcer March 1, 2023 08:45
Copy link
Member

@lukaszstolarczuk lukaszstolarczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r4, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bfaccini and @pbalcer)


src/libpmemobj/memops.c line 618 at r4 (raw file):

	if (VEC_PUSH_BACK(&ctx->next, buffer_offset) != 0)
		return -1;

just wondering, should we clean something up in this case? I can see we're persisting something above 😉

@bfaccini
Copy link
Contributor Author

bfaccini commented Mar 1, 2023

just wondering, should we clean something up in this case? I can see we're persisting something above 😉

Well, I thought this kind of concern could araise when I started to work on how to fix this, as usual for error paths...
The problem is that I don't know what exactly does/provisions pmemops_persist() nor I have been able to identify which method/operation can revert its effects...
Also, and strangely (!!), I only found one error path after pmemops_persist() has been called (other place do not show any error path at all !!...), in src/libpmemobj/obj.c:obj_runtime_init(), and there is no clean-up there too !!

@grom72
Copy link
Contributor

grom72 commented Apr 4, 2023

@bfaccini are you going to fix this PR to pass tests?

Looks like doing so affects VEC_PUSH_BACK() execution,
surprisingly for non-debug build only, preventing next
vector allocation...

Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
@bfaccini
Copy link
Contributor Author

bfaccini commented Apr 6, 2023

@bfaccini are you going to fix this PR to pass tests?

Looks like embeding VEC_PUSH_BACK() in ASSERT() affects VEC_PUSH_BACK() execution, surprisingly for non-debug build only, preventing next vector allocation...

Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
@bfaccini
Copy link
Contributor Author

bfaccini commented Apr 7, 2023

Well, now that I have fixed the original errors/SEGVs during "./RUNTESTS -b nondebug obj_action -s TEST0", I now get new ones/asserts/aborts, but unfortunately I am unable to reproduce any of them locally :-(
@grom72 , any help/idea would be appreciated.

testing return value from VEC_PUSH_BACK() equal to 0
for error is not good !!...

Ref: pmem/issues#5515
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
@bfaccini
Copy link
Contributor Author

Well, now that I have fixed the original errors/SEGVs during "./RUNTESTS -b nondebug obj_action -s TEST0", I now get new ones/asserts/aborts, but unfortunately I am unable to reproduce any of them locally :-(

Now that I have found that these asserts/aborts were due to the "wrong/reverse test" in my previous changes, some workflows are canceled and one has failed, but I am not able to find any reason in the logs... Can somebody help ??

@bfaccini
Copy link
Contributor Author

Can somebody help ??

???

@codecov
Copy link

codecov bot commented Apr 20, 2023

Codecov Report

Merging #5537 (e717f1a) into master (3c623ae) will increase coverage by 0.56%.
The diff coverage is 20.00%.

@@            Coverage Diff             @@
##           master    #5537      +/-   ##
==========================================
+ Coverage   71.62%   72.18%   +0.56%     
==========================================
  Files         162      146      -16     
  Lines       24256    22655    -1601     
  Branches        0     3778    +3778     
==========================================
- Hits        17373    16354    -1019     
+ Misses       6883     6301     -582     

@pbalcer
Copy link
Member

pbalcer commented Apr 20, 2023

I restarted the CI and most things pass, and the things that fail seem unrelated:


2023-04-13T08:00:40.9198249Z obj_basic_integration/TEST7: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:40.9205687Z RUNTESTS: stopping: obj_basic_integration/TEST7 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.0181676Z obj_basic_integration/TEST8: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.0189683Z RUNTESTS: stopping: obj_basic_integration/TEST8 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.1141737Z obj_basic_integration/TEST9: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.1149866Z RUNTESTS: stopping: obj_basic_integration/TEST9 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.2133923Z obj_basic_integration/TEST10: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.2141758Z RUNTESTS: stopping: obj_basic_integration/TEST10 failed, TEST=check FS=none BUILD=debug

Can you rebase to see if this goes away?

@wlemkows
Copy link
Contributor

I restarted the CI and most things pass, and the things that fail seem unrelated:


2023-04-13T08:00:40.9198249Z obj_basic_integration/TEST7: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:40.9205687Z RUNTESTS: stopping: obj_basic_integration/TEST7 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.0181676Z obj_basic_integration/TEST8: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.0189683Z RUNTESTS: stopping: obj_basic_integration/TEST8 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.1141737Z obj_basic_integration/TEST9: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.1149866Z RUNTESTS: stopping: obj_basic_integration/TEST9 failed, TEST=check FS=none BUILD=debug
2023-04-13T08:00:41.2133923Z obj_basic_integration/TEST10: pmemdetect: /dev/dax0.2 -- permission denied
2023-04-13T08:00:41.2141758Z RUNTESTS: stopping: obj_basic_integration/TEST10 failed, TEST=check FS=none BUILD=debug

Can you rebase to see if this goes away?

Unfortunately, this is a known issue (platform setup issue), it is not related to your PR. We are working on it.

@bfaccini
Copy link
Contributor Author

Can you rebase to see if this goes away?

Unfortunately, this is a known issue (platform setup issue), it is not related to your PR. We are working on it.

But should I rebase finally ?

@pbalcer
Copy link
Member

pbalcer commented Apr 21, 2023

No, everything seems fine.

LGTM overall.

Copy link
Member

@lukaszstolarczuk lukaszstolarczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bfaccini and @pbalcer)


src/libpmemobj/memops.c line 618 at r4 (raw file):
@wlemkows, can you please take a look at this piece of code...?

I allow myself to quote bfaccini's words (it was out of this issues' thread):

Well, I thought this kind of concern could araise when I started to work on how to fix this, as usual for error paths...
The problem is that I don't know what exactly does/provisions pmemops_persist() nor I have been able to identify which method/operation can revert its effects...
Also, and strangely (!!), I only found one error path after pmemops_persist() has been called (other place do not show any error path at all !!...), in src/libpmemobj/obj.c:obj_runtime_init(), and there is no clean-up there too !!

@janekmi janekmi added this to the 1.13 milestone Apr 25, 2023
@lukaszstolarczuk lukaszstolarczuk removed this from the 1.13 milestone Apr 26, 2023
@janekmi janekmi added this to the 1.14 milestone Apr 27, 2023
@janekmi janekmi modified the milestones: 2.0.0, 2.0.1 Jul 21, 2023
@janekmi janekmi modified the milestones: 2.0.1, 2.0.2 Nov 23, 2023
@janekmi janekmi modified the milestones: 2.1.0, 2.x Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants