-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(shortfin-sd) Cleanup fiber distribution, logging, error handling. #555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of small nits. Approving since I'm going to be out today and you can address.
shortfin/python/lib_ext.h
Outdated
@@ -65,6 +65,7 @@ template <typename CppType, typename KeepAlivePatient, typename... Args> | |||
inline py::object custom_new_keep_alive(py::handle py_type, | |||
KeepAlivePatient &keep_alive, | |||
Args &&...args) { | |||
py::set_leak_warnings(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to the module registration function in lib_ext.cpp. it is a global.
Agreed this thing is flaky. ASAN does a more reliable job.
@@ -135,7 +136,7 @@ def get_configs(args): | |||
f"--model={modelname}", | |||
f"--topology={topology_inp}", | |||
] | |||
outs = subprocess.check_output(cfg_builder_args).decode() | |||
outs = subprocess.check_output(cfg_builder_args, stderr=subprocess.DEVNULL).decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes debugging impossible... We really need to finish the build API and get this stuff more for free
@@ -158,18 +159,19 @@ def get_configs(args): | |||
arglist = spec.strip("--").split("=") | |||
arg = arglist[0] | |||
if len(arglist) > 2: | |||
print(arglist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray print?
logger.info(f"Preparing runtime artifacts for {modelname}...") | ||
logger.debug( | ||
f"COMMAND LINE EQUIVALENT: " | ||
+ " ".join([str(argn) for argn in builder_args]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using an fstring, don't also concat like this.
+ " ".join([str(argn) for argn in builder_args]) | ||
) | ||
output = subprocess.check_output( | ||
builder_args, stderr=subprocess.DEVNULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Debugging not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall took a look through it all. It's a pretty decent change to get in right before the deadline so please run flake8 / pylint against it to get some static checking in before submitting. It'll help to give extra qa coverage since you are changing some of the function inputs/outputs and you may have missed an edge case
value = arglist[1:] | ||
for val in value: | ||
try: | ||
val = int(val) | ||
except ValueError: | ||
continue | ||
val = val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you doing val=val and value=value below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly for the device ids arg, where user can pass either device indexes (ints) or device UIDs (strings).
The logic being: if we can't convert them to ints, then we pass them as they were recieved.
doing continue here breaks parsing of non-list/int arguments e.g. isolation
.
@amd-chrissosa thanks for the pylint/flake8 suggestion -- didn't find any regressions but did clean up quite a few unused imports/variables. |
I could have sworn we had those enabled in pre-commit. Probably running of another repo. Should fix that. |
Switches fiber distribution to a much simpler process where idle fibers are kept in a set and pop/replaced when used. For dense batches this should be near optimal, though it cycles through fibers without regard of each fiber's current load (this only matters in per_call/none isolation where we allow in-fiber concurrency -- in per_call/none isolation the fibers are put back in the idle_fibers set after boarding)
Suppresses leak messages for shortfin nanobind objects with keep_awake (maybe fix and rollback after V3.0)
Makes port selection/error return much more friendly
Segments configuration and suppresses some builder outputs for now.
Makes logging look a little better, removes some extraneous outputs.