Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(shortfin-sd) Cleanup fiber distribution, logging, error handling. #555

Merged
merged 13 commits into from
Nov 18, 2024

Conversation

monorimet
Copy link
Contributor

@monorimet monorimet commented Nov 17, 2024

  • Switches fiber distribution to a much simpler process where idle fibers are kept in a set and pop/replaced when used. For dense batches this should be near optimal, though it cycles through fibers without regard of each fiber's current load (this only matters in per_call/none isolation where we allow in-fiber concurrency -- in per_call/none isolation the fibers are put back in the idle_fibers set after boarding)

  • Suppresses leak messages for shortfin nanobind objects with keep_awake (maybe fix and rollback after V3.0)

  • Makes port selection/error return much more friendly

  • Segments configuration and suppresses some builder outputs for now.

  • Makes logging look a little better, removes some extraneous outputs.

Copy link
Contributor

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of small nits. Approving since I'm going to be out today and you can address.

@@ -65,6 +65,7 @@ template <typename CppType, typename KeepAlivePatient, typename... Args>
inline py::object custom_new_keep_alive(py::handle py_type,
KeepAlivePatient &keep_alive,
Args &&...args) {
py::set_leak_warnings(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the module registration function in lib_ext.cpp. it is a global.

Agreed this thing is flaky. ASAN does a more reliable job.

@@ -135,7 +136,7 @@ def get_configs(args):
f"--model={modelname}",
f"--topology={topology_inp}",
]
outs = subprocess.check_output(cfg_builder_args).decode()
outs = subprocess.check_output(cfg_builder_args, stderr=subprocess.DEVNULL).decode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes debugging impossible... We really need to finish the build API and get this stuff more for free

@@ -158,18 +159,19 @@ def get_configs(args):
arglist = spec.strip("--").split("=")
arg = arglist[0]
if len(arglist) > 2:
print(arglist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray print?

logger.info(f"Preparing runtime artifacts for {modelname}...")
logger.debug(
f"COMMAND LINE EQUIVALENT: "
+ " ".join([str(argn) for argn in builder_args])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using an fstring, don't also concat like this.

+ " ".join([str(argn) for argn in builder_args])
)
output = subprocess.check_output(
builder_args, stderr=subprocess.DEVNULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Debugging not good.

Copy link
Contributor

@amd-chrissosa amd-chrissosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall took a look through it all. It's a pretty decent change to get in right before the deadline so please run flake8 / pylint against it to get some static checking in before submitting. It'll help to give extra qa coverage since you are changing some of the function inputs/outputs and you may have missed an edge case

value = arglist[1:]
for val in value:
try:
val = int(val)
except ValueError:
continue
val = val
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing val=val and value=value below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for the device ids arg, where user can pass either device indexes (ints) or device UIDs (strings).
The logic being: if we can't convert them to ints, then we pass them as they were recieved.
doing continue here breaks parsing of non-list/int arguments e.g. isolation.

@monorimet
Copy link
Contributor Author

@amd-chrissosa thanks for the pylint/flake8 suggestion -- didn't find any regressions but did clean up quite a few unused imports/variables.

@monorimet monorimet merged commit a968fc4 into main Nov 18, 2024
11 of 13 checks passed
@monorimet monorimet deleted the sd-logging-nanobind branch November 18, 2024 18:51
@stellaraccident
Copy link
Contributor

I could have sworn we had those enabled in pre-commit. Probably running of another repo. Should fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants