Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference server, lots of related changes #42

Merged
merged 22 commits into from
Sep 19, 2024
Merged

Inference server, lots of related changes #42

merged 22 commits into from
Sep 19, 2024

Conversation

francoishernandez
Copy link
Contributor

@francoishernandez francoishernandez commented Jun 24, 2024

This is a very first draft for a simple fastAPI based inference server. Not much but will be a first base to iterate on.

Key concepts/changes

  • transforms and transforms_configs are saved in an inference.json config file within the model directory, for transparent loading + tentative adaptation of convert_HF to grab everything transparently;
  • prediction settings are transparently supported in requests, via the inheritance ofDecodingConfig;
  • support of dynamic settings (updated in the predictor for each request), e.g. temperature/top_p, etc. (might not be super robust, but works for now);
  • renaming of random sampling related flags (random_sampling_topk/p -> top_k/p, random_sampling_temp -> temperature) and homogenization across the code;
  • getting rid of the gpu flag in PredictConfig, duplicate with world_size/gpu_ranks(might still be improved though).

Some short-term TODOs

  • proper support of GPU assignment, model loading/unloading;
  • prompt template support + OpenAI-like chat completion API;
  • allow configuration of some model level settings (e.g. quantization);

Some nice-to-haves

  • streaming support (requires significant adaptations in inference_engine and underlying codepaths);
  • lightweight docker image;
  • some nice caching mechanisms (e.g. Implementation of prompt caching rustformers/llm#14);
  • CT2 format support once conversion is manageable;
  • dynamic batching?;

@francoishernandez
Copy link
Contributor Author

francoishernandez commented Sep 16, 2024

238ab22 -> mapped_tokens are retrieved from HF's added_tokens (special_tokens_map.json)

TODO:

  • move stuff from inference.json to main config.json to prevent demultiplicating files;
  • load basis inference config in all inference paths (server/predict)
  • prompt template support (retrieve jinja template from HF)

@francoishernandez francoishernandez changed the title [WIP] Inference server, lots of related changes Inference server, lots of related changes Sep 18, 2024
@francoishernandez francoishernandez marked this pull request as ready for review September 18, 2024 15:25
@francoishernandez
Copy link
Contributor Author

francoishernandez commented Sep 18, 2024

We can probably merge this. The server in itself works. It needs some improvement (gpu/memory model management, error handling, etc.) but all that can be added iteratively.
Also, this PR fixes a few annoying things, such as the unnecessary "gpu" inference flag, and moves towards better support of llama-style placeholder tokens and chat templates. (Note: eos_token patch in convert_HF is quite fishy, but #45 should make it better.)
Bumping to 0.0.2/0.1.0 after merging might not hurt for clarity. (Maybe first 0.0.2, and 0.1.0 will be after finalizing #45.)

@francoishernandez
Copy link
Contributor Author

d2fd18f aligns the behaviour of converted and trained models : transforms_configs of a trained model are adapted to facilitate loading of corresponding artifacts.
E.g. when training model using "long/path/to/subwords.bpe", this model will be saved to the model's directory as "subwords.bpe" and the transform config in config.json will be updated to "${MODEL_PATH}/subwords.bpe", allowing transparent loading when predicting (or later finetuning).
This is quite a nice step towards simplifying the whole config/command management from a user pov, as now we will be able to run inference via a simple command line, even with complex transforms.

@francoishernandez
Copy link
Contributor Author

fe8e8d7 -> when calling the infer entrypoint on a new model, unload any model that is already loaded before loading the new one to prevent potential conflict. More specific logic (multi-model, multi-gpu, memory limits, etc.) can be implemented later depending on use cases.

@francoishernandez francoishernandez merged commit 520566a into main Sep 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant