Inference server, lots of related changes #42

francoishernandez · 2024-06-24T13:30:01Z

This is a very first draft for a simple fastAPI based inference server. Not much but will be a first base to iterate on.

Key concepts/changes

transforms and transforms_configs are saved in an inference.json config file within the model directory, for transparent loading + tentative adaptation of convert_HF to grab everything transparently;
prediction settings are transparently supported in requests, via the inheritance ofDecodingConfig;
support of dynamic settings (updated in the predictor for each request), e.g. temperature/top_p, etc. (might not be super robust, but works for now);
renaming of random sampling related flags (random_sampling_topk/p -> top_k/p, random_sampling_temp -> temperature) and homogenization across the code;
getting rid of the gpu flag in PredictConfig, duplicate with world_size/gpu_ranks(might still be improved though).

Some short-term TODOs

proper support of GPU assignment, model loading/unloading;
prompt template support + OpenAI-like chat completion API;
allow configuration of some model level settings (e.g. quantization);

Some nice-to-haves

streaming support (requires significant adaptations in inference_engine and underlying codepaths);
lightweight docker image;
some nice caching mechanisms (e.g. Implementation of prompt caching rustformers/llm#14);
CT2 format support once conversion is manageable;
dynamic batching?;

francoishernandez · 2024-09-16T20:02:58Z

238ab22 -> mapped_tokens are retrieved from HF's added_tokens (special_tokens_map.json)

TODO:

move stuff from inference.json to main config.json to prevent demultiplicating files;
load basis inference config in all inference paths (server/predict)
prompt template support (retrieve jinja template from HF)

…xample, newline replace

francoishernandez · 2024-09-18T15:25:46Z

We can probably merge this. The server in itself works. It needs some improvement (gpu/memory model management, error handling, etc.) but all that can be added iteratively.
Also, this PR fixes a few annoying things, such as the unnecessary "gpu" inference flag, and moves towards better support of llama-style placeholder tokens and chat templates. (Note: eos_token patch in convert_HF is quite fishy, but #45 should make it better.)
Bumping to 0.0.2/0.1.0 after merging might not hurt for clarity. (Maybe first 0.0.2, and 0.1.0 will be after finalizing #45.)

…ransform .json

francoishernandez · 2024-09-19T09:31:52Z

d2fd18f aligns the behaviour of converted and trained models : transforms_configs of a trained model are adapted to facilitate loading of corresponding artifacts.
E.g. when training model using "long/path/to/subwords.bpe", this model will be saved to the model's directory as "subwords.bpe" and the transform config in config.json will be updated to "${MODEL_PATH}/subwords.bpe", allowing transparent loading when predicting (or later finetuning).
This is quite a nice step towards simplifying the whole config/command management from a user pov, as now we will be able to run inference via a simple command line, even with complex transforms.

francoishernandez · 2024-09-19T12:32:49Z

fe8e8d7 -> when calling the infer entrypoint on a new model, unload any model that is already loaded before loading the new one to prevent potential conflict. More specific logic (multi-model, multi-gpu, memory limits, etc.) can be implemented later depending on use cases.

francoishernandez added 2 commits June 24, 2024 15:13

inference server, lots of related changes

2e09045

minimal example config in server recipe

068e626

francoishernandez added the enhancement New feature or request label Jun 24, 2024

francoishernandez marked this pull request as draft June 24, 2024 13:45

francoishernandez mentioned this pull request Jun 27, 2024

Some improvements to config.json readability #44

Merged

francoishernandez mentioned this pull request Sep 16, 2024

Rework handling of special tokens #45

Merged

francoishernandez added 4 commits September 16, 2024 16:47

fix conflicts

06dbfed

deduce optional_eos in convert_HF

a15768c

black

4f144b4

retrieve mapped_tokens in convert_HF

238ab22

francoishernandez added 7 commits September 18, 2024 11:49

add inference key to main config.json instead of separate inference.json

21f23ce

update predict config from model config.json, eos_token patch

10214e2

retrieve base inference settings from HF generation_config, request e…

83d1aa6

…xample, newline replace

add chat_template at chat requests, minimal README, swagger examples

bf512b0

minimal docstrings, some cleanup

2fd01f5

improve README, minor cleanup

423ea33

remove print

09b722b

francoishernandez changed the title ~~[WIP] Inference server, lots of related changes~~ Inference server, lots of related changes Sep 18, 2024

francoishernandez marked this pull request as ready for review September 18, 2024 15:25

francoishernandez added 3 commits September 18, 2024 17:29

Merge branch 'main' into server

e13f407

save training transforms_configs in main config.json instead of per-t…

d2fd18f

…ransform .json

Merge branch 'server' of github.com:eole-nlp/eole into server

e6ff8cc

francoishernandez added 5 commits September 19, 2024 11:36

simplifying some recipes inference configs

865eb9e

add command line alternative to inference.yaml in wiki_103 README

cb6da19

add TransformType mechanism to flag transforms to ignore at inference

a85e8fd

add transforms config to sphinx autodoc

c4f4972

fix link in transforms doc

60bffab

minimal safeguard logic to unload model(s) before loading a new one

fe8e8d7

francoishernandez merged commit 520566a into main Sep 19, 2024
5 checks passed

francoishernandez mentioned this pull request Sep 19, 2024

[patch] minor fixes for 0.0.2 #109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference server, lots of related changes #42

Inference server, lots of related changes #42

francoishernandez commented Jun 24, 2024 •

edited

Loading

francoishernandez commented Sep 16, 2024 •

edited

Loading

francoishernandez commented Sep 18, 2024 •

edited

Loading

francoishernandez commented Sep 19, 2024

francoishernandez commented Sep 19, 2024

Inference server, lots of related changes #42

Inference server, lots of related changes #42

Conversation

francoishernandez commented Jun 24, 2024 • edited Loading

Key concepts/changes

Some short-term TODOs

Some nice-to-haves

francoishernandez commented Sep 16, 2024 • edited Loading

francoishernandez commented Sep 18, 2024 • edited Loading

francoishernandez commented Sep 19, 2024

francoishernandez commented Sep 19, 2024

francoishernandez commented Jun 24, 2024 •

edited

Loading

francoishernandez commented Sep 16, 2024 •

edited

Loading

francoishernandez commented Sep 18, 2024 •

edited

Loading