FedML 0.8.4
What's Changed
New Features in 0.8.4
At FedML, our mission is to remove the friction and pain points of converting your ML & AI models from R&D into production-scale-distributed and federated training & serving via our no-code MLOps platform.
FedML is happy to announce our update 0.8.4. This release is filled with new capabilities, bug fixes, and enhancements. A key announcement is the launch of FedLLM for simplifying & reducing the costs associated with training & serving large language models. You can read more about it on our blog post.
New Features
-
[CoreEngine/MLOps] Launched FedLLM (Federated Large Language Model) for training and serving GitHub Blog
-
[CoreEngine] Deployed Helm Charts to our repository for packaging and ease of deploying on Kubernetes https://github.com/FedML-AI/FedML/blob/master/installation/install_on_k8s/fedml-edge-client-server/fedml-server-deployment-latest.tgz https://github.com/FedML-AI/FedML/blob/master/installation/install_on_k8s/fedml-edge-client-server/fedml-client-deployment-latest.tgz
-
[Documents] Refactored the devops and installation structures (devops for internal pipelines, installation for external users). https://github.com/FedML-AI/FedML/tree/master/installation
-
[DevOps] Deployed a new fedml fedml-light docker image and related documents. DockerHub GitHub doc
-
[DevOps] Built the light docker image to deploy to the k8s cluster, refined k8s related installation sections in the document. https://hub.docker.com/r/fedml/fedml-edge-client-server-light/tags
-
[CoreEngine] Added support for multiple simultaneous training jobs when using our open source MLOPs commands.
-
[CoreEngine] Improved training health monitoring and properly report failed status.
-
[CoreEngine] Added APIs for enabling, disabling and querying client agent status. The APIs are as follows.
curl -XPOST http://localhost:40800/fedml/api/v2/disableAgent -d’{}'
curl -XPOST http://localhost:40800/fedml/api/v2/enableAgent -d’{}'
curl -XPOST http://localhost:40800/fedml/api/v2/queryAgentStatus -d’{}'
Bug Fixes
-
[CoreEngine] Create distinct device ids when running multiple Docker containers to simulate multiple clients or silos on one machine. Now using the product id plus a random id as the device id
-
[CoreEngine] Fixed a device assignment issue in get_torch_device in the distributed training mode.
-
[Serving] Fixed the exceptions that occurred when recovering at startup after upgrading.
-
[CoreEngine] Fixed the device id issue when running in the docker on MacOS.
-
[App] Fixed the issue in the app fedprox + sage graph regression and graph clf.
-
[App] Fixed an issue with the heart disease app failing when running in MLOps.
-
[App] Fixed an issue with the heart disease app’s performance curve
-
[App/Android] Enhanced Android starting/stopping mechanism and fixed the following issues:
Fixed status displays after stopping the run.
When stopping a Run during a round that has not finished, the MNN process will remain in IDLE state (it was previously going OFFLINE).
When stopping after a round is done, the training will now stop
Python server TAG in the logs is not correct. Now you can easily find the server mentioned in logs.
Enhancements
-
[Serving] Tested the inference backend and checked the response after the model deployment is finished.
-
[CoreEngine/Serving] Set the GPU option based on the availability of CUDA when running the inference backend, optimize the mqtt connection checking.
-
[CoreEngine] Stored model caches to the user home directory when running the federated learning.
-
[CoreEngine] Added the device id to the monitor message when processing inference request
-
[CoreEngine] Reported the runner exception and ignored exceptions when missing the bootstrap section in the fedml_config.yaml.