Merge pull request #5678 from FederatedAI/patch-doc-update

Patch doc update
FederatedAI · Aug 2, 2024 · d9253c4 · d9253c4
2 parents d4d8dd8 + b9a6848
commit d9253c4
Show file tree

Hide file tree

Showing 10 changed files with 298 additions and 19 deletions.
diff --git a/doc/2.0/fate/components/feature_binning.md b/doc/2.0/fate/components/feature_binning.md
@@ -32,17 +32,19 @@ Principle](../../images/multiple_host_binning.png)
 
 1. Support Quantile Binning based on quantile summary algorithm.
 2. Support Bucket Binning.
-3. Support calculating woe and iv values.
-4. Support transforming data into bin indexes or woe value(guest only).
-5. Support multiple-host binning.
-6. Support asymmetric binning methods on Host & Guest sides.
+3. Support manual binning based on user-defined binning points.
+4. Support calculating woe and iv values.
+5. Support transforming data into bin indexes or woe value(guest only).
+6. Support multiple-host binning.
+7. Support asymmetric binning methods on Host & Guest sides.
 
 Below lists supported features with links to examples:
 
-| Cases | Scenario |
-|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) |
-| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
-| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) |
+| Cases | Scenario |
+|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) |
+| Binning with User-defined split points | [manual binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
+| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
+| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) |
 
 
diff --git a/doc/2.0/fate/components/feature_selection.md b/doc/2.0/fate/components/feature_selection.md
@@ -28,24 +28,52 @@ Below lists their acceptable parameter values.
 | IV Filter | filter_param | "iv" | "threshold", "top_k", "top_percentile" | True |
 | Statistic Filter | statistic_param | "max", "min", "mean", "median", "std", "var", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" | True/False |
 
-1.
- - iv\_filter: Use iv as criterion to selection features. Support
- three mode: threshold value, top-k and top-percentile.
+## Filter Configuration
 
+1. iv\_filter: Use iv as criterion to selection features. 
+ - filter_type: Support three modes: threshold value, top-k and top-percentile.
  - threshold value: Filter those columns whose iv is smaller
  than threshold. You can also set different threshold for
  each party.
  - top-k: Sort features from larger iv to smaller and take top
  k features in the sorted result.
  - top-percentile. Sort features from larger to smaller and
  take top percentile.
-
+ - select_federated: If set to True, the feature selection will be
+ performed in a federated manner. The feature selection will be
+ performed on the guest side, and the anonymously selected features will be
+ sent to the host side. The host side will then filter the
+ features based on the selected features from the guest side. This param is available in iv\_filter only.
+ - threshold: The threshold value for feature selection.
+ - take_high: If set to True, the filter will select features with
+ higher iv values. If set to False, the filter will select
+ features with lower iv values.
+ - host_filter_type: The filter type for host features. It can be
+ "threshold", "top_k", "top_percentile". This param is available in iv\_filter only.
+ - host_threshold: The threshold value for feature selection on host
+ features. This param is available in iv\_filter only.
+ - host_top_k: The top k value for feature selection on host features.
+ This param is available in iv\_filter only.
 2. statistic\_filter: Use statistic values calculate from DataStatistic
  component. Support coefficient of variance, missing value,
  percentile value etc. You can pick the columns with higher statistic
  values or smaller values as you need.
+ - filter_type: Support three modes: threshold value, top-k and top-percentile.
+ - threshold value: Filter those columns whose statistic metric is smaller
+ than threshold. You can also set different threshold for
+ each party.
+ - top-k: Sort features from larger statistic metric to smaller and take top
+ k features in the sorted result.
+ - top-percentile. Sort features from larger to smaller and
+ take top percentile.
+ - threshold: The threshold value for feature selection.
+ - take_high: If set to True, the filter will select features with
+ higher metric values. If set to False, the filter will select
+ features with lower iv values.
 
 3. manually: Indicate features that need to be filtered or kept.
+ - keep_col: The columns that need to be kept.
+ - filter_out_col: The columns that need to be dropped.
 
 Besides, we support multi-host federated feature selection for iv
 filters. Starting in ver 2.0.0-beta, all data sets will obtain anonymous header

diff --git a/doc/2.0/fate/model_building_quick_start.md b/doc/2.0/fate/model_building_quick_start.md
@@ -0,0 +1,163 @@
+## Quick Start: A Model Building Demo
+
+1. install `fate_client` with extra package `fate` 
+
+```sh
+python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0
+```
+after installing packages successfully, initialize fate_flow service and fate_client
+
+```sh
+mkdir fate_workspace
+fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace
+pipeline init --ip 127.0.0.1 --port 9380
+
+fate_flow start
+fate_flow status # make sure fate_flow service is started
+```
+
+2. download example data
+
+```sh
+wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \
+wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv
+```
+
+3. transform example data to dataframe using in fate
+
+```python
+import os
+from fate_client.pipeline import FateFlowPipeline
+
+base_path = os.path.abspath(os.path.join(__file__, os.path.pardir))
+guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv")
+host_data_path = os.path.join(base_path, "breast_hetero_host.csv")
+
+data_pipeline = FateFlowPipeline().set_parties(local="0")
+guest_meta = {
+ "delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id"
+}
+host_meta = {
+ "delimiter": ",", "input_format": "dense", "match_id_name": "id"
+}
+data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest",
+ meta=guest_meta, head=True, extend_sid=True)
+data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host",
+ meta=host_meta, head=True, extend_sid=True)
+```
+4. run training example and save pipeline 
+
+```python
+from fate_client.pipeline.components.fate import (
+ Reader,
+ PSI,
+ HeteroFeatureBinning,
+ HeteroFeatureSelection,
+ DataSplit, 
+ Statistics,
+ FeatureScale,
+ SSHELR,
+ Evaluation
+)
+from fate_client.pipeline import FateFlowPipeline
+
+
+# create pipeline for training
+pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000")
+
+# create reader task_desc
+reader_0 = Reader("reader_0")
+reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
+reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")
+
+# create psi component_desc
+psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])
+
+data_split_0 = DataSplit("data_split_0", input_data=psi_0.outputs["output_data"],
+ train_size=0.7, validate_size=0.3, test_size=None, stratified=True)
+
+# compute metrics for selection
+binning_0 = HeteroFeatureBinning("binning_0", train_data=data_split_0.outputs["train_output_data"], 
+ method="bucket", n_bins=10)
+statistics_0 = Statistics("statistics_0", input_data=data_split_0.outputs["train_output_data"],
+ metrics=["min", "max", "25%", "mean", "median"])
+
+# run feature selection
+selection_0 = HeteroFeatureSelection("selection_0",
+ method=["iv", "statistics", "manual"],
+ train_data=data_split_0.outputs["train_output_data"],
+ input_models=[binning_0.outputs["output_model"], 
+ statistics_0.outputs["output_model"]],
+ iv_param={"metrics": "iv", "filter_type": "top_k", "threshold": 6,
+ "select_federated": True},
+ statistic_param={"metrics": ["max", "mean"],
+ "filter_type": "top_k", "threshold": 5, "take_high": False},
+ manual_param={"keep_col": ["x0", "x1"]})
+selection_1 = HeteroFeatureSelection("selection_1",
+ test_data=data_split_0.outputs["validate_output_data"],
+ input_model=selection_0.outputs["train_output_model"])
+
+# scale data 
+scale_0 = FeatureScale("scale_0", train_data=selection_0.outputs["train_output_data"], method="min_max")
+scale_1 = FeatureScale("scale_1", test_data=selection_1.outputs["test_output_data"],
+ input_model=scale_0.outputs["output_model"])
+
+# train with sshe lr
+sshe_lr_0 = SSHELR("sshe_lr_0", train_data=selection_0.outputs["train_output_data"],
+ validate_data=scale_0.outputs["test_output_data"], epochs=3)
+
+# evaluate both models' output
+evaluation_0 = Evaluation("evaluation_0", input_datas=[sshe_lr_0.outputs["train_output_data"]],
+ default_eval_setting="binary",
+ runtime_parties=dict(guest="9999"))
+
+# compose training pipeline
+pipeline.add_tasks([reader_0, psi_0, data_split_0, 
+ binning_0, statistics_0, selection_0, selection_1,
+ scale_0, scale_1, sshe_lr_0, evaluation_0])
+
+# compile and train
+pipeline.compile()
+pipeline.fit()
+
+# print metric and model info
+print (pipeline.get_task_info("sshe_lr_0").get_output_model())
+print (pipeline.get_task_info("evaluation_0").get_output_metric())
+
+# save pipeline for later usage
+pipeline.dump_model("./pipeline.pkl")
+
+```
+
+5. reload trained pipeline and run prediction
+
+```python
+from fate_client.pipeline import FateFlowPipeline
+from fate_client.pipeline.components.fate import Reader
+
+# create pipeline for predicting
+predict_pipeline = FateFlowPipeline()
+
+# reload trained pipeline
+pipeline = FateFlowPipeline.load_model("./pipeline.pkl")
+
+# deploy task for inference
+pipeline.deploy([pipeline.psi_0, pipeline.selection_0, pipeline.scale_0, pipeline.sshe_lr_0])
+
+# add input to deployed_pipeline
+deployed_pipeline = pipeline.get_deployed_pipeline()
+reader_1 = Reader("reader_1")
+reader_1.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
+reader_1.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")
+deployed_pipeline.psi_0.input_data = reader_1.outputs["output_data"]
+
+# add task to predict pipeline
+predict_pipeline.add_tasks([reader_1, deployed_pipeline])
+
+# compile and predict
+predict_pipeline.compile()
+predict_pipeline.predict()
+```
+
+6. More tutorials
+More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md)
diff --git a/doc/2.0/fate/psi_quick_start.md b/doc/2.0/fate/psi_quick_start.md
@@ -0,0 +1,80 @@
+## PSI Quick Start
+
+1. install `fate_client` with extra package `fate` 
+
+```sh
+python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0
+```
+after installing packages successfully, initialize fate_flow service and fate_client
+
+```sh
+mkdir fate_workspace
+fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace
+pipeline init --ip 127.0.0.1 --port 9380
+
+fate_flow start
+fate_flow status # make sure fate_flow service is started
+```
+
+
+2. download example data
+
+```sh
+wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \
+wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv
+```
+
+3. transform example data to dataframe using in fate
+```python
+import os
+from fate_client.pipeline import FateFlowPipeline
+
+
+base_path = os.path.abspath(os.path.join(__file__, os.path.pardir))
+guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv")
+host_data_path = os.path.join(base_path, "breast_hetero_host.csv")
+
+data_pipeline = FateFlowPipeline().set_parties(local="0")
+guest_meta = {
+ "delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id"
+}
+host_meta = {
+ "delimiter": ",", "input_format": "dense", "match_id_name": "id"
+}
+data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest",
+ meta=guest_meta, head=True, extend_sid=True)
+data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host",
+ meta=host_meta, head=True, extend_sid=True)
+```
+4. run psi 
+
+```python
+from fate_client.pipeline.components.fate import (
+ Reader,
+ PSI
+)
+from fate_client.pipeline import FateFlowPipeline
+
+
+# create pipeline for training
+pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000")
+
+# create reader task_desc
+reader_0 = Reader("reader_0")
+reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
+reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")
+
+# create psi component_desc
+psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])
+
+# add training task
+pipeline.add_tasks([reader_0, psi_0])
+
+# compile and train
+pipeline.compile()
+pipeline.fit()
+
+```
+
+5. More tutorials
+More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md)
diff --git a/doc/README.md b/doc/README.md
@@ -3,7 +3,9 @@
 ### Tutorial
 
 - [Quick Start](./2.0/fate/quick_start.md): Train & predict with FATE HeteroSecureBoost using FATE-Pipeline
+- [Running PSI](./2.0/fate/psi_quick_start.md): Run PSI only using FATE-PipeLine
 - [Quick Start with Homo NN](./2.0/fate/homo_quick_start.md): Train & predict with FATE HomoNN using FATE-PipeLine
+- [Building Models with Hetero Components](./2.0/fate/model_building_quick_start.md): model-building tutorial with Hetero components: including reading data, feature engineering, and training & evaluating models
 
 ### FATE Design
 - [Architecture](./architecture/README.md): Building unified and standardized API for heterogeneous computing engines interconnection

diff --git a/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py b/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py
@@ -44,8 +44,8 @@ def main(config="../config.yaml", namespace=""):
  psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])
 
  binning_0 = HeteroFeatureBinning("binning_0",
- method="quantile",
- n_bins=10,
+ method="manual",
+ split_pt_dict={"x0": [0.1, 0.3, 0.5]},
  train_data=psi_0.outputs["output_data"],
  local_only=True
  )

diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py
@@ -53,7 +53,8 @@ def main(config=".../config.yaml", namespace=""):
  method=["iv"],
  train_data=psi_0.outputs["output_data"],
  input_models=[binning_0.outputs["output_model"]],
- iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1})
+ iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1,
+ "select_federated": True})
 
  pipeline.add_tasks([reader_0, psi_0, binning_0, selection_0])
 

diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py
@@ -55,7 +55,8 @@ def main(config=".../config.yaml", namespace=""):
  method=["iv"],
  train_data=psi_0.outputs["output_data"],
  input_models=[binning_0.outputs["output_model"]],
- iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1})
+ iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1,
+ "select_federated": True})
 
  lr_0 = SSHELR("lr_0",
  learning_rate=0.05,

diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py
@@ -54,7 +54,8 @@ def main(config=".../config.yaml", namespace=""):
  train_data=psi_0.outputs["output_data"],
  input_models=[binning_0.outputs["output_model"],
  statistics_0.outputs["output_model"]],
- iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8},
+ iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8,
+ "select_federated": True},
  statistic_param={"metrics": ["max", "mean"],
  "filter_type": "top_k", "threshold": 5},
  manual_param={"keep_col": ["x0", "x1"]}