Skip to content

Commit

Permalink
Merge pull request #5678 from FederatedAI/patch-doc-update
Browse files Browse the repository at this point in the history
Patch doc update
  • Loading branch information
mgqa34 authored Aug 2, 2024
2 parents d4d8dd8 + b9a6848 commit d9253c4
Show file tree
Hide file tree
Showing 10 changed files with 298 additions and 19 deletions.
20 changes: 11 additions & 9 deletions doc/2.0/fate/components/feature_binning.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,19 @@ Principle](../../images/multiple_host_binning.png)

1. Support Quantile Binning based on quantile summary algorithm.
2. Support Bucket Binning.
3. Support calculating woe and iv values.
4. Support transforming data into bin indexes or woe value(guest only).
5. Support multiple-host binning.
6. Support asymmetric binning methods on Host & Guest sides.
3. Support manual binning based on user-defined binning points.
4. Support calculating woe and iv values.
5. Support transforming data into bin indexes or woe value(guest only).
6. Support multiple-host binning.
7. Support asymmetric binning methods on Host & Guest sides.

Below lists supported features with links to examples:

| Cases | Scenario |
|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) |
| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) |
| Cases | Scenario |
|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py) <br> [quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) |
| Binning with User-defined split points | [manual binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) <br> [woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) |
| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) |


36 changes: 32 additions & 4 deletions doc/2.0/fate/components/feature_selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,52 @@ Below lists their acceptable parameter values.
| IV Filter | filter_param | "iv" | "threshold", "top_k", "top_percentile" | True |
| Statistic Filter | statistic_param | "max", "min", "mean", "median", "std", "var", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" | True/False |

1.
- iv\_filter: Use iv as criterion to selection features. Support
three mode: threshold value, top-k and top-percentile.
## Filter Configuration

1. iv\_filter: Use iv as criterion to selection features.
- filter_type: Support three modes: threshold value, top-k and top-percentile.
- threshold value: Filter those columns whose iv is smaller
than threshold. You can also set different threshold for
each party.
- top-k: Sort features from larger iv to smaller and take top
k features in the sorted result.
- top-percentile. Sort features from larger to smaller and
take top percentile.

- select_federated: If set to True, the feature selection will be
performed in a federated manner. The feature selection will be
performed on the guest side, and the anonymously selected features will be
sent to the host side. The host side will then filter the
features based on the selected features from the guest side. This param is available in iv\_filter only.
- threshold: The threshold value for feature selection.
- take_high: If set to True, the filter will select features with
higher iv values. If set to False, the filter will select
features with lower iv values.
- host_filter_type: The filter type for host features. It can be
"threshold", "top_k", "top_percentile". This param is available in iv\_filter only.
- host_threshold: The threshold value for feature selection on host
features. This param is available in iv\_filter only.
- host_top_k: The top k value for feature selection on host features.
This param is available in iv\_filter only.
2. statistic\_filter: Use statistic values calculate from DataStatistic
component. Support coefficient of variance, missing value,
percentile value etc. You can pick the columns with higher statistic
values or smaller values as you need.
- filter_type: Support three modes: threshold value, top-k and top-percentile.
- threshold value: Filter those columns whose statistic metric is smaller
than threshold. You can also set different threshold for
each party.
- top-k: Sort features from larger statistic metric to smaller and take top
k features in the sorted result.
- top-percentile. Sort features from larger to smaller and
take top percentile.
- threshold: The threshold value for feature selection.
- take_high: If set to True, the filter will select features with
higher metric values. If set to False, the filter will select
features with lower iv values.

3. manually: Indicate features that need to be filtered or kept.
- keep_col: The columns that need to be kept.
- filter_out_col: The columns that need to be dropped.

Besides, we support multi-host federated feature selection for iv
filters. Starting in ver 2.0.0-beta, all data sets will obtain anonymous header
Expand Down
163 changes: 163 additions & 0 deletions doc/2.0/fate/model_building_quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
## Quick Start: A Model Building Demo

1. install `fate_client` with extra package `fate`

```sh
python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0
```
after installing packages successfully, initialize fate_flow service and fate_client

```sh
mkdir fate_workspace
fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace
pipeline init --ip 127.0.0.1 --port 9380

fate_flow start
fate_flow status # make sure fate_flow service is started
```

2. download example data

```sh
wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \
wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv
```

3. transform example data to dataframe using in fate

```python
import os
from fate_client.pipeline import FateFlowPipeline

base_path = os.path.abspath(os.path.join(__file__, os.path.pardir))
guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv")
host_data_path = os.path.join(base_path, "breast_hetero_host.csv")

data_pipeline = FateFlowPipeline().set_parties(local="0")
guest_meta = {
"delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id"
}
host_meta = {
"delimiter": ",", "input_format": "dense", "match_id_name": "id"
}
data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest",
meta=guest_meta, head=True, extend_sid=True)
data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host",
meta=host_meta, head=True, extend_sid=True)
```
4. run training example and save pipeline

```python
from fate_client.pipeline.components.fate import (
Reader,
PSI,
HeteroFeatureBinning,
HeteroFeatureSelection,
DataSplit,
Statistics,
FeatureScale,
SSHELR,
Evaluation
)
from fate_client.pipeline import FateFlowPipeline


# create pipeline for training
pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000")

# create reader task_desc
reader_0 = Reader("reader_0")
reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")

# create psi component_desc
psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])

data_split_0 = DataSplit("data_split_0", input_data=psi_0.outputs["output_data"],
train_size=0.7, validate_size=0.3, test_size=None, stratified=True)

# compute metrics for selection
binning_0 = HeteroFeatureBinning("binning_0", train_data=data_split_0.outputs["train_output_data"],
method="bucket", n_bins=10)
statistics_0 = Statistics("statistics_0", input_data=data_split_0.outputs["train_output_data"],
metrics=["min", "max", "25%", "mean", "median"])

# run feature selection
selection_0 = HeteroFeatureSelection("selection_0",
method=["iv", "statistics", "manual"],
train_data=data_split_0.outputs["train_output_data"],
input_models=[binning_0.outputs["output_model"],
statistics_0.outputs["output_model"]],
iv_param={"metrics": "iv", "filter_type": "top_k", "threshold": 6,
"select_federated": True},
statistic_param={"metrics": ["max", "mean"],
"filter_type": "top_k", "threshold": 5, "take_high": False},
manual_param={"keep_col": ["x0", "x1"]})
selection_1 = HeteroFeatureSelection("selection_1",
test_data=data_split_0.outputs["validate_output_data"],
input_model=selection_0.outputs["train_output_model"])

# scale data
scale_0 = FeatureScale("scale_0", train_data=selection_0.outputs["train_output_data"], method="min_max")
scale_1 = FeatureScale("scale_1", test_data=selection_1.outputs["test_output_data"],
input_model=scale_0.outputs["output_model"])

# train with sshe lr
sshe_lr_0 = SSHELR("sshe_lr_0", train_data=selection_0.outputs["train_output_data"],
validate_data=scale_0.outputs["test_output_data"], epochs=3)

# evaluate both models' output
evaluation_0 = Evaluation("evaluation_0", input_datas=[sshe_lr_0.outputs["train_output_data"]],
default_eval_setting="binary",
runtime_parties=dict(guest="9999"))

# compose training pipeline
pipeline.add_tasks([reader_0, psi_0, data_split_0,
binning_0, statistics_0, selection_0, selection_1,
scale_0, scale_1, sshe_lr_0, evaluation_0])

# compile and train
pipeline.compile()
pipeline.fit()

# print metric and model info
print (pipeline.get_task_info("sshe_lr_0").get_output_model())
print (pipeline.get_task_info("evaluation_0").get_output_metric())

# save pipeline for later usage
pipeline.dump_model("./pipeline.pkl")

```

5. reload trained pipeline and run prediction

```python
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate import Reader

# create pipeline for predicting
predict_pipeline = FateFlowPipeline()

# reload trained pipeline
pipeline = FateFlowPipeline.load_model("./pipeline.pkl")

# deploy task for inference
pipeline.deploy([pipeline.psi_0, pipeline.selection_0, pipeline.scale_0, pipeline.sshe_lr_0])

# add input to deployed_pipeline
deployed_pipeline = pipeline.get_deployed_pipeline()
reader_1 = Reader("reader_1")
reader_1.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
reader_1.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")
deployed_pipeline.psi_0.input_data = reader_1.outputs["output_data"]

# add task to predict pipeline
predict_pipeline.add_tasks([reader_1, deployed_pipeline])

# compile and predict
predict_pipeline.compile()
predict_pipeline.predict()
```

6. More tutorials
More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md)
80 changes: 80 additions & 0 deletions doc/2.0/fate/psi_quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
## PSI Quick Start

1. install `fate_client` with extra package `fate`

```sh
python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0
```
after installing packages successfully, initialize fate_flow service and fate_client

```sh
mkdir fate_workspace
fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace
pipeline init --ip 127.0.0.1 --port 9380

fate_flow start
fate_flow status # make sure fate_flow service is started
```


2. download example data

```sh
wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \
wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv
```

3. transform example data to dataframe using in fate
```python
import os
from fate_client.pipeline import FateFlowPipeline


base_path = os.path.abspath(os.path.join(__file__, os.path.pardir))
guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv")
host_data_path = os.path.join(base_path, "breast_hetero_host.csv")

data_pipeline = FateFlowPipeline().set_parties(local="0")
guest_meta = {
"delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id"
}
host_meta = {
"delimiter": ",", "input_format": "dense", "match_id_name": "id"
}
data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest",
meta=guest_meta, head=True, extend_sid=True)
data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host",
meta=host_meta, head=True, extend_sid=True)
```
4. run psi

```python
from fate_client.pipeline.components.fate import (
Reader,
PSI
)
from fate_client.pipeline import FateFlowPipeline


# create pipeline for training
pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000")

# create reader task_desc
reader_0 = Reader("reader_0")
reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest")
reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host")

# create psi component_desc
psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])

# add training task
pipeline.add_tasks([reader_0, psi_0])

# compile and train
pipeline.compile()
pipeline.fit()

```

5. More tutorials
More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md)
2 changes: 2 additions & 0 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
### Tutorial

- [Quick Start](./2.0/fate/quick_start.md): Train & predict with FATE HeteroSecureBoost using FATE-Pipeline
- [Running PSI](./2.0/fate/psi_quick_start.md): Run PSI only using FATE-PipeLine
- [Quick Start with Homo NN](./2.0/fate/homo_quick_start.md): Train & predict with FATE HomoNN using FATE-PipeLine
- [Building Models with Hetero Components](./2.0/fate/model_building_quick_start.md): model-building tutorial with Hetero components: including reading data, feature engineering, and training & evaluating models

### FATE Design
- [Architecture](./architecture/README.md): Building unified and standardized API for heterogeneous computing engines interconnection
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ def main(config="../config.yaml", namespace=""):
psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"])

binning_0 = HeteroFeatureBinning("binning_0",
method="quantile",
n_bins=10,
method="manual",
split_pt_dict={"x0": [0.1, 0.3, 0.5]},
train_data=psi_0.outputs["output_data"],
local_only=True
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ def main(config=".../config.yaml", namespace=""):
method=["iv"],
train_data=psi_0.outputs["output_data"],
input_models=[binning_0.outputs["output_model"]],
iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1})
iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1,
"select_federated": True})

pipeline.add_tasks([reader_0, psi_0, binning_0, selection_0])

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ def main(config=".../config.yaml", namespace=""):
method=["iv"],
train_data=psi_0.outputs["output_data"],
input_models=[binning_0.outputs["output_model"]],
iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1})
iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1,
"select_federated": True})

lr_0 = SSHELR("lr_0",
learning_rate=0.05,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ def main(config=".../config.yaml", namespace=""):
train_data=psi_0.outputs["output_data"],
input_models=[binning_0.outputs["output_model"],
statistics_0.outputs["output_model"]],
iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8},
iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8,
"select_federated": True},
statistic_param={"metrics": ["max", "mean"],
"filter_type": "top_k", "threshold": 5},
manual_param={"keep_col": ["x0", "x1"]}
Expand Down
Loading

0 comments on commit d9253c4

Please sign in to comment.