From fca767391cdf77b972bf461ef6b1d59ca92e60a2 Mon Sep 17 00:00:00 2001 From: Benjamin DeCoste Date: Fri, 19 Jun 2020 14:34:08 -0300 Subject: [PATCH] Update the README --- README.md | 134 ++++++++++++++---- examples/tutorials/README.md | 2 +- .../tutorials/apply_policy_on_spark_df.py | 7 +- 3 files changed, 113 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index e1a7476..e38091b 100644 --- a/README.md +++ b/README.md @@ -1,50 +1,134 @@ -# Cape Python +[Cape Privacy](https://capeprivacy.com/) -![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python) +![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg) +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) +[![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python) +[![PyPI version](https://badge.fury.io/py/cape-privacy.svg)](https://badge.fury.io/py/cape-privacy) +Cape Privacy offers data scientists and data engineers a policy-based interface for applying privacy-enhancing techniques +across several popular libraries and frameworks to protect sensitive data throughout the data science life cycle. + +Cape Python brings Cape's policy language to Pandas and Apache Spark, +enabling you to collaborate on privacy-preserving policy at a non-technical level. +The supported techniques include tokenization with linkability as well as perturbation and rounding. +You can experiment with these techniques programmatically, in Python or in human-readable policy files. +Stay tuned for more privacy-enhancing techniques in the future! + +See below for instructions on how to get started or visit the [documentation](https://docs.capeprivacy.com/). ## Getting Started -Make sure you have at least Python 3.6 installed. We recommend running it in a virtual environment -such as with [venv](https://docs.python.org/3/library/venv.html) or -[conda](https://www.anaconda.com/products/individual). +Cape Python is available via Pypi. + +```sh +pip install cape-privacy +``` + +Support for Apache Spark is optional. If you plan on using the library together with Apache Spark, we suggest the following instead: -`make` will also be required to run our `Makefile` so ensure that you have that installed as well. +```sh +pip install cape-privacy[spark] +``` -### Bootstrapping +We recommend running it in a virtual environment, such as [venv](https://docs.python.org/3/library/venv.html). -Bootstrapping your environment installs all direct dependencies of the Python API -and ensures that the API is installed as well. +### Installing from source -Run the following command from a command line: +It is also possible to install the library from source. -```bash -$ make bootstrap +```sh +git clone https://github.com/capeprivacy/cape-python.git +cd cape-python +make bootstrap ``` -#### Example +This will also install all dependencies, including Apache Spark. Make sure you have `make` installed before running the above. -This example does a basic plusOne transformation on a pandas dataframe with a single column called `value`. It can be -found in the `examples` directory. +## Example -```python -import cape_privacy as cape -import numpy as np -import pandas as pd +*(this example is an abridged version of the tutorial found [here](./examples/tutorials/))* -policy = cape.parse_policy("perturb_value_field.yaml") -df = pd.DataFrame(np.ones(5,), columns=["value"]) -df = cape.apply_policy(policy, df) +To discover what different transformations do and how you might use them, it is best to explore via the `transformations` APIs: + +```python +df = pd.DataFrame({ + "name": ["alice", "bob"], + "age": [34, 55], + "birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)], + }) + +tokenize = Tokenizer( + max_token_len=10, + key=b"my secret", +) + +perturb_numeric = NumericPerturbation( + dtype=dtypes.Integer, + min=-10, + max=10, +) + +df["name"] = tokenize(df["name"]) +df["age"] = perturb_numeric(df["age"]) print(df.head()) + +# >> +# name age birthdate +# 0 f42c2f1964 34 1985-02-23 +# 1 2e586494b2 63 1963-05-10 ``` -You can also pass a URL to `parse_policy`. +These steps can be saved in policy files so you can share them and collaborate with your team: + +```yaml +# my-policy.yaml +label: my-policy +version: 1 +rules: + - match: + name: age + actions: + - transform: + type: numeric-perturbation + dtype: Integer + min: -10 + max: 10 + seed: 4984 + - match: + name: name + actions: + - transform: + type: tokenizer + max_token_len: 10 + key: my secret +``` + +You can then load this policy and apply it to your data frame: ```python -policy = cape.parse_policy("https://mydomain.com/policy.yaml") +# df can be a Pandas or Spark data frame +policy = cape.parse_policy("my-policy.yaml") +df = cape.apply_policy(policy, df) + +print(df.head()) +# >> +# name age birthdate +# 0 f42c2f1964 34 1985-02-23 +# 1 2e586494b2 63 1963-05-10 ``` -# License +You can see more examples and usage [here](./examples) or by visiting our [documentation](https://docs.capeprivacy.com). + +## Contributing and Bug Reports + +Please file any [feature request](https://github.com/capeprivacy/cape-python/issues/new?template=feature_request.md) or +[bug report](https://github.com/capeprivacy/cape-python/issues/new?template=bug_report.md) as GitHub issues. + +## License Licensed under Apache License, Version 2.0 (see [LICENSE](./LICENSE) or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in [NOTICE](./NOTICE). + +## About Cape + +[Cape Privacy](https://capeprivacy.com) helps teams share data and make decisions for safer and more powerful data science. Learn more at [capeprivacy.com](https://capeprivacy.com). diff --git a/examples/tutorials/README.md b/examples/tutorials/README.md index 22fd2d8..1d37966 100644 --- a/examples/tutorials/README.md +++ b/examples/tutorials/README.md @@ -28,7 +28,7 @@ You can also experiment with these transformations on Spark DataFrame with the ` python spark_transformations_without_policy.py ``` -As you will notice, the `transformations` API for `Pandas` and `Spark` are identitical, so you can easily transfer the transformations applied in `Pandas` to `Spark`. +As you will notice, the `transformations` API for `Pandas` and `Spark` are identical, so you can easily transfer the transformations applied in `Pandas` to `Spark`. ## Write your policy diff --git a/examples/tutorials/apply_policy_on_spark_df.py b/examples/tutorials/apply_policy_on_spark_df.py index 5cc32d0..8457857 100644 --- a/examples/tutorials/apply_policy_on_spark_df.py +++ b/examples/tutorials/apply_policy_on_spark_df.py @@ -1,4 +1,4 @@ -import cape_privacy +import cape_privacy as cape from dataset import load_dataset @@ -7,10 +7,9 @@ print("Original Dataset:") print(df.show()) # Load the privacy policy -policy = cape_privacy.parse_policy("mask_personal_information.yaml") +policy = cape.parse_policy("mask_personal_information.yaml") # Apply the policy to the DataFrame -# [NOTE] will be updated to `cape_privacy.apply_policy` #49 is merged -df = cape_privacy.apply_policy(policy, df) +df = cape.apply_policy(policy, df) # Output the masked dataset print("Masked Dataset:") print(df.show())