Skip to content

Commit

Permalink
Update the README
Browse files Browse the repository at this point in the history
  • Loading branch information
bendecoste committed Jun 23, 2020
1 parent a04440a commit fca7673
Show file tree
Hide file tree
Showing 3 changed files with 113 additions and 30 deletions.
134 changes: 109 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,134 @@
# Cape Python
[<img src="https://raw.githubusercontent.com/dropoutlabs/files/master/cape-logo.png" alt="Cape Privacy" width="500"/>](https://capeprivacy.com/)

![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python)
![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python)
[![PyPI version](https://badge.fury.io/py/cape-privacy.svg)](https://badge.fury.io/py/cape-privacy)

Cape Privacy offers data scientists and data engineers a policy-based interface for applying privacy-enhancing techniques
across several popular libraries and frameworks to protect sensitive data throughout the data science life cycle.

Cape Python brings Cape's policy language to Pandas and Apache Spark,
enabling you to collaborate on privacy-preserving policy at a non-technical level.
The supported techniques include tokenization with linkability as well as perturbation and rounding.
You can experiment with these techniques programmatically, in Python or in human-readable policy files.
Stay tuned for more privacy-enhancing techniques in the future!

See below for instructions on how to get started or visit the [documentation](https://docs.capeprivacy.com/).

## Getting Started

Make sure you have at least Python 3.6 installed. We recommend running it in a virtual environment
such as with [venv](https://docs.python.org/3/library/venv.html) or
[conda](https://www.anaconda.com/products/individual).
Cape Python is available via Pypi.

```sh
pip install cape-privacy
```

Support for Apache Spark is optional. If you plan on using the library together with Apache Spark, we suggest the following instead:

`make` will also be required to run our `Makefile` so ensure that you have that installed as well.
```sh
pip install cape-privacy[spark]
```

### Bootstrapping
We recommend running it in a virtual environment, such as [venv](https://docs.python.org/3/library/venv.html).

Bootstrapping your environment installs all direct dependencies of the Python API
and ensures that the API is installed as well.
### Installing from source

Run the following command from a command line:
It is also possible to install the library from source.

```bash
$ make bootstrap
```sh
git clone https://github.com/capeprivacy/cape-python.git
cd cape-python
make bootstrap
```

#### Example
This will also install all dependencies, including Apache Spark. Make sure you have `make` installed before running the above.

This example does a basic plusOne transformation on a pandas dataframe with a single column called `value`. It can be
found in the `examples` directory.
## Example

```python
import cape_privacy as cape
import numpy as np
import pandas as pd
*(this example is an abridged version of the tutorial found [here](./examples/tutorials/))*

policy = cape.parse_policy("perturb_value_field.yaml")
df = pd.DataFrame(np.ones(5,), columns=["value"])
df = cape.apply_policy(policy, df)
To discover what different transformations do and how you might use them, it is best to explore via the `transformations` APIs:

```python
df = pd.DataFrame({
"name": ["alice", "bob"],
"age": [34, 55],
"birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
})

tokenize = Tokenizer(
max_token_len=10,
key=b"my secret",
)

perturb_numeric = NumericPerturbation(
dtype=dtypes.Integer,
min=-10,
max=10,
)

df["name"] = tokenize(df["name"])
df["age"] = perturb_numeric(df["age"])

print(df.head())

# >>
# name age birthdate
# 0 f42c2f1964 34 1985-02-23
# 1 2e586494b2 63 1963-05-10
```

You can also pass a URL to `parse_policy`.
These steps can be saved in policy files so you can share them and collaborate with your team:

```yaml
# my-policy.yaml
label: my-policy
version: 1
rules:
- match:
name: age
actions:
- transform:
type: numeric-perturbation
dtype: Integer
min: -10
max: 10
seed: 4984
- match:
name: name
actions:
- transform:
type: tokenizer
max_token_len: 10
key: my secret
```
You can then load this policy and apply it to your data frame:
```python
policy = cape.parse_policy("https://mydomain.com/policy.yaml")
# df can be a Pandas or Spark data frame
policy = cape.parse_policy("my-policy.yaml")
df = cape.apply_policy(policy, df)

print(df.head())
# >>
# name age birthdate
# 0 f42c2f1964 34 1985-02-23
# 1 2e586494b2 63 1963-05-10
```

# License
You can see more examples and usage [here](./examples) or by visiting our [documentation](https://docs.capeprivacy.com).

## Contributing and Bug Reports

Please file any [feature request](https://github.com/capeprivacy/cape-python/issues/new?template=feature_request.md) or
[bug report](https://github.com/capeprivacy/cape-python/issues/new?template=bug_report.md) as GitHub issues.

## License

Licensed under Apache License, Version 2.0 (see [LICENSE](./LICENSE) or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in [NOTICE](./NOTICE).

## About Cape

[Cape Privacy](https://capeprivacy.com) helps teams share data and make decisions for safer and more powerful data science. Learn more at [capeprivacy.com](https://capeprivacy.com).
2 changes: 1 addition & 1 deletion examples/tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ You can also experiment with these transformations on Spark DataFrame with the `
python spark_transformations_without_policy.py
```

As you will notice, the `transformations` API for `Pandas` and `Spark` are identitical, so you can easily transfer the transformations applied in `Pandas` to `Spark`.
As you will notice, the `transformations` API for `Pandas` and `Spark` are identical, so you can easily transfer the transformations applied in `Pandas` to `Spark`.

## Write your policy

Expand Down
7 changes: 3 additions & 4 deletions examples/tutorials/apply_policy_on_spark_df.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import cape_privacy
import cape_privacy as cape
from dataset import load_dataset


Expand All @@ -7,10 +7,9 @@
print("Original Dataset:")
print(df.show())
# Load the privacy policy
policy = cape_privacy.parse_policy("mask_personal_information.yaml")
policy = cape.parse_policy("mask_personal_information.yaml")
# Apply the policy to the DataFrame
# [NOTE] will be updated to `cape_privacy.apply_policy` #49 is merged
df = cape_privacy.apply_policy(policy, df)
df = cape.apply_policy(policy, df)
# Output the masked dataset
print("Masked Dataset:")
print(df.show())

0 comments on commit fca7673

Please sign in to comment.