Update the README

capeprivacy · Jun 23, 2020 · fca7673 · fca7673
1 parent a04440a
commit fca7673
Show file tree

Hide file tree

Showing 3 changed files with 113 additions and 30 deletions.
diff --git a/README.md b/README.md
@@ -1,50 +1,134 @@
-# Cape Python
+[<img src="https://raw.githubusercontent.com/dropoutlabs/files/master/cape-logo.png" alt="Cape Privacy" width="500"/>](https://capeprivacy.com/)
 
-![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python)
+![](https://github.com/capeprivacy/cape-python/workflows/Main/badge.svg) 
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) 
+[![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python)
+[![PyPI version](https://badge.fury.io/py/cape-privacy.svg)](https://badge.fury.io/py/cape-privacy)
 
+Cape Privacy offers data scientists and data engineers a policy-based interface for applying privacy-enhancing techniques 
+across several popular libraries and frameworks to protect sensitive data throughout the data science life cycle.
+
+Cape Python brings Cape's policy language to Pandas and Apache Spark, 
+enabling you to collaborate on privacy-preserving policy at a non-technical level. 
+The supported techniques include tokenization with linkability as well as perturbation and rounding.
+You can experiment with these techniques programmatically, in Python or in human-readable policy files. 
+Stay tuned for more privacy-enhancing techniques in the future!
+
+See below for instructions on how to get started or visit the [documentation](https://docs.capeprivacy.com/).
 
 ## Getting Started
 
-Make sure you have at least Python 3.6 installed. We recommend running it in a virtual environment
-such as with [venv](https://docs.python.org/3/library/venv.html) or
-[conda](https://www.anaconda.com/products/individual).
+Cape Python is available via Pypi.
+
+```sh
+pip install cape-privacy
+```
+
+Support for Apache Spark is optional.  If you plan on using the library together with Apache Spark, we suggest the following instead:
 
-`make` will also be required to run our `Makefile` so ensure that you have that installed as well.
+```sh
+pip install cape-privacy[spark]
+```
 
-### Bootstrapping
+We recommend running it in a virtual environment, such as [venv](https://docs.python.org/3/library/venv.html).
 
-Bootstrapping your environment installs all direct dependencies of the Python API
-and ensures that the API is installed as well.
+### Installing from source
 
-Run the following command from a command line:
+It is also possible to install the library from source.
 
-```bash
-$ make bootstrap
+```sh
+git clone https://github.com/capeprivacy/cape-python.git
+cd cape-python
+make bootstrap
 ```
 
-#### Example
+This will also install all dependencies, including Apache Spark. Make sure you have `make` installed before running the above.
 
-This example does a basic plusOne transformation on a pandas dataframe with a single column called `value`. It can be
-found in the `examples` directory.
+## Example
 
-```python
-import cape_privacy as cape
-import numpy as np
-import pandas as pd
+*(this example is an abridged version of the tutorial found [here](./examples/tutorials/))*
 
-policy = cape.parse_policy("perturb_value_field.yaml")
-df = pd.DataFrame(np.ones(5,), columns=["value"])
-df = cape.apply_policy(policy, df)
+To discover what different transformations do and how you might use them, it is best to explore via the `transformations` APIs:
+
+```python
+df = pd.DataFrame({
+        "name": ["alice", "bob"],
+        "age": [34, 55],
+        "birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
+    })
+
+tokenize = Tokenizer(
+    max_token_len=10,
+    key=b"my secret",
+)
+
+perturb_numeric = NumericPerturbation(
+    dtype=dtypes.Integer,
+    min=-10,
+    max=10,
+)
+
+df["name"] = tokenize(df["name"])
+df["age"] = perturb_numeric(df["age"])
 
 print(df.head())
+
+# >>
+#          name  age  birthdate
+# 0  f42c2f1964   34 1985-02-23
+# 1  2e586494b2   63 1963-05-10
 ```
 
-You can also pass a URL to `parse_policy`.
+These steps can be saved in policy files so you can share them and collaborate with your team:
+
+```yaml
+# my-policy.yaml
+label: my-policy
+version: 1
+rules:
+  - match:
+      name: age
+    actions:
+      - transform:
+          type: numeric-perturbation
+          dtype: Integer
+          min: -10
+          max: 10
+          seed: 4984
+  - match:
+      name: name
+    actions:
+      - transform:
+          type: tokenizer
+          max_token_len: 10
+          key: my secret
+``` 
+
+You can then load this policy and apply it to your data frame:
 
 ```python
-policy = cape.parse_policy("https://mydomain.com/policy.yaml")
+# df can be a Pandas or Spark data frame 
+policy = cape.parse_policy("my-policy.yaml")
+df = cape.apply_policy(policy, df)
+
+print(df.head())
+# >>
+#          name  age  birthdate
+# 0  f42c2f1964   34 1985-02-23
+# 1  2e586494b2   63 1963-05-10
 ```
 
-# License
+You can see more examples and usage [here](./examples) or by visiting our [documentation](https://docs.capeprivacy.com).
+
+## Contributing and Bug Reports
+
+Please file any [feature request](https://github.com/capeprivacy/cape-python/issues/new?template=feature_request.md) or 
+[bug report](https://github.com/capeprivacy/cape-python/issues/new?template=bug_report.md) as GitHub issues.
+
+## License
 
 Licensed under Apache License, Version 2.0 (see [LICENSE](./LICENSE) or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in [NOTICE](./NOTICE).
+
+## About Cape
+
+[Cape Privacy](https://capeprivacy.com) helps teams share data and make decisions for safer and more powerful data science. Learn more at [capeprivacy.com](https://capeprivacy.com).
diff --git a/examples/tutorials/README.md b/examples/tutorials/README.md
@@ -28,7 +28,7 @@ You can also experiment with these transformations on Spark DataFrame with the `
 python spark_transformations_without_policy.py
 ```
 
-As you will notice, the `transformations` API for `Pandas` and `Spark` are identitical, so you can easily transfer the transformations applied in `Pandas` to `Spark`.
+As you will notice, the `transformations` API for `Pandas` and `Spark` are identical, so you can easily transfer the transformations applied in `Pandas` to `Spark`.
 
 ## Write your policy
 

diff --git a/examples/tutorials/apply_policy_on_spark_df.py b/examples/tutorials/apply_policy_on_spark_df.py
@@ -1,4 +1,4 @@
-import cape_privacy
+import cape_privacy as cape
 from dataset import load_dataset
 
 
@@ -7,10 +7,9 @@
 print("Original Dataset:")
 print(df.show())
 # Load the privacy policy
-policy = cape_privacy.parse_policy("mask_personal_information.yaml")
+policy = cape.parse_policy("mask_personal_information.yaml")
 # Apply the policy to the DataFrame
-# [NOTE] will be updated to `cape_privacy.apply_policy` #49 is merged
-df = cape_privacy.apply_policy(policy, df)
+df = cape.apply_policy(policy, df)
 # Output the masked dataset
 print("Masked Dataset:")
 print(df.show())