Skip to content

Commit

Permalink
Fixing test for pii operator (#430)
Browse files Browse the repository at this point in the history
  • Loading branch information
mingkang111 authored Nov 16, 2023
1 parent a1f09f0 commit 6899cbc
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 29 deletions.
5 changes: 3 additions & 2 deletions ads/opctl/operator/lowcode/pii/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,12 @@ To run pii operator locally, create and activate a new conda environment (`ads-p
- datapane
- gender_guesser
- nameparser
- oracle_ads[opctl]
- plotly
- spacy_transformers
- scrubadub
- scrubadub_spacy
- oracle_ads[opctl]
- spacy-transformers==1.2.5
- spacy==3.6.1
```
Please review the previously generated `pii.yaml` file using the `init` command, and make any necessary adjustments to the input and output file locations. By default, it assumes that the files should be located in the same folder from which the `init` command was executed.
Expand Down
5 changes: 3 additions & 2 deletions ads/opctl/operator/lowcode/pii/environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ dependencies:
- datapane
- gender_guesser
- nameparser
- oracle_ads[opctl]
- plotly
- spacy_transformers
- scrubadub
- scrubadub_spacy
- oracle_ads[opctl]
- spacy-transformers==1.2.5
- spacy==3.6.1
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ After having set up ``ads opctl`` on your desired machine using ``ads opctl conf
- Path to the input data (input_data)
- Path to the output directory, where the operator will place the processed data and report.html produced from the run (output_directory)
- Name of the column with user data (target_column)
- Name of the detector will be used in the operator (detectors)
- The detector will be used in the operator (detectors)

These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``:
You can check :ref:`Configure Detector <config_detector>` for more details on how to configure ``detectors`` parameter. These details exactly match the initial pii.yaml file generated by running ``ads operator init --type pii``:

.. code-block:: yaml
Expand All @@ -32,10 +32,10 @@ These details exactly match the initial pii.yaml file generated by running ``ads
Optionally, you are able to specify much more. The most common additions are:

- Whether to show sensitive content in the report. (show_sensitive_content)
- Way to process the detected entity. (action)
- Whether to show sensitive content in the report (show_sensitive_content)
- Way to process the detected entity (action)

An extensive list of parameters can be found in the ``YAML Schema`` section.
An extensive list of parameters can be found in the :ref:`YAML Schema <pii-yaml-schema>`.


Run
Expand All @@ -57,7 +57,7 @@ We will go through each of these output files in turn.

**mydata-out.csv**

The name of this file can be customized based on output_directory parameters in the configuration yaml. This file contains the processed dataset.
The name of this file can be customized based on ``output_directory`` parameters in the configuration yaml. This file contains the processed dataset.

**report.html**

Expand Down
13 changes: 12 additions & 1 deletion docs/source/user_guide/operators/pii_operator/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,18 @@ The PII Operator can be installed from PyPi.

.. code-block:: bash
python3 -m pip install oracle_ads[pii]
python3 -m pip install oracle_ads[pii]==2.9
After that, the Operator is ready to go!

In order to run on a job, you will need to create and publish a conda pack with ``oracle_ads[pii]`` installed. The simplest way to do this is from a Notebook Session, running the following commands:

.. code-block:: bash
odsc conda create -n ads_pii -e
conda activate /home/datascience/conda/ads_pii_v1_0
python3 -m pip install oracle-ads[pii]==2.9
odsc conda publish -s /home/datascience/conda/ads_pii_v1_0
Ensure that you have properly configured your conda pack namespace and bucket in the Launcher -> Settings -> Object Storage Settings. For more details, see :doc:`ADS Conda Set Up <../../cli/opctl/configure>`
91 changes: 88 additions & 3 deletions docs/source/user_guide/operators/pii_operator/pii.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,98 @@ Here is an example pii.yaml with every parameter specified:
* **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://<bucket>@<namespace>/path/to/data.csv``.
* **target_column**: This string specifies the name of the column where the user data is within the input data.
* **detectors**: This list contains the details for each detector and action that will be taken.
* **name**: The string specifies the name of the detector. The format should be ``<type>.<entity>``.
* **name**: The string specifies the name of the detector. The format should be ``<type>.<entity>``. Check :ref:`Configure Detector <config_detector>` for more details.
* **action**: The string specifies the way to process the detected entity. Default to mask.
* **output_directory**: This dictionary contains the details for where to put the output artifacts. The directory need not exist, but must be accessible by the Operator during runtime.
* **url**: Insert the uri for the dataset if it's on object storage using the URI pattern ``oci://<bucket>@<namespace>/subfolder/``.
* **name**: The string specifies the name of the processed data file.

* **report**: (optional) This dictionary specific details for the generated report.
* **report_filename**: Placed into output_directory location. Defaults to report.html.
* **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to false.
* **report_filename**: Placed into output_directory location. Defaults to ``report.html``.
* **show_sensitive_content**: Whether to show sensitive content in the report. Defaults to ``false``.
* **show_rows**: The number of rows that shows in the report.


.. _config_detector:

Configure Detector
------------------

A detector consists of ``name`` and ``action``. The **name** parameter defines the detector that will be used, and the **action** parameter defines the way to process the entity.

Configure Name
~~~~~~~~~~~~~~

We currently support the following type of detectors:

* default
* spacy

Default
^^^^^^^

Here scrubadub's pre-defined detector is used. You can designate the name in the format of ``default.<entity>`` (e.g., ``default.phone``). Check the supported detectors from `scrubadub <https://scrubadub.readthedocs.io/en/stable/api_scrubadub_detectors.html>`_.

.. note::

If you want to de-identify `address` by this tool, `scrubadub_address` is required.
You will need to follow the `instructions`_ to install the required dependencies.

.. _instructions: https://scrubadub.readthedocs.io/en/stable/addresses.html/


spaCy
^^^^^

To use spaCy’s NER to identify entity, you can designate the name in the format of ``spacy.<model>.<entity>`` (e.g., ``spacy.en_core_web_sm.person``).
The "entity" value can correspond to any entity that spaCy recognizes. For a list of available models and entities, please refer to the `spaCy documentation <https://spacy.io/models/en>`_.



Configure Action
~~~~~~~~~~~~~~~~

We currently support the following types of actions:

* mask
* remove
* anonymize

Mask
^^^^

The ``mask`` action is used to mask the detected entity with the name of the entity type. It replaces the entity with a placeholder. For example, with the following configured detector:

.. code-block:: yaml
name: spacy.en_core_web_sm.person
action: mask
After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is {{NAME}}."

Remove
^^^^^^

The ``remove`` action is used to delete the detected entity from the text. It completely removes the entity without replacement. For example, with the following configured detector:

.. code-block:: yaml
name: spacy.en_core_web_sm.person
action: remove
After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is ."


Anonymize
^^^^^^^^^

The ``anonymize`` action can be used to obfuscate the detected sensitive information.
Currently, we provide context-aware anonymization for name, email, and number-like entities.
For example, with the following configured detector:

.. code-block:: yaml
name: spacy.en_core_web_sm.person
action: anonymize
After processing, the input text "Hi, my name is John Doe." will become "Hi, my name is Joe Blow."
2 changes: 2 additions & 0 deletions docs/source/user_guide/operators/pii_operator/yaml_schema.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _pii-yaml-schema:

===========
YAML Schema
===========
Expand Down
27 changes: 14 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,8 @@ opctl = [
"nbconvert",
"nbformat",
"oci-cli",
"rich",
"py-cpuinfo",
"rich",
]
optuna = [
"optuna==2.9.0",
Expand Down Expand Up @@ -154,20 +154,20 @@ viz = [
"seaborn>=0.11.0",
]
forecast = [
"autots[additional]",
"datapane",
"prophet",
"pmdarima",
"statsmodels",
"sktime",
"optuna==2.9.0",
"oci-cli",
"shap",
"numpy",
"holidays==0.21.13",
"neuralprophet",
"numpy",
"oci-cli",
"optuna==2.9.0",
"oracle-ads[opctl]",
"oracle-automlx==23.2.3",
"autots[additional]",
"neuralprophet",
"pmdarima",
"prophet",
"shap",
"sktime",
"statsmodels",
]
pii = [
"aiohttp",
Expand All @@ -176,9 +176,10 @@ pii = [
"nameparser",
"oracle_ads[opctl]",
"plotly",
"spacy_transformers",
"scrubadub",
"scrubadub==2.0.1",
"scrubadub_spacy",
"spacy-transformers==1.2.5",
"spacy==3.6.1",
]

[project.urls]
Expand Down
7 changes: 5 additions & 2 deletions tests/unitary/with_extras/operator/pii/test_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ def test_get_default_detector(self):
@pytest.mark.parametrize(
"detector_type, entity, model",
[
("spacy", "person", "en_core_web_trf"),
("spacy", "other", "en_core_web_trf"),
("spacy", "person", "en_core_web_sm"),
("spacy", "other", "en_core_web_sm"),
# ("spacy", "org", "en_core_web_trf"),
# ("spacy", "loc", "en_core_web_md"),
# ("spacy", "date", "en_core_web_lg"),
],
)
def test_get_spacy_detector(self, detector_type, entity, model):
Expand Down

0 comments on commit 6899cbc

Please sign in to comment.