Skip to content

Commit

Permalink
make release-tag: Merge branch 'master' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
katxiao committed May 3, 2022
2 parents fc7a4ce + 18c9345 commit 9d8bdce
Show file tree
Hide file tree
Showing 32 changed files with 1,232 additions and 457 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ ENV/
# Vim
.*.swp

# other
.DS_Store
sdv/data/
docs/**/*.pkl
docs/**/*metadata.json
Expand Down
25 changes: 25 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,30 @@
# Release Notes

## 0.14.1 - 2022-05-03

This release adds a `TabularPreset`, available in the `sdv.lite` module, which allows users to easily optimize a tabular model for speed.
In this release, we also include bug fixes for sampling with conditions, an unresolved warning, and setting field distributions. Finally,
we include documentation updates for sampling and the new `TabularPreset`.

### Bugs Fixed
* Fix write to file in sampling - Issue [#732](https://github.com/sdv-dev/SDV/issues/732) by @katxiao
* Sampling with conditions={column: 0.0} for float columns doesn't work - Issue [#525](https://github.com/sdv-dev/SDV/issues/525) by @shlomihod and @tssbas
* resolved FutureWarning with Pandas replaced append by concat - Issue [#759](https://github.com/sdv-dev/SDV/issues/759) by @Deathn0t
* Field distributions bug in CopulaGAN - Issue [#747](https://github.com/sdv-dev/SDV/issues/747) by @katxiao
* Field distributions bug in GaussianCopula - Issue [#746](https://github.com/sdv-dev/SDV/issues/746) by @katxiao

### New Features
* Set default transformer to categorical_fuzzy - Issue [#768](https://github.com/sdv-dev/SDV/issues/768) by @amontanez24
* Model nulls normally when tabular preset has constraints - Issue [#764](https://github.com/sdv-dev/SDV/issues/764) by @katxiao
* Don't modify my metadata object - Issue [#754](https://github.com/sdv-dev/SDV/issues/754) by @amontanez24
* Presets should be able to handle constraints - Issue [#753](https://github.com/sdv-dev/SDV/issues/753) by @katxiao
* Change preset optimize_for --> name - Issue [#749](https://github.com/sdv-dev/SDV/issues/749) by @katxiao
* Create a speed optimized Preset - Issue [#716](https://github.com/sdv-dev/SDV/issues/716) by @katxiao

### Documentation Changes
* Add tabular preset docs - Issue [#777](https://github.com/sdv-dev/SDV/issues/777) by @katxiao
* sdv.sampling module is missing from the API - Issue [#740](https://github.com/sdv-dev/SDV/issues/740) by @katxiao

## 0.14.0 - 2022-03-21

This release updates the sampling API and splits the existing functionality into three methods - `sample`, `sample_conditions`,
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2018, MIT Data To AI Lab
Copyright (c) 2022, DataCebo, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
2 changes: 1 addition & 1 deletion conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{% set name = 'sdv' %}
{% set version = '0.14.1.dev0' %}
{% set version = '0.14.1.dev1' %}

package:
name: "{{ name|lower }}"
Expand Down
Binary file removed docs/.DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions docs/api_reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ and classes in SDV.
timeseries/index
metadata/index
constraints/index
sampling/index
metrics/index
evaluation
demo
10 changes: 10 additions & 0 deletions docs/api_reference/sampling/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. _sdv.sampling:

sdv.sampling
===============

.. toctree::
:maxdepth: 1
:titlesonly:

tabular
16 changes: 16 additions & 0 deletions docs/api_reference/sampling/tabular.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _sdv.sampling.tabular:

Tabular Conditions
===================

.. currentmodule:: sdv.sampling

Condition
~~~~~~~~~

.. autosummary::
:toctree: api/

Condition
Condition.get_column_values
Condition.get_num_rows
Binary file added docs/images/google_colab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion docs/user_guides/single_table/copulagan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ discover functionalities of the ``CopulaGAN`` model, including how to:
- Fit the instance to your data.
- Generate synthetic versions of your data.
- Use ``CopulaGAN`` to anonymize PII information.
- Customize the data transformations to improve the learning process.
- Specify the column distributions to improve the output quality.
- Specify hyperparameters to improve the output quality.

Expand Down
1 change: 0 additions & 1 deletion docs/user_guides/single_table/ctgan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ discover functionalities of the ``CTGAN`` model, including how to:
- Fit the instance to your data.
- Generate synthetic versions of your data.
- Use ``CTGAN`` to anonymize PII information.
- Customize the data transformations to improve the learning process.
- Specify hyperparameters to improve the output quality.

What is CTGAN?
Expand Down
68 changes: 0 additions & 68 deletions docs/user_guides/single_table/gaussian_copula.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ to:
- Fit the instance to your data.
- Generate synthetic versions of your data.
- Use ``GaussianCopula`` to anonymize PII information.
- Customize the data transformations to improve the learning process.
- Specify the column distributions to improve the output quality.

What is GaussianCopula?
Expand Down Expand Up @@ -351,73 +350,6 @@ Now that we have discovered the basics, let's go over a few more
advanced usage examples and see the different arguments that we can pass
to our ``GaussianCopula`` Model in order to customize it to our needs.

How to set transforms to use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One thing that you may have noticed when executing the previous steps is
that the fitting process took much longer on the
``student_placements_pii`` dataset than it took on the previous version
that did not contain the student ``address``. This happens because the
``address`` field is interpreted as a categorical variable, which the
``GaussianCopula`` `one-hot
encoded <https://en.wikipedia.org/wiki/One-hot>`__ generating 215 new
columns that it had to learn afterwards.

This transformation, which in this case was very inefficient, happens
because the Tabular Models apply `Reversible Data
Transforms <https://github.com/sdv-dev/RDT>`__ under the hood to
transform all the non-numerical variables, which the underlying models
cannot handle, into numerical representations which they can properly
work with. In the case of the ``GaussianCopula``, the default
transformation is a One-Hot encoding, which can work very well with
variables that have a small number of different values, but which is
very inefficient in cases where there is a large number of values.

For this reason, the Tabular Models have an additional argument called
``field_transformers`` that let you select which transformer to apply to
each column. This ``field_transformers`` argument must be passed as a
``dict`` which contains the name of the fields for which we want to use
a transformer different than the default, and the name of the
transformer that we want to use.

Possible transformer names are:

- ``integer``: Uses a ``NumericalTransformer`` of dtype ``int``.
- ``float``: Uses a ``NumericalTransformer`` of dtype ``float``.
- ``categorical``: Uses a ``CategoricalTransformer`` without gaussian
noise.
- ``categorical_fuzzy``: Uses a ``CategoricalTransformer`` adding
gaussian noise.
- ``one_hot_encoding``: Uses a ``OneHotEncodingTransformer``.
- ``label_encoding``: Uses a ``LabelEncodingTransformer``.
- ``boolean``: Uses a ``BooleanTransformer``.
- ``datetime``: Uses a ``DatetimeTransformer``.

**NOTE**: For additional details about each one of the transformers,
please visit `RDT <https://github.com/sdv-dev/RDT>`__

Let's now try to improve the previous fitting process by changing the
transformer that we use for the ``address`` field to something other
than the default. As an example, we will use the ``label_encoding``
transformer, which instead of generating one column for each possible
value, it just replaces each value with a unique integer value.

.. ipython:: python
:okwarning:
model = GaussianCopula(
primary_key='student_id',
anonymize_fields={
'address': 'address'
},
field_transformers={
'address': 'label_encoding'
}
)
model.fit(data_pii)
new_data_pii = model.sample(200)
new_data_pii.head()
Setting Bounds and Specifying Rounding for Numerical Columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
10 changes: 10 additions & 0 deletions docs/user_guides/single_table/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,20 @@

Models
======
In this section, you'll find information about using synthetic data models for single table data.

.. note::
Is this your first time using the SDV?

We recommend starting with the new, :ref:`tabular_preset` model. This model comes pre-configured
so you can spend less time choosing parameters or tuning a model, and more time using your
synthetic data.


.. toctree::
:maxdepth: 2

tabular_preset
gaussian_copula
ctgan
copulagan
Expand Down
Loading

0 comments on commit 9d8bdce

Please sign in to comment.