make release-tag: Merge branch 'master' into stable

sdv-dev · May 3, 2022 · 9d8bdce · 9d8bdce
2 parents fc7a4ce + 18c9345
commit 9d8bdce
Show file tree

Hide file tree

Showing 32 changed files with 1,232 additions and 457 deletions.
diff --git a/.gitignore b/.gitignore
@@ -106,6 +106,8 @@ ENV/
 # Vim
 .*.swp
 
+# other
+.DS_Store
 sdv/data/
 docs/**/*.pkl
 docs/**/*metadata.json

diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,30 @@
 # Release Notes
 
+## 0.14.1 - 2022-05-03
+
+This release adds a `TabularPreset`, available in the `sdv.lite` module, which allows users to easily optimize a tabular model for speed.
+In this release, we also include bug fixes for sampling with conditions, an unresolved warning, and setting field distributions. Finally,
+we include documentation updates for sampling and the new `TabularPreset`.
+
+### Bugs Fixed
+* Fix write to file in sampling - Issue [#732](https://github.com/sdv-dev/SDV/issues/732) by @katxiao
+* Sampling with conditions={column: 0.0} for float columns doesn't work - Issue [#525](https://github.com/sdv-dev/SDV/issues/525) by @shlomihod and @tssbas
+* resolved FutureWarning with Pandas replaced append by concat - Issue [#759](https://github.com/sdv-dev/SDV/issues/759) by @Deathn0t
+* Field distributions bug in CopulaGAN - Issue [#747](https://github.com/sdv-dev/SDV/issues/747) by @katxiao
+* Field distributions bug in GaussianCopula - Issue [#746](https://github.com/sdv-dev/SDV/issues/746) by @katxiao
+
+### New Features
+* Set default transformer to categorical_fuzzy - Issue [#768](https://github.com/sdv-dev/SDV/issues/768) by @amontanez24
+* Model nulls normally when tabular preset has constraints - Issue [#764](https://github.com/sdv-dev/SDV/issues/764) by @katxiao
+* Don't modify my metadata object - Issue [#754](https://github.com/sdv-dev/SDV/issues/754) by @amontanez24
+* Presets should be able to handle constraints - Issue [#753](https://github.com/sdv-dev/SDV/issues/753) by @katxiao
+* Change preset optimize_for --> name - Issue [#749](https://github.com/sdv-dev/SDV/issues/749) by @katxiao
+* Create a speed optimized Preset - Issue [#716](https://github.com/sdv-dev/SDV/issues/716) by @katxiao
+
+### Documentation Changes
+* Add tabular preset docs - Issue [#777](https://github.com/sdv-dev/SDV/issues/777) by @katxiao
+* sdv.sampling module is missing from the API - Issue [#740](https://github.com/sdv-dev/SDV/issues/740) by @katxiao
+
 ## 0.14.0 - 2022-03-21
 
 This release updates the sampling API and splits the existing functionality into three methods - `sample`, `sample_conditions`,

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2018, MIT Data To AI Lab
+Copyright (c) 2022, DataCebo, Inc.
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,5 +1,5 @@
 {% set name = 'sdv' %}
-{% set version = '0.14.1.dev0' %}
+{% set version = '0.14.1.dev1' %}
 
 package:
   name: "{{ name|lower }}"

diff --git a/docs/.DS_Store b/docs/.DS_Store
diff --git a/docs/api_reference/index.rst b/docs/api_reference/index.rst
@@ -15,6 +15,7 @@ and classes in SDV.
    timeseries/index
    metadata/index
    constraints/index
+   sampling/index
    metrics/index
    evaluation
    demo
diff --git a/docs/api_reference/sampling/index.rst b/docs/api_reference/sampling/index.rst
@@ -0,0 +1,10 @@
+.. _sdv.sampling:
+
+sdv.sampling
+===============
+
+.. toctree::
+    :maxdepth: 1
+    :titlesonly:
+
+    tabular
diff --git a/docs/api_reference/sampling/tabular.rst b/docs/api_reference/sampling/tabular.rst
@@ -0,0 +1,16 @@
+.. _sdv.sampling.tabular:
+
+Tabular Conditions
+===================
+
+.. currentmodule:: sdv.sampling
+
+Condition
+~~~~~~~~~
+
+.. autosummary::
+   :toctree: api/
+
+   Condition
+   Condition.get_column_values
+   Condition.get_num_rows
diff --git a/docs/images/google_colab.png b/docs/images/google_colab.png
diff --git a/docs/user_guides/single_table/copulagan.rst b/docs/user_guides/single_table/copulagan.rst
@@ -10,7 +10,6 @@ discover functionalities of the ``CopulaGAN`` model, including how to:
 -  Fit the instance to your data.
 -  Generate synthetic versions of your data.
 -  Use ``CopulaGAN`` to anonymize PII information.
--  Customize the data transformations to improve the learning process.
 -  Specify the column distributions to improve the output quality.
 -  Specify hyperparameters to improve the output quality.
 

diff --git a/docs/user_guides/single_table/ctgan.rst b/docs/user_guides/single_table/ctgan.rst
@@ -10,7 +10,6 @@ discover functionalities of the ``CTGAN`` model, including how to:
 -  Fit the instance to your data.
 -  Generate synthetic versions of your data.
 -  Use ``CTGAN`` to anonymize PII information.
--  Customize the data transformations to improve the learning process.
 -  Specify hyperparameters to improve the output quality.
 
 What is CTGAN?

diff --git a/docs/user_guides/single_table/gaussian_copula.rst b/docs/user_guides/single_table/gaussian_copula.rst
@@ -11,7 +11,6 @@ to:
 -  Fit the instance to your data.
 -  Generate synthetic versions of your data.
 -  Use ``GaussianCopula`` to anonymize PII information.
--  Customize the data transformations to improve the learning process.
 -  Specify the column distributions to improve the output quality.
 
 What is GaussianCopula?
@@ -351,73 +350,6 @@ Now that we have discovered the basics, let's go over a few more
 advanced usage examples and see the different arguments that we can pass
 to our ``GaussianCopula`` Model in order to customize it to our needs.
 
-How to set transforms to use?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-One thing that you may have noticed when executing the previous steps is
-that the fitting process took much longer on the
-``student_placements_pii`` dataset than it took on the previous version
-that did not contain the student ``address``. This happens because the
-``address`` field is interpreted as a categorical variable, which the
-``GaussianCopula`` `one-hot
-encoded <https://en.wikipedia.org/wiki/One-hot>`__ generating 215 new
-columns that it had to learn afterwards.
-
-This transformation, which in this case was very inefficient, happens
-because the Tabular Models apply `Reversible Data
-Transforms <https://github.com/sdv-dev/RDT>`__ under the hood to
-transform all the non-numerical variables, which the underlying models
-cannot handle, into numerical representations which they can properly
-work with. In the case of the ``GaussianCopula``, the default
-transformation is a One-Hot encoding, which can work very well with
-variables that have a small number of different values, but which is
-very inefficient in cases where there is a large number of values.
-
-For this reason, the Tabular Models have an additional argument called
-``field_transformers`` that let you select which transformer to apply to
-each column. This ``field_transformers`` argument must be passed as a
-``dict`` which contains the name of the fields for which we want to use
-a transformer different than the default, and the name of the
-transformer that we want to use.
-
-Possible transformer names are:
-
--  ``integer``: Uses a ``NumericalTransformer`` of dtype ``int``.
--  ``float``: Uses a ``NumericalTransformer`` of dtype ``float``.
--  ``categorical``: Uses a ``CategoricalTransformer`` without gaussian
-   noise.
--  ``categorical_fuzzy``: Uses a ``CategoricalTransformer`` adding
-   gaussian noise.
--  ``one_hot_encoding``: Uses a ``OneHotEncodingTransformer``.
--  ``label_encoding``: Uses a ``LabelEncodingTransformer``.
--  ``boolean``: Uses a ``BooleanTransformer``.
--  ``datetime``: Uses a ``DatetimeTransformer``.
-
-**NOTE**: For additional details about each one of the transformers,
-please visit `RDT <https://github.com/sdv-dev/RDT>`__
-
-Let's now try to improve the previous fitting process by changing the
-transformer that we use for the ``address`` field to something other
-than the default. As an example, we will use the ``label_encoding``
-transformer, which instead of generating one column for each possible
-value, it just replaces each value with a unique integer value.
-
-.. ipython:: python
-    :okwarning:
-
-    model = GaussianCopula(
-        primary_key='student_id',
-        anonymize_fields={
-            'address': 'address'
-        },
-        field_transformers={
-            'address': 'label_encoding'
-        }
-    )
-    model.fit(data_pii)
-    new_data_pii = model.sample(200)
-    new_data_pii.head()
-
 Setting Bounds and Specifying Rounding for Numerical Columns
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/user_guides/single_table/models.rst b/docs/user_guides/single_table/models.rst
@@ -2,10 +2,20 @@
 
 Models
 ======
+In this section, you'll find information about using synthetic data models for single table data.
+
+.. note::
+   Is this your first time using the SDV?
+
+   We recommend starting with the new, :ref:`tabular_preset` model. This model comes pre-configured
+   so you can spend less time choosing parameters or tuning a model, and more time using your
+   synthetic data.
+
 
 .. toctree::
     :maxdepth: 2
 
+    tabular_preset
     gaussian_copula
     ctgan
     copulagan