make release-tag: Merge branch 'main' into stable

sdv-dev · May 15, 2024 · 0bc2a9b · 0bc2a9b
2 parents fd67bd3 + ee9a8bb
commit 0bc2a9b
Show file tree

Hide file tree

Showing 60 changed files with 4,950 additions and 733 deletions.
diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml
@@ -11,7 +11,12 @@ jobs:
     strategy:
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
-        os: [ubuntu-latest, macos-latest, windows-latest]
+        os: [ubuntu-latest, windows-latest]
+        include:
+          - os: macos-latest
+            python-version: '3.8'
+          - os: macos-latest
+            python-version: '3.12'
     steps:
     - uses: actions/checkout@v4
     - name: Set up Python ${{ matrix.python-version }}

diff --git a/.github/workflows/minimum.yml b/.github/workflows/minimum.yml
@@ -11,7 +11,12 @@ jobs:
     strategy:
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
-        os: [ubuntu-latest, macos-latest, windows-latest]
+        os: [ubuntu-latest, windows-latest]
+        include:
+          - os: macos-latest
+            python-version: '3.8'
+          - os: macos-latest
+            python-version: '3.12'
     steps:
     - uses: actions/checkout@v4
     - name: Set up Python ${{ matrix.python-version }}

diff --git a/.github/workflows/unit.yml b/.github/workflows/unit.yml
@@ -11,7 +11,12 @@ jobs:
     strategy:
       matrix:
         python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
-        os: [ubuntu-latest, macos-latest, windows-latest]
+        os: [ubuntu-latest, windows-latest]
+        include:
+          - os: macos-latest
+            python-version: '3.8'
+          - os: macos-latest
+            python-version: '3.12'
     steps:
     - uses: actions/checkout@v4
     - name: Set up Python ${{ matrix.python-version }}

diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,41 @@
 # Release Notes
 
+## 1.13.0 - 2024-05-15
+
+This release adds a utility function called `get_random_subset` that helps users get a subset of their multi-table data so that modeling can be done quicker. Given a dictionary of table names mapped to DataFrames, metadata, a main table and a desired number of rows to use for the main table, it will subsample the data in a way that maintains referential integrity.
+
+This release also adds two new local file handlers: the `CSVHandler` and the `ExcelHandler`. This enables users to easily load from and save synthetic data to these files types. These handlers return data and metadata in the multi-table format, so we also added the function `get_table_metadata` to get a `SingleTableMetadata` object from a `MultiTableMetadata` object.
+
+Finally, this release fixes some bugs that prevented synthesizers from working with data that had numerical column names.
+
+### New Features
+
+* Add `get_random_subset` poc utility function - Issue [#1877](https://github.com/sdv-dev/SDV/issues/1877) by @R-Palazzo
+* Add usage logging - Issue [#1903](https://github.com/sdv-dev/SDV/issues/1903) by @pvk-developer
+* Move function `drop_unknown_references` from `poc` to be directly under `utils` - Issue [#1947](https://github.com/sdv-dev/SDV/issues/1947) by @R-Palazzo
+* Add CSVHandler - Issue [#1949](https://github.com/sdv-dev/SDV/issues/1949) by @pvk-developer
+* Add ExcelHandler - Issue [#1950](https://github.com/sdv-dev/SDV/issues/1950) by @pvk-developer
+* Add get_table_metadata function - Issue [#1951](https://github.com/sdv-dev/SDV/issues/1951) by @R-Palazzo
+* Save usage log file as a csv - Issue [#1974](https://github.com/sdv-dev/SDV/issues/1974) by @frances-h
+* Split out metadata creation from data import in the local files handlers - Issue [#1975](https://github.com/sdv-dev/SDV/issues/1975) by @pvk-developer
+* Improve error message when trying to sample before fitting (single table) - Issue [#1978](https://github.com/sdv-dev/SDV/issues/1978) by @R-Palazzo
+
+### Bugs Fixed
+
+* Metadata detection crashes when the column names are integers (`AttributeError: 'int' object has no attribute 'lower'`) - Issue [#1933](https://github.com/sdv-dev/SDV/issues/1933) by @lajohn4747
+* Synthesizers crash when column names are integers (`TypeError: unsupported operand`) - Issue [#1935](https://github.com/sdv-dev/SDV/issues/1935) by @lajohn4747
+* Switch parameter order in drop_unknown_references - Issue [#1944](https://github.com/sdv-dev/SDV/issues/1944) by @R-Palazzo
+* Unexpected NaN values in sequence_index when dataframe isn't reset - Issue [#1973](https://github.com/sdv-dev/SDV/issues/1973) by @fealho
+* Fix pandas DtypeWarning in download_demo - Issue [#1980](https://github.com/sdv-dev/SDV/issues/1980) by @fealho
+
+### Maintenance
+
+* Only run unit and integration tests on oldest and latest python versions for macos - Issue [#1948](https://github.com/sdv-dev/SDV/issues/1948) by @frances-h
+
+### Internal
+
+* Update code to remove `FutureWarning` related to 'enforce_uniqueness' parameter - Issue [#1995](https://github.com/sdv-dev/SDV/issues/1995) by @pvk-developer
+
 ## 1.12.1 - 2024-04-19
 
 This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.

diff --git a/Makefile b/Makefile
@@ -123,12 +123,8 @@ test-integration: ## run tests quickly with the default Python
 test-readme: ## run the readme snippets
 	invoke readme
 
-.PHONY: test-tutorials
-test-tutorials: ## run the tutorial notebooks
-	invoke tutorials
-
 .PHONY: test
-test: test-unit test-integration test-readme test-tutorials ## test everything that needs test dependencies
+test: test-unit test-integration test-readme ## test everything that needs test dependencies
 
 .PHONY: test-all
 test-all: ## run tests on every Python version with tox
@@ -239,6 +235,10 @@ ifeq ($(CHANGELOG_LINES),0)
 	$(error Please insert the release notes in HISTORY.md before releasing)
 endif
 
+.PHONY: git-push
+git-push: ## Simply push the repository to github
+	git push
+
 .PHONY: check-release
 check-release: check-clean check-main check-history ## Check if the release can be made
 	@echo "A new release can be made"
@@ -265,5 +265,5 @@ release-major: check-release bumpversion-major release
 
 .PHONY: check-deps
 check-deps:
-	$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=')
+	$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=')
 	pip freeze | grep -v "SDV.git" | grep -E $(allow_list) | sort > $(OUTPUT_FILEPATH)
diff --git a/latest_requirements.txt b/latest_requirements.txt
@@ -5,6 +5,7 @@ deepecho==0.6.0
 graphviz==0.20.3
 numpy==1.26.4
 pandas==2.2.2
-rdt==1.11.1
+platformdirs==4.2.1
+rdt==1.12.1
 sdmetrics==0.14.0
-tqdm==4.66.2
+tqdm==4.66.4
diff --git a/pyproject.toml b/pyproject.toml
@@ -25,11 +25,10 @@ dependencies = [
     'botocore>=1.31',
     'cloudpickle>=2.1.0',
     'graphviz>=0.13.2',
-    "numpy>=1.20.0;python_version<'3.10'",
+    "numpy>=1.21.0;python_version<'3.10'",
     "numpy>=1.23.3,<2;python_version>='3.10' and python_version<'3.12'",
     "numpy>=1.26.0,<2;python_version>='3.12'",
-    "pandas>=1.1.3;python_version<'3.10'",
-    "pandas>=1.3.4;python_version>='3.10' and python_version<'3.11'",
+    "pandas>=1.4.0;python_version<'3.11'",
     "pandas>=1.5.0;python_version>='3.11' and python_version<'3.12'",
     "pandas>=2.1.1;python_version>='3.12'",
     'tqdm>=4.29',
@@ -38,6 +37,7 @@ dependencies = [
     'deepecho>=0.6.0',
     'rdt>=1.12.0',
     'sdmetrics>=0.14.0',
+    'platformdirs>=4.0',
 ]
 
 [project.urls]
@@ -51,7 +51,9 @@ dependencies = [
 sdv = { main = 'sdv.cli.__main__:main' }
 
 [project.optional-dependencies]
+excel = ['pandas[excel]']
 test = [
+    'sdv[excel]',
     'pytest>=3.4.2',
     'pytest-cov>=2.6.0',
     'pytest-rerunfailures>=10.3,<15',
@@ -140,7 +142,8 @@ namespaces = false
     'make.bat',
     '*.jpg',
     '*.png',
-    '*.gif'
+    '*.gif',
+    'sdv_logger_config.yml'
 ]
 
 [tool.setuptools.exclude-package-data]
@@ -154,7 +157,7 @@ namespaces = false
 version = {attr = 'sdv.__version__'}
 
 [tool.bumpversion]
-current_version = "1.12.1"
+current_version = "1.13.0.dev1"
 parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
 serialize = [
     '{major}.{minor}.{patch}.{release}{candidate}',

diff --git a/sdv/__init__.py b/sdv/__init__.py
@@ -6,7 +6,7 @@
 
 __author__ = 'DataCebo, Inc.'
 __email__ = 'info@sdv.dev'
-__version__ = '1.12.1'
+__version__ = '1.13.0.dev1'
 
 
 import sys
@@ -16,8 +16,8 @@
 from types import ModuleType
 
 from sdv import (
-    constraints, data_processing, datasets, evaluation, io, lite, metadata, metrics, multi_table,
-    sampling, sequential, single_table, version)
+    constraints, data_processing, datasets, evaluation, io, lite, logging, metadata, metrics,
+    multi_table, sampling, sequential, single_table, version)
 
 __all__ = [
     'constraints',
@@ -26,6 +26,7 @@
     'evaluation',
     'io',
     'lite',
+    'logging',
     'metadata',
     'metrics',
     'multi_table',
@@ -94,7 +95,7 @@ def _find_addons():
             addon = entry_point.load()
         except Exception as e:  # pylint: disable=broad-exception-caught
             msg = (
-                f'Failed to load "{entry_point.name}" from "{entry_point.version}" '
+                f'Failed to load "{entry_point.name}" from "{entry_point.value}" '
                 f'with error:\n{e}'
             )
             warnings.warn(msg)

diff --git a/sdv/_utils.py b/sdv/_utils.py
@@ -214,6 +214,8 @@ def _validate_foreign_keys_not_null(metadata, data):
     invalid_tables = defaultdict(list)
     for table_name, table_data in data.items():
         for foreign_key in metadata._get_all_foreign_keys(table_name):
+            if foreign_key not in table_data and int(foreign_key) in table_data:
+                foreign_key = int(foreign_key)
             if table_data[foreign_key].isna().any():
                 invalid_tables[table_name].append(foreign_key)
 

diff --git a/sdv/data_processing/data_processor.py b/sdv/data_processing/data_processor.py
@@ -412,7 +412,7 @@ def _update_transformers_by_sdtypes(self, sdtype, transformer):
         self._transformers_by_sdtype[sdtype] = transformer
 
     @staticmethod
-    def create_anonymized_transformer(sdtype, column_metadata, enforce_uniqueness,
+    def create_anonymized_transformer(sdtype, column_metadata, cardinality_rule,
                                       locales=['en_US']):
         """Create an instance of an ``AnonymizedFaker``.
 
@@ -424,24 +424,26 @@ def create_anonymized_transformer(sdtype, column_metadata, enforce_uniqueness,
                 Sematic data type or a ``Faker`` function name.
             column_metadata (dict):
                 A dictionary representing the rest of the metadata for the given ``sdtype``.
-            enforce_uniqueness (bool):
-                If ``True`` overwrite ``enforce_uniqueness`` with ``True`` to ensure unique
-                generation for primary keys.
+            cardinality_rule (str):
+                If ``'unique'`` enforce that every created value is unique.
+                If ``'match'`` match the cardinality of the data seen during fit.
+                If ``None`` do not consider cardinality.
+                Defaults to ``None``.
             locales (str or list):
                 Locale or list of locales to use for the AnonymizedFaker transfomer.
                 Defaults to ['en_US'].
 
         Returns:
             Instance of ``rdt.transformers.pii.AnonymizedFaker``.
         """
-        kwargs = {'locales': locales}
+        kwargs = {
+            'locales': locales,
+            'cardinality_rule': cardinality_rule
+        }
         for key, value in column_metadata.items():
             if key not in ['pii', 'sdtype']:
                 kwargs[key] = value
 
-        if enforce_uniqueness:
-            kwargs['enforce_uniqueness'] = True
-
         try:
             transformer = get_anonymized_transformer(sdtype, kwargs)
         except AttributeError as error:
@@ -494,7 +496,7 @@ def _get_transformer_instance(self, sdtype, column_metadata):
             is_baseprovider = transformer.provider_name == 'BaseProvider'
             if is_lexify and is_baseprovider:  # Default settings
                 return self.create_anonymized_transformer(
-                    sdtype, column_metadata, False, self._locales
+                    sdtype, column_metadata, None, self._locales
                 )
 
         kwargs = {
@@ -598,11 +600,11 @@ def _create_config(self, data, columns_created_by_constraints):
 
             elif pii:
                 sdtypes[column] = 'pii'
-                enforce_uniqueness = bool(column in self._keys)
+                cardinality_rule = 'unique' if bool(column in self._keys) else None
                 transformers[column] = self.create_anonymized_transformer(
                     sdtype,
                     column_metadata,
-                    enforce_uniqueness,
+                    cardinality_rule,
                     self._locales
                 )
 
@@ -614,7 +616,7 @@ def _create_config(self, data, columns_created_by_constraints):
                     transformers[column] = self.create_anonymized_transformer(
                         sdtype=sdtype,
                         column_metadata=column_metadata,
-                        enforce_uniqueness=True,
+                        cardinality_rule='unique',
                         locales=self._locales
                     )
 

diff --git a/sdv/datasets/demo.py b/sdv/datasets/demo.py
@@ -96,7 +96,7 @@ def _get_data(modality, output_folder_name, in_memory_directory):
         for filename, file_ in in_memory_directory.items():
             if filename.endswith('.csv'):
                 table_name = Path(filename).stem
-                data[table_name] = pd.read_csv(io.StringIO(file_.decode()))
+                data[table_name] = pd.read_csv(io.StringIO(file_.decode()), low_memory=False)
 
     if modality != 'multi_table':
         data = data.popitem()[1]

diff --git a/sdv/io/local/__init__.py b/sdv/io/local/__init__.py
@@ -0,0 +1,9 @@
+"""Local I/O module."""
+
+from sdv.io.local.local import BaseLocalHandler, CSVHandler, ExcelHandler
+
+__all__ = (
+    'BaseLocalHandler',
+    'CSVHandler',
+    'ExcelHandler'
+)