Skip to content

Commit

Permalink
make release-tag: Merge branch 'main' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
amontanez24 committed May 15, 2024
2 parents fd67bd3 + ee9a8bb commit 0bc2a9b
Show file tree
Hide file tree
Showing 60 changed files with 4,950 additions and 733 deletions.
7 changes: 6 additions & 1 deletion .github/workflows/integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@ jobs:
strategy:
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, windows-latest]
include:
- os: macos-latest
python-version: '3.8'
- os: macos-latest
python-version: '3.12'
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand Down
7 changes: 6 additions & 1 deletion .github/workflows/minimum.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@ jobs:
strategy:
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, windows-latest]
include:
- os: macos-latest
python-version: '3.8'
- os: macos-latest
python-version: '3.12'
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand Down
7 changes: 6 additions & 1 deletion .github/workflows/unit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@ jobs:
strategy:
matrix:
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12']
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, windows-latest]
include:
- os: macos-latest
python-version: '3.8'
- os: macos-latest
python-version: '3.12'
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand Down
36 changes: 36 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@
# Release Notes

## 1.13.0 - 2024-05-15

This release adds a utility function called `get_random_subset` that helps users get a subset of their multi-table data so that modeling can be done quicker. Given a dictionary of table names mapped to DataFrames, metadata, a main table and a desired number of rows to use for the main table, it will subsample the data in a way that maintains referential integrity.

This release also adds two new local file handlers: the `CSVHandler` and the `ExcelHandler`. This enables users to easily load from and save synthetic data to these files types. These handlers return data and metadata in the multi-table format, so we also added the function `get_table_metadata` to get a `SingleTableMetadata` object from a `MultiTableMetadata` object.

Finally, this release fixes some bugs that prevented synthesizers from working with data that had numerical column names.

### New Features

* Add `get_random_subset` poc utility function - Issue [#1877](https://github.com/sdv-dev/SDV/issues/1877) by @R-Palazzo
* Add usage logging - Issue [#1903](https://github.com/sdv-dev/SDV/issues/1903) by @pvk-developer
* Move function `drop_unknown_references` from `poc` to be directly under `utils` - Issue [#1947](https://github.com/sdv-dev/SDV/issues/1947) by @R-Palazzo
* Add CSVHandler - Issue [#1949](https://github.com/sdv-dev/SDV/issues/1949) by @pvk-developer
* Add ExcelHandler - Issue [#1950](https://github.com/sdv-dev/SDV/issues/1950) by @pvk-developer
* Add get_table_metadata function - Issue [#1951](https://github.com/sdv-dev/SDV/issues/1951) by @R-Palazzo
* Save usage log file as a csv - Issue [#1974](https://github.com/sdv-dev/SDV/issues/1974) by @frances-h
* Split out metadata creation from data import in the local files handlers - Issue [#1975](https://github.com/sdv-dev/SDV/issues/1975) by @pvk-developer
* Improve error message when trying to sample before fitting (single table) - Issue [#1978](https://github.com/sdv-dev/SDV/issues/1978) by @R-Palazzo

### Bugs Fixed

* Metadata detection crashes when the column names are integers (`AttributeError: 'int' object has no attribute 'lower'`) - Issue [#1933](https://github.com/sdv-dev/SDV/issues/1933) by @lajohn4747
* Synthesizers crash when column names are integers (`TypeError: unsupported operand`) - Issue [#1935](https://github.com/sdv-dev/SDV/issues/1935) by @lajohn4747
* Switch parameter order in drop_unknown_references - Issue [#1944](https://github.com/sdv-dev/SDV/issues/1944) by @R-Palazzo
* Unexpected NaN values in sequence_index when dataframe isn't reset - Issue [#1973](https://github.com/sdv-dev/SDV/issues/1973) by @fealho
* Fix pandas DtypeWarning in download_demo - Issue [#1980](https://github.com/sdv-dev/SDV/issues/1980) by @fealho

### Maintenance

* Only run unit and integration tests on oldest and latest python versions for macos - Issue [#1948](https://github.com/sdv-dev/SDV/issues/1948) by @frances-h

### Internal

* Update code to remove `FutureWarning` related to 'enforce_uniqueness' parameter - Issue [#1995](https://github.com/sdv-dev/SDV/issues/1995) by @pvk-developer

## 1.12.1 - 2024-04-19

This release makes a number of changes to how id columns are generated. By default, id columns with a regex will now have their values scrambled in the output. Id columns without a regex that are numeric will be created randomly. If they're not numeric, they will have a random suffix.
Expand Down
12 changes: 6 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -123,12 +123,8 @@ test-integration: ## run tests quickly with the default Python
test-readme: ## run the readme snippets
invoke readme

.PHONY: test-tutorials
test-tutorials: ## run the tutorial notebooks
invoke tutorials

.PHONY: test
test: test-unit test-integration test-readme test-tutorials ## test everything that needs test dependencies
test: test-unit test-integration test-readme ## test everything that needs test dependencies

.PHONY: test-all
test-all: ## run tests on every Python version with tox
Expand Down Expand Up @@ -239,6 +235,10 @@ ifeq ($(CHANGELOG_LINES),0)
$(error Please insert the release notes in HISTORY.md before releasing)
endif

.PHONY: git-push
git-push: ## Simply push the repository to github
git push

.PHONY: check-release
check-release: check-clean check-main check-history ## Check if the release can be made
@echo "A new release can be made"
Expand All @@ -265,5 +265,5 @@ release-major: check-release bumpversion-major release

.PHONY: check-deps
check-deps:
$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=')
$(eval allow_list='cloudpickle=|graphviz=|numpy=|pandas=|tqdm=|copulas=|ctgan=|deepecho=|rdt=|sdmetrics=|platformdirs=')
pip freeze | grep -v "SDV.git" | grep -E $(allow_list) | sort > $(OUTPUT_FILEPATH)
5 changes: 3 additions & 2 deletions latest_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ deepecho==0.6.0
graphviz==0.20.3
numpy==1.26.4
pandas==2.2.2
rdt==1.11.1
platformdirs==4.2.1
rdt==1.12.1
sdmetrics==0.14.0
tqdm==4.66.2
tqdm==4.66.4
13 changes: 8 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,10 @@ dependencies = [
'botocore>=1.31',
'cloudpickle>=2.1.0',
'graphviz>=0.13.2',
"numpy>=1.20.0;python_version<'3.10'",
"numpy>=1.21.0;python_version<'3.10'",
"numpy>=1.23.3,<2;python_version>='3.10' and python_version<'3.12'",
"numpy>=1.26.0,<2;python_version>='3.12'",
"pandas>=1.1.3;python_version<'3.10'",
"pandas>=1.3.4;python_version>='3.10' and python_version<'3.11'",
"pandas>=1.4.0;python_version<'3.11'",
"pandas>=1.5.0;python_version>='3.11' and python_version<'3.12'",
"pandas>=2.1.1;python_version>='3.12'",
'tqdm>=4.29',
Expand All @@ -38,6 +37,7 @@ dependencies = [
'deepecho>=0.6.0',
'rdt>=1.12.0',
'sdmetrics>=0.14.0',
'platformdirs>=4.0',
]

[project.urls]
Expand All @@ -51,7 +51,9 @@ dependencies = [
sdv = { main = 'sdv.cli.__main__:main' }

[project.optional-dependencies]
excel = ['pandas[excel]']
test = [
'sdv[excel]',
'pytest>=3.4.2',
'pytest-cov>=2.6.0',
'pytest-rerunfailures>=10.3,<15',
Expand Down Expand Up @@ -140,7 +142,8 @@ namespaces = false
'make.bat',
'*.jpg',
'*.png',
'*.gif'
'*.gif',
'sdv_logger_config.yml'
]

[tool.setuptools.exclude-package-data]
Expand All @@ -154,7 +157,7 @@ namespaces = false
version = {attr = 'sdv.__version__'}

[tool.bumpversion]
current_version = "1.12.1"
current_version = "1.13.0.dev1"
parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
serialize = [
'{major}.{minor}.{patch}.{release}{candidate}',
Expand Down
9 changes: 5 additions & 4 deletions sdv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

__author__ = 'DataCebo, Inc.'
__email__ = 'info@sdv.dev'
__version__ = '1.12.1'
__version__ = '1.13.0.dev1'


import sys
Expand All @@ -16,8 +16,8 @@
from types import ModuleType

from sdv import (
constraints, data_processing, datasets, evaluation, io, lite, metadata, metrics, multi_table,
sampling, sequential, single_table, version)
constraints, data_processing, datasets, evaluation, io, lite, logging, metadata, metrics,
multi_table, sampling, sequential, single_table, version)

__all__ = [
'constraints',
Expand All @@ -26,6 +26,7 @@
'evaluation',
'io',
'lite',
'logging',
'metadata',
'metrics',
'multi_table',
Expand Down Expand Up @@ -94,7 +95,7 @@ def _find_addons():
addon = entry_point.load()
except Exception as e: # pylint: disable=broad-exception-caught
msg = (
f'Failed to load "{entry_point.name}" from "{entry_point.version}" '
f'Failed to load "{entry_point.name}" from "{entry_point.value}" '
f'with error:\n{e}'
)
warnings.warn(msg)
Expand Down
2 changes: 2 additions & 0 deletions sdv/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,8 @@ def _validate_foreign_keys_not_null(metadata, data):
invalid_tables = defaultdict(list)
for table_name, table_data in data.items():
for foreign_key in metadata._get_all_foreign_keys(table_name):
if foreign_key not in table_data and int(foreign_key) in table_data:
foreign_key = int(foreign_key)
if table_data[foreign_key].isna().any():
invalid_tables[table_name].append(foreign_key)

Expand Down
26 changes: 14 additions & 12 deletions sdv/data_processing/data_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,7 +412,7 @@ def _update_transformers_by_sdtypes(self, sdtype, transformer):
self._transformers_by_sdtype[sdtype] = transformer

@staticmethod
def create_anonymized_transformer(sdtype, column_metadata, enforce_uniqueness,
def create_anonymized_transformer(sdtype, column_metadata, cardinality_rule,
locales=['en_US']):
"""Create an instance of an ``AnonymizedFaker``.
Expand All @@ -424,24 +424,26 @@ def create_anonymized_transformer(sdtype, column_metadata, enforce_uniqueness,
Sematic data type or a ``Faker`` function name.
column_metadata (dict):
A dictionary representing the rest of the metadata for the given ``sdtype``.
enforce_uniqueness (bool):
If ``True`` overwrite ``enforce_uniqueness`` with ``True`` to ensure unique
generation for primary keys.
cardinality_rule (str):
If ``'unique'`` enforce that every created value is unique.
If ``'match'`` match the cardinality of the data seen during fit.
If ``None`` do not consider cardinality.
Defaults to ``None``.
locales (str or list):
Locale or list of locales to use for the AnonymizedFaker transfomer.
Defaults to ['en_US'].
Returns:
Instance of ``rdt.transformers.pii.AnonymizedFaker``.
"""
kwargs = {'locales': locales}
kwargs = {
'locales': locales,
'cardinality_rule': cardinality_rule
}
for key, value in column_metadata.items():
if key not in ['pii', 'sdtype']:
kwargs[key] = value

if enforce_uniqueness:
kwargs['enforce_uniqueness'] = True

try:
transformer = get_anonymized_transformer(sdtype, kwargs)
except AttributeError as error:
Expand Down Expand Up @@ -494,7 +496,7 @@ def _get_transformer_instance(self, sdtype, column_metadata):
is_baseprovider = transformer.provider_name == 'BaseProvider'
if is_lexify and is_baseprovider: # Default settings
return self.create_anonymized_transformer(
sdtype, column_metadata, False, self._locales
sdtype, column_metadata, None, self._locales
)

kwargs = {
Expand Down Expand Up @@ -598,11 +600,11 @@ def _create_config(self, data, columns_created_by_constraints):

elif pii:
sdtypes[column] = 'pii'
enforce_uniqueness = bool(column in self._keys)
cardinality_rule = 'unique' if bool(column in self._keys) else None
transformers[column] = self.create_anonymized_transformer(
sdtype,
column_metadata,
enforce_uniqueness,
cardinality_rule,
self._locales
)

Expand All @@ -614,7 +616,7 @@ def _create_config(self, data, columns_created_by_constraints):
transformers[column] = self.create_anonymized_transformer(
sdtype=sdtype,
column_metadata=column_metadata,
enforce_uniqueness=True,
cardinality_rule='unique',
locales=self._locales
)

Expand Down
2 changes: 1 addition & 1 deletion sdv/datasets/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ def _get_data(modality, output_folder_name, in_memory_directory):
for filename, file_ in in_memory_directory.items():
if filename.endswith('.csv'):
table_name = Path(filename).stem
data[table_name] = pd.read_csv(io.StringIO(file_.decode()))
data[table_name] = pd.read_csv(io.StringIO(file_.decode()), low_memory=False)

if modality != 'multi_table':
data = data.popitem()[1]
Expand Down
9 changes: 9 additions & 0 deletions sdv/io/local/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""Local I/O module."""

from sdv.io.local.local import BaseLocalHandler, CSVHandler, ExcelHandler

__all__ = (
'BaseLocalHandler',
'CSVHandler',
'ExcelHandler'
)
Loading

0 comments on commit 0bc2a9b

Please sign in to comment.