make release-tag: Merge branch 'main' into stable

sdv-dev · Jun 13, 2024 · 33f2643 · 33f2643
2 parents c428c2b + b3e4375
commit 33f2643
Show file tree

Hide file tree

Showing 27 changed files with 904 additions and 195 deletions.
diff --git a/.github/workflows/release_notes.yml b/.github/workflows/release_notes.yml
@@ -0,0 +1,52 @@
+name: Release Notes Generator
+
+on:
+  workflow_dispatch:
+    inputs:
+      branch:
+        description: 'Branch to merge release notes into.'
+        required: true
+        default: 'main'
+      version:
+        description:
+          'Version to use for the release. Must be in format: X.Y.Z.'
+      date:
+        description:
+          'Date of the release. Must be in format YYYY-MM-DD.'
+
+jobs:
+  releasenotesgeneration:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+
+      - name: Install dependencies
+        run: |
+            python -m pip install --upgrade pip
+            python -m pip install requests==2.31.0
+
+      - name: Generate release notes
+        env:
+            GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
+        run: >
+            python scripts/release_notes_generator.py
+            -v ${{ inputs.version }}
+            -d ${{ inputs.date }}
+
+      - name: Create pull request
+        id: cpr
+        uses: peter-evans/create-pull-request@v4
+        with:
+          token: ${{ secrets.GH_ACCESS_TOKEN }}
+          commit-message: Release notes for v${{ inputs.version }}
+          author: "github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>"
+          committer: "github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>"
+          title: v${{ inputs.version }} Release Notes
+          body: "This is an auto-generated PR to update the release notes."
+          branch: release-notes
+          branch-suffix: short-commit-hash
+          base: ${{ inputs.branch }}
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,6 +1,39 @@
 # Release Notes
 
-## 1.13.1 - 2024-05-16
+### v1.14.0 - 2024-06-13
+
+This release provides a number of new features. A big one is that it adds the ability to fit the `HMASynthesizer` on disconnected schemas! It also enables the `PARSynthesizer` to work with constraints in certain conditions. More specifically, the `PARSynthesizer` can now handle constraints as long as the columns involved in the constraints are either exclusively all context columns or exclusively all non-context columns.
+
+Additionally, a `verbose` parameter was added to the `TVAESynthesizer` to get a more detailed progress bar. Also, a bug was corrected that renamed the `file_path` parameter in the `ExcelHandler.read()` method to `filepath` as specified in the official [SDV docs](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/loading-data/excel#read).
+
+### Internal
+
+* Add workflow to generate release notes - Issue [#2050](https://github.com/sdv-dev/SDV/issues/2050) by @amontanez24
+
+### Bugs Fixed
+
+* PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data - Issue [#2031](https://github.com/sdv-dev/SDV/issues/2031) by @lajohn4747
+* PARSynthesizer model won't fit if sequence_index is missing - Issue [#1972](https://github.com/sdv-dev/SDV/issues/1972) by @lajohn4747
+* `DataProcessor` never gets assigned a `table_name`. - Issue [#1964](https://github.com/sdv-dev/SDV/issues/1964) by @fealho
+
+### New Features
+
+* Rename `file_path` to `filepath` parameter in ExcelHandler - Issue [#2055](https://github.com/sdv-dev/SDV/issues/2055) by @amontanez24
+* Enable the ability to run multi table synthesizers on disjointed table schemas - Issue [#2047](https://github.com/sdv-dev/SDV/issues/2047) by @lajohn4747
+* Add header to log.csv file - Issue [#2046](https://github.com/sdv-dev/SDV/issues/2046) by @lajohn4747
+* If no filepath is provided, do not create a file during `sample` - Issue [#2042](https://github.com/sdv-dev/SDV/issues/2042) by @lajohn4747
+* Add verbosity to `TVAESynthesizer` - Issue [#1990](https://github.com/sdv-dev/SDV/issues/1990) by @fealho
+* Allow constraints in PARSynthesizer (for all context cols, or all non-context columns) - Issue [#1936](https://github.com/sdv-dev/SDV/issues/1936) by @lajohn4747
+* Improve error message when sampling on a non-CPU device - Issue [#1819](https://github.com/sdv-dev/SDV/issues/1819) by @fealho
+* Better data validation message for `auto_assign_transformers` - Issue [#1509](https://github.com/sdv-dev/SDV/issues/1509) by @lajohn4747
+
+### Miscellaneous
+
+* Do not enforce min/max on sequence index column - Issue [#2043](https://github.com/sdv-dev/SDV/pull/2043)
+* Include validation check for single table auto_assign_transformers - Issue [#2021](https://github.com/sdv-dev/SDV/pull/2021)
+* Add the dummy context column to metadata and not to extra_context_column - Issue [#2019](https://github.com/sdv-dev/SDV/pull/2019)
+
+# 1.13.1 - 2024-05-16
 
 This release fixes the `ModuleNotFoundError` error that was causing the 1.13.0 release to fail.
 

diff --git a/README.md b/README.md
@@ -94,12 +94,12 @@ column and the primary key (`guest_email`).
 ## Synthesizing Data
 Next, we can create an **SDV synthesizer**,  an object that you can use to create synthetic data.
 It learns patterns from the real data and replicates them to generate synthetic data. Let's use
-the `FAST_ML` preset synthesizer, which is optimized for performance.
+the [GaussianCopulaSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer).
 
 ```python
-from sdv.lite import SingleTablePreset
+from sdv.single_table import GaussianCopulaSynthesizer
 
-synthesizer = SingleTablePreset(metadata, name='FAST_ML')
+synthesizer = GaussianCopulaSynthesizer(metadata)
 synthesizer.fit(data=real_data)
 ```
 
@@ -131,11 +131,15 @@ quality_report = evaluate_quality(
 ```
 
 ```
-Creating report: 100%|██████████| 4/4 [00:00<00:00, 19.30it/s]
-Overall Quality Score: 89.12%
-Properties:
-Column Shapes: 90.27%
-Column Pair Trends: 87.97%
+Generating report ...
+
+(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
+Column Shapes Score: 89.11%
+
+(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
+Column Pair Trends Score: 88.3%
+
+Overall Score (Average): 88.7%
 ```
 
 This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well

diff --git a/latest_requirements.txt b/latest_requirements.txt
@@ -1,11 +1,11 @@
 cloudpickle==3.0.0
 copulas==0.11.0
-ctgan==0.10.0
+ctgan==0.10.1
 deepecho==0.6.0
 graphviz==0.20.3
 numpy==1.26.4
 pandas==2.2.2
-platformdirs==4.2.1
+platformdirs==4.2.2
 rdt==1.12.1
-sdmetrics==0.14.0
+sdmetrics==0.14.1
 tqdm==4.66.4
diff --git a/pyproject.toml b/pyproject.toml
@@ -158,7 +158,7 @@ namespaces = false
 version = {attr = 'sdv.__version__'}
 
 [tool.bumpversion]
-current_version = "1.13.1"
+current_version = "1.14.0.dev1"
 parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
 serialize = [
     '{major}.{minor}.{patch}.{release}{candidate}',

diff --git a/scripts/release_notes_generator.py b/scripts/release_notes_generator.py
@@ -0,0 +1,153 @@
+"""Script to generate release notes."""
+
+import argparse
+import os
+from collections import defaultdict
+
+import requests
+
+LABEL_TO_HEADER = {
+    'feature request': 'New Features',
+    'bug': 'Bugs Fixed',
+    'internal': 'Internal',
+    'maintenance': 'Maintenance',
+    'customer success': 'Customer Success',
+    'documentation': 'Documentation',
+    'misc': 'Miscellaneous'
+}
+ISSUE_LABELS = [
+    'documentation',
+    'maintenance',
+    'internal',
+    'bug',
+    'feature request',
+    'customer success'
+]
+NEW_LINE = '\n'
+GITHUB_URL = 'https://api.github.com/repos/sdv-dev/sdv'
+GITHUB_TOKEN = os.getenv('GH_ACCESS_TOKEN')
+
+
+def _get_milestone_number(milestone_title):
+    url = f'{GITHUB_URL}/milestones'
+    headers = {
+        'Authorization': f'Bearer {GITHUB_TOKEN}'
+    }
+    query_params = {
+        'milestone': milestone_title,
+        'state': 'all',
+        'per_page': 100
+    }
+    response = requests.get(url, headers=headers, params=query_params)
+    body = response.json()
+    if response.status_code != 200:
+        raise Exception(str(body))
+
+    milestones = body
+    for milestone in milestones:
+        if milestone.get('title') == milestone_title:
+            return milestone.get('number')
+
+    raise ValueError(f'Milestone {milestone_title} not found in past 100 milestones.')
+
+
+def _get_issues_by_milestone(milestone):
+    headers = {
+        'Authorization': f'Bearer {GITHUB_TOKEN}'
+    }
+    # get milestone number
+    milestone_number = _get_milestone_number(milestone)
+    url = f'{GITHUB_URL}/issues'
+    page = 1
+    query_params = {
+        'milestone': milestone_number,
+        'state': 'all'
+    }
+    issues = []
+    while True:
+        query_params['page'] = page
+        response = requests.get(url, headers=headers, params=query_params)
+        body = response.json()
+        if response.status_code != 200:
+            raise Exception(str(body))
+
+        issues_on_page = body
+        if not issues_on_page:
+            break
+
+        issues.extend(issues_on_page)
+        page += 1
+
+    return issues
+
+
+def _get_issues_by_category(release_issues):
+    category_to_issues = defaultdict(list)
+
+    for issue in release_issues:
+        issue_title = issue['title']
+        issue_number = issue['number']
+        issue_url = issue['html_url']
+        line = f'* {issue_title} - Issue [#{issue_number}]({issue_url})'
+        assignee = issue.get('assignee')
+        if assignee:
+            login = assignee['login']
+            line += f' by @{login}'
+
+        # Check if any known label is marked on the issue
+        labels = [label['name'] for label in issue['labels']]
+        found_category = False
+        for category in ISSUE_LABELS:
+            if category in labels:
+                category_to_issues[category].append(line)
+                found_category = True
+                break
+
+        if not found_category:
+            category_to_issues['misc'].append(line)
+
+    return category_to_issues
+
+
+def _create_release_notes(issues_by_category, version, date):
+    title = f'## v{version} - {date}'
+    release_notes = f'{title}{NEW_LINE}{NEW_LINE}'
+
+    for category in ISSUE_LABELS + ['misc']:
+        issues = issues_by_category.get(category)
+        if issues:
+            section_text = (
+                f'### {LABEL_TO_HEADER[category]}{NEW_LINE}{NEW_LINE}'
+                f'{NEW_LINE.join(issues)}{NEW_LINE}{NEW_LINE}'
+            )
+
+            release_notes += section_text
+
+    return release_notes
+
+
+def update_release_notes(release_notes):
+    """Add the release notes for the new release to the ``HISTORY.md``."""
+    file_path = 'HISTORY.md'
+    with open(file_path, 'r') as history_file:
+        history = history_file.read()
+
+    token = '# Release Notes\n\n'
+    split_index = history.find(token) + len(token) + 1
+    header = history[:split_index]
+    new_notes = f'{header}{release_notes}{history[split_index:]}'
+
+    with open(file_path, 'w') as new_history_file:
+        new_history_file.write(new_notes)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-v', '--version', type=str, help='Release version number (ie. v1.0.1)')
+    parser.add_argument('-d', '--date', type=str, help='Date of release in format YYYY-MM-DD')
+    args = parser.parse_args()
+    release_number = args.version
+    release_issues = _get_issues_by_milestone(release_number)
+    issues_by_category = _get_issues_by_category(release_issues)
+    release_notes = _create_release_notes(issues_by_category, release_number, args.date)
+    update_release_notes(release_notes)
diff --git a/sdv/__init__.py b/sdv/__init__.py
@@ -6,7 +6,7 @@
 
 __author__ = 'DataCebo, Inc.'
 __email__ = 'info@sdv.dev'
-__version__ = '1.13.1'
+__version__ = '1.14.0.dev1'
 
 
 import sys

diff --git a/sdv/io/local/local.py b/sdv/io/local/local.py
@@ -192,16 +192,16 @@ def write(self, synthetic_data, folder_name, file_name_suffix=None, mode='x'):
 class ExcelHandler(BaseLocalHandler):
     """A class for handling Excel files."""
 
-    def _read_excel(self, file_path, sheet_names=None):
+    def _read_excel(self, filepath, sheet_names=None):
         """Read data from Excel File and return just the data as a dictionary."""
         data = {}
         if sheet_names is None:
-            xl_file = pd.ExcelFile(file_path)
+            xl_file = pd.ExcelFile(filepath)
             sheet_names = xl_file.sheet_names
 
         for sheet_name in sheet_names:
             data[sheet_name] = pd.read_excel(
-                file_path,
+                filepath,
                 sheet_name=sheet_name,
                 parse_dates=False,
                 decimal=self.decimal,
@@ -210,11 +210,11 @@ def _read_excel(self, file_path, sheet_names=None):
 
         return data
 
-    def read(self, file_path, sheet_names=None):
+    def read(self, filepath, sheet_names=None):
         """Read data from Excel files and return it along with metadata.
 
         Args:
-            file_path (str):
+            filepath (str):
                 The path to the Excel file to read.
             sheet_names (list of str, optional):
                 The names of sheets to read. If None, all sheets are read.
@@ -226,7 +226,7 @@ def read(self, file_path, sheet_names=None):
         if sheet_names is not None and not isinstance(sheet_names, list):
             raise ValueError("'sheet_names' must be None or a list of strings.")
 
-        return self._read_excel(file_path, sheet_names)
+        return self._read_excel(filepath, sheet_names)
 
     def write(self, synthetic_data, file_name, sheet_name_suffix=None, mode='w'):
         """Write synthetic data to an Excel File.

diff --git a/sdv/lite/single_table.py b/sdv/lite/single_table.py
@@ -136,8 +136,7 @@ def sample_from_conditions(self, conditions, max_tries_per_batch=100,
                 The batch size to use per attempt at sampling. Defaults to 10 times
                 the number of rows.
             output_file_path (str or None):
-                The file to periodically write sampled rows to. Defaults to
-                a temporary file, if None.
+                The file to periodically write sampled rows to. Defaults to None.
 
         Returns:
             pandas.DataFrame:
@@ -168,8 +167,7 @@ def sample_remaining_columns(self, known_columns, max_tries_per_batch=100,
                 The batch size to use per attempt at sampling. Defaults to 10 times
                 the number of rows.
             output_file_path (str or None):
-                The file to periodically write sampled rows to. Defaults to
-                a temporary file, if None.
+                The file to periodically write sampled rows to. Defaults to None.
 
         Returns:
             pandas.DataFrame: