Skip to content

Commit

Permalink
make release-tag: Merge branch 'main' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
amontanez24 committed Jun 13, 2024
2 parents c428c2b + b3e4375 commit 33f2643
Show file tree
Hide file tree
Showing 27 changed files with 904 additions and 195 deletions.
52 changes: 52 additions & 0 deletions .github/workflows/release_notes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Release Notes Generator

on:
workflow_dispatch:
inputs:
branch:
description: 'Branch to merge release notes into.'
required: true
default: 'main'
version:
description:
'Version to use for the release. Must be in format: X.Y.Z.'
date:
description:
'Date of the release. Must be in format YYYY-MM-DD.'

jobs:
releasenotesgeneration:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install requests==2.31.0
- name: Generate release notes
env:
GH_ACCESS_TOKEN: ${{ secrets.GH_ACCESS_TOKEN }}
run: >
python scripts/release_notes_generator.py
-v ${{ inputs.version }}
-d ${{ inputs.date }}
- name: Create pull request
id: cpr
uses: peter-evans/create-pull-request@v4
with:
token: ${{ secrets.GH_ACCESS_TOKEN }}
commit-message: Release notes for v${{ inputs.version }}
author: "github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>"
committer: "github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>"
title: v${{ inputs.version }} Release Notes
body: "This is an auto-generated PR to update the release notes."
branch: release-notes
branch-suffix: short-commit-hash
base: ${{ inputs.branch }}
35 changes: 34 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,39 @@
# Release Notes

## 1.13.1 - 2024-05-16
### v1.14.0 - 2024-06-13

This release provides a number of new features. A big one is that it adds the ability to fit the `HMASynthesizer` on disconnected schemas! It also enables the `PARSynthesizer` to work with constraints in certain conditions. More specifically, the `PARSynthesizer` can now handle constraints as long as the columns involved in the constraints are either exclusively all context columns or exclusively all non-context columns.

Additionally, a `verbose` parameter was added to the `TVAESynthesizer` to get a more detailed progress bar. Also, a bug was corrected that renamed the `file_path` parameter in the `ExcelHandler.read()` method to `filepath` as specified in the official [SDV docs](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/loading-data/excel#read).

### Internal

* Add workflow to generate release notes - Issue [#2050](https://github.com/sdv-dev/SDV/issues/2050) by @amontanez24

### Bugs Fixed

* PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data - Issue [#2031](https://github.com/sdv-dev/SDV/issues/2031) by @lajohn4747
* PARSynthesizer model won't fit if sequence_index is missing - Issue [#1972](https://github.com/sdv-dev/SDV/issues/1972) by @lajohn4747
* `DataProcessor` never gets assigned a `table_name`. - Issue [#1964](https://github.com/sdv-dev/SDV/issues/1964) by @fealho

### New Features

* Rename `file_path` to `filepath` parameter in ExcelHandler - Issue [#2055](https://github.com/sdv-dev/SDV/issues/2055) by @amontanez24
* Enable the ability to run multi table synthesizers on disjointed table schemas - Issue [#2047](https://github.com/sdv-dev/SDV/issues/2047) by @lajohn4747
* Add header to log.csv file - Issue [#2046](https://github.com/sdv-dev/SDV/issues/2046) by @lajohn4747
* If no filepath is provided, do not create a file during `sample` - Issue [#2042](https://github.com/sdv-dev/SDV/issues/2042) by @lajohn4747
* Add verbosity to `TVAESynthesizer` - Issue [#1990](https://github.com/sdv-dev/SDV/issues/1990) by @fealho
* Allow constraints in PARSynthesizer (for all context cols, or all non-context columns) - Issue [#1936](https://github.com/sdv-dev/SDV/issues/1936) by @lajohn4747
* Improve error message when sampling on a non-CPU device - Issue [#1819](https://github.com/sdv-dev/SDV/issues/1819) by @fealho
* Better data validation message for `auto_assign_transformers` - Issue [#1509](https://github.com/sdv-dev/SDV/issues/1509) by @lajohn4747

### Miscellaneous

* Do not enforce min/max on sequence index column - Issue [#2043](https://github.com/sdv-dev/SDV/pull/2043)
* Include validation check for single table auto_assign_transformers - Issue [#2021](https://github.com/sdv-dev/SDV/pull/2021)
* Add the dummy context column to metadata and not to extra_context_column - Issue [#2019](https://github.com/sdv-dev/SDV/pull/2019)

# 1.13.1 - 2024-05-16

This release fixes the `ModuleNotFoundError` error that was causing the 1.13.0 release to fail.

Expand Down
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,12 +94,12 @@ column and the primary key (`guest_email`).
## Synthesizing Data
Next, we can create an **SDV synthesizer**, an object that you can use to create synthetic data.
It learns patterns from the real data and replicates them to generate synthetic data. Let's use
the `FAST_ML` preset synthesizer, which is optimized for performance.
the [GaussianCopulaSynthesizer](https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer).

```python
from sdv.lite import SingleTablePreset
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)
```

Expand Down Expand Up @@ -131,11 +131,15 @@ quality_report = evaluate_quality(
```

```
Creating report: 100%|██████████| 4/4 [00:00<00:00, 19.30it/s]
Overall Quality Score: 89.12%
Properties:
Column Shapes: 90.27%
Column Pair Trends: 87.97%
Generating report ...
(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%
(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%
Overall Score (Average): 88.7%
```

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well
Expand Down
6 changes: 3 additions & 3 deletions latest_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
cloudpickle==3.0.0
copulas==0.11.0
ctgan==0.10.0
ctgan==0.10.1
deepecho==0.6.0
graphviz==0.20.3
numpy==1.26.4
pandas==2.2.2
platformdirs==4.2.1
platformdirs==4.2.2
rdt==1.12.1
sdmetrics==0.14.0
sdmetrics==0.14.1
tqdm==4.66.4
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ namespaces = false
version = {attr = 'sdv.__version__'}

[tool.bumpversion]
current_version = "1.13.1"
current_version = "1.14.0.dev1"
parse = '(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?'
serialize = [
'{major}.{minor}.{patch}.{release}{candidate}',
Expand Down
153 changes: 153 additions & 0 deletions scripts/release_notes_generator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
"""Script to generate release notes."""

import argparse
import os
from collections import defaultdict

import requests

LABEL_TO_HEADER = {
'feature request': 'New Features',
'bug': 'Bugs Fixed',
'internal': 'Internal',
'maintenance': 'Maintenance',
'customer success': 'Customer Success',
'documentation': 'Documentation',
'misc': 'Miscellaneous'
}
ISSUE_LABELS = [
'documentation',
'maintenance',
'internal',
'bug',
'feature request',
'customer success'
]
NEW_LINE = '\n'
GITHUB_URL = 'https://api.github.com/repos/sdv-dev/sdv'
GITHUB_TOKEN = os.getenv('GH_ACCESS_TOKEN')


def _get_milestone_number(milestone_title):
url = f'{GITHUB_URL}/milestones'
headers = {
'Authorization': f'Bearer {GITHUB_TOKEN}'
}
query_params = {
'milestone': milestone_title,
'state': 'all',
'per_page': 100
}
response = requests.get(url, headers=headers, params=query_params)
body = response.json()
if response.status_code != 200:
raise Exception(str(body))

milestones = body
for milestone in milestones:
if milestone.get('title') == milestone_title:
return milestone.get('number')

raise ValueError(f'Milestone {milestone_title} not found in past 100 milestones.')


def _get_issues_by_milestone(milestone):
headers = {
'Authorization': f'Bearer {GITHUB_TOKEN}'
}
# get milestone number
milestone_number = _get_milestone_number(milestone)
url = f'{GITHUB_URL}/issues'
page = 1
query_params = {
'milestone': milestone_number,
'state': 'all'
}
issues = []
while True:
query_params['page'] = page
response = requests.get(url, headers=headers, params=query_params)
body = response.json()
if response.status_code != 200:
raise Exception(str(body))

issues_on_page = body
if not issues_on_page:
break

issues.extend(issues_on_page)
page += 1

return issues


def _get_issues_by_category(release_issues):
category_to_issues = defaultdict(list)

for issue in release_issues:
issue_title = issue['title']
issue_number = issue['number']
issue_url = issue['html_url']
line = f'* {issue_title} - Issue [#{issue_number}]({issue_url})'
assignee = issue.get('assignee')
if assignee:
login = assignee['login']
line += f' by @{login}'

# Check if any known label is marked on the issue
labels = [label['name'] for label in issue['labels']]
found_category = False
for category in ISSUE_LABELS:
if category in labels:
category_to_issues[category].append(line)
found_category = True
break

if not found_category:
category_to_issues['misc'].append(line)

return category_to_issues


def _create_release_notes(issues_by_category, version, date):
title = f'## v{version} - {date}'
release_notes = f'{title}{NEW_LINE}{NEW_LINE}'

for category in ISSUE_LABELS + ['misc']:
issues = issues_by_category.get(category)
if issues:
section_text = (
f'### {LABEL_TO_HEADER[category]}{NEW_LINE}{NEW_LINE}'
f'{NEW_LINE.join(issues)}{NEW_LINE}{NEW_LINE}'
)

release_notes += section_text

return release_notes


def update_release_notes(release_notes):
"""Add the release notes for the new release to the ``HISTORY.md``."""
file_path = 'HISTORY.md'
with open(file_path, 'r') as history_file:
history = history_file.read()

token = '# Release Notes\n\n'
split_index = history.find(token) + len(token) + 1
header = history[:split_index]
new_notes = f'{header}{release_notes}{history[split_index:]}'

with open(file_path, 'w') as new_history_file:
new_history_file.write(new_notes)


if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-v', '--version', type=str, help='Release version number (ie. v1.0.1)')
parser.add_argument('-d', '--date', type=str, help='Date of release in format YYYY-MM-DD')
args = parser.parse_args()
release_number = args.version
release_issues = _get_issues_by_milestone(release_number)
issues_by_category = _get_issues_by_category(release_issues)
release_notes = _create_release_notes(issues_by_category, release_number, args.date)
update_release_notes(release_notes)
2 changes: 1 addition & 1 deletion sdv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

__author__ = 'DataCebo, Inc.'
__email__ = 'info@sdv.dev'
__version__ = '1.13.1'
__version__ = '1.14.0.dev1'


import sys
Expand Down
12 changes: 6 additions & 6 deletions sdv/io/local/local.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,16 +192,16 @@ def write(self, synthetic_data, folder_name, file_name_suffix=None, mode='x'):
class ExcelHandler(BaseLocalHandler):
"""A class for handling Excel files."""

def _read_excel(self, file_path, sheet_names=None):
def _read_excel(self, filepath, sheet_names=None):
"""Read data from Excel File and return just the data as a dictionary."""
data = {}
if sheet_names is None:
xl_file = pd.ExcelFile(file_path)
xl_file = pd.ExcelFile(filepath)
sheet_names = xl_file.sheet_names

for sheet_name in sheet_names:
data[sheet_name] = pd.read_excel(
file_path,
filepath,
sheet_name=sheet_name,
parse_dates=False,
decimal=self.decimal,
Expand All @@ -210,11 +210,11 @@ def _read_excel(self, file_path, sheet_names=None):

return data

def read(self, file_path, sheet_names=None):
def read(self, filepath, sheet_names=None):
"""Read data from Excel files and return it along with metadata.
Args:
file_path (str):
filepath (str):
The path to the Excel file to read.
sheet_names (list of str, optional):
The names of sheets to read. If None, all sheets are read.
Expand All @@ -226,7 +226,7 @@ def read(self, file_path, sheet_names=None):
if sheet_names is not None and not isinstance(sheet_names, list):
raise ValueError("'sheet_names' must be None or a list of strings.")

return self._read_excel(file_path, sheet_names)
return self._read_excel(filepath, sheet_names)

def write(self, synthetic_data, file_name, sheet_name_suffix=None, mode='w'):
"""Write synthetic data to an Excel File.
Expand Down
6 changes: 2 additions & 4 deletions sdv/lite/single_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,7 @@ def sample_from_conditions(self, conditions, max_tries_per_batch=100,
The batch size to use per attempt at sampling. Defaults to 10 times
the number of rows.
output_file_path (str or None):
The file to periodically write sampled rows to. Defaults to
a temporary file, if None.
The file to periodically write sampled rows to. Defaults to None.
Returns:
pandas.DataFrame:
Expand Down Expand Up @@ -168,8 +167,7 @@ def sample_remaining_columns(self, known_columns, max_tries_per_batch=100,
The batch size to use per attempt at sampling. Defaults to 10 times
the number of rows.
output_file_path (str or None):
The file to periodically write sampled rows to. Defaults to
a temporary file, if None.
The file to periodically write sampled rows to. Defaults to None.
Returns:
pandas.DataFrame:
Expand Down
Loading

0 comments on commit 33f2643

Please sign in to comment.