Release 0.9.0 #310

nnansters · 2023-06-26T16:16:11Z

nnansters
Jun 26, 2023
Maintainer

Hello everybody!

We've just released NannyML 0.9.0! In this release we're smashing some bugs, improving some docs and introducing two new calculators types: the summary stats calculators and the data quality calculators!

Installing / upgrading

You can get this latest version by using pip:

pip install -U nannyml

Or conda:

conda install -c conda-forge nannyml

What’s new?

Data quality calculators

We've added our two first data quality metrics to track over time: the number of missing values and the number of unseen values.

The missing values calculator returns the number of missing values in a column for a given chunk. This allows you to track this number over time and compare it to the amount of missing values for that column in your reference data.

The unseen values calculator checks for any values in categorical features that have not occurred in your reference data.

The following snippet illustrates how to set up unseen values tracking:

import nannyml as nml
from IPython.display import display

reference, analysis, analysis_targets = nml.load_titanic_dataset()
display(reference.head())

selected_columns = [
    'Sex', 'Ticket', 'Cabin', 'Embarked',
]
calc = nml.UnseenValuesCalculator(
    column_names=selected_columns,
)

calc.fit(reference)
results = calc.calculate(analysis)
display(results.filter(period='all').to_df())

for column_name in results.column_names:
    results.filter(column_names=column_name).plot().show()

You can read more about the data quality calculators in the missing values calculator tutorial and the unseen values calculator tutorial.

Summary stats calculators

With these calculators you can track the evolution of summary statistics over time. Currently supported summary stats are:

Summation
Average
Standard Deviation
Median
Row count

NannyML will determine thresholds for the summary statistic values based on the reference period data and raise an alert when new values exceed those thresholds.

The following snippet shows how to set up monitoring of the median:

import nannyml as nml
from IPython.display import display

reference, analysis, analysis_targets = nml.load_synthetic_car_loan_dataset()
display(reference.head())

selected_columns = [
    'car_value', 'debt_to_income_ratio', 'driver_tenure'
]
calc = nml.SummaryStatsMedianCalculator(
    column_names=selected_columns,
)

calc.fit(reference)
results = calc.calculate(analysis)
display(results.filter(period='all').to_df())

for column_name in results.column_names:
    results.filter(column_names=column_name).plot().show()

Read more about it in the summary stats tutorials.

What's next?

We have multiple proverbial irons in the fire so library development has slowed down a bit. We'll be picking up the pace soon with some fundamental changes!

We hope our new functionality improves your quality of life (and deployed models). As always, any feedback is encouraged!

Reach out in our community Slack, log a bug or a feature request our repository or just leave us a star for positive holiday vibes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.9.0 #310

{{title}}

Replies: 0 comments

Select a reply

Release 0.9.0 #310

nnansters Jun 26, 2023 Maintainer

Installing / upgrading

What’s new?

Data quality calculators

Summary stats calculators

What's next?

Replies: 0 comments

nnansters
Jun 26, 2023
Maintainer