A data validation library for scientists, engineers, and analysts seeking correctness.
pandas
data structures contain information that pandera
explicitly
validates at runtime. This is useful in production-critical or reproducible
research settings. With pandera
, you can:
- Check the types and
properties of columns in a
DataFrame
or values in aSeries
. - Perform more complex statistical validation like hypothesis testing.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
- Define schema models with the class-based API with pydantic-style syntax and validate dataframes using the typing syntax.
- Synthesize data from schema objects for property-based testing with pandas data structures.
pandera
provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.
The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io
Using pip:
pip install pandera
Installing optional functionality:
pip install pandera[hypotheses] # hypothesis checks
pip install pandera[io] # yaml/script schema io utilities
pip install pandera[strategies] # data synthesis strategies
pip install pandera[all] # all packages
Using conda:
conda install -c conda-forge pandera-core # core library functionality
conda install -c conda-forge pandera # pandera with all extensions
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df)
print(validated_df)
# column1 column2 column3
# 0 1 -1.3 value_1
# 1 4 -1.4 value_2
# 2 0 -2.9 value_3
# 3 10 -10.1 value_2
# 4 9 -20.4 value_1
pandera
also provides an alternative API for expressing schemas inspired
by dataclasses and
pydantic. The equivalent SchemaModel
for the above DataFrameSchema
would be:
from pandera.typing import Series
class Schema(pa.SchemaModel):
column1: Series[int] = pa.Field(le=10)
column2: Series[float] = pa.Field(lt=-1.2)
column3: Series[str] = pa.Field(str_startswith="value_")
@pa.check("column3")
def column_3_check(cls, series: Series[str]) -> Series[bool]:
"""Check that values have two elements after being split with '_'"""
return series.str.split("_", expand=True).shape[1] == 2
Schema.validate(df)
git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements-dev.txt
pip install -e .
pip install pytest
pytest tests
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Go here to submit feature requests or bugfixes.
Here are a few other alternatives for validating Python data structures.
Generic Python object data validation
pandas
-specific data validation
Other tools for data validation
pandas
-centric data types, column nullability, and uniqueness are first-class concepts.check_input
andcheck_output
decorators enable seamless integration with existing code.Check
s provide flexibility and performance by providing access topandas
API by design and offers built-in checks for common data tests.Hypothesis
class provides a tidy-first interface for statistical hypothesis testing.Check
s andHypothesis
objects support both tidy and wide data validation.- Comprehensive documentation on key functionality.
If you use pandera
in the context of academic or industry research, please
consider citing the paper and/or software package.
@InProceedings{ niels_bantilan-proc-scipy-2020,
author = { {N}iels {B}antilan },
title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
pages = { 116 - 124 },
year = { 2020 },
editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
doi = { 10.25080/Majora-342d178e-010 }
}
pandera
is licensed under the MIT license and is written and
maintained by Niels Bantilan (niels@pandera.ci)