Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding testing check_data function to load_data #458

Merged
merged 12 commits into from
Mar 6, 2024

Conversation

sabinala
Copy link
Contributor

@sabinala sabinala commented Jan 15, 2024

This PR adds a function to check for formatting errors in a dataset within the load_data function that is called whenever sample is used. I've also included a notebook where these errors are produced.

Closes #454
Closes #290

@sabinala sabinala self-assigned this Jan 15, 2024
@sabinala sabinala linked an issue Jan 15, 2024 that may be closed by this pull request
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sabinala sabinala removed the request for review from djinnome January 15, 2024 20:46
@sabinala sabinala added the WIP PR submitter still making changes, not ready for review label Jan 15, 2024
@sabinala
Copy link
Contributor Author

Changing back to WIP to add: (1) check that there is a column of data corresponding to everything mentioned in the data mapping, and (2) a data report that tells the shape of the data and what columns are included

@SamWitty
Copy link
Contributor

@sabinala , what is the status of this PR?

@sabinala
Copy link
Contributor Author

@SamWitty the status of this PR is that it's still in progress. I need to convert the assert statements to the form if (condition) raise (error message), and then write tests for these errors.

@sabinala sabinala added awaiting review PR submitter awaiting code review from reviewer and removed WIP PR submitter still making changes, not ready for review labels Feb 19, 2024
@sabinala
Copy link
Contributor Author

sabinala commented Feb 19, 2024

@djinnome let me know if you have questions on this. I added a check_data and a print_dataframe_report function inside of load_data in pyciemss/integration_utils/observation.py, and the test test_load_data to tests/test_interfaces.py which uses some "bad datasets" added to fixtures.py. Importantly, in order for the testing to work, I had to change load_data so that it can accept a DataFrame as well as a file path (string).

@sabinala
Copy link
Contributor Author

@djinnome the only issue is that I was not able to create a dataset with "NaN" values in fixtures.py for testing, but the error will still be caught by the data checker function.

@sabinala
Copy link
Contributor Author

@djinnome let's talk about this during our meeting later...it looks like some tests aren't passing, but the problems are coming from tests for visuals and interruptions? When I run make format and pytest tests/test_interfaces.py I get no issues/all tests passing.

@sabinala sabinala added WIP PR submitter still making changes, not ready for review and removed awaiting review PR submitter awaiting code review from reviewer labels Feb 19, 2024
@sabinala sabinala added blocked and removed WIP PR submitter still making changes, not ready for review labels Feb 21, 2024
@sabinala
Copy link
Contributor Author

This PR is ready to go, but currently blocked by #481. (Still getting this issue after having merged main into this branch and running pip install -e . or pip install -e .[tests].)

@sabinala
Copy link
Contributor Author

I'm confused as to why it appears this PR is failing on linting...when I run make format no files are changed.

@djinnome
Copy link
Contributor

I think it is because local flake8 is out of sync with the flake8 on the CI. Should we upgrade local flake8 or downgrade the CI flake8 @SamWitty ?

@SamWitty
Copy link
Contributor

Please update local flake8. Thanks for checking!

Cleaning up data report
@sabinala
Copy link
Contributor Author

@SamWitty @djinnome I updated local flake8 with pip install --upgrade flake8 and am still failing that test:

===================================== FAILURES ======================================
_______________ test_export_PNG[schema_file2-ref_file2-trajectories] ________________

schema_file = PosixPath('/Users/altu809/Projects/pyciemss/pyciemss/visuals/schemas/trajectories.vg.json')
ref_file = PosixPath('/Users/altu809/Projects/pyciemss/tests/visuals/reference_images/trajectories.png')
name = 'trajectories'

    @pytest.mark.parametrize("schema_file, ref_file, name", schemas(ref_ext="png"))
    def test_export_PNG(schema_file, ref_file, name):
        """
        Test all default schema files against the reference files for PNG files
    
        schema_file: default schema files saved within the visuals module
        ref_file: compare the created  png to this reference file
        name: stem name of reference file
        """
        with open(schema_file) as f:
            schema = json.load(f)
    
        image = plots.ipy_display(schema, format="PNG", dpi=72).data
        save_result(image, name, "png")
    
        test_threshold = 0.04
        JS_boolean, JS_score = png_matches(image, ref_file, test_threshold)
>       assert (
            JS_boolean
        ), f"{name}: PNG Histogram divergence: Shannon Jansen value {JS_score} > {test_threshold} "
E       AssertionError: trajectories: PNG Histogram divergence: Shannon Jansen value 0.1562242136437859 > 0.04 
E       assert False

tests/visuals/test_schemas.py:148: AssertionError

@SamWitty
Copy link
Contributor

SamWitty commented Feb 21, 2024

This is not a linting error. The test itself is failing. This happens sometimes, as the tests are randomized. Rerunning failing tests should work most of the time.

@sabinala sabinala added awaiting review PR submitter awaiting code review from reviewer and removed blocked labels Feb 22, 2024
@sabinala
Copy link
Contributor Author

@SamWitty correct again! All tests have passed, @djinnome this is ready now for review.

"The first column must be named 'Timestamp' and contain the time corresponding to each row of data."
)

# Check that there are no NaN values or empty entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a constraint that we want to impose? It might be better if we could handle ragged data, yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djinnome It would be better, but I'm not sure where to go with that. How would that propagate to calibrate? I think it's probably best to throw an error message for now, and create a new issue to handle ragged data in the future.

Copy link
Contributor

@djinnome djinnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Perhaps we should create an issue for a feature request to relax the missing data constraint.

@djinnome djinnome merged commit d6838e7 into main Mar 6, 2024
5 checks passed
@sabinala sabinala deleted the 454-create-data-checker-inside-of-load_data branch March 21, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review PR submitter awaiting code review from reviewer
Projects
None yet
3 participants