adding testing check_data function to load_data #458

sabinala · 2024-01-15T18:47:43Z

This PR adds a function to check for formatting errors in a dataset within the load_data function that is called whenever sample is used. I've also included a notebook where these errors are produced.

Closes #454
Closes #290

review-notebook-app · 2024-01-15T18:47:48Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sabinala · 2024-01-15T20:47:10Z

Changing back to WIP to add: (1) check that there is a column of data corresponding to everything mentioned in the data mapping, and (2) a data report that tells the shape of the data and what columns are included

SamWitty · 2024-02-13T19:35:58Z

@sabinala , what is the status of this PR?

sabinala · 2024-02-13T19:56:08Z

@SamWitty the status of this PR is that it's still in progress. I need to convert the assert statements to the form if (condition) raise (error message), and then write tests for these errors.

…er-inside-of-load_data merging main into this branch

…er-inside-of-load_data Merging main into this branch

sabinala · 2024-02-19T18:55:22Z

@djinnome let me know if you have questions on this. I added a check_data and a print_dataframe_report function inside of load_data in pyciemss/integration_utils/observation.py, and the test test_load_data to tests/test_interfaces.py which uses some "bad datasets" added to fixtures.py. Importantly, in order for the testing to work, I had to change load_data so that it can accept a DataFrame as well as a file path (string).

sabinala · 2024-02-19T18:57:45Z

@djinnome the only issue is that I was not able to create a dataset with "NaN" values in fixtures.py for testing, but the error will still be caught by the data checker function.

sabinala · 2024-02-19T19:01:08Z

@djinnome let's talk about this during our meeting later...it looks like some tests aren't passing, but the problems are coming from tests for visuals and interruptions? When I run make format and pytest tests/test_interfaces.py I get no issues/all tests passing.

…er-inside-of-load_data Merging main into this branch.

…/github.com/ciemss/pyciemss into 454-create-data-checker-inside-of-load_data Merging local and remote.

sabinala · 2024-02-21T18:36:06Z

This PR is ready to go, but currently blocked by #481. (Still getting this issue after having merged main into this branch and running pip install -e . or pip install -e .[tests].)

sabinala · 2024-02-21T18:36:41Z

I'm confused as to why it appears this PR is failing on linting...when I run make format no files are changed.

djinnome · 2024-02-21T19:18:22Z

I think it is because local flake8 is out of sync with the flake8 on the CI. Should we upgrade local flake8 or downgrade the CI flake8 @SamWitty ?

SamWitty · 2024-02-21T19:19:38Z

Please update local flake8. Thanks for checking!

Cleaning up data report

sabinala · 2024-02-21T21:34:18Z

@SamWitty @djinnome I updated local flake8 with pip install --upgrade flake8 and am still failing that test:

===================================== FAILURES ======================================
_______________ test_export_PNG[schema_file2-ref_file2-trajectories] ________________

schema_file = PosixPath('/Users/altu809/Projects/pyciemss/pyciemss/visuals/schemas/trajectories.vg.json')
ref_file = PosixPath('/Users/altu809/Projects/pyciemss/tests/visuals/reference_images/trajectories.png')
name = 'trajectories'

    @pytest.mark.parametrize("schema_file, ref_file, name", schemas(ref_ext="png"))
    def test_export_PNG(schema_file, ref_file, name):
        """
        Test all default schema files against the reference files for PNG files
    
        schema_file: default schema files saved within the visuals module
        ref_file: compare the created  png to this reference file
        name: stem name of reference file
        """
        with open(schema_file) as f:
            schema = json.load(f)
    
        image = plots.ipy_display(schema, format="PNG", dpi=72).data
        save_result(image, name, "png")
    
        test_threshold = 0.04
        JS_boolean, JS_score = png_matches(image, ref_file, test_threshold)
>       assert (
            JS_boolean
        ), f"{name}: PNG Histogram divergence: Shannon Jansen value {JS_score} > {test_threshold} "
E       AssertionError: trajectories: PNG Histogram divergence: Shannon Jansen value 0.1562242136437859 > 0.04 
E       assert False

tests/visuals/test_schemas.py:148: AssertionError

SamWitty · 2024-02-21T22:14:42Z

This is not a linting error. The test itself is failing. This happens sometimes, as the tests are randomized. Rerunning failing tests should work most of the time.

sabinala · 2024-02-22T19:16:26Z

@SamWitty correct again! All tests have passed, @djinnome this is ready now for review.

djinnome · 2024-03-06T21:30:28Z

pyciemss/integration_utils/observation.py

+ "The first column must be named 'Timestamp' and contain the time corresponding to each row of data."
+ )
+
+ # Check that there are no NaN values or empty entries


Is this a constraint that we want to impose? It might be better if we could handle ragged data, yes?

@djinnome It would be better, but I'm not sure where to go with that. How would that propagate to calibrate? I think it's probably best to throw an error message for now, and create a new issue to handle ragged data in the future.

djinnome

Approving. Perhaps we should create an issue for a feature request to relax the missing data constraint.

adding testing check_data function to load_data

5367f3a

sabinala requested a review from djinnome January 15, 2024 18:47

sabinala self-assigned this Jan 15, 2024

sabinala linked an issue Jan 15, 2024 that may be closed by this pull request

Create data checker inside of load_data #454

Closed

reformatting

f888221

sabinala removed the request for review from djinnome January 15, 2024 20:46

sabinala added the WIP PR submitter still making changes, not ready for review label Jan 15, 2024

sabinala added 3 commits February 13, 2024 12:09

Merge remote-tracking branch 'origin/main' into 454-create-data-check…

ff45080

…er-inside-of-load_data merging main into this branch

fixing data checker

f5024ac

Merge remote-tracking branch 'origin/main' into 454-create-data-check…

a2af3cf

…er-inside-of-load_data merging main into this branch

This was linked to issues Feb 14, 2024

Error required: Tell user that the timestep or date must be first column of the dataset #290

Closed

Gracefully handle nans in the input data #315

Closed

sabinala added 2 commits February 19, 2024 09:30

Merge remote-tracking branch 'origin/main' into 454-create-data-check…

2802e98

…er-inside-of-load_data Merging main into this branch

adding test for load_data

7b36abe

sabinala requested a review from djinnome February 19, 2024 18:51

sabinala added awaiting review PR submitter awaiting code review from reviewer and removed WIP PR submitter still making changes, not ready for review labels Feb 19, 2024

sabinala added WIP PR submitter still making changes, not ready for review and removed awaiting review PR submitter awaiting code review from reviewer labels Feb 19, 2024

sabinala and others added 4 commits February 20, 2024 12:09

Update observation.py

5069220

Update observation.py

5410c2a

Merge remote-tracking branch 'origin/main' into 454-create-data-check…

051f190

…er-inside-of-load_data Merging main into this branch.

Merge branch '454-create-data-checker-inside-of-load_data' of https:/…

a434f98

…/github.com/ciemss/pyciemss into 454-create-data-checker-inside-of-load_data Merging local and remote.

sabinala added blocked and removed WIP PR submitter still making changes, not ready for review labels Feb 21, 2024

Update observation.py

a59d102

Cleaning up data report

sabinala added awaiting review PR submitter awaiting code review from reviewer and removed blocked labels Feb 22, 2024

djinnome reviewed Mar 6, 2024

View reviewed changes

djinnome approved these changes Mar 6, 2024

View reviewed changes

djinnome merged commit d6838e7 into main Mar 6, 2024
5 checks passed

djinnome mentioned this pull request Mar 6, 2024

feature request to relax the missing data constraint in load_data #518

Open

sabinala deleted the 454-create-data-checker-inside-of-load_data branch March 21, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding testing check_data function to load_data #458

adding testing check_data function to load_data #458

sabinala commented Jan 15, 2024 •

edited

Loading

review-notebook-app bot commented Jan 15, 2024

sabinala commented Jan 15, 2024

SamWitty commented Feb 13, 2024

sabinala commented Feb 13, 2024

sabinala commented Feb 19, 2024 •

edited

Loading

sabinala commented Feb 19, 2024

sabinala commented Feb 19, 2024

sabinala commented Feb 21, 2024

sabinala commented Feb 21, 2024

djinnome commented Feb 21, 2024

SamWitty commented Feb 21, 2024

sabinala commented Feb 21, 2024

SamWitty commented Feb 21, 2024 •

edited

Loading

sabinala commented Feb 22, 2024

djinnome Mar 6, 2024

sabinala Mar 6, 2024

djinnome left a comment

adding testing check_data function to load_data #458

adding testing check_data function to load_data #458

Conversation

sabinala commented Jan 15, 2024 • edited Loading

review-notebook-app bot commented Jan 15, 2024

sabinala commented Jan 15, 2024

SamWitty commented Feb 13, 2024

sabinala commented Feb 13, 2024

sabinala commented Feb 19, 2024 • edited Loading

sabinala commented Feb 19, 2024

sabinala commented Feb 19, 2024

sabinala commented Feb 21, 2024

sabinala commented Feb 21, 2024

djinnome commented Feb 21, 2024

SamWitty commented Feb 21, 2024

sabinala commented Feb 21, 2024

SamWitty commented Feb 21, 2024 • edited Loading

sabinala commented Feb 22, 2024

djinnome Mar 6, 2024

Choose a reason for hiding this comment

sabinala Mar 6, 2024

Choose a reason for hiding this comment

djinnome left a comment

Choose a reason for hiding this comment

sabinala commented Jan 15, 2024 •

edited

Loading

sabinala commented Feb 19, 2024 •

edited

Loading

SamWitty commented Feb 21, 2024 •

edited

Loading