Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST(string dtype): Resolve HDF5 xfails in test_put.py #60625

Closed
wants to merge 5 commits into from

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

It would be more performant to implement a way of writing the string dtypes without converting to NumPy, but this approach gets us correct behavior for now.

@rhshadrach rhshadrach added IO HDF5 read_hdf, HDFStore Strings String extension data type and string data labels Dec 30, 2024
@rhshadrach rhshadrach added this to the 2.3 milestone Dec 30, 2024
@@ -3185,6 +3186,8 @@ def write_array(
# both self._filters and EA

value = extract_array(obj, extract_numpy=True)
if isinstance(value, BaseStringArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic live in extract_array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line from the docstring of extract_array states:

Extract the ndarray or ExtensionArray from a Series or Index.

and the extract_numpy argument:

Whether to extract the ndarray from a NumpyExtensionArray.

So I think no - I would expect that function to still return the ExtensionArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests currently cover the "string" data type going through pytables? This seems like it might mangle the NA markers?

Not super familiar with pytables so not saying this is right or wrong - just want to double check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certainly tests with dataframe roundtrip that contain string columns, not entirely sure if there are also tests where those columns have missing values, though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche - I'm not seeing this particular line hit with string dtype. I added a test, including an NA value.

@WillAyd - the current behavior is to write out the underlying objects, and then infer upon loading. So if we start with string an future.infer_string=False, we get object. When that option is True, we get str.

pandas/tests/io/pytables/test_put.py Outdated Show resolved Hide resolved
pandas/tests/io/pytables/test_put.py Outdated Show resolved Hide resolved
Comment on lines 271 to 272
if using_infer_string:
msg = "Saving a MultiIndex with an extension dtype is not supported."
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if using_infer_string:
msg = "Saving a MultiIndex with an extension dtype is not supported."
if using_infer_string:
# TODO(infer_string) make this work for string dtype
msg = "Saving a MultiIndex with an extension dtype is not supported."

(just as a reminder, because ideally we still solve this)

Comment on lines -3366 to +3373
if using_string_dtype() and is_string_array(values, skipna=True):
if (
using_string_dtype()
and isinstance(values, np.ndarray)
and is_string_array(values, skipna=True)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same code pattern happens above in SeriesFixed read() method. Not entirely sure if the same change should be applied there as well, but would expect so (but maybe not covered by any test?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, added a test



def test_put_mixed_type(setup_path, performance_warning):
df = DataFrame(
np.random.default_rng(2).standard_normal((10, 4)),
columns=Index(list("ABCD"), dtype=object),
columns=Index(list("ABCD")),
index=date_range("2000-01-01", periods=10, freq="B"),
)
df["obj1"] = "foo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df["obj1"] = "foo"
df["obj1"] = np.array(["foo"] * 10, dtype=object)
df["str"] = pd.Series(["a", None, "b", "c", "d"] * 2).array

To explicitly test here with both object dtype and string dtype, and with strings with missing values (this is passing locally, so that should address Will's comment at https://github.com/pandas-dev/pandas/pull/60625/files#r1900476124

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@rhshadrach
Copy link
Member Author

Closing in favor of #60663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants