-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support upcoming default pandas string dtype (pandas >= 3) #930
Comments
With this type, are the values still python strings? |
The values are either object-dtype with python strings (or np.nan for missing values) or either a pyarrow array, depending on the |
But, regardless of the exact storage, if you just want to have Python strings you can always do something like |
I want to pre-allocate a dataframe and fill in the values as they are read. That model probably doesn't work anymore for arrow-backed data more complex than the equivalent numpy array. #931 shows the possible future evolution of fastparquet where we no longer use pandas at all... |
(FWIW, pandas is not going to hard require pyarrow for pandas 3.0, that decision is postponed until a later release. But regardless of that, having less pandas-specific code here sounds certainly worthwhile) Preallocating probably won't work for the arrow-backed data indeed. But I would say you can always read the strings as you do now (preallocating an object-dtype array, I assume?) and do any conversion afterwards (or leave that to pandas to do so) |
Probably we'll continue to produce numpy object columns while we can, but we still have to deal with the I'll get back to you on the two issues, thanks for letting me know. |
Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and pandas-dev/pandas#54792 for progress of implementation).
This is already available in the
main
branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flagpd.options.future.infer_string = True
.Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):
The text was updated successfully, but these errors were encountered: