Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dtype of quality metrics before and after merging #3497

Merged
merged 23 commits into from
Jan 10, 2025

Conversation

zm711
Copy link
Collaborator

@zm711 zm711 commented Oct 22, 2024

MRE

import pandas as pd
df = pd.DataFrame({'test': [1,2,3]})
new_df = pd.DataFrame(index=df.index, columns=df.columns)

df.test.dtype
Out[7]: dtype('int64')

new_df.test.dtype
Out[8]: dtype('O')

Basically when you copy a dataframe from a previous dataframe columns it forces the dtype to be object instead of numeric.

Easy Solution

Using the to_numeric will bring us back to numeric values.

Caveats

This switches the dtype from the Pandas Float64 to the numpy float64. I don't think this is too bad, since doing the queries should still be fine no?

Testing

i added a small test to test merging, but let me know if we'd prefer not to have it.

@zm711 zm711 added the qualitymetrics Related to qualitymetrics module label Oct 22, 2024
# we can iterate through the columns and convert them back to numbers with
# pandas.to_numeric. coerce allows us to keep the nan values.
for column in metrics.columns:
metrics[column] = pd.to_numeric(metrics[column], errors="coerce")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok for me.
pandas behavior is becoming quite cryptic for me.
using old_metrics[col].dtype could be also used no ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I agree Pandas is making their own dtypes like NADType which doesn't play nicely with numpy in my scipts I tend to just query based on numpy stuff). So I don't know for sure. I could test that later. Although for me I would prefer to coerce everything to numpy types since that's what I'm used to. None of my tables are big enough that I worry about dtype inefficiency stuff that Pandas has been working on with the new backend.

Copy link
Collaborator

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zm711 this looks great good catch! super useful test too. Weird behaviour from pandas. Some minor comments:

  • I think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64) will have the same effect. You loose the coerce on error behaviour but assuming the data is always going to be filled with NaN this shouldn't be a problem. However it is more implicit and provides less information on the weird pandas behaviour than the loop approach.

  • The results of this operation mean all columns are np.float64 but in the original metrics as returned from compute_quality_metrics some columns are Int64Dtype. This seems to be dynamic based on contents (e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances. num_spikes I guess will always be int. The only time I can imagine this being a problem is if some equality check is performed e.g.num_spikes == 1 which might work for the original compute_quality_metrics output but fail after merging as data will be float. So maybe it is simplest just to cast num_spikes -> Int64Dtype and leave the rest as float?

@zm711
Copy link
Collaborator Author

zm711 commented Oct 24, 2024

Thanks so much @JoeZiminski!

think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64)

I'm no Pandas expert so I'm happy to have changes here if they are better! I just don't know have an intuition for what is the smartest strategy so if you know Pandas really well then I'll make the change :)

e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances.

True. This is our mistake for letting Pandas infer. presence ratio is a float between 0 and 1. But if they are all 0 or all 1 it casts to int for memory purposes. Users should never assume it is an int although it could be in extreme cases. It would be better for us to explicitly make it a float and take the memory hit in my opinion.

num_spikes I guess will always be int.

This is true and when I scanned the table I forgot about this one. It would be better to make that one an int. I don't think a user should ever do num_spikes == x. I could imagine a query of num_spikes>=x. I think (BUT correct me if I'm wrong) that for floats only the decimal is unstable such that a float(1) >= int(1). In this case testing against a min number of spikes should not be a problem. If I'm wrong then I think we have to cast that one back to int in a separate step.

super useful test too

<3 Thanks. I figure we really need to protect ourselves from some of these small regressions. So I'm trying :)

@zm711
Copy link
Collaborator Author

zm711 commented Nov 1, 2024

@alejoe91, do you have any opinions of implementing this? Happy to change to a different method if you prefer something. I think the only thing we are failing to maintain is num_spikes.

assert len(metrics.index) > len(new_metrics.index)

# dtype should be fine after merge but is cast from Float64->float64
assert np.float64 == new_metrics["snr"].dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a test on int coercion if we end up using the suggestion here: https://github.com/SpikeInterface/spikeinterface/pull/3497/files#r1827487180

@zm711
Copy link
Collaborator Author

zm711 commented Nov 4, 2024

So the problem is that Pandas will infer the dtype and sometimes this is actually wrong. Like the presence ratio above which should technically always be a float between 0.0 and 1.0, but if it is all 1s and 0s will be stored as an int. Then if we merge and get a fraction then the dtype is wrong. I think it might be better to hard code the int64 for num_spikes since everything else is a float.

I basically implemented Sam's idea. But this fails. Unless we hard code the dtype of the different metrics rather than allow Pandas to infer them. What do people think about me adding a line to coerce everything in the original calculator to float64 except num_spikes? Then we have that be int. Then we should be fine to do the casting that I do here in this code. If we make sure we know the dtype of our metrics we are safer.

@zm711
Copy link
Collaborator Author

zm711 commented Nov 22, 2024

Okay so changes in this PR

  1. now we ensure all columns have the dtype specified in our name_to_dtype dict (pandas is not allowed to infer)
  2. test for empty units was changed because we no longer put nans in num_spikes (because that should be 0 for empty units)--let me know if we should discuss
  3. merging now recasts the columns to the original dtypes
  4. test added for merging

@zm711 zm711 changed the title Fix dtype of quality metrics after merging Fix dtype of quality metrics before and after merging Nov 22, 2024
Copy link
Collaborator Author

@zm711 zm711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix the conflict. Let me know what you all think :)

Comment on lines 252 to 256
# we have one issue where the name of the columns for synchrony are named based on
# what the user has input as arguments so we need a way to handle this separately
# everything else should be handled with the column name.
if "sync" in column:
metrics[column] = metrics[column].astype(column_name_to_column_dtype["sync"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue we keep this for backward compatibility no? I could add a comment saying we can simplify this in a couple versions.

@zm711
Copy link
Collaborator Author

zm711 commented Jan 9, 2025

Never mind. After Chris's updates I thought this would go in cleanly other than one conflict. His changes actually require me to make additional updates to this PR...

@zm711
Copy link
Collaborator Author

zm711 commented Jan 9, 2025

Now we are finally ready :)

"sync_spike_2",
"sync_spike_4",
"sync_spike_8",
], # we probably shouldn't hard code this. This is determined by the arguments in the function...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think we agreed that we can just hard-code this at the QM level, so it should be ok.

Let's keep the comment until this is actually hard-coded!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this was done already: #3559

"amplitude_median": float,
"amplitude_cv_median": float,
"amplitude_cv_range": float,
"sync": float,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this could become sync_spike_2, sync_spike_4, sync_spike_8 as well

@alejoe91
Copy link
Member

@zm711 since we hard-coded the sync sizes in #3559 I simplified the PR, so we don't have to deal with them differently.

For back compatibility, I just added a check that casts the column values only if the column name is in column_name_to_column_dtype. So for example, if one computed a sync_spike_32, it won't be cast but would still work.

@zm711
Copy link
Collaborator Author

zm711 commented Jan 10, 2025

Okay cool. Works for me :)

@alejoe91 alejoe91 merged commit 82d62ca into SpikeInterface:main Jan 10, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qualitymetrics Related to qualitymetrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants