Fix dtype of quality metrics before and after merging #3497

zm711 · 2024-10-22T20:01:01Z

MRE

import pandas as pd
df = pd.DataFrame({'test': [1,2,3]})
new_df = pd.DataFrame(index=df.index, columns=df.columns)

df.test.dtype
Out[7]: dtype('int64')

new_df.test.dtype
Out[8]: dtype('O')

Basically when you copy a dataframe from a previous dataframe columns it forces the dtype to be object instead of numeric.

Easy Solution

Using the to_numeric will bring us back to numeric values.

Caveats

This switches the dtype from the Pandas Float64 to the numpy float64. I don't think this is too bad, since doing the queries should still be fine no?

Testing

i added a small test to test merging, but let me know if we'd prefer not to have it.

samuelgarcia · 2024-10-23T12:39:55Z

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

+        # we can iterate through the columns and convert them back to numbers with
+        # pandas.to_numeric. coerce allows us to keep the nan values.
+        for column in metrics.columns:
+            metrics[column] = pd.to_numeric(metrics[column], errors="coerce")


this is ok for me.
pandas behavior is becoming quite cryptic for me.
using old_metrics[col].dtype could be also used no ?

Maybe. I agree Pandas is making their own dtypes like NADType which doesn't play nicely with numpy in my scipts I tend to just query based on numpy stuff). So I don't know for sure. I could test that later. Although for me I would prefer to coerce everything to numpy types since that's what I'm used to. None of my tables are big enough that I worry about dtype inefficiency stuff that Pandas has been working on with the new backend.

JoeZiminski

Hey @zm711 this looks great good catch! super useful test too. Weird behaviour from pandas. Some minor comments:

I think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64) will have the same effect. You loose the coerce on error behaviour but assuming the data is always going to be filled with NaN this shouldn't be a problem. However it is more implicit and provides less information on the weird pandas behaviour than the loop approach.
The results of this operation mean all columns are np.float64 but in the original metrics as returned from compute_quality_metrics some columns are Int64Dtype. This seems to be dynamic based on contents (e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances. num_spikes I guess will always be int. The only time I can imagine this being a problem is if some equality check is performed e.g.num_spikes == 1 which might work for the original compute_quality_metrics output but fail after merging as data will be float. So maybe it is simplest just to cast num_spikes -> Int64Dtype and leave the rest as float?

zm711 · 2024-10-24T12:09:03Z

Thanks so much @JoeZiminski!

think new_df = pd.DataFrame(index=df.index, columns=df.columns, dtype=np.float64)

I'm no Pandas expert so I'm happy to have changes here if they are better! I just don't know have an intuition for what is the smartest strategy so if you know Pandas really well then I'll make the change :)

e.g. in the test run presence ratio were all 1 and it's dtype is Int64Dtypebut presumably it would be a float under most circumstances.

True. This is our mistake for letting Pandas infer. presence ratio is a float between 0 and 1. But if they are all 0 or all 1 it casts to int for memory purposes. Users should never assume it is an int although it could be in extreme cases. It would be better for us to explicitly make it a float and take the memory hit in my opinion.

num_spikes I guess will always be int.

This is true and when I scanned the table I forgot about this one. It would be better to make that one an int. I don't think a user should ever do num_spikes == x. I could imagine a query of num_spikes>=x. I think (BUT correct me if I'm wrong) that for floats only the decimal is unstable such that a float(1) >= int(1). In this case testing against a min number of spikes should not be a problem. If I'm wrong then I think we have to cast that one back to int in a separate step.

super useful test too

<3 Thanks. I figure we really need to protect ourselves from some of these small regressions. So I'm trying :)

zm711 · 2024-11-01T17:59:48Z

@alejoe91, do you have any opinions of implementing this? Happy to change to a different method if you prefer something. I think the only thing we are failing to maintain is num_spikes.

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

alejoe91 · 2024-11-04T10:13:18Z

src/spikeinterface/qualitymetrics/tests/test_quality_metric_calculator.py

+    assert len(metrics.index) > len(new_metrics.index)
+
+    # dtype should be fine after merge but is cast from Float64->float64
+    assert np.float64 == new_metrics["snr"].dtype


we can add a test on int coercion if we end up using the suggestion here: https://github.com/SpikeInterface/spikeinterface/pull/3497/files#r1827487180

Co-authored-by: Alessio Buccino <alejoe9187@gmail.com>

…to merge-qc

for more information, see https://pre-commit.ci

zm711 · 2024-11-04T21:08:25Z

So the problem is that Pandas will infer the dtype and sometimes this is actually wrong. Like the presence ratio above which should technically always be a float between 0.0 and 1.0, but if it is all 1s and 0s will be stored as an int. Then if we merge and get a fraction then the dtype is wrong. I think it might be better to hard code the int64 for num_spikes since everything else is a float.

I basically implemented Sam's idea. But this fails. Unless we hard code the dtype of the different metrics rather than allow Pandas to infer them. What do people think about me adding a line to coerce everything in the original calculator to float64 except num_spikes? Then we have that be int. Then we should be fine to do the casting that I do here in this code. If we make sure we know the dtype of our metrics we are safer.

…to merge-qc

zm711 · 2024-11-22T21:57:04Z

Okay so changes in this PR

now we ensure all columns have the dtype specified in our name_to_dtype dict (pandas is not allowed to infer)
test for empty units was changed because we no longer put nans in num_spikes (because that should be 0 for empty units)--let me know if we should discuss
merging now recasts the columns to the original dtypes
test added for merging

for more information, see https://pre-commit.ci

zm711

I fix the conflict. Let me know what you all think :)

zm711 · 2025-01-08T21:37:39Z

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

+            # we have one issue where the name of the columns for synchrony are named based on
+            # what the user has input as arguments so we need a way to handle this separately
+            # everything else should be handled with the column name.
+            if "sync" in column:
+                metrics[column] = metrics[column].astype(column_name_to_column_dtype["sync"])


I would argue we keep this for backward compatibility no? I could add a comment saying we can simplify this in a couple versions.

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

zm711 · 2025-01-09T13:36:17Z

Never mind. After Chris's updates I thought this would go in cleanly other than one conflict. His changes actually require me to make additional updates to this PR...

zm711 · 2025-01-09T19:10:54Z

Now we are finally ready :)

alejoe91 · 2025-01-10T10:47:54Z

src/spikeinterface/qualitymetrics/quality_metric_list.py

+        "sync_spike_2",
+        "sync_spike_4",
+        "sync_spike_8",
+    ],  # we probably shouldn't hard code this. This is determined by the arguments in the function...


but I think we agreed that we can just hard-code this at the QM level, so it should be ok.

Let's keep the comment until this is actually hard-coded!

Actually, this was done already: #3559

alejoe91 · 2025-01-10T10:49:07Z

src/spikeinterface/qualitymetrics/quality_metric_list.py

+    "amplitude_median": float,
+    "amplitude_cv_median": float,
+    "amplitude_cv_range": float,
+    "sync": float,


then this could become sync_spike_2, sync_spike_4, sync_spike_8 as well

src/spikeinterface/qualitymetrics/quality_metric_list.py

src/spikeinterface/qualitymetrics/quality_metric_calculator.py

alejoe91 · 2025-01-10T10:59:49Z

@zm711 since we hard-coded the sync sizes in #3559 I simplified the PR, so we don't have to deal with them differently.

For back compatibility, I just added a check that casts the column values only if the column name is in column_name_to_column_dtype. So for example, if one computed a sync_spike_32, it won't be cast but would still work.

zm711 · 2025-01-10T11:01:35Z

Okay cool. Works for me :)

zm711 added 2 commits October 22, 2024 15:36

fix dtype of quality metrics after merging

6527c90

fix test

1e596f8

zm711 added the qualitymetrics Related to qualitymetrics module label Oct 22, 2024

samuelgarcia reviewed Oct 23, 2024

View reviewed changes

JoeZiminski approved these changes Oct 24, 2024

View reviewed changes

zm711 added 2 commits October 24, 2024 08:39

Merge branch 'main' into merge-qc

2e53fbf

Merge branch 'main' into merge-qc

aeab562

alejoe91 approved these changes Nov 4, 2024

View reviewed changes

alejoe91 reviewed Nov 4, 2024

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

alejoe91 reviewed Nov 4, 2024

View reviewed changes

zm711 and others added 6 commits November 4, 2024 15:19

Alessio's idea

812376e

Co-authored-by: Alessio Buccino <alejoe9187@gmail.com>

add int test

4a6b1e3

Merge branch 'main' into merge-qc

af5b93a

Merge branch 'merge-qc' of https://github.com/zm711/spikeinterface in…

81307a6

…to merge-qc

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f2e2f1

for more information, see https://pre-commit.ci

try different dtype approach

0cc1fa1

zm711 added 3 commits November 22, 2024 16:06

wip

e175bdc

fix synchrony

bf96fe1

Merge branch 'main' into merge-qc

a132acc

zm711 mentioned this pull request Nov 22, 2024

fix handling of synchrony columns in qualitymetrics #3549

Closed

zm711 added 3 commits November 22, 2024 16:29

fix nan and empty units

5b77ba1

Merge branch 'merge-qc' of https://github.com/zm711/spikeinterface in…

91c8a84

…to merge-qc

fix test

807e771

zm711 requested review from JoeZiminski, alejoe91 and samuelgarcia November 22, 2024 21:57

zm711 changed the title ~~Fix dtype of quality metrics after merging~~ Fix dtype of quality metrics before and after merging Nov 22, 2024

zm711 and others added 2 commits January 8, 2025 16:36

Merge branch 'main' into merge-qc

77ed6ab

[pre-commit.ci] auto fixes from pre-commit.com hooks

ac46fb5

for more information, see https://pre-commit.ci

zm711 commented Jan 8, 2025

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

oops

e1c401d

add back in list

cdc1b2a

alejoe91 reviewed Jan 10, 2025

View reviewed changes

alejoe91 approved these changes Jan 10, 2025

View reviewed changes

alejoe91 reviewed Jan 10, 2025

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_list.py Outdated Show resolved Hide resolved

alejoe91 reviewed Jan 10, 2025

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_list.py Outdated Show resolved Hide resolved

alejoe91 reviewed Jan 10, 2025

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

Exploit hard-coded sync sizes

e45a9f8

alejoe91 reviewed Jan 10, 2025

View reviewed changes

src/spikeinterface/qualitymetrics/quality_metric_calculator.py Outdated Show resolved Hide resolved

alejoe91 added 2 commits January 10, 2025 11:53

Remove comment

8353160

Protect dtype casting only when column is in column_name_to_column_dtype

33feca3

alejoe91 merged commit 82d62ca into SpikeInterface:main Jan 10, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dtype of quality metrics before and after merging #3497

Fix dtype of quality metrics before and after merging #3497

zm711 commented Oct 22, 2024

samuelgarcia Oct 23, 2024

zm711 Oct 23, 2024

JoeZiminski left a comment •

edited

Loading

zm711 commented Oct 24, 2024 •

edited

Loading

zm711 commented Nov 1, 2024

alejoe91 Nov 4, 2024

zm711 commented Nov 4, 2024 •

edited

Loading

zm711 commented Nov 22, 2024

zm711 left a comment

zm711 Jan 8, 2025

zm711 commented Jan 9, 2025

zm711 commented Jan 9, 2025

alejoe91 Jan 10, 2025

alejoe91 Jan 10, 2025

alejoe91 Jan 10, 2025

alejoe91 commented Jan 10, 2025

zm711 commented Jan 10, 2025

Fix dtype of quality metrics before and after merging #3497

Fix dtype of quality metrics before and after merging #3497

Conversation

zm711 commented Oct 22, 2024

MRE

Easy Solution

Caveats

Testing

samuelgarcia Oct 23, 2024

Choose a reason for hiding this comment

zm711 Oct 23, 2024

Choose a reason for hiding this comment

JoeZiminski left a comment • edited Loading

Choose a reason for hiding this comment

zm711 commented Oct 24, 2024 • edited Loading

zm711 commented Nov 1, 2024

alejoe91 Nov 4, 2024

Choose a reason for hiding this comment

zm711 commented Nov 4, 2024 • edited Loading

zm711 commented Nov 22, 2024

zm711 left a comment

Choose a reason for hiding this comment

zm711 Jan 8, 2025

Choose a reason for hiding this comment

zm711 commented Jan 9, 2025

zm711 commented Jan 9, 2025

alejoe91 Jan 10, 2025

Choose a reason for hiding this comment

alejoe91 Jan 10, 2025

Choose a reason for hiding this comment

alejoe91 Jan 10, 2025

Choose a reason for hiding this comment

alejoe91 commented Jan 10, 2025

zm711 commented Jan 10, 2025

JoeZiminski left a comment •

edited

Loading

zm711 commented Oct 24, 2024 •

edited

Loading

zm711 commented Nov 4, 2024 •

edited

Loading