-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify slicing behavior in SpikeInterface #1989
Comments
OK, so initially I wanted to follow numpy as much as possible. I had two cases in mind. Out of bounds slicing where we should do clipping and In [1]: x = np.ones(10)
In [2]: x[15:13]
Out[2]: array([], dtype=float64)
In [3]: x[5:3]
Out[3]: array([], dtype=float64)
In [4]: x[10:10]
Out[4]: array([], dtype=float64) Note that standard lists Using two arguments:
But then I remembered that there is actually a standard in the matter that I think we should follow: And this is what it has to say about indexing (quoting from the relevant sections):
I think we should just follow that. We can start second guessing standards because we think we know better but I think we increase costs for interaction with the ecosystem and we increase costs of discussing between us if a non-standard rule were we disgagree comes in the future. So my reading of the standard and my current position that is fine that we throw error for end or start frame outside of bounds (as this is unespecified behavior) but that we should return an empty array with the dimension of the channels if |
This is a great post and important question with no clear best-answer. I think the decision depends in part on the broader aims of spikeinterface and the intended audience for which it will be used. On one extreme, there is a highly restrictive API which does not allow any indexing procedure that could be interpreted as a mistake. i.e. indexing the recording out of bounds, or with The benefits of a more restrictive API is 1) easier to track errors and detect bugs 2) enforces very explicit handling of all use cases 3) easier for non-expert users to detect mistakes in their code. The downside is that it increases initial overhead when writing code, and breaks with existing data-slicing API, and is less flexible. At present, I lean towards a highly restrictive API that will throw an error if any potentially unintended indexing operations are performed. Although this might lead to some initial overhead to explicitly handle all indexing cases, I think it will pay dividends down the line with increase ease of bug-fixing and reduced number of user-reported issues. Also, I think that SI should aim to be the de-facto tool for in vivo electrophysiology analysis in Python, inevitably drawing novice Python users with it's easy-to-use API. Often in practice those analysing electrophysiological data will have little computational training. Having robust checks on common indexing mistakes will make SI more secure and save a lot of headaches for new users. Fundamentally, I think the problem of allowing such indexing operations is it is impossible to tell whether this indexing was done on purpose, or is due to a mistake or oversight. The possible options then are to allow both intentional and unintentional indexing of this sort (i.e. flexible API), or not allow either (restrictive API). I think in general it is safer to handle the intended cases at a higher level than |
Tagging @CodyCBakerPhD and @pauladkisson as we usually model roi extractors after the conventions in spikeinterface. |
Restrictive indexing makes sense to me -- I just wrote some input validation for the MultiPlaneScanImage (see PR) to prevent out-of-bounds channel and plane indexing. Otherwise when the specified channel/plane number exceeded the actual number of planes/channels, some other frame was returned (silent error) due to the way that ScanImage stores its data (sequentially in a round-robin format). Seems like restrictive indexing is the much safer way to approach this. |
My feeling is: we should go to more strict In short raise if:
Mainly because:
In short, less work, less code and less bugs. |
The point that @pauladkisson brings to the table moves me in the direction of more restrictive and away from my initial position. We indeed deal with multiple segments in SI by default and I can see going over the first segment and getting no errors being even more confusing in this context. |
So, it seem that everbody wants to throw errors for outside of bounds (including negative). I was the only one that was negative on that but I have been convinced by you guys and the standard not having specification for that behavior. Let's say we give this another week and if we don't have any strong arguments by 2023/10/06 we run on with the restrictive version. That is, we will raising errors if:
What is less clear to me is what should we do for the
Sombeody else that is not me put a date on that one as this is too much of a hard pill to swallow : ) |
@h-mayorquin You're not alone! I know my comment probably won't change anything, since almost everybody prefers throwing errors, but I still wanted to give my opinion (but I respect the fact that the majority wins). I also prefer doing it the "numpy way", where giving |
These are very good points, and the current discussion on documentation reminds again how important it is to check what the other main packages are doing and follow as best as possible. So any divergence should be done with very good reason. In this case, I think there is a reason to deviate from the slicing behaviour of Python / Numpy. The purpose of these tools are to provide general purpose array structures that support a very broad number of cases. I can see how in the generalised concept of an array, trying to index However, in the context of spikeinterface, the concept is not of a generalised array but of an extracellular ephys timeseries. In this case, it does not make sense to try and index from and to the same timepoint, especially in the case of |
One think that we did not discuss is whether we follow numpy convention of accepting slicing with None: In [1]: np.arange(10)[None:]
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [2]: np.arange(10)[None:None]
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [3]: np.arange(10)[:None]
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) As a proponent above of "doing what numpy does" I think we should but given that slicing police side won above maybe you also want to throw an error here? @JoeZiminski @samuelgarcia @alejoe91 @samuelgarcia I will add this to the next maintenace discussion topic so we can agree on this, wrote some guidelines and maybe close this issue. This comes from the discussion in #3159 where the solution depends on the decision. |
Good point @h-mayorquin for this I think if the user has explicitly passed |
It is a good point that taking the opposite stance would break backwards compatbility. I prefer to comply as much we the array standard as we can and this does: Probably everbody agress with this one. |
Related to #1881, #1886 and #1979 .
Let's discuss how we should handle the behavior of
start_frame
/end_frame
/channel_ids
in theget_traces()
.Plus, let's discuss how to do add generalizing testing once we reach consensus.
Some proposals:
end_frame
>start_frame
: passingstart_frame == end_frame
will raise an Exceptionend_frame
<num_samples
For the channel_ids is the behavior is already established, but we should consider adding tests for them when we add generalized testing.
The text was updated successfully, but these errors were encountered: