Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map - imputing zeroes doesn't seem to impute enough zeroes #31

Open
bobular opened this issue Dec 1, 2023 · 9 comments
Open

Map - imputing zeroes doesn't seem to impute enough zeroes #31

bobular opened this issue Dec 1, 2023 · 9 comments
Assignees

Comments

@bobular
Copy link
Member

bobular commented Dec 1, 2023

Megastudy

Filters:

  • PopBio ID = VBP0000844
  • Collection start date: 2022-05-01 to 2022-11-01
  • Provider name for collection site: Canyon Rim Church

Markers: donut by species

Floater: timeline

  • unbiased specimen count vs. collection start date
  • overlay with species
  • bin width = 1 day

Should be available in this saved analysis: https://vectorbase.org/vectorbase/app/workspace/maps/A4nuqcm/import

image

Two issues

  1. There ought to points for each species for all timepoints - where there are currently no points, the points should be at zero
  2. There exist collections with no samples at all for some time points: 2022-06-01, 2022-06-14, 2022-08-16, 2022-08-30 and 2022-09-27 - these are shown in the Collection dates floating histogram - these should also have zero points for all species

I notice that there is a __UNSELECTED__ series returned from the back end with all zero y values. The client didn't ask for this (it only sent three overlayValues in the requests for markers and lineplot). I don't think these are the missing zeroes.

@bobular
Copy link
Member Author

bobular commented Dec 1, 2023

Looked closer at the __UNSELECTED__ stratum of the histogram (of unbiased specimen count) response. It certainly returns a count of zeroes (equal to the number of collections in the subset, it seem) but only in this __UNSELECTED__ stratum.

But with stratification turned off, one would expect the histogram to return counts for zeroes and it doesn't. (I check this by setting the bin width to 0.1)

@d-callan
Copy link
Member

d-callan commented Dec 4, 2023

i think i know why the histo doesnt impute zeroes when the overlay is turned off, and will think about how to fix that.. the missing time points for species ill have to investigate some more..

@d-callan d-callan self-assigned this Dec 4, 2023
@d-callan d-callan transferred this issue from VEuPathDB/EdaNewIssues Dec 4, 2023
@d-callan
Copy link
Member

d-callan commented Dec 11, 2023

ive spent a bit of time looking at this and thinking about it in a bit more depth. one thing thats clear so far is that we need to find all combinations of values for all study specific vocab variables regardless of if the study specific vocab variable is in the final plot.

i think what id propose to do next is the following:

  1. make it so the method that imputes zeroes is happy to receive columns that wont be in the resulting plot, so we can use them to find combinations to impute zeroes for (for the histogram issue)
  2. add some unit tests for the R that include things like dates (to at least eliminate that as a source of the problem for the line plot)
  3. modify the data service so that itll pass the study-specific vocab variables any time we see unbiased specimen count in a plot, regardless of the study vocab variables are also in the plot
  4. revisit the missing points in line plot, once we know the issue ive identified is corrected

@d-callan
Copy link
Member

d-callan commented Dec 11, 2023

There exist collections with no samples at all for some time points: 2022-06-01, 2022-06-14, 2022-08-16, 2022-08-30 and 2022-09-27 - these are shown in the Collection dates floating histogram - these should also have zero points for all species

im finding this curious.. isnt this basically fabricating collections? are we saying to introduce a collection where every sample for all species has a specimen count of 0? or am i misunderstanding?

@d-callan
Copy link
Member

d-callan commented Dec 11, 2023

also, just to keep all my thoughts in one place, here are some imputing zero related questions i have. i may have already asked and had these answered last time i looked at imputing zeroes, but i dont remember.. sorry! but i think getting answers to them now while im looking at this again would be good.

  1. are (or should) all the vars on the sample entity be considered to have study-specific vocabs? and if not, why not?
  2. if there are variables which dont have study-specific vocabs, do we still want to include that variable in finding combinations to impute 0s for? i think maybe so, but i dont think we talked about it.
  3. is it possible to have a continuous variable outside of unbiased specimen counts on the sample entity? what values would we impute 0s for if we did?

@bobular
Copy link
Member Author

bobular commented Dec 11, 2023

There exist collections with no samples at all for some time points: 2022-06-01, 2022-06-14, 2022-08-16, 2022-08-30 and 2022-09-27 - these are shown in the Collection dates floating histogram - these should also have zero points for all species

im finding this curious.. isnt this basically fabricating collections? are we saying to introduce a collection where every sample for all species has a specimen count of 0? or am i misunderstanding?

It's not fabricating collections. A collection effort was undertaken in a specific location with certain device, and at a certain time. This information is provided to us by the, er, providers. It's just that there are zeroes in all columns (it often comes in wide format).

@d-callan
Copy link
Member

How can I know the difference between that case and a collection not having happened? Dates don't have vocabularies. maybe I'm still not understanding..

@bobular
Copy link
Member Author

bobular commented Dec 11, 2023

Thanks for looking into our favourite topic 🤯

also, just to keep all my thoughts in one place, here are some imputing zero related questions i have. i may have already asked and had these answered last time i looked at imputing zeroes, but i dont remember.. sorry! but i think getting answers to them now while im looking at this again would be good.

1. are (or should) all the vars on the sample entity be considered to have study-specific vocabs? and if not, why not?

I was thinking about this last week too. I came to the conclusion that a) this could only apply to categorical variables, and b) all the other variables were effectively homogenous/constant/single-valued within a study.

So the answer is yes, we could consider all variables the same way, but in practice it's really only going to be the species, sex and dev. stage that produce any relevant combinations.

2. if there are variables which dont have study-specific vocabs, do we still want to include that variable in finding combinations to impute 0s for? i think maybe so, but i dont think we talked about it.

How is this different from 1?

3. is it possible to have a continuous variable outside of unbiased specimen counts on the sample entity? what values would we impute 0s for if we did?

Ah, I see you've been thinking about that too...

I genuinely can't think of an example continuous variable that could live alongside unbiased specimen count. I think that's because the count variable is signifying X identical copies of identical specimens. Let's say we were collecting mosquitoes and measuring species, sex and wing length as a continuous variable. The unbiased specimen count would have to be 1 for every record - because no two wing lengths are identical. If the wing length measurement was binned/categorical then it would make sense, just like species, sex and dev. stage do now.

So in short, no I don't think we need to worry about this.

@bobular
Copy link
Member Author

bobular commented Dec 11, 2023

i think what id propose to do next is the following:

This all sounds good. I don't know if the data service "knows" enough to do 3. but hopefully it does! If there's anything the client can do, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants