Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate detections #259

Open
peterdesmet opened this issue Nov 24, 2022 · 14 comments
Open

Duplicate detections #259

peterdesmet opened this issue Nov 24, 2022 · 14 comments
Assignees
Labels
bug Something isn't working database Related to ETN database

Comments

@peterdesmet
Copy link
Member

In 2014_demer (but likely in other projects), I discovered detections that are duplicates (same datetime, receiver, transmitter), except for their station_name and file (source of data):

@PieterjanVerhelst @IPauwels is this valid data? If not, @aubrivliz do you think this is an issue in the acoustic.detections_limited query?

Screenshot 2022-11-24 at 10 45 42

@peterdesmet peterdesmet added bug Something isn't working database Related to ETN database labels Nov 24, 2022
@aubrivliz aubrivliz self-assigned this Nov 24, 2022
@aubrivliz
Copy link

Is this from a specific deployment?

@peterdesmet
Copy link
Member Author

No, this is across the entire project. The issue also occurs in other projects. Here's the code to get all duplicates:

det <- get_acoustic_detections(animal_project_code = "2014_DEMER")
dups <-
  det %>%
  group_by(tag_serial_number, date_time) %>%
  filter(n() > 1)
dups

File: dups.csv

@PieterjanVerhelst
Copy link
Collaborator

@peterdesmet the file inbo_data_file contains all detection data from INBO projects untill a certain date; I think 31st December 2014. This file was uploaded in ETN. We decided to do this bulk upload because it was impossible to figure out the different deployments over the various projects before that date. However, it seems that the data from 2014_Demer was still uploaded afterwards, leading to duplicates. @IPauwels, is this correct, did you upload those data separately in ETN?
If so, one of the duplicates can be removed. Preferably the ones without a station name.
For two duplicates the station name is different (tag ID A69-1601-28295 at stations de-12 vs de-14). I have no idea how this is possible. Since it was @IPauwels her project, she is the right person to give feedback on that specific issue.

@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 25, 2022

Thanks @PieterjanVerhelst. Unfortunately the issue is not only related to that or INBO-projects only:

animal_project_code duplicates checked by
2010_phd_reubens 2766
2011_rivierprik 46054
2012_leopoldkanaal 690512
2013_albertkanaal 1136927  @IPauwels
2014_demer 8298  @IPauwels
2015_dijle 5050
2015_homarus 796
2015_phd_verhelst_cod 479797 @PieterjanVerhelst
2015_phd_verhelst_eel 13282  @PieterjanVerhelst
library(etn)
library(dplyr)
con <- connect_to_etn()
det1 <- get_acoustic_detections(animal_project_code = c(
  "2010_PHD_REUBENS",
  "2011_RIVIERPRIK",
  "2012_LEOPOLDKANAAL",
  "2013_ALBERTKANAAL"
))
det2 <- get_acoustic_detections(animal_project_code = c(
  "2014_DEMER",
  "2015_DIJLE",
  "2015_HOMARUS",
  "2015_PHD_VERHELST_COD",
  "2015_PHD_VERHELST_EEL"
))
det <- bind_rows(det1, det2)
dups <-
  det %>%
  group_by(animal_project_code, tag_serial_number, date_time) %>%
  filter(n() > 1)
dups %>%
  group_by(animal_project_code) %>%
  count()

@PieterjanVerhelst
Copy link
Collaborator

@peterdesmet could you create a csv with duplicates for the projects 2015_PHD_VERHELST_COD and 2015_PHD_VERHELST_EEL so I can look into these? I am currently abroad with limited internet access.

@peterdesmet
Copy link
Member Author

@PieterjanVerhelst here you go: dups_pj.csv.zip

I notice I already reported this issue before for 2013_albertkanaal at #259 and for 2010_phd_reubens_sync at inbo/etn-occurrences#78, where @jreubens answered that is difficult to detect these.

I do think we'll have to tackle this at some point, as users (like me) will bump into this again and again.

@IPauwels
Copy link
Collaborator

I can check this for the Demer project. Could you create a csv with all duplicates for that project too, please?
I'll also check for Albertkanaal. If in both projects, as well as in the ones of Pieterjan we find that it is only the station names and the ID_pk (it seems) that differ between certain duplicates (who are duplicates in datetime AND receiver AND transmitter), then I guess we can keep the ones with the station names? And I guess these duplicates are created internally, and not because we imported them double ... Is that possible?

@peterdesmet peterdesmet changed the title Duplicate detections, except for station_name Duplicate detections, except for deployment_id (sometimes station_name) Nov 26, 2022
@peterdesmet
Copy link
Member Author

@IPauwels here are the files with duplicates:

The station name is sometimes the same, the constant difference is detection_id (so these are actual different rows, not something created by the query) and deployment_id (of the receiver). I don't know how these were created (@aubrivliz?).

A consistent way to identify them and deciding which one to keep would be good. E.g. delete all duplicate detections associated with deployment_id x, y, z

@peterdesmet peterdesmet changed the title Duplicate detections, except for deployment_id (sometimes station_name) Duplicate detections Nov 26, 2022
@peterdesmet
Copy link
Member Author

Correction: the deployment_id is sometimes the same for the duplicates. 🤷‍♂️

@PieterjanVerhelst
Copy link
Collaborator

I checked the file dups_pj.csv file and apart from the deployment_id column, the records are identical. IMO we can remove one of the duplicates, but better to first check how these duplicates got into the database. In case one of the records holds more info (e.g. station name) we should keep that record and remove the other.
@aubrivliz would it be possible to check when two duplicates were uploaded? So we can hopefully identifiy if a dataset was uploaded at two different times.

@aubrivliz
Copy link

@PieterjanVerhelst I've checked and there are 2 files uploaded for the same deployment.
image

Once with the "inbo_data_file" and "VR2W_122322_20141121_1.csv"

But there is no trace of them in the database.

@jreubens
Copy link
Collaborator

@aubrivliz I think the excercise you did for @PieterjanVerhelst should be done for all of these duplicates. I have for instance duplicates in 'Humarus' and 'PhD Reubens' .... would be interesting to know which files they originally come from

@PietrH
Copy link
Member

PietrH commented Jun 20, 2023

@PieterjanVerhelst @jreubens Has this been solved? Is this a different issue from #283 ?

@PieterjanVerhelst
Copy link
Collaborator

@PietrH this issue hasn't been solved and if you refer to #283 with duplicate records instead of unique tag ids, than this is indeed the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working database Related to ETN database
Projects
None yet
Development

No branches or pull requests

6 participants