Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_animals(): info of 2 tags collapsed into one column #287

Open
lottepohl opened this issue Oct 9, 2023 · 15 comments
Open

get_animals(): info of 2 tags collapsed into one column #287

lottepohl opened this issue Oct 9, 2023 · 15 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@lottepohl
Copy link

Hi everyone,

In the table retrieved with etn::get_animals(), the information of two tags is collapsed in one row if the animal was equipped with a tag containing 2 sensors (resulting in 1 unique tag_serial_number with two different acoustic_tag_ids), see here. Then, a few columns (tag_serial_number, tag_type, tag_subtype, acoustic_tag_id and acoustic_tag_id_alternative) have double entries, separated with commas.
Would it be possible to only have the tag_serial_number in the output from get_animals()? Then, the collapsing could be avoided and the column contents, when e.g. animals are tagged both with one and two sensors, would be consistent.
The remaining tag details, if needed, could be joined from the output from get_tags().
Just an idea, what do you think?

All the best,
Lotte

@PietrH
Copy link
Member

PietrH commented Dec 4, 2023

I'd love to hear some opinions on this. I'm not one to stand in the way of progress. As it stands, I would expect you to be able to do this anyway. Except...

I played around with this, and at the moment this wouldn't actually work directly in cases where there is a comma in the tag_serial_number as per #288

I also noticed the error you get isn't super informative, as it doesn't actually tell you what tags weren't found.

I'll patch get_tags() so it'll also accept comma separated serial_tag_numbers and I'll also improve the error messaging if no match was found.

I'm aiming for something like this to result in the data.frame you are looking for:

get_animals()  %>%
    pull(tag_serial_number) %>%
    stringr::str_split(stringr::fixed(",")) %>%
    unlist() %>%
    get_tags(tag_serial_number = .)

@PietrH PietrH added bug Something isn't working enhancement New feature or request labels Dec 4, 2023
@PietrH PietrH self-assigned this Dec 4, 2023
@PieterjanVerhelst
Copy link
Collaborator

@lottepohl could you provide an example table of the output you get so I can better understand what you mean?

@lottepohl
Copy link
Author

@PieterjanVerhelst sure! Here you go.
So, the columns 'tag_serial_number' , 'tag_type', 'tag_subtype', and 'acoustic_tag_id' have information collapsed from both tags, separated by a comma within each cell. In the case of the starry smoothhounds the tags (Acoustic Data Storage Tags) have the same serial number but different acoustic tag ids.
I guess when doing traditional double tagging, the tags will also have different serial numbers.

So I will reformulate my suggestion:
Instead of only retrieving only the 'tag_serial_number' when running etn::get_animals(), make 3 to 4 (enough? depending on how many tags one individual can have at most) columns: 'tag_serial_number_1', 'tag_serial_number_2', .... and 'acoustic_tag_id_1', 'acoustic_tag_id_2' etc.

This approach is similar to the different length measurements that can be retrieved with etn::get_animals() (shown in the last section of the code below).

@PieterjanVerhelst do you understand the situation better now? It's nothing super urgent, just something that would be nice to have. Because indeed, like @PietrH said, this could be done with just a couple of lines of code.

Cheers,
Lotte

library(etn)
library(dplyr)
con <- etn::connect_to_etn()

smoothhounds <- etn::get_animals(scientific_name = "Mustelus asterias")

# many columns contain double information, e.g. 'tag_serial_number' has the same info twice
smoothhounds %>%
  dplyr::select(tag_serial_number, tag_type, tag_subtype, acoustic_tag_id) %>%
  head()
#> # A tibble: 6 × 4
#>   tag_serial_number tag_type                         tag_subtype acoustic_tag_id
#>   <chr>             <chr>                            <chr>       <chr>          
#> 1 1293314,1293314   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3820,…
#> 2 1293321,1293321   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3835,…
#> 3 1293315,1293315   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3822,…
#> 4 1293316,1293316   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3825,…
#> 5 1293322,1293322   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3836,…
#> 6 1293317,1293317   acoustic-archival,acoustic-arch… animal,ani… A69-9006-3826,…

# acoustic_tag_id has two different values
smoothhounds %>%
  dplyr::select(acoustic_tag_id) %>%
  head()
#> # A tibble: 6 × 1
#>   acoustic_tag_id            
#>   <chr>                      
#> 1 A69-9006-3820,A69-9006-3821
#> 2 A69-9006-3835,A69-9006-3834
#> 3 A69-9006-3822,A69-9006-3823
#> 4 A69-9006-3825,A69-9006-3824
#> 5 A69-9006-3836,A69-9006-3837
#> 6 A69-9006-3826,A69-9006-3827

# there are different columns for different length measurements
smoothhounds %>%
  dplyr::select(length1:length4_unit) %>% 
  colnames()
#>  [1] "length1"      "length1_unit" "length2_type" "length2"      "length2_unit"
#>  [6] "length3_type" "length3"      "length3_unit" "length4_type" "length4"     
#> [11] "length4_unit"

@PieterjanVerhelst
Copy link
Collaborator

Aha indeed, now I understand. I am not a fan of having multiple values in one cell, separated by a separator (here a comma). This makes data processing like filtering hard and cumbersome. Why not add two records (or the number needed) instead of collapsing everything in one record, separated with commas?

@lottepohl
Copy link
Author

@PieterjanVerhelst how I understand that would be the same as I propose? To have a column for each unique acoustic tag id, and serial number etc.?
However, to make the solution fit all possible data in ETN I guess it's easiest to have these placeholder columns for the different serial numbers and tag ids.

Cheers!

@PieterjanVerhelst
Copy link
Collaborator

Not sure, but I would suggest the following format, with the example of two transmitters (hence two tag serial numbers) each having two acoustic tag IDs with a specific sensor value as an example:

tag_serial_number acoustic_tag_id sensor_value
1 500 10
1 501 25
2 503 12
2 504 22

@lottepohl
Copy link
Author

@PieterjanVerhelst Yes I agree! However, to my understanding, the table that you show is similar to the output you get from etn::get_tags() - it lists information about tags (see code below). Above I was referring to the etn::get_animals() function.

library(etn)
library(dplyr)
con <- etn::connect_to_etn()tags <- etn::get_tags(acoustic_tag_id = c("A69-9006-3820", "A69-9006-3821", "A69-9006-3835", "A69-9006-3834"))

tags %>% dplyr::select(tag_serial_number, sensor_type, acoustic_tag_id)
#> # A tibble: 4 × 3
#>   tag_serial_number sensor_type acoustic_tag_id
#>   <chr>             <chr>       <chr>          
#> 1 1293314           pressure    A69-9006-3821  
#> 2 1293314           temperature A69-9006-3820  
#> 3 1293321           pressure    A69-9006-3835  
#> 4 1293321           temperature A69-9006-3834

@PietrH @PieterjanVerhelst could the following work?
You can query tags with etn::get_tags() using the animal_id (at the moment you can only query using the tag_type/subtype, tag_serial_number, and acoustic_tag_id). And when querying animals with etn::get_animals() you don't get any tag information directly but you can join the dataframes from etn::get_animals() and etn::get_tags().
The premise for this approach would be, though, that only tags that are already associated with an animal_id are in the database (I don't know if that's the case or not).

To illustrate, the output from etn::get_animals() would look like:

animal_id acoustic_project_code scientific_name ...
3171 ADST-Shark Mustelus asterias ...
3172 ADST-Shark Mustelus asterias ...

The output from etn::get_tags() would look like:

tag_serial_number acoustic_tag_id sensor_type animal_id ...
1293314 A69-9006-3820 pressure 3171 ...
1293314 A69-9006-3821 temperature 3171 ...
1293321 A69-9006-3835 pressure 3172 ...
1293321 A69-9006-3834 temperature 3172 ...

If desired, these two tables could be joined (dplyr::left_join()) and pivoted (dployr::pivot_wider()) and then you would end up with one row per animal with all tag information.

@PietrH
Copy link
Member

PietrH commented Dec 7, 2023

@peterdesmet There is a proposal to change the output of get_animals() and/or get_tags() currently acoutistic_tag_id's are delimited: A69-9006-3820,A69-9006-3821

Do you happen to remember why you chose to do it this way back in the day? 6b81948

@PieterjanVerhelst
Copy link
Collaborator

@lottepohl indeed, your suggestion is also what seems more easy for data processing and manipulation 👍 .

@PietrH
Copy link
Member

PietrH commented Dec 7, 2023

@lottepohl

Thank you for all your input! I'm trying to visualise your proposed flow of functions.

You can query tags with etn::get_tags() using the animal_id. And when querying animals with etn::get_animals() you don't get any tag information directly but you can join the dataframes from etn::get_animals() and etn::get_tags().

these two tables could be joined (dplyr::left_join()) and pivoted (dployr::pivot_wider()) and then you would end up with one row per animal with all tag information

Could you write me some (pseudo-)code of how you'd like this to work: how would you connect the functions in your script, and what the intermediary tables would look like?

@peterdesmet
Copy link
Member

@PietrH The reason tag information was collapsed with comma is because of naming. With get_animals() you expect one row per animal, even though that function actually returned animal/tag deployments ...

The solution that @lottepohl proposes solves this and get_animals() will always return unique animals. It does somewhat move the problem to get_tags(), but that function already returned multiple sensors related to one tag in a tidy way. Returning the animal_id might create more rows, but likely not many.

Notes:

  • We should consider get_tag_deployments() as a name for the function
  • We should add a new parameter animal_id (vector allowed)
  • The function should also return undeployed tags, i.e. rows where animal_id is NULL (can be avoided by using the animal_id parameter
  • Take into account that in theory, an animal can get a tag (date 1), it being removed (date 2) and getting a new tag (date 3). Does that work with the left_join()?
  • Note that get_animals() returns more information that are actually events: the capture, release, recapture, etc. That is stored in separate columns for each animal (but converted to tidy events in Darwin Core). Should it continue to be returned as part of get_animals() (even though it is deployment info)?

@peterdesmet
Copy link
Member

I looked a bit more into it and want to change my suggestion.

Scope

  • I have access to 20766 animals through get_animals().
  • 1654 of these have comma-separated tag information
  • The tag_serial_number for all of those is just repeated (e.g. 1278974,1278974 or 1083102,1083102,1083102) except for one animal 23107 from FISHINTEL which has 2 serial numbers 22020176,22020224,22020176,22020224.

I propose the following:

1. Collapse repeated tag info in animals

If one of the tag_ columns in get_animals() has duplicated, comma-separated information, then collapse that information to a unique value. If all values are empty (,) the result is NA

library(etn)
library(dplyr)
con <- etn::connect_to_etn()
# Before
animals <- get_animals(tag_serial_number = c(1293314, 1293321))
animals %>% select(animal_id, animal_project_code, tag_serial_number, tag_type, tag_subtype, acoustic_tag_id, acoustic_tag_id_alternative, scientific_name)
#> # A tibble: 2 × 8
#>   animal_id animal_project_code tag_serial_number tag_type           tag_subtype
#>       <int> <chr>               <chr>             <chr>              <chr>      
#> 1      3171 ADST-Shark          1293314,1293314   acoustic-archival… animal,ani…
#> 2      3172 ADST-Shark          1293321,1293321   acoustic-archival… animal,ani…
#> # ℹ 3 more variables: acoustic_tag_id <chr>, acoustic_tag_id_alternative <chr>,
#> #   scientific_name <chr>

# After
#> # A tibble: 2 × 8
#>   animal_id animal_project_code tag_serial_number tag_type          tag_subtype
#>       <int> <chr>               <chr>             <chr>             <chr>      
#> 1      3171 ADST-Shark          1293314           acoustic-archival animal     
#> 2      3172 ADST-Shark          1293321           acoustic-archival animal     
#> # ℹ 3 more variables: acoustic_tag_id <chr>, acoustic_tag_id_alternative <chr>,
#> #   scientific_name <chr>

2. One step further: remove tag_ related columns from animals, except tag_serial_number

In addition to solution 1, remove tag-related columns from get_animals() except for tag_serial_number.

#> # A tibble: 2 × 4
#>   animal_id animal_project_code tag_serial_number scientific_name  
#>       <int> <chr>               <chr>             <chr>            
#> 1      3171 ADST-Shark          1293314           Mustelus asterias
#> 2      3172 ADST-Shark          1293321           Mustelus asterias

The 4 removed columns (tag_type, tag_subtype, acoustic_tag_id, acoustic_tag_id_alternative) can all be retrieved via get_tags()

This is a less breaking change than moving animal_id to get_tags() since:

  • get_animals() can retain the parameter tag_serial_number
  • All scripts that join animals and tags on tag_serial_number continue to work (including e.g. Darwin Core SQL)
  • No changes to get_tags() (and no need for new parameters)
  • What is potentially breaking is that users looked at the returned acoustic_tag_id in get_animals() and then used it to look up a tag in get_tags(). They would now have to do this via tag_serial_number.

3. Future work

In the future, I think we should work towards a get_animals() for unique animals (no tag info), get_tags() for unique tag information, get_sensors() for sensors associated with tags (which then have unique acoustic_tag_id) and a (cf. Movebank) get_reference_data() that combines all that information into one data frame.


But for now, solutions 1 and 2 are non-breaking and (hopefully) fairly easy to implement. They solve they annoying fact that tag information is comma-separated in cases where that is not necessary. @lottepohl @PieterjanVerhelst what do you think? Are you in favour of 1 or 1+2?

@PieterjanVerhelst
Copy link
Collaborator

@peterdesmet a good suggestion which I follow, but to be able to better judge it, could you provide a full table of a before and after example?

@peterdesmet
Copy link
Member

@PieterjanVerhelst here you go:

Current implementation (notice repeated identical values in tag_serial_number, tag_type, etc.):

animals
# A tibble: 80 × 66
   animal_id animal_project_code tag_serial_number tag_type                tag_subtype acoustic_tag_id acoustic_tag_id_alte…¹
       <int> <chr>               <chr>             <chr>                   <chr>       <chr>           <chr>                 
 1      3171 ADST-Shark          1293314,1293314   acoustic-archival,acouanimal,aniA69-9006-3820,… ,                     
 2      3172 ADST-Shark          1293321,1293321   acoustic-archival,acouanimal,aniA69-9006-3835,… ,                     
 3      3173 ADST-Shark          1293315,1293315   acoustic-archival,acouanimal,aniA69-9006-3822,… ,                     
 4      3174 ADST-Shark          1293316,1293316   acoustic-archival,acouanimal,aniA69-9006-3825,… ,                     
 5      3175 ADST-Shark          1293322,1293322   acoustic-archival,acouanimal,aniA69-9006-3836,… ,                     
 6      3176 ADST-Shark          1293317,1293317   acoustic-archival,acouanimal,aniA69-9006-3826,… ,                     
 7      3177 ADST-Shark          1293318,1293318   acoustic-archival,acouanimal,aniA69-9006-3828,… ,                     
 8      3178 ADST-Shark          1293319,1293319   acoustic-archival,acouanimal,aniA69-9006-3831,… ,                     
 9      3179 ADST-Shark          1293320,1293320   acoustic-archival,acouanimal,aniA69-9006-3832,… ,                     
10      3180 ADST-Shark          1293293,1293293   acoustic-archival,acouanimal,aniA69-9006-3778,… ,                     
# ℹ 70 more rows
# ℹ abbreviated name: ¹​acoustic_tag_id_alternative
# ℹ 59 more variables: scientific_name <chr>, common_name <chr>, aphia_id <int>, animal_label <chr>, animal_nickname <chr>,
#   tagger <chr>, capture_date_time <dttm>, capture_location <chr>, capture_latitude <dbl>, capture_longitude <dbl>,
#   capture_method <chr>, capture_depth <chr>, capture_temperature_change <chr>, release_date_time <dttm>,
#   release_location <chr>, release_latitude <dbl>, release_longitude <dbl>, recapture_date_time <dttm>, length1_type <chr>,
#   length1 <dbl>, length1_unit <chr>, length2_type <chr>, length2 <dbl>, length2_unit <chr>, length3_type <chr>, …
# ℹ Use `print(n = ...)` to see more rows

Solution 1 (don't repeat identical values, acoustic_tag_id still contains comma separated values that are different):

animals
# A tibble: 80 × 66
   animal_id animal_project_code tag_serial_number tag_type          tag_subtype acoustic_tag_id       acoustic_tag_id_alte…¹
       <int> <chr>               <chr>             <chr>             <chr>       <chr>                 <chr>                 
 1      3171 ADST-Shark          1293314           acoustic-archival animal      A69-9006-3820,A69-90NA                    
 2      3172 ADST-Shark          1293321           acoustic-archival animal      A69-9006-3835,A69-90NA                    
 3      3173 ADST-Shark          1293315           acoustic-archival animal      A69-9006-3822,A69-90NA                    
 4      3174 ADST-Shark          1293316           acoustic-archival animal      A69-9006-3825,A69-90NA                    
 5      3175 ADST-Shark          1293322           acoustic-archival animal      A69-9006-3836,A69-90NA                    
 6      3176 ADST-Shark          1293317           acoustic-archival animal      A69-9006-3826,A69-90NA                    
 7      3177 ADST-Shark          1293318           acoustic-archival animal      A69-9006-3828,A69-90NA                    
 8      3178 ADST-Shark          1293319           acoustic-archival animal      A69-9006-3831,A69-90NA                    
 9      3179 ADST-Shark          1293320           acoustic-archival animal      A69-9006-3832,A69-90NA                    
10      3180 ADST-Shark          1293293           acoustic-archival animal      A69-9006-3778,A69-90NA                    
# ℹ 70 more rows
# ℹ abbreviated name: ¹​acoustic_tag_id_alternative
# ℹ 59 more variables: scientific_name <chr>, common_name <chr>, aphia_id <int>, animal_label <chr>, animal_nickname <chr>,
#   tagger <chr>, capture_date_time <dttm>, capture_location <chr>, capture_latitude <dbl>, capture_longitude <dbl>,
#   capture_method <chr>, capture_depth <chr>, capture_temperature_change <chr>, release_date_time <dttm>,
#   release_location <chr>, release_latitude <dbl>, release_longitude <dbl>, recapture_date_time <dttm>, length1_type <chr>,
#   length1 <dbl>, length1_unit <chr>, length2_type <chr>, length2 <dbl>, length2_unit <chr>, length3_type <chr>, …
# ℹ Use `print(n = ...)` to see more rows

Solution 1+2 (don't repeat identical values, only retain tag_serial_number column):

animals
# A tibble: 80 × 62
   animal_id animal_project_code tag_serial_number scientific_name   common_name aphia_id animal_label animal_nickname tagger
       <int> <chr>               <chr>             <chr>             <chr>          <int> <chr>        <chr>           <chr> 
 1      3171 ADST-Shark          1293314           Mustelus asterias Starry smo105821 7901         NA              ,     
 2      3172 ADST-Shark          1293321           Mustelus asterias Starry smo105821 7912         NA              ,     
 3      3173 ADST-Shark          1293315           Mustelus asterias Starry smo105821 7913         NA              ,     
 4      3174 ADST-Shark          1293316           Mustelus asterias Starry smo105821 7902         NA              ,     
 5      3175 ADST-Shark          1293322           Mustelus asterias Starry smo105821 7903         NA              ,     
 6      3176 ADST-Shark          1293317           Mustelus asterias Starry smo105821 7904         NA              ,     
 7      3177 ADST-Shark          1293318           Mustelus asterias Starry smo105821 7905         NA              ,     
 8      3178 ADST-Shark          1293319           Mustelus asterias Starry smo105821 7906         NA              ,     
 9      3179 ADST-Shark          1293320           Mustelus asterias Starry smo105821 7907         NA              ,     
10      3180 ADST-Shark          1293293           Mustelus asterias Starry smo105821 7908         NA              ,     
# ℹ 70 more rows
# ℹ 53 more variables: capture_date_time <dttm>, capture_location <chr>, capture_latitude <dbl>, capture_longitude <dbl>,
#   capture_method <chr>, capture_depth <chr>, capture_temperature_change <chr>, release_date_time <dttm>,
#   release_location <chr>, release_latitude <dbl>, release_longitude <dbl>, recapture_date_time <dttm>, length1_type <chr>,
#   length1 <dbl>, length1_unit <chr>, length2_type <chr>, length2 <dbl>, length2_unit <chr>, length3_type <chr>,
#   length3 <dbl>, length3_unit <chr>, length4_type <chr>, length4 <dbl>, length4_unit <chr>, weight <dbl>,
#   weight_unit <chr>, age <dbl>, age_unit <chr>, sex <chr>, life_stage <chr>, wild_or_hatchery <chr>, stock <chr>, …
# ℹ Use `print(n = ...)` to see more rows

@PieterjanVerhelst
Copy link
Collaborator

I would go for option 1+2 as get_animals() provides data on the animals. Data on tags can be obtained through get_tags() and if necessary joined to animal meta-data based on the tag_serial_number. To me this seems like a logical and straightforward methodology. @lottepohl what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants