forked from EduardoArle/big_data_biogeography
-
Notifications
You must be signed in to change notification settings - Fork 0
/
bon_species_list.Rmd
97 lines (73 loc) · 2.99 KB
/
bon_species_list.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title: "A species list from a higher rank taxon name"
output: html_document
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, eval = FALSE,
echo=TRUE, warning=FALSE, message=FALSE,
tidy = TRUE, collapse = TRUE,
results = 'hold')
```
The taxize package enables you to get a list of daughter taxa from a higher taxonomic name. You will have to decide for a taxonomic to rely on. In this example we will use th GBIF backbone since we are working with GBIF data, but others may be more suitable for other applications. Options include ITIS, NCBI, WORMS, BOLD.
# Setup
```{r}
library(rgbif)
library(taxize)
library(tidyverse)
```
# Obtain species list
For groups with more than 1000 species expected, see below
```{r}
# get the GBIF ID first, as done in the exercise from day 1
ID<- taxize::get_gbifid_("Peperomia", method="backbone")%>%
bind_rows() %>% # get_gbifid_ returns a list, convert to data.frame
filter(matchtype == "EXACT" & status == "ACCEPTED") # filter data
# get all species with in this taxon
# the donstream function will find species for you
splist <- downstream(sci_id = ID$usagekey,
method = "lookup",
db = "gbif",
downto = "species",
intermediate = FALSE,
rows = NA,
limit = 1000)[[1]]
# You may want to remove certain species from the list again
splist <- splist %>%
filter(!name %in% c("Amblyomma albopictum", "Dermacentor abaensis")) #the names of the species to remove here
# write to disk
write_csv(splist, "species_list.csv")
```
# For large taxa with > 1000 species expected
There is a limit of 1000 names to be returned from downstream for GBIF. For larger groups you need to use multiple queries with the `at` argument.
```{r}
# get the GBIF ID first, as done in the exercise from day 1
ID<- taxize::get_gbifid_("Peperomia", method="backbone")%>%
bind_rows() %>% # get_gbifid_ returns a list, convert to data.frame
filter(matchtype == "EXACT" & status == "ACCEPTED") # filter data
# get all species with in this taxon
# the donstream function will find species for you
expected_species_number <- 1900
splist <- data.frame()
start <- 1
while(start < expected_species_number){
print(start)
sub <- gbif_downstream(id = ID$usagekey,
downto = "species",
intermediate = FALSE,
start = start,
rows = NA,
limit = start + 1000)
splist <- bind_rows(splist, sub)
start <- start + 1000
}
#count the number of entries returned
nrow(splist)
# some names are duplicated, potentially because they refer to different subspecies.
# If you want you can remove the duplicated names
splist <- splist %>%
distinct(name, .keep_all = TRUE)
#count the number of unique species names returned
nrow(splist)
# write to disk
write_csv(splist, "species_list.csv")
```