-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R crashes on large get_acoustic_detections query #278
Comments
Todo:
We'll have a look at possible solutions to speed up the detections view. Indexing, perhaps implementing pagination. |
@PietrH, I'm doing a meta-analysis on silver eel migration by combining the data of various studies. Downloading all the data at once not only takes a lot of time, it also gave me a 'time out' error in the past, hence I had to download the data in 2 or 3 subsets. I am running it right now just to try (it has been running since 10.47h this morning). If you want to try yourself, here is the code:
Update: download just crashed, giving the same error as above. |
Hi,
|
Hello Claudia, thank you for including the error message, very helpful! :-) What was the exact R command you were running? Currently the package seems to struggle with queries that result in very large R objects, I'm not sure yet if this a database or a package problem. In the meantime you might be able to work around the issue by splitting up your query, but I'll know for sure if you can let me know what exactly you were doing. |
@bart-v Could the following error be due to the server/ODBC drivers running out of memory?
|
The query is not complete, so who knows... |
This is the query that SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
'True'
AND 'True'
AND 'True'
AND LOWER(animal_project_code) IN ('2015_phd_verhelst_eel', '2012_leopoldkanaal', 'ptn-silver-eel-mondego', 'semp', 'dak', 'emmn', 'noordzeekanaal', '2011_warnow', '2017_fremur', '2011_loire', '2019_grotenete', 'esgl', '2013_albertkanaal', '2004_gudena', '2014_frome', '2014_nene', 'life4fish')
AND animal_scientific_name IN ('Anguilla anguilla')
AND 'True'
AND 'True'
AND 'True'
LIMIT ALL}
I can replicate Claudia's issue with the following SQL query: SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
'True'
AND 'True'
AND 'True'
AND LOWER(animal_project_code) IN ('2013_albertkanaal')
AND 'True'
AND 'True'
AND 'True'
AND 'True'
LIMIT ALL}
Executed directly via |
16M records. |
Wouldn't pagination require sorting the query result first, for every frame retrieved? In the past decisions were made to keep all sorting on the client side. Is there a better way to implement pagination? I'm still wondering if anything has changed, since the query also fails on single larger datasets such as SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(animal_project_code) IN ('2013_albertkanaal')
} I believe this query should result in a 6.4M row table. |
I believe pagination is the right answer.
I believe there are significant trade-offs, it gets more inefficient as the query result gets bigger, and it's inconsistent due to the random queries currently get returned in, we might get duplication and omission in the results.
I'm thinking of using cursors or keyset pagination instead of OFFSET LIMIT to get around the inconsistency problem and to reduce server load.
|
|
I've implemented paging on the following query: SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(network_project_code) IN ('demer', 'dijle')
This is lifed from one of the old unit tests, so I know it used to work. Currently this query will fail without paging, and takes 29.8 minutes with paging. I used keyset pagination on id_pk, assuming it would be indexed, but I assume it can't be as slow as it is. Half an hour is unacceptably slow for a query like |
OK, we'll need some time to investigate and implement |
We made some changes to the underlying view, can you verify there is a major performance improvement? |
I'm getting this error:
On this query, for a single animal project code: SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(animal_project_code) IN ('2013_albertkanaal')
} To reproduce in R: library(etn)
my_con <- connect_to_etn(username = "pieter.huybrechts@inbo.be",
password = askpass::askpass())
query <-
glue::glue_sql(.con = my_con,
"
SELECT
det.id_pk AS detection_id,
det.datetime AS date_time,
tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
det.transmitter AS acoustic_tag_id,
animal_project_code AS animal_project_code, -- exclusive to detections_limited
animal_id_pk AS animal_id, -- exclusive to detections_limited
animal_scientific_name AS scientific_name, -- exclusive to detections_limited
network_project_code AS acoustic_project_code, -- exclusive to detections_limited
det.receiver AS receiver_id,
deployment_station_name AS station_name, -- exclusive to detections_limited
deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
det.sensor_value AS sensor_value,
det.sensor_unit AS sensor_unit,
det.sensor2_value AS sensor2_value,
det.sensor2_unit AS sensor2_unit,
det.signal_to_noise_ratio AS signal_to_noise_ratio,
det.file AS source_file,
det.qc_flag AS qc_flag,
det.deployment_fk AS deployment_id
-- det.transmitter_name
-- det.transmitter_serial: via tag_device instead
-- det.station_name: deployment.station_name instead
-- det.latitude: deployment.deploy_lat instead
-- det.longitude: deployment.deploy_long instead
-- det.detection_file_id
-- det.receiver_serial
-- det.gain
-- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(animal_project_code) IN ('2013_albertkanaal')
}
")
detections <- DBI::dbGetQuery(my_con, query)
result <- DBI::dbSendQuery(my_con, query, immediate = FALSE)
DBI::dbFetch(result)
DBI::dbClearResult(result)
I will continue testing |
Just FYI: I am running RStudio with the code I earlier popped up above. Still running since 8.45h... |
I'm still crashing rstudio server via |
As I can see now the query is done quite fast, but it is the readout that is taking a long time and uses a lot of resources. Proposed solutions: indeed, paging will resolve blowing up your memory, and that of the database. |
I eventually ran into an error... |
I'm able to get '2013_albertkanaal' to work via the built in paging in DBI, it's still slow and I can't keep the object in memory without rstudio crashing. But we can get around that by writing to a temporary file and reading it again when done. However, this still used a lot of memory, several GB, so I'm not sure how well it'll scale to even bigger queries. I'll experiment a bit further and publish a branch so my work isn't lost. |
I've split up sending the query to the database from the fetching of the result, the result is also fetched paginated. This speeds up a query like I suppose we could split up the animal_project_codes and query the database once per animal_project_code, then combine the results, but I'm not sure this is such a great solution. |
I haven't got a good solution for you yet @PieterjanVerhelst , but this is getting closer: # using the parsing branch of etn
animal_projects <- c(
"2015_phd_verhelst_eel",
"2012_leopoldkanaal",
"PTN-Silver-eel-Mondego",
"SEMP",
"DAK",
"EMMN",
"Noordzeekanaal",
"2011_Warnow",
"2017_Fremur",
"2011_Loire",
"2019_Grotenete",
"ESGL",
"2013_albertkanaal",
"2004_Gudena",
"2014_Frome",
"2014_Nene",
"life4fish"
)
my_con <- connect_to_etn()
huge_result <-
purrr::map(animal_projects,
~ get_acoustic_detections(animal_project_code = .x,
con = my_con,
scientific_name = "Anguilla anguilla",
progress = TRUE), # progress for individual queries
.progress = FALSE) # progress for the whole thing
temp_file <- tempfile()
# write the result to a single csv, and read it in again for col parsing
purrr::walk(huge_result, ~readr::write_csv(.x, temp_file, append = TRUE))
huge_result_parsed <- readr::read_csv(temp_file) # probably need to add col_names, handle parsing issues
This is however rather rough, and will not give you a nice sorted file with the right column names like With the experimental code in the parsing branch, we can get results for every individual Instead of what I did above, it's probably better to write each project out to it's own file, including a header, and then to combine those files using |
@PietrH thanks for the info! Indeed, in the past I was also not able to extract the dataset in one go. I did it in three parts with each part calling data from a few |
@PieterjanVerhelst I completely agree that we need to look ahead to systems that can handle much more data than we currently have. With the weekend behind me, I had a second look and have gotten your query down to just under 5 minutes. This is how I do it: # open connection
my_con <- connect_to_etn()
# list projects to query
animal_projects <- c(
"2015_phd_verhelst_eel",
"2012_leopoldkanaal",
"PTN-Silver-eel-Mondego",
"SEMP",
"DAK",
"EMMN",
"Noordzeekanaal",
"2011_Warnow",
"2017_Fremur",
"2011_Loire",
"2019_Grotenete",
"ESGL",
"2013_albertkanaal",
"2004_Gudena",
"2014_Frome",
"2014_Nene",
"life4fish"
)
# create temp folder to write to
dir_to_write_to <- tempdir()
# query the projects one by one, save them as csv files into temp folder
for(current_animal_project in animal_projects){
message(paste("working on", current_animal_project))
data.table::fwrite(
get_acoustic_detections(animal_project_code = current_animal_project,
con = my_con,
scientific_name = "Anguilla anguilla",
progress = TRUE),
file.path(dir_to_write_to, paste0(current_animal_project, ".csv"))
)
}
# read the csv files from the temp folder into a single data.frame
anguilla_detections <-
vroom::vroom(fs::dir_ls(path = dir_to_write_to, glob="*csv"),
show_col_types = FALSE)
# inspect the results
dplyr::glimpse(anguilla_detections) This is what this looks like in the console (video is just under 5 minutes): etn-paging.mp4 |
my first thought in this is, what are the next steps in data processing you are doing. |
When I personally check the data, I always start from the raw data and perform various aggregations, creating numerous summaries to understand what is in the data. It is only after that step that one can identify which aggregation she or he needs. Therefore, I still believe we need access to the raw data before performing aggregations. |
@Stijn-VLIZ I reran a test on In the paging branch, we are not reading the whole result at once, but 1000 lines at a time, and writing those out to a temp file, and then as a last step reading that temp file into memory. I've never seen any failures on reading this temp file, the failures always occur on the database itself running out of memory. EDIT: Let's discuss this, I've sent you an email to look for a timeslot to meet |
@PieterjanVerhelst More RAM was added to the server (months ago), does your query now work? |
@PietrH data from Update: also the following code returns data 👏 !
|
I'm closing this for now as the performance issue seems to be fixed. We can reopen if it turns up again. |
I seem to have run into a bug where etn crashes the whole R session when calling
get_acoustic_detections()
on aanimal_project_code
that will result in many detections.This has been replicated by @peterdesmet , from what I understand,
get_acoustic_detections()
used to work on thisanimal_project_code
.Running this code results in:
I welcome any idea's or speculation on the cause of this bug.
=============================
My session info:
The text was updated successfully, but these errors were encountered: