-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance (for big datasets) #673
Comments
Wow yes we have really been trying to think about the bottle necks, we have known that the processing was very slow, so this is really helpful! @michaelquinn32 thoughts? |
Hey @elinw, yes I saw that in the original code |
Well to create a new branch you would fork the project and create the branch in your repo. |
That's awesome Henry. Thanks! We would love a pull request. I would be happy to review your code. As @elinw pointed out, create a branch with your change and we can go from there.
I'm surprised that Thanks! |
Hey @michaelquinn32 , I had a chance to look at this properly rather than just re-writing the function that was going slow. First off I should clarify that it is not Amazingly it is the call to Anyway, I have made an initial PR that simply changes |
Let's look at the timing for both grouped and ungrouped data with the as_tibble change. I'm wondering what we are thinking right there that we used as.data.frame() -- was there some kind of edge case? It's right in the middle of a bunch of tidyverse code. Should we be coercing to a tibble after we check that the input to skim is a data frame or coercible to a data frame? |
Even when we merge that change, let's keep this issue open for the general topic of making skimr faster for different situations. |
Also let's consider performance without histograms as th base since we know they have a big impact. |
Hi, love skimr and use it a lot. Are there any updates on the performance issue? My datasets are 50-300+ vars and 8-60 mlns observations and skimr takes a long time to do summaries. Thank you! |
Thanks for using skimr! We do support summaries of data.table objects. Have you tried that? Is it faster? Otherwise, scaling up to that size is a big challenge for the design of skimr, and I'm not exactly sure what we could do. We've both been very stretched for time recently, so a big change (like support parallel backends, etc.) might be a ways away still. |
|
First of all thanks for making such a useful package!
However I have been trying to
skim
a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.I had a look to see where the bottleneck was and the
skim_by_type
methods and thebuild_results
function called are both very slow when there are lots of columns.This looks to me like an issue withI refactoreddplyr::across
when runningsummarize
with lots of column / function pairs. Notwithstanding thatskim_by_type
to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored
skim_by_type
function, which should be able to replace all 3 of the current methods:The text was updated successfully, but these errors were encountered: