-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can provide a normal geom_* function which can inherit from the ggplot2 data? #2
Comments
I don't understand what you mean. Could you provide some sample code of how would you expect this to work? |
Thank you for your response. The current geom_matrix_raster requires a manually provided matrix and cannot inherit data directly from the ggplot function. Suppose I already have a long-formatted data frame, data <- matrix(rnorm(100L), nrow = 10L)
data <- reshape2::melt(data, varnames = c("y", "x"))
library(ggplot2)
library(ggmatrix)
ggplot(data) +
geom_matrix_raster(aes(x, y, value)) # this will fail |
Based on my testing, I’ve also tested ggalign (which internally uses A minor example of how ggalign works here: library(ggalign)
set.seed(123)
small_mat <- matrix(rnorm(81), nrow = 9)
rownames(small_mat) <- paste0("row", seq_len(nrow(small_mat)))
colnames(small_mat) <- paste0("column", seq_len(ncol(small_mat)))
# initialize the heatmap layout, we can regard it as a normal ggplot object
ggheatmap(small_mat) +
# we can directly modify geoms, scales and other ggplot2 components
scale_fill_viridis_c() +
# add annotation in the top
hmanno("top") +
# in the top annotation, we add a dendrogram, and split observations into 3 groups
align_dendro(aes(color = branch), k = 3) +
# in the dendrogram we add a point geom
geom_point(aes(color = branch, y = y)) +
# change color mapping for the dendrogram
scale_color_brewer(palette = "Dark2") |
I think it is worth explaining why this pkg works with a matrix before making further changes. I originally avoided the need for melt() for performance reasons: A matrix is just a long vector with a dimesion attribute. This package takes a matrix as input and places a copy on a data frame. Just the values, no x and no y, in column-wise order. Because of how R works internally, that copy doesn't actually happen, since the matrix data is never modified and R uses copy-on-write. All the code considered how to minimize memory allocations and moves/copies, because that's where ggplot wastes time. When you use melt, you create a data frame with as many rows as the number of matrix elements and three columns. This is at least two allocations as large as your matrix ($x, $y) + a copy of the values. ggplot goes through all that data several times (checking plot limits, looking for missing values...). This adds significant overhead. If you want my overhead analysis, that includes many pull requests (some already merged) and probably too much detail, you can check: The first post already shows a big difference between using geom_raster() vs the combination of annotation_raster() with converting the matrix to a nativeRaster object (checkout Maybe you can adapt the code from the cutting corners strategy explained in the first post on that issue to replace your slow geom_raster calls in your package with a faster solution. Your benchmark is using a 1000*1000 matrix. A double takes 8 bytes, so that matrix takes 8MB. However your benchmark shows you are allocating more than 2GB. That's a 250x overhead. Data temporarily copied, read and written so many times by your package, by ggplot, and by some of its dependencies that the plot takes several seconds to happen. |
Using your package and ggplot:
Any function using a matrix as input can benefit from knowing it's shape and knowing that the values are given in column-major order, so for instance if I know I have to draw the values in a rectangle shape and I know where the first value is drawn and the size of each value I can easily find where any value of the matrix will be drawn
At this point you have lost the rectangular shape constraint. It is risky/complicated for any "general purpose function that accepts a data frame" to assume that the rows of the given data frame come from a melted matrix. The data frame is treated as a set of individual observations and each observation's position in the plot will have to be individually calculated. ggplot can't assume the data frame has nm observations corresponding to a rectangle shape nor that those observations are sorted in a specific order* During the ggplot_build phase:
My estimation based on my performance benchmarking on the ggplot issue I linked before is that at least 30% of the plotting time is spent here
This happens when applying the scales. I have several pull requests in both ggplot and farver packages to reduce the amount of conversions, but they have been pending for review for two years. I also don't need that anymore so finding time myself to respond to the code review wouldn't be that easy. During plot rendering phase:
This conversion does not assume that all rows and columns are present, or that they come from a rectangle shaped matrix. Basically we have to undo the melt without knowing that the data came from a melt operation.
I wish we could skip the character colour and go straight to native colour format SummaryI can't take a data frame as input and preserve performance because taking the data frame implies going through all of ggplot positional scale mapping. The positional scale mapping is useful for instance to set plot limits correctly. I assume in my package you want to draw the whole matrix and then I skip all position calculations in ggplot. If you wanted to plot a part of the matrix I would expect you to give me only that subset of the matrix as input because I can't use ggplot to do the filtering and keep performance as is. In your package you may be able to make assumptions about the data frame that you pass to ggplot, since your package is also responsible for melting the matrix, I believe. You can take a copy of my .R file, put it in your package and modify it so it fits your needs and benefits from your data frame assumptions |
Thank you for your detailed explanation. I’ll take a look at your |
No description provided.
The text was updated successfully, but these errors were encountered: