-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let geoms pass parameters to scales. ScaleContinuous methods to map values to colours #5031
Conversation
d18d316
to
57997cc
Compare
ScaleContinous maps values to palette colours as follows: - unique values are found - unique values are mapped to colors - colors are matched to the original vector If most values are unique, we can be faster by simply maping all values to colors, without finding and matching unique values first. In some scenarios the geom can guess or know if that is going to be the case. The goal of this commit is to let the geom tell the ScaleContinuous scale how the mapping from values to colours should be done. By default the existing "unique" approach is used. The geom may now specify `scale_params = list(fill=list(mapping_method = "raw"))` to tell the scale corresponding to the fill aesthetic to use a "raw" approach of mapping values to colours without finding unique values first. Besides the default "unique" and the new "raw" mapping methods, we also allow the geom to ask to use the "binned" approach, where the geom specifies a number of intervals to use and the mapping process is as follows: - values are binned in N intervals - intervals are mapped to colors This approach is "lossy" (we have a maximum of N different colours), but this can be much faster and have almost no difference with respect to the other mapping methods.
57997cc
to
3d394b0
Compare
An alternative is to limit the number of colours used to those that an observer can distinguish and automatically switch to binning when there are more distinct values to be mapped in the data. There is no point in using more hue or lightness values than those that can be perceived as different. (the basis of JPEG). |
Hi Sergio, I think having extra options to make colour mapping more efficient is a good thing. However, putting scale parameters under the control of geoms seems to me like it goes against the grammar of graphics. I think you're spot on with this point here:
Is there a good reason to implement this at the geom level of things? |
You are right, however the issue implementing this suggestion is that the mapping from numbers to colours is not always related just to hue, or just to brightness and it may be a combination of an arbitrary number of gradients, so it's not that easy to tell in advance the what's the threshold in your values where two numbers become the same colour, without mapping them. And if you have spent the time mapping all the values to colours already, then there is not much more to optimize... :-) |
To be honest, I thought about this when I was writing the pull request. I will have to change it so it becomes a scale argument. There is a scenario (not related to this PR) where it makes sense for the geom to set a scale parameter: A geom may prefer to render colours in a character format (like I will rewrite this pull request whenever I have time (probably in a couple of weeks) |
After further reading, most of what I suggest here can already be done with |
This is the base of one of the pillars of:
Currently:
The user can freely combine geoms with scales, and use multiple geoms with the same scale. The independence between geoms and scales lets most geoms work with most scales and that is a great core part of ggplot2.
While this independence is great, there are some implementation details left to ggplot2 that leave room for improvement in terms of performance. In particular, as shown on #4989, a plot that can be built and rendered in <6 seconds may take ~45 seconds in ggplot2.
One of the major bottlenecks is the mapping of column values into aesthetic values, done by the
scale$map()
method. What can we do to make it faster?Make the underlying implementation of mapping values to colours faster
The first point is mostly taken care in other pull requests in the
scales
andfarver
packages and it covers things like:Letting the scales know how does the geom expect the aesthetic values to be.
The second point needs some communication between geoms and scales. To give an example: when a geom maps a data to a fill or colour aesthetic, the scale will transform column values into a character vector ("#ff0000",...). Some geoms do not use character colours, but rather use native colours (for nativeRaster objects, in integer format) and they must do the format conversion when rendering (e.g. https://github.com/zeehio/ggmatrix/blob/98445bf28caaca1022c03a542b8b4541034566a2/R/geom_matrix_raster.R#L123). If the geom can tell the scale that it would rather have colours in native format, and if the scale can tell the same to the palette, the intermediate character representation of colours can be avoided with significant performance benefits. This pull request defines a way for geoms to communicate with scales, but the example described in this paragraph is tackled in a future pull request.
Changing how values are mapped to colours
The third point is how values are mapped to colours and it is what this pull request is concerned about. The pull request focuses on
ScaleContinuous
because it is one of the most common scales, but similar adjustments could be applied to other scales if desired.ScaleContinous
maps values to palette colours as follows:ggplot2/R/scale-.r
Lines 608 to 610 in 63125db
When most values are unique, this mapping could be faster by simply maping all values to colors,
without finding and matching unique values first. In some cases the geom can guess or know if that is going to be the case.
This pull request establishes a way for geoms to communicate parameters to scales, and specifically use those parameters to define three different
mapping_method
s. By default the current "unique" approach is used. The geom may specify"raw"
or"binned"
instead.The geom defines a new method
scale_params=
that typically will be a list (or a function that takes the computed params and returns that list). The list is named with the aesthetics, and for each aesthetic it provides a list with options.For instance, the geom may now specify
scale_params = list(fill=list(mapping_method = "raw"))
to tell the scale corresponding to thefill
aesthetic to use a "raw" mapping method, this is without finding unique values first. The "raw" method is usually faster than the current "unique" method for instance when the data consists of doubles without duplicate values.Besides the default "unique" and the new "raw" mapping methods, we also allow the geom to ask to use the "binned" mapping method where the geom specifies a number of intervals to use
scale_params = list(fill=list(mapping_method = "binned", mapping_method_bins = 256))
and the mapping process is as follows:This approach is "lossy" (we have a maximum of N different colours), but this can be much faster and have almost no difference with respect to the other mapping methods.
Questions/Discussion
"mapping_method"
be just a scale argument?Yes... with a "but maybe". Yes, that makes sense. If the
"mapping_method"
is a relevant argument for the scale it could be one of thescale_*_gradient(...)
arguments. However it seems a rather "internal" argument and it won't be easy for a regular user to see its effect. An alternative could be to sample the vector we want to map and, based on the density of unique values in the sample, we could choose either "unique" or "raw". However, by letting the geom hint the scale we can let the scale use a more efficient default mapping method in some scenarios.