-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighted version of the KLL sketch? #157
Comments
The only option we have thought about would be restricted to positive integer weights and >= 1. My interpretation of weights in a quantiles context is that an item with a weight of 2 is equivalent to updating the sketch with two identical items with a weight of 1. Is this your understanding? |
Yes, I think the outcome would be the same in this case. I think for this to be used in XGBoost it would require real-valued weights. I found the paper that I had in mind when talking to Zohar Karnin a couple of years ago, that includes a weighted extension for KLL (Section 4). I'll ping @trivialfis in case he wants to chime in. |
@thvasilo: Is it possible to assume that all the weights in the data sum up to at least 1? (or some other known constant.) There are no assumptions on individual weights or on weights being integral here, but it does mean you need to know something about the overall scaling of the weights. There are several possible implementations that incorporate weighting. If you can assume this, then you get the simplest (and perhaps most performant) implementation, but it's certainly possible to handle weights without this assumption as well. |
@DanielTing The XGBoost use-case would be for batch training, so we can assume to know all weights in advance and normalize them so they sum to 1. I'm having a chat with the first author of the linked paper tomorrow, I'll update with any progress. I plan to work on this codebase so I hope we can come up with something we can contribute back. |
Hello,
There's a consideration at XGBoost about potentially using the KLL sketch to represent feature value histograms.
One potential blocker is the need for a weighted version of the sketch, this would allow us to use data points that are weighted, and adjust their feature contributions accordingly (See Appendix A of XGBoost paper).
I remember discussing in the past the possibility of using data weights with KLL, is that still an option?
The text was updated successfully, but these errors were encountered: