Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted version of the KLL sketch? #157

Open
thvasilo opened this issue Jun 9, 2020 · 4 comments
Open

Weighted version of the KLL sketch? #157

thvasilo opened this issue Jun 9, 2020 · 4 comments

Comments

@thvasilo
Copy link

thvasilo commented Jun 9, 2020

Hello,

There's a consideration at XGBoost about potentially using the KLL sketch to represent feature value histograms.

One potential blocker is the need for a weighted version of the sketch, this would allow us to use data points that are weighted, and adjust their feature contributions accordingly (See Appendix A of XGBoost paper).

I remember discussing in the past the possibility of using data weights with KLL, is that still an option?

@leerho
Copy link
Contributor

leerho commented Jun 9, 2020

The only option we have thought about would be restricted to positive integer weights and >= 1.

My interpretation of weights in a quantiles context is that an item with a weight of 2 is equivalent to updating the sketch with two identical items with a weight of 1. Is this your understanding?

@thvasilo
Copy link
Author

thvasilo commented Jun 9, 2020

Yes, I think the outcome would be the same in this case. I think for this to be used in XGBoost it would require real-valued weights.

I found the paper that I had in mind when talking to Zohar Karnin a couple of years ago, that includes a weighted extension for KLL (Section 4).

I'll ping @trivialfis in case he wants to chime in.

@DanielTing
Copy link

@thvasilo: Is it possible to assume that all the weights in the data sum up to at least 1? (or some other known constant.) There are no assumptions on individual weights or on weights being integral here, but it does mean you need to know something about the overall scaling of the weights.

There are several possible implementations that incorporate weighting. If you can assume this, then you get the simplest (and perhaps most performant) implementation, but it's certainly possible to handle weights without this assumption as well.

@thvasilo
Copy link
Author

@DanielTing The XGBoost use-case would be for batch training, so we can assume to know all weights in advance and normalize them so they sum to 1.

I'm having a chat with the first author of the linked paper tomorrow, I'll update with any progress. I plan to work on this codebase so I hope we can come up with something we can contribute back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants