Replies: 6 comments
-
Hi! Yes you are right! Those two parameters control the size of the model. The size of the gradient boosted decision tree model is determined by the number of nodes in all of the trees. You can control how many nodes are in the model by limiting the number of nodes in each of the trees and the total number of trees trained. The most straightforward way to control the model size would be with the These parameters control the size of the model:
If model size is a really big concern, you can also train linear models which will be smaller. Currently, the I'm going to keep this issue open until we add documentation to our website explaining this! Also, does your dataset contain text columns? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the very detailed answer, this helps and clarifies the effect of the hyperparameters! It would be great to have this information added to the docs. The dataset I'm looking at contains 30 float columns and a binary enum column as target. |
Beta Was this translation helpful? Give feedback.
-
Great! I'll make sure to add it to the docs :) The reason I asked about text columns is because we by default create a large number of features and that could greatly increase model size and we are adding support to customize that now. |
Beta Was this translation helpful? Give feedback.
-
It seems like a tree So the binary classifier should have a size of approximately Patchdiff --git a/crates/tree/lib.rs b/crates/tree/lib.rs
index fe030f8..1691bbe 100644
--- a/crates/tree/lib.rs
+++ b/crates/tree/lib.rs
@@ -124,6 +124,11 @@ pub struct Tree {
pub nodes: Vec<Node>,
}
+#[test]
+fn node_size() {
+ assert_eq!(std::mem::size_of::<Node>(), 0);
+}
+
impl Tree {
/// Make a prediction.
pub fn predict(&self, example: &[tangram_table::TableValue]) -> f32 { |
Beta Was this translation helpful? Give feedback.
-
So, the total number of nodes in any given tree is The serialized size of the |
Beta Was this translation helpful? Give feedback.
-
So
Tangram seems to be using a binary serialization format, so I would expect the serialized size to be similar to the in-memory size (maybe minus the padding, and plus the data for the report). I was just trying to estimate what model sizes I should expect, so the exact sizes are not necessary, thank you! |
Beta Was this translation helpful? Give feedback.
-
Which hyperparameters are the most important ones for minimizing the size of a Gradient Boosted Tree model? From my experiments so far, it seems like
min_examples_per_node
andmax_rounds
have the biggest effect.Beta Was this translation helpful? Give feedback.
All reactions