Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predict_proba as per #138 #211

Open
wants to merge 16 commits into
base: development
Choose a base branch
from
Open

Implement predict_proba as per #138 #211

wants to merge 16 commits into from

Conversation

Mec-iS
Copy link
Collaborator

@Mec-iS Mec-iS commented Oct 31, 2022

See #138

Last test fit_predict_probabilities is failing, I don't know if because of the implementation or I am missing something in replicating the test in sklearn:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(criterion="gini")
clf.fit(x, y)
print(clf.predict_proba(x))

[[0.99 0.01]
 [0.82 0.18]
 [0.97 0.03]
 [0.8  0.2 ]
 [0.99 0.01]
 [0.9  0.1 ]
 [0.99 0.01]
 [0.91 0.09]
 [0.23 0.77]
 [0.4  0.6 ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.   1.  ]
 [0.01 0.99]
 [0.02 0.98]
 [0.   1.  ]
 [0.01 0.99]]


@Mec-iS Mec-iS changed the title Implent predict_proba as per #138 Implement predict_proba as per #138 Oct 31, 2022
@dlrobson
Copy link

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

@morenol
Copy link
Collaborator

morenol commented Jan 27, 2023

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

Hey, I think that there have not been more updates, none of the contributors have had time to finish it

@Mec-iS
Copy link
Collaborator Author

Mec-iS commented Jan 27, 2023

Hey guys! My team and I have just started using this library for our project. This functionality would be awesome for what we're doing. Is there any update on the progress of this?

It would be nice to have somebody with understanding of this feature, please help if you know how to implement this. It seems there are some differences between the results returned by smartcore and the ones returned by sklearn.

@lars-frogner
Copy link

Hi! Thanks for providing this nice library, we are finding the random forest implementations really useful.

I am very interested in getting predicted class probabilities from the random forest classifier, so I have been looking into this issue.

As far as I can tell, the way sklearn does it is recording the per-class sample counts in each node when fitting a decision tree. The class probabilities returned by DecisionTreeClassifier.predict_proba are then the per-class sample counts for each the predicted node divided by the total sample count of the node. The predict_proba method of RandomForestClassifier then calls predict_proba for each tree and averages the resulting probabilities.

I have implemented this in a separate smartcore fork, and it gives results that are quite close to the RandomForestClassifier::predict_proba implementation in this branch. The latter gives a bit more coarse-grained results since it uses the predicted classes rather than the underlying probabilities.

Since the results are pretty similar to before, they still deviate by up to 10% from sklearn's class probabilities for the input used in the fit_predict_probabilities test. But there are other inputs that give the exact same probabilities as sklearn. So I suspect the differences are not due to the predict_proba implementation, but rather a result of different splitting policies when building a decision tree.

Here is a small test case showing a difference in splitting between smartcore and sklearn:

This sklearn test passes:

from sklearn.tree import DecisionTreeClassifier

X = [
    [1., 1., 0.],
    [1., 1., 0.],
    [1., 1., 1.],
    [1., 0., 0.],
    [1., 0., 1.],
]

y = [1, 1, 0, 0, 1]

dt = DecisionTreeClassifier()
dt.fit(X, y)

assert dt.tree_.node_count == 7

This corresponding smartcore test fails, since the fit results in a tree with only a single node.

let x = DenseMatrix::from_2d_array(&[
    &[1., 1., 0.],
    &[1., 1., 0.],
    &[1., 1., 1.],
    &[1., 0., 0.],
    &[1., 0., 1.],
])
.unwrap();

let y = vec![1, 1, 0, 0, 1];

// We use the same defaults as sklearn
let classifier =
    DecisionTreeClassifier::fit(&x, &y, DecisionTreeClassifierParameters::default())
        .unwrap();

assert_eq!(classifier.nodes().len(), 7);

This might be old news, but I think it shows that we can't expect to get the same probabilities as sklearn, at least not without first replicating their exact splitting policy (which seems more complicated).

Since we have now tried two different predict_proba implementations that give similar results and also agree with sklearn for inputs where the splitting is more straightforward, it seems safe to me to proceed with merging one of them. The implementation in my fork has the advantage of providing predict_proba for DecisionTreeClassifier in addition to RandomForestClassifier and that the probabilities are probably a bit more precise. The disadvantage is having to store num_features * num_classes counts in every decision tree. An option could be to add a keep_counts parameter to DecisionTreeClassifierParameters and let predict_proba fail if the counts were not kept, similarly to keep_samples for RandomForestClassifier.

@lars-frogner
Copy link

Hi! It would be really nice to move forwards on this. Do you have any thoughts, @Mec-iS, @montanalow, @morenol?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants