General doubts that I have not been able to resolve #1488

gbullido · 2023-09-21T22:44:32Z

gbullido
Sep 21, 2023

Hello everyone, I have several questions with this library, I have read the documentation. But I haven't been able to solve them. I haven't even solved it by searching the support forums.

1. Why do you support RandomForestClassifier and not DecisionTreeClassifier? If RandomForestClassifier are based on DecisionTreeClassifier. (My question is not with ulterior motives, it is just out of pure curiosity).

2. In Supported Algorithms in the "On CPU" section there is another section called "Other task" and I can see that the method train_test_split appears but only for "Only dense data is supported". What does this mean?

3. In the same section of "Other task" the method GridSearchCV. Does it mean it is not supported? For example, I have improved this code by activating Verbose with the logging library as it appears in the documentation and I know that it works because the method "train_test_split" returns:

INFO:sklearnex: sklearn.utils.validation._assert_all_infnite: fallback to original Scikit-learn
INFO:sklearnex: sklearn.utils.validation._assert_all_infnite: fallback to original Scikit-learn
INFO:sklearnex: sklearn.model_selection.train_test_split: fallback to original Scikit-learn
INFO:sklearnex: sklearn.model_selection.train_test_split: fallback to original Scikit-learn

Instead for this code:

scoring = {'accuracy': make_scorer(accuracy_score),
           'precision': make_scorer(precision_score, average='macro'),
           'recall': make_scorer(recall_score, average='macro'),
           'f1': make_scorer(f1_score, average='macro')}
 
# Define preprocessing for numerical variables.
numeric_transformers = [RobustScaler(copy=False), MinMaxScaler(copy=False), 'passthrough']
numeric_transformers_robust = [RobustScaler(copy=False)]
 
# Define preprocessing for categorical variables.
categorical_transformers = [OneHotEncoder(dtype='int8'), TargetEncoder(target_type='binary', random_state=1998)]
 
# Create the preprocessor that will apply the appropriate transformations to the numerical and categorical variables.
preprocessor = ColumnTransformer(
    transformers=[
        ('feature1', 'passthrough', ['feature1']),
        ('feature2', 'passthrough', ['feature2']),
        ('feature3', 'passthrough', ['feature3']),
        ('feature4', 'passthrough', ['feature4']),
        ('feature5', 'passthrough', ['feature5']),
        ('feature6', 'passthrough', ['feature6']),
        ('feature7', 'passthrough', ['feature7']),
        ('feature8', 'passthrough', ['feature8'])])
 
# Create the pipeline that combines the preprocessor with the classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', DecisionTreeClassifier())])
 
# Define the parameter grid for grid search
param_grid = {
    'preprocessor__feature1': numeric_transformers_robust,
    'preprocessor__feature2': numeric_transformers_robust,
    'preprocessor__feature3': numeric_transformers,
    'preprocessor__feature4': numeric_transformers,
    'preprocessor__feature5': numeric_transformers,
    'preprocessor__feature6': numeric_transformers_robust,
    'preprocessor__feature7': numeric_transformers_robust,
    'preprocessor__feature8': categorical_transformers,
    'classifier__criterion': ['entropy'],
    'classifier__splitter': ['best'],
    'classifier__max_depth': [49, 52, 57, 58, 59, 61, 62, 63, 64, 65, 100],
    'classifier__min_samples_split': [3, 6, 7, 8, 9, 11],
    'classifier__min_samples_leaf': [1, 6, 7, 8, 9, 11],
    #'classifier__max_features': ['auto', 'sqrt', 'log2', None],
    'classifier__random_state': [1998]
}
 
# Create the GridSearchCV object with KFold
grid_search_kfold = GridSearchCV(clf, param_grid, cv=KFold(n_splits=10), 
                                 scoring=scoring ,
                                 n_jobs=-1, verbose=1,
                                 refit='f1')
 
# Fit GridSearchCV to data with KFold
grid_search_kfold.fit(X_train, y_train)

When you click to execute, nothing appears from the logs, only from the library, only the verbose of GridSearchCV when entering n_jobs! =1:

Details

I am using Jupyter Notebook and I know that n_jobs is ignored by this library and I know that d4p.daalinit(n_threads) has to be used, I saw it in #1164.

Fitting 10 folds for each of 21384 candidates, totalling 213840 fits

4. In the System requirements and supported configurations section in the For CPU subsection says that all processors of the X86 architecture are compatible as long as they have the SSE2, SSE4.2, AVX2, AVX512 instruction set. Does this mean that an AMD processor with these instruction sets would work? Is it mandatory for an Intel processor with this set of instructions to have integrated graphics for this library to work?. I have only been able to find this information #932

Thanks. And sorry for my "Google Traductor" English.

Answered by napetrov

Sep 22, 2023

@gbullido

They have a differences in implementation in our variant. But possibly would be extended to decision tree although it's of lower priority as simple decision trees usually not a bottlenecks and scikit implementation is sufficient.
Sparse data is not supported for example. So current accelerated version does support only dense data
Yes, it's not supported - there is no point for accelerating this particular call - there is no compute load there. it's just used from stock scikit version.
Yes, AMD would work. Integrated graphics is required only if you would like to offload compute to graphic. For your usecases any X86 will work

View full answer

napetrov · 2023-09-22T12:45:46Z

napetrov
Sep 22, 2023
Maintainer

@gbullido

They have a differences in implementation in our variant. But possibly would be extended to decision tree although it's of lower priority as simple decision trees usually not a bottlenecks and scikit implementation is sufficient.
Sparse data is not supported for example. So current accelerated version does support only dense data
Yes, it's not supported - there is no point for accelerating this particular call - there is no compute load there. it's just used from stock scikit version.
Yes, AMD would work. Integrated graphics is required only if you would like to offload compute to graphic. For your usecases any X86 will work

1 reply

gbullido Sep 23, 2023
Author

Thanks for responding @napetrov.

Understood, thank you very much. I thought it was due to some more complex technical issue, but it is because it would not provide a pleasant performance improvement, I understand that it does not have such a priority.

Okay, but what is the difference or how do you detect that in the train_test_split which is a sparse dataset, What is a dense dataset?

Okay, so GridSearchCV is not supported because it would not improve performance, but it would in a GridSearchCV I include some of the Supported Algorithms like RandomForestClassifier yes, the scikit-learn-intelex library would be activated and it would be accelerated, I'm right?.

On the other hand, if you look again at the code block in point 3. of my initial publication, I told you that it only appeared during execution:

Fitting 10 folds for each of 21384 candidates, totaling 213840 fits

But when the execution finished the output was this:
INFO:sklearnex: sklearn.utils.validation._assert_all_finite: running accelerated version on CPU

Fitting 10 folds for each of 21384 candidates, totalling 213840 fits
INFO:sklearnex: sklearn.utils.validation._assert_all_finite: running accelerated version on CPU
Reporte en predecir datos que son totalmente nuevos con CV KFold:
              precision    recall  f1-score   support
 
           0       0.60      0.67      0.64      3566
           1       0.70      0.64      0.67      4381
 
    accuracy                           0.65      7947
   macro avg       0.65      0.66      0.65      7947
weighted avg       0.66      0.65      0.65      7947
 
Mejores parámetros con KFold:  
{
'classifier__criterion': 'entropy', 
'classifier__max_depth': 52, 
'classifier__min_samples_leaf': 1, 
'classifier__min_samples_split': 3, 
'classifier__random_state': 1998, 
'classifier__splitter': 'best', 
'preprocessor__feature1': RobustScaler(copy=False), 
'preprocessor__feature2': RobustScaler(copy=False), 
'preprocessor__cat': OneHotEncoder(dtype='int8'), 
'preprocessor__feature3': RobustScaler(copy=False), 
'preprocessor__feature4': MinMaxScaler(copy=False), 
'preprocessor__feature5': MinMaxScaler(copy=False), 
'preprocessor__feature6': RobustScaler(copy=False), 
'preprocessor__feature7': RobustScaler(copy=False)
}
Mejores puntuaciones con KFold:  
0.6329274201243186
Mejores estimadores con KFold:  
Pipeline(steps=[
('preprocessor',
                 ColumnTransformer(transformers=[
                 ('feature1', RobustScaler(copy=False), ['feature1']),
                 ('feature2', RobustScaler(copy=False), ['feature2']),
                 ('feature3', RobustScaler(copy=False), ['feature3']),
                 ('feature4', MinMaxScaler(copy=False), ['feature4']),
                 ('feature5', MinMaxScaler(copy=False), ['feature5']),
                 ('feature6', RobustScaler(copy=False), ['feature6']),
                 ('feature7', RobustScaler(copy=False), ['feature7']),
                 ('cat', OneHotEncoder(dtype='int8'), ['event.type'])
])),
('classifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=52,
                                        min_samples_split=3,
                                        random_state=1998))
])
Mejores índices con KFold:  1948

So as long as it is x86 and has the SSE2, SSE4.2, AVX2, AVX512 instruction set. Would it work? The integrated graphics thing is because I am going to buy an Intel processor and as you know, the "F" versions are cheaper because they do not have integrated graphics. What do you recommend?

napetrov · 2023-09-25T16:03:42Z

napetrov
Sep 25, 2023
Maintainer

sparse is scipy.sparse.csr_matrix . So input data is checked at runtime.
For example here is some explanation on sparse data - https://www.analyticsvidhya.com/blog/2022/11/explaining-sparse-datasets-with-practical-examples/
Well - you get accelerated algorithm execution within GridSearch folds. Like with 10 fold example you would have 10 runs of algorithm and if it's supported it would be accelerated, but GridSeach itself is not accelerated because it's just mechanism to run iterations of real algorithms on different folds through the data.
Any of those instruction sets. Can't recommend specifically but you will be able to do all the things on CPU while GPU support is limited and primarily targeted discrete GPUs. There are certain cases then integrated GPU would be providing better perf that CPUs, but those are pretty specific.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General doubts that I have not been able to resolve #1488

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

General doubts that I have not been able to resolve #1488

gbullido Sep 21, 2023

Replies: 2 comments · 1 reply

napetrov Sep 22, 2023 Maintainer

gbullido Sep 23, 2023 Author

napetrov Sep 25, 2023 Maintainer

gbullido
Sep 21, 2023

Replies: 2 comments 1 reply

napetrov
Sep 22, 2023
Maintainer

gbullido Sep 23, 2023
Author

napetrov
Sep 25, 2023
Maintainer