Improving preprocessing #1320

aPovidlo · 2024-08-13T15:01:33Z

This is a 🔨 code refactoring.

Summary

Significant Updates in Data Storage and Preprocessing

Major Updates:

Enhanced logging: Added more detailed logs in DEBUG mode during preprocessing.
New functionality: You can now mark categorical features in data when using InputData.from_numpy(...), InputData.from_dataframe(...), and InputData.from_csv(...) methods.
New class: Introduced OptimizedFeatures, which stores data with optimal dtypes for improved efficiency.
Preprocessing improvement: Added a new stage called reduce_memory_size to optimize memory usage.
API enhancements: Updated PredefinedModel to allow copying parameters from DataPreprocessor.

Minor Updates:

Improved logic for detecting categorical data.
Updated encoders and imputers to align with the new changes.
Revised tests to incorporate the new features.

Context

…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces

pep8speaks · 2024-08-13T15:01:43Z

Hello @aPovidlo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file fedot/core/operations/evaluation/operation_implementations/data_operations/sklearn_imbalanced_class.py:

Line 35:121: E501 line too long (121 > 120 characters)

In the file fedot/preprocessing/preprocessing.py:

Line 32:1: F401 'fedot.preprocessing.data_types.ID_TO_TYPE' imported but unused

In the file fedot/utilities/memory.py:

Line 90:14: W292 no newline at end of file

Comment last updated at 2024-09-08 19:41:58 UTC

github-actions · 2024-08-13T15:02:32Z

Code in this pull request still contains PEP8 errors, please write the /fix-pep8 command in the comments below to create commit with automatic fixes.

Comment last updated at

fedot/core/data/data.py

fedot/api/api_utils/api_data.py

fedot/preprocessing/data_types.py

fedot/preprocessing/preprocessing.py

fedot/preprocessing/categorical.py

fedot/preprocessing/preprocessing.py

…l_correct

…y method to OptimisedFeature

…_correct by adding check for target

…_fit_transform by adding cat and num idx in get_dataset func

…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size

aPovidlo · 2024-08-22T15:57:30Z

Изменения из #1318 внес сюда + исправил недочеты и добавил тесты в этом PR для нововведений из #1318

DRMPN

lgtm 👍

Lopa10ko · 2024-08-23T15:36:44Z

не знаю, почему не получилось запустить интеграционники в github actions, но локально очень много тестов падают в связи с изменением в подходе к препроцессингу

игнорировать Experimental support for categorical data is not implemented for current tree method yet

aPovidlo · 2024-08-23T15:39:55Z

не знаю, почему не получилось запустить интеграционники в github actions, но локально очень много тестов падают в связи с изменением в подходе к препроцессингу

Благодарю! Обязательно разберусь с этим. Пока вносил другие небольшие изменения и тестировал на своих примерах.

aPovidlo · 2024-08-23T15:48:55Z

@nicl-nno Еще я думаю было бы полезно пройтись профайлером чтобы померить использование памяти в пиках. Почему-то думаю, что возможно reduce_memory нужно подвинуть перед применением кодирования категориальных данных или возможно немного обновить этот процесс чтобы он сразу превращал массивы в int8

…DataFrame

aPovidlo · 2024-09-08T20:06:22Z

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

Lopa10ko · 2024-09-09T07:19:07Z

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

aPovidlo · 2024-09-09T11:38:04Z

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

Благодарю! кттс сделаю в освободившиеся время.

Lopa10ko · 2024-09-09T12:44:11Z

fedot/preprocessing/preprocessing.py

+ else:
+ self.log.debug('-- Reduce memory in features')
+ data.features = reduce_mem_usage(data.features, data.supplementary_data.col_type_ids['features'])
+


It seems there is a persistent error that causes the test_fill_nan_without_categorical integration test to crash. The reason for this is that data.features should be of type np.ndarray, but new functions, such as reduce_mem_usage, internally work with pd.DataFrame.

If you need to maintain the data type returned by reduce_mem_usage, I recommend the following approach:

Suggested change

data.features = data.features.to_numpy()

This also solves problems with test_text_preprocessing.py::test_clean_text_preprocessing and test_real_cases.py::test_spam_detection_problem

Lopa10ko · 2024-09-09T13:03:51Z

fedot/preprocessing/preprocessing.py

+ @copy_doc(BasePreprocessor.reduce_memory_size)
+ def reduce_memory_size(self, data: InputData) -> InputData:
+ if isinstance(data, InputData):
+ if data.task.task_type == TaskTypesEnum.ts_forecasting:


At this point the value of the data.supplementary_data.col_type_ids field is set to None, resulting in integration tests failures for test_classification.py:test_image_classification_quality, test_classification.py:test_cnn_methods, and test_data.py:test_data_from_image

Lopa10ko · 2024-09-09T13:51:02Z

...core/operations/evaluation/operation_implementations/data_operations/categorical_encoders.py

+ if isinstance(features, np.ndarray):
+ transformed_categorical = self.encoder.transform(features[:, self.categorical_ids]).toarray()
+ # Stack transformed categorical and non-categorical data, ignore if none
+ non_categorical_features = features[:, self.non_categorical_ids]


What was the rationale behind choosing a np.ndarray as the type for self.non_categorical_ids?

self.categorical_ids: np.ndarray = np.array([]) self.non_categorical_ids: np.ndarray = np.array([]) self.encoded_ids: np.ndarray = np.array([]) self.new_numerical_idx: np.ndarray = np.array([])

The integration test test_pipeline_preprocessing.py::test_data_preprocessor_performs_optional_data_preprocessing_only_once failed when the default field value of self.non_categorical_ids was changed from [] to np.array([])

It's a bit strange that the changes to the type of these ids attributes are limited only to OneHotEncodingImplementation

@Lopa10ko Изначально стоял что на вход придет тип np.ndarray. Заметил, что PyCharm помечает несоответсвие типу, поэтому решил поменять. Видимо мб и в тесте поменять, ну или типы)

The issue is that numpy slices can only be created with id sequences with a constant data type (int | bool). However, the test checks something entirely different. The easiest solution is to either leave the lists as they are or set dtype

aPovidlo added 5 commits August 7, 2024 18:47

Adding logs & the ability to specify categorical data

6d60801

Fixes categorical features

057c4d2

Changing getsizeof to nbytes

4b4536a

Delete _clean_extra_spaces

ae6eb42

Adding more logs, adding OptimisedFeature storage, refactoring fittin…

f0df60c

…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces

nicl-nno reviewed Aug 13, 2024

View reviewed changes

fedot/core/data/data.py Outdated Show resolved Hide resolved

Lopa10ko reviewed Aug 14, 2024

View reviewed changes

aPovidlo added 21 commits August 14, 2024 17:14

@Lopa10ko requested changes

e4c13f5

Fix bug with nbytes

c0f7ff3

Fix bug with cat_features_names if there aren't exists features_names

6d7bf97

Adding reduce_memory_size to pipeline._preprocess

705529a

Return to Pandas for nan_matrix

4c7d281

Change logic of _into_categorical_features_transformation_for_fit

75901ae

Adding convert to np.array

426dbd9

Update ImputationImplementation

9ab9f99

Fix bug in BinaryCategorical

b679660

Fix bug with test_data_from_csv_load_correctly

119bca8

Fix bug with test_api_fit_predict_with_pseudo_large_dataset_with_labe…

7a3946a

…l_correct

Fix bug with test_pipeline_preprocessing_through_api_correctly

3134fc6

Fix bug with test_default_forecast (add new TODO for ts_forecasting)

e5db54d

Fix bug with test_cv_multiple_metrics_evaluated_correct by adding cop…

ebab7f2

…y method to OptimisedFeature

Fix bug with test_regression_pipeline_with_data_operation_fit_predict…

c123779

…_correct by adding check for target

Fix bug in test_default_train_test_simple with nbytes

2e168dc

Fix bugs with str* types in features

f6d539a

Fix bug with test_inf_and_nan_absence_after_imputation_implementation…

9290d82

…_fit_transform by adding cat and num idx in get_dataset func

Fix bug with test_pipeline_objective_evaluate_with_different_metrics …

2f59466

…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size

Fix bug with test_order_by_data_flow_len_correct

1be317f

Fix bug with test_pipeline_with_imputer (finally)

16285df

aPovidlo requested a review from DRMPN August 22, 2024 15:56

aPovidlo added 2 commits August 22, 2024 19:08

Fixing for test_metrics with py3.10

b1cfadc

Fix test_from_ ... with broadcast

888f484

DRMPN reviewed Aug 23, 2024

View reviewed changes

Hide preprocessing messages under debug logging (2)

f963d09

aPovidlo added 5 commits August 25, 2024 22:12

Fix TypeError with float16, rejection from this type

a542088

Refactoring OptimisedFeatures - _columns: np.ndarray -> _columns: pd.…

776d7f5

…DataFrame

Revert changes with features property

4cc8a3d

Fixes various tests

762f892

Global refactoring - Rejection from separate class

4efdad5

aPovidlo mentioned this pull request Sep 8, 2024

[Bug]: cannot get prediction for boosting models #1329

Open

aPovidlo added 10 commits September 8, 2024 17:55

Fix pep8, wrong code correction & test

bfe617d

Fixes bug with memory_usage & test

68e7610

Fixes bug with invalid slice

bef6bf2

pep8 fix

bc1681d

test fixes

4843f7b

pep8 fix

a066f31

fix bug with memory_usage

8aac969

reduce_memory_usage in utils, fix test with operations

1039392

fix tests

9a0ccab

fix tests in main api

0d8796d

Lopa10ko requested changes Sep 9, 2024

View reviewed changes

Lopa10ko reviewed Sep 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving preprocessing #1320

Improving preprocessing #1320

aPovidlo commented Aug 13, 2024 •

edited by Lopa10ko

Loading

pep8speaks commented Aug 13, 2024 •

edited

Loading

github-actions bot commented Aug 13, 2024 •

edited

Loading

aPovidlo commented Aug 22, 2024

DRMPN left a comment

Lopa10ko commented Aug 23, 2024 •

edited

Loading

aPovidlo commented Aug 23, 2024

aPovidlo commented Aug 23, 2024

aPovidlo commented Sep 8, 2024

Lopa10ko commented Sep 9, 2024

aPovidlo commented Sep 9, 2024

Lopa10ko Sep 9, 2024 •

edited

Loading

Lopa10ko Sep 9, 2024

Lopa10ko Sep 9, 2024

aPovidlo Sep 9, 2024

Lopa10ko Sep 9, 2024 •

edited

Loading

Improving preprocessing #1320

Are you sure you want to change the base?

Improving preprocessing #1320

Conversation

aPovidlo commented Aug 13, 2024 • edited by Lopa10ko Loading

Summary

Context

pep8speaks commented Aug 13, 2024 • edited Loading

Comment last updated at 2024-09-08 19:41:58 UTC

github-actions bot commented Aug 13, 2024 • edited Loading

Comment last updated at

aPovidlo commented Aug 22, 2024

DRMPN left a comment

Choose a reason for hiding this comment

Lopa10ko commented Aug 23, 2024 • edited Loading

aPovidlo commented Aug 23, 2024

aPovidlo commented Aug 23, 2024

aPovidlo commented Sep 8, 2024

Lopa10ko commented Sep 9, 2024

aPovidlo commented Sep 9, 2024

Lopa10ko Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Lopa10ko Sep 9, 2024

Choose a reason for hiding this comment

Lopa10ko Sep 9, 2024

Choose a reason for hiding this comment

aPovidlo Sep 9, 2024

Choose a reason for hiding this comment

Lopa10ko Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

aPovidlo commented Aug 13, 2024 •

edited by Lopa10ko

Loading

pep8speaks commented Aug 13, 2024 •

edited

Loading

github-actions bot commented Aug 13, 2024 •

edited

Loading

Lopa10ko commented Aug 23, 2024 •

edited

Loading

Lopa10ko Sep 9, 2024 •

edited

Loading

Lopa10ko Sep 9, 2024 •

edited

Loading