Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving preprocessing #1320

Open
wants to merge 70 commits into
base: master
Choose a base branch
from
Open

Improving preprocessing #1320

wants to merge 70 commits into from

Conversation

aPovidlo
Copy link
Collaborator

@aPovidlo aPovidlo commented Aug 13, 2024

This is a 🔨 code refactoring.

Summary

Significant Updates in Data Storage and Preprocessing

Major Updates:

  • Enhanced logging: Added more detailed logs in DEBUG mode during preprocessing.
  • New functionality: You can now mark categorical features in data when using InputData.from_numpy(...), InputData.from_dataframe(...), and InputData.from_csv(...) methods.
  • New class: Introduced OptimizedFeatures, which stores data with optimal dtypes for improved efficiency.
  • Preprocessing improvement: Added a new stage called reduce_memory_size to optimize memory usage.
  • API enhancements: Updated PredefinedModel to allow copying parameters from DataPreprocessor.

Minor Updates:

  • Improved logic for detecting categorical data.
  • Updated encoders and imputers to align with the new changes.
  • Revised tests to incorporate the new features.

Context

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2024

Hello @aPovidlo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 35:121: E501 line too long (121 > 120 characters)

Line 32:1: F401 'fedot.preprocessing.data_types.ID_TO_TYPE' imported but unused

Line 90:14: W292 no newline at end of file

Comment last updated at 2024-09-08 19:41:58 UTC

Copy link
Contributor

github-actions bot commented Aug 13, 2024

Code in this pull request still contains PEP8 errors, please write the /fix-pep8 command in the comments below to create commit with automatic fixes.

Comment last updated at

fedot/core/data/data.py Outdated Show resolved Hide resolved
fedot/core/data/data.py Outdated Show resolved Hide resolved
fedot/api/api_utils/api_data.py Outdated Show resolved Hide resolved
fedot/preprocessing/data_types.py Outdated Show resolved Hide resolved
fedot/preprocessing/preprocessing.py Outdated Show resolved Hide resolved
fedot/preprocessing/categorical.py Outdated Show resolved Hide resolved
fedot/preprocessing/preprocessing.py Outdated Show resolved Hide resolved
…_fit_transform by adding cat and num idx in get_dataset func
…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size
@aPovidlo aPovidlo requested a review from DRMPN August 22, 2024 15:56
@aPovidlo
Copy link
Collaborator Author

Изменения из #1318 внес сюда + исправил недочеты и добавил тесты в этом PR для нововведений из #1318

Copy link
Collaborator

@DRMPN DRMPN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@Lopa10ko
Copy link
Collaborator

Lopa10ko commented Aug 23, 2024

не знаю, почему не получилось запустить интеграционники в github actions, но локально очень много тестов падают в связи с изменением в подходе к препроцессингу

игнорировать Experimental support for categorical data is not implemented for current tree method yet

image

@aPovidlo
Copy link
Collaborator Author

не знаю, почему не получилось запустить интеграционники в github actions, но локально очень много тестов падают в связи с изменением в подходе к препроцессингу

Благодарю! Обязательно разберусь с этим. Пока вносил другие небольшие изменения и тестировал на своих примерах.

@aPovidlo
Copy link
Collaborator Author

@nicl-nno Еще я думаю было бы полезно пройтись профайлером чтобы померить использование памяти в пиках. Почему-то думаю, что возможно reduce_memory нужно подвинуть перед применением кодирования категориальных данных или возможно немного обновить этот процесс чтобы он сразу превращал массивы в int8

@aPovidlo
Copy link
Collaborator Author

aPovidlo commented Sep 8, 2024

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

@Lopa10ko
Copy link
Collaborator

Lopa10ko commented Sep 9, 2024

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

image

@aPovidlo
Copy link
Collaborator Author

aPovidlo commented Sep 9, 2024

@Lopa10ko Можно тебя попросить запустить снова интеграционные тесты?

image

Благодарю! кттс сделаю в освободившиеся время.

else:
self.log.debug('-- Reduce memory in features')
data.features = reduce_mem_usage(data.features, data.supplementary_data.col_type_ids['features'])

Copy link
Collaborator

@Lopa10ko Lopa10ko Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a persistent error that causes the test_fill_nan_without_categorical integration test to crash. The reason for this is that data.features should be of type np.ndarray, but new functions, such as reduce_mem_usage, internally work with pd.DataFrame.

If you need to maintain the data type returned by reduce_mem_usage, I recommend the following approach:

Suggested change
data.features = data.features.to_numpy()

This also solves problems with test_text_preprocessing.py::test_clean_text_preprocessing and test_real_cases.py::test_spam_detection_problem

@copy_doc(BasePreprocessor.reduce_memory_size)
def reduce_memory_size(self, data: InputData) -> InputData:
if isinstance(data, InputData):
if data.task.task_type == TaskTypesEnum.ts_forecasting:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the value of the data.supplementary_data.col_type_ids field is set to None, resulting in integration tests failures for test_classification.py:test_image_classification_quality, test_classification.py:test_cnn_methods, and test_data.py:test_data_from_image

if isinstance(features, np.ndarray):
transformed_categorical = self.encoder.transform(features[:, self.categorical_ids]).toarray()
# Stack transformed categorical and non-categorical data, ignore if none
non_categorical_features = features[:, self.non_categorical_ids]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the rationale behind choosing a np.ndarray as the type for self.non_categorical_ids?

self.categorical_ids: np.ndarray = np.array([])
self.non_categorical_ids: np.ndarray = np.array([])
self.encoded_ids: np.ndarray = np.array([])
self.new_numerical_idx: np.ndarray = np.array([])

The integration test test_pipeline_preprocessing.py::test_data_preprocessor_performs_optional_data_preprocessing_only_once failed when the default field value of self.non_categorical_ids was changed from [] to np.array([])

It's a bit strange that the changes to the type of these ids attributes are limited only to OneHotEncodingImplementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lopa10ko Изначально стоял что на вход придет тип np.ndarray. Заметил, что PyCharm помечает несоответсвие типу, поэтому решил поменять. Видимо мб и в тесте поменять, ну или типы)

Copy link
Collaborator

@Lopa10ko Lopa10ko Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that numpy slices can only be created with id sequences with a constant data type (int | bool). However, the test checks something entirely different. The easiest solution is to either leave the lists as they are or set dtype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants