Multi process schema inference #6

psmyth94 · 2024-11-12T13:59:11Z

Included the addition of a new SchemaManager class for infer the schema during downloads, and did updates to the load_dataset function to handle existing keyword arguments, and various improvements to error messages and logging.

New Features:

Added SchemaManager class to handle schema gathering during downloads and integrated it into the src/biosets/download module. (src/biosets/download/__init__.py, src/biosets/download/schema_manager.py) [1] [2]

Function Enhancements:

Updated load_dataset function to check for existing keyword arguments before updating with new ones. (src/biosets/load.py) [1] [2]
Modified _split_generators in biodata.py to use SchemaManager for schema management when certain conditions are met. (src/biosets/packaged_modules/biodata/biodata.py)

Error Messaging and Logging:

Fixed where logs didn't have line breaks when a progress bar is present. (src/biosets/packaged_modules/biodata/biodata.py) [1] [2] [3] [4] [5] [6] [7] [8] [9]

Updated the load_dataset function to prevent overwriting existing keyword arguments when updating new_kwargs. This change ensures that only non-existing keyword arguments are added from dataset_builder, builder_config, and builder_info.

Separated the import of tqdm from py_utils and updated it to import directly from datasets.utils. Also added the import for thread_map from tqdm.contrib.concurrent to enable concurrent processing.

Added a newline character at the beginning of various warning messages to improve readability in the logs. This change ensures that the messages are clearly separated from preceding log entries.

Integrated SchemaManager to handle dl_manager when features are not specified and add_missing_columns or zero_as_missing is enabled. This change ensures proper schema management and improves the data loading process by gathering schema information during download.

psmyth94 · 2024-11-12T14:14:00Z

Need tests for sharded files with different schemas

Ensure that None features are not updated into the features dictionary. Set features to None if the resulting dictionary is empty.

Extend schema manager to handle TSV and TXT files using Polars. This ensures that schemas for these file types are correctly processed and converted to Arrow schemas.

Refactor the handling of dataset builder arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

Update the handling of builder config arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

Update the handling of builder info arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

Improve the handling of builder kwargs by updating them with new kwargs or attributes from builder_info. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

Include builder_kwargs in new_kwargs to ensure that builder-specific arguments are passed correctly when calling load_dataset.

Update the TARGET_COLUMN constant to encoded_labels for better clarity and consistency in the codebase.

Rename drop_* attributes to encode_labels and *_kwargs attributes to data_builder_kwargs, sample_metadata_builder_kwargs, and feature_metadata_builder_kwargs for better clarity and consistency.

Rename drop_* attributes to encode_labels and *_kwargs attributes to data_builder_kwargs, sample_metadata_builder_kwargs, and feature_metadata_builder_kwargs for better clarity and consistency. Update the handling of these kwargs to ensure proper argument passing.

Refactored the _get_builder_kwargs method to return additional values (config_path and module_path) and to use a new_builder_kwargs dictionary instead of modifying the existing builder_kwargs. This improves code clarity and ensures that the original builder_kwargs is not altered.

Added a new attribute _missing_label_value to handle cases where labels are missing. Updated the _label_map initialization to set appropriate values for _missing_label_value based on the presence of positive and negative labels. Modified the bin_labels assignment to use the new _missing_label_value.

Added support for sample and feature metadata generators in the _split_generators method. Refactored the code to handle the initialization and feature inference for these generators. Updated the splits to include metadata generator kwargs. This enhances the flexibility and functionality of the data loading process.

Modified the _add_sample_metadata method to include a data_generator parameter. Added validation to ensure the number of rows in the sample metadata matches the number of rows in the data table when no sample column is present. This ensures consistency and correctness in the metadata handling process.

Refactored the _prepare_labels method to improve label handling and added support for encoding labels. Introduced sample_metadata_generator and feature_metadata_generator in the _generate_tables method to handle metadata generation more efficiently. This update ensures better consistency and flexibility in processing data and metadata.

Updated the _load_metadata method to include additional parameters (table, key, sample_metadata_generator_iter) for better handling of sample metadata. Modified the _add_sample_metadata method to accept a generator parameter. This change improves the flexibility and accuracy of metadata processing in the BioData class.

Refactored feature handling to use self.config.features instead of self.info.features. Updated the logic to add missing columns and cast the table schema based on self.config.features. This change ensures better consistency and flexibility in managing feature metadata.

Refactored the _load_metadata method to include additional parameters (table, file_key, gen_table_iter) for improved metadata handling. Updated the logic to handle multiple metadata files and ensure consistency in metadata processing. This change enhances the flexibility and accuracy of metadata management in the BioData class.

Updated the schema handling logic to use self.config.features instead of self.info.features. This change ensures that the schema is correctly aligned with the configuration features, improving consistency and flexibility in feature management.

Refactored the schema handling logic to use the _create_target_feature method for creating target features. This change simplifies the code and ensures consistency in the creation of target features based on the provided schema and configuration.

Renamed the 'separator' attribute to 'sep' in the CsvConfig class for consistency with other parameter names. Updated the pl_scan_csv_kwargs property and relevant method calls to reflect this change. This improves code readability and consistency.

Updated the test data and metadata files to include additional samples and features. Refactored the test cases to handle the new data structure and ensure consistency in the test results. This change improves the coverage and accuracy of the tests for the BioData class.

psmyth94 added 6 commits November 12, 2024 08:50

add existing_kwargs to track kwargs keys

fba2930

fix(load.py): avoid overwriting existing kwargs

3a92a0b

Updated the load_dataset function to prevent overwriting existing keyword arguments when updating new_kwargs. This change ensures that only non-existing keyword arguments are added from dataset_builder, builder_config, and builder_info.

fix(biodata.py): update tqdm import and add thread_map

9eb1efc

Separated the import of tqdm from py_utils and updated it to import directly from datasets.utils. Also added the import for thread_map from tqdm.contrib.concurrent to enable concurrent processing.

fix(biodata.py): add newline to warning messages

e02277d

Added a newline character at the beginning of various warning messages to improve readability in the logs. This change ensures that the messages are clearly separated from preceding log entries.

ruff fix

03acf35

psmyth94 added 22 commits November 13, 2024 15:15

fix(schema_manager): handle None features and empty dict

f846afc

Ensure that None features are not updated into the features dictionary. Set features to None if the resulting dictionary is empty.

feat(schema_manager): add support for TSV and TXT files

69e8e73

Extend schema manager to handle TSV and TXT files using Polars. This ensures that schemas for these file types are correctly processed and converted to Arrow schemas.

refactor(load.py): improve dataset builder argument handling

324be1b

Refactor the handling of dataset builder arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

refactor(load.py): exclude VAR_KEYWORD in config args

6b8894b

Update the handling of builder config arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

refactor(load.py): exclude VAR_KEYWORD in builder info args

4d51139

Update the handling of builder info arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

refactor(load.py): update builder kwargs handling

4477508

Improve the handling of builder kwargs by updating them with new kwargs or attributes from builder_info. This ensures that only relevant arguments are considered when matching with load_dataset arguments.

refactor(load.py): add builder_kwargs to new_kwargs

b3242ce

Include builder_kwargs in new_kwargs to ensure that builder-specific arguments are passed correctly when calling load_dataset.

refactor(biodata.py): rename TARGET_COLUMN to encoded_labels

57c5b17

Update the TARGET_COLUMN constant to encoded_labels for better clarity and consistency in the codebase.

refactor(biodata.py): rename drop_* and *_kwargs attributes

b92d397

Rename drop_* attributes to encode_labels and *_kwargs attributes to data_builder_kwargs, sample_metadata_builder_kwargs, and feature_metadata_builder_kwargs for better clarity and consistency.

psmyth94 merged commit 01dafcf into main Nov 14, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi process schema inference #6

Multi process schema inference #6

psmyth94 commented Nov 12, 2024

psmyth94 commented Nov 12, 2024

Multi process schema inference #6

Multi process schema inference #6

Conversation

psmyth94 commented Nov 12, 2024

New Features:

Function Enhancements:

Error Messaging and Logging:

psmyth94 commented Nov 12, 2024