-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi process schema inference #6
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Updated the load_dataset function to prevent overwriting existing keyword arguments when updating new_kwargs. This change ensures that only non-existing keyword arguments are added from dataset_builder, builder_config, and builder_info.
Separated the import of tqdm from py_utils and updated it to import directly from datasets.utils. Also added the import for thread_map from tqdm.contrib.concurrent to enable concurrent processing.
Added a newline character at the beginning of various warning messages to improve readability in the logs. This change ensures that the messages are clearly separated from preceding log entries.
Integrated SchemaManager to handle dl_manager when features are not specified and add_missing_columns or zero_as_missing is enabled. This change ensures proper schema management and improves the data loading process by gathering schema information during download.
Need tests for sharded files with different schemas |
Ensure that None features are not updated into the features dictionary. Set features to None if the resulting dictionary is empty.
Extend schema manager to handle TSV and TXT files using Polars. This ensures that schemas for these file types are correctly processed and converted to Arrow schemas.
Refactor the handling of dataset builder arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.
Update the handling of builder config arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.
Update the handling of builder info arguments to exclude VAR_KEYWORD parameters. This ensures that only relevant arguments are considered when matching with load_dataset arguments.
Improve the handling of builder kwargs by updating them with new kwargs or attributes from builder_info. This ensures that only relevant arguments are considered when matching with load_dataset arguments.
Include builder_kwargs in new_kwargs to ensure that builder-specific arguments are passed correctly when calling load_dataset.
Update the TARGET_COLUMN constant to encoded_labels for better clarity and consistency in the codebase.
Rename drop_* attributes to encode_labels and *_kwargs attributes to data_builder_kwargs, sample_metadata_builder_kwargs, and feature_metadata_builder_kwargs for better clarity and consistency.
Rename drop_* attributes to encode_labels and *_kwargs attributes to data_builder_kwargs, sample_metadata_builder_kwargs, and feature_metadata_builder_kwargs for better clarity and consistency. Update the handling of these kwargs to ensure proper argument passing.
Refactored the _get_builder_kwargs method to return additional values (config_path and module_path) and to use a new_builder_kwargs dictionary instead of modifying the existing builder_kwargs. This improves code clarity and ensures that the original builder_kwargs is not altered.
Added a new attribute _missing_label_value to handle cases where labels are missing. Updated the _label_map initialization to set appropriate values for _missing_label_value based on the presence of positive and negative labels. Modified the bin_labels assignment to use the new _missing_label_value.
Added support for sample and feature metadata generators in the _split_generators method. Refactored the code to handle the initialization and feature inference for these generators. Updated the splits to include metadata generator kwargs. This enhances the flexibility and functionality of the data loading process.
Modified the _add_sample_metadata method to include a data_generator parameter. Added validation to ensure the number of rows in the sample metadata matches the number of rows in the data table when no sample column is present. This ensures consistency and correctness in the metadata handling process.
Refactored the _prepare_labels method to improve label handling and added support for encoding labels. Introduced sample_metadata_generator and feature_metadata_generator in the _generate_tables method to handle metadata generation more efficiently. This update ensures better consistency and flexibility in processing data and metadata.
Updated the _load_metadata method to include additional parameters (table, key, sample_metadata_generator_iter) for better handling of sample metadata. Modified the _add_sample_metadata method to accept a generator parameter. This change improves the flexibility and accuracy of metadata processing in the BioData class.
Refactored feature handling to use self.config.features instead of self.info.features. Updated the logic to add missing columns and cast the table schema based on self.config.features. This change ensures better consistency and flexibility in managing feature metadata.
Refactored the _load_metadata method to include additional parameters (table, file_key, gen_table_iter) for improved metadata handling. Updated the logic to handle multiple metadata files and ensure consistency in metadata processing. This change enhances the flexibility and accuracy of metadata management in the BioData class.
Updated the schema handling logic to use self.config.features instead of self.info.features. This change ensures that the schema is correctly aligned with the configuration features, improving consistency and flexibility in feature management.
Refactored the schema handling logic to use the _create_target_feature method for creating target features. This change simplifies the code and ensures consistency in the creation of target features based on the provided schema and configuration.
Renamed the 'separator' attribute to 'sep' in the CsvConfig class for consistency with other parameter names. Updated the pl_scan_csv_kwargs property and relevant method calls to reflect this change. This improves code readability and consistency.
Updated the test data and metadata files to include additional samples and features. Refactored the test cases to handle the new data structure and ensure consistency in the test results. This change improves the coverage and accuracy of the tests for the BioData class.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Included the addition of a new
SchemaManager
class for infer the schema during downloads, and did updates to theload_dataset
function to handle existing keyword arguments, and various improvements to error messages and logging.New Features:
SchemaManager
class to handle schema gathering during downloads and integrated it into thesrc/biosets/download
module. (src/biosets/download/__init__.py
,src/biosets/download/schema_manager.py
) [1] [2]Function Enhancements:
load_dataset
function to check for existing keyword arguments before updating with new ones. (src/biosets/load.py
) [1] [2]_split_generators
inbiodata.py
to useSchemaManager
for schema management when certain conditions are met. (src/biosets/packaged_modules/biodata/biodata.py
)Error Messaging and Logging:
src/biosets/packaged_modules/biodata/biodata.py
) [1] [2] [3] [4] [5] [6] [7] [8] [9]