Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi process schema inference #6

Merged
merged 28 commits into from
Nov 14, 2024
Merged

Conversation

psmyth94
Copy link
Owner

Included the addition of a new SchemaManager class for infer the schema during downloads, and did updates to the load_dataset function to handle existing keyword arguments, and various improvements to error messages and logging.

New Features:

  • Added SchemaManager class to handle schema gathering during downloads and integrated it into the src/biosets/download module. (src/biosets/download/__init__.py, src/biosets/download/schema_manager.py) [1] [2]

Function Enhancements:

  • Updated load_dataset function to check for existing keyword arguments before updating with new ones. (src/biosets/load.py) [1] [2]
  • Modified _split_generators in biodata.py to use SchemaManager for schema management when certain conditions are met. (src/biosets/packaged_modules/biodata/biodata.py)

Error Messaging and Logging:

  • Fixed where logs didn't have line breaks when a progress bar is present. (src/biosets/packaged_modules/biodata/biodata.py) [1] [2] [3] [4] [5] [6] [7] [8] [9]

Updated the load_dataset function to prevent overwriting existing
keyword arguments when updating new_kwargs. This change ensures that
only non-existing keyword arguments are added from dataset_builder,
builder_config, and builder_info.
Separated the import of tqdm from py_utils and updated it to import
directly from datasets.utils. Also added the import for thread_map
from tqdm.contrib.concurrent to enable concurrent processing.
Added a newline character at the beginning of various warning messages
to improve readability in the logs. This change ensures that the
messages are clearly separated from preceding log entries.
Integrated SchemaManager to handle dl_manager when features are not
specified and add_missing_columns or zero_as_missing is enabled. This
change ensures proper schema management and improves the data loading
process by gathering schema information during download.
@psmyth94
Copy link
Owner Author

Need tests for sharded files with different schemas

Ensure that None features are not updated into the features dictionary.
Set features to None if the resulting dictionary is empty.
Extend schema manager to handle TSV and TXT files using Polars. This ensures
that schemas for these file types are correctly processed and converted to
Arrow schemas.
Refactor the handling of dataset builder arguments to exclude VAR_KEYWORD
parameters. This ensures that only relevant arguments are considered when
matching with load_dataset arguments.
Update the handling of builder config arguments to exclude VAR_KEYWORD
parameters. This ensures that only relevant arguments are considered
when matching with load_dataset arguments.
Update the handling of builder info arguments to exclude VAR_KEYWORD
parameters. This ensures that only relevant arguments are considered
when matching with load_dataset arguments.
Improve the handling of builder kwargs by updating them with new kwargs
or attributes from builder_info. This ensures that only relevant arguments
are considered when matching with load_dataset arguments.
Include builder_kwargs in new_kwargs to ensure that builder-specific
arguments are passed correctly when calling load_dataset.
Update the TARGET_COLUMN constant to encoded_labels for better clarity
and consistency in the codebase.
Rename drop_* attributes to encode_labels and *_kwargs attributes to
data_builder_kwargs, sample_metadata_builder_kwargs, and
feature_metadata_builder_kwargs for better clarity and consistency.
Rename drop_* attributes to encode_labels and *_kwargs attributes to
data_builder_kwargs, sample_metadata_builder_kwargs, and
feature_metadata_builder_kwargs for better clarity and consistency.
Update the handling of these kwargs to ensure proper argument passing.
Refactored the _get_builder_kwargs method to return additional values
(config_path and module_path) and to use a new_builder_kwargs dictionary
instead of modifying the existing builder_kwargs. This improves code
clarity and ensures that the original builder_kwargs is not altered.
Added a new attribute _missing_label_value to handle cases where labels
are missing. Updated the _label_map initialization to set appropriate
values for _missing_label_value based on the presence of positive and
negative labels. Modified the bin_labels assignment to use the new
_missing_label_value.
Added support for sample and feature metadata generators in the
_split_generators method. Refactored the code to handle the initialization
and feature inference for these generators. Updated the splits to include
metadata generator kwargs. This enhances the flexibility and functionality
of the data loading process.
Modified the _add_sample_metadata method to include a data_generator
parameter. Added validation to ensure the number of rows in the sample
metadata matches the number of rows in the data table when no sample
column is present. This ensures consistency and correctness in the
metadata handling process.
Refactored the _prepare_labels method to improve label handling and added
support for encoding labels. Introduced sample_metadata_generator and
feature_metadata_generator in the _generate_tables method to handle
metadata generation more efficiently. This update ensures better
consistency and flexibility in processing data and metadata.
Updated the _load_metadata method to include additional parameters
(table, key, sample_metadata_generator_iter) for better handling of
sample metadata. Modified the _add_sample_metadata method to accept a
generator parameter. This change improves the flexibility and accuracy
of metadata processing in the BioData class.
Refactored feature handling to use self.config.features instead of
self.info.features. Updated the logic to add missing columns and cast
the table schema based on self.config.features. This change ensures
better consistency and flexibility in managing feature metadata.
Refactored the _load_metadata method to include additional parameters
(table, file_key, gen_table_iter) for improved metadata handling. Updated
the logic to handle multiple metadata files and ensure consistency in
metadata processing. This change enhances the flexibility and accuracy of
metadata management in the BioData class.
Updated the schema handling logic to use self.config.features instead of
self.info.features. This change ensures that the schema is correctly
aligned with the configuration features, improving consistency and
flexibility in feature management.
Refactored the schema handling logic to use the _create_target_feature
method for creating target features. This change simplifies the code and
ensures consistency in the creation of target features based on the
provided schema and configuration.
Renamed the 'separator' attribute to 'sep' in the CsvConfig class for
consistency with other parameter names. Updated the pl_scan_csv_kwargs
property and relevant method calls to reflect this change. This improves
code readability and consistency.
Updated the test data and metadata files to include additional samples
and features. Refactored the test cases to handle the new data structure
and ensure consistency in the test results. This change improves the
coverage and accuracy of the tests for the BioData class.
@psmyth94 psmyth94 merged commit 01dafcf into main Nov 14, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant