Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Type Mismatch in Dataset Mapping #7135

Open
marko1616 opened this issue Sep 3, 2024 · 3 comments
Open

Bug: Type Mismatch in Dataset Mapping #7135

marko1616 opened this issue Sep 3, 2024 · 3 comments

Comments

@marko1616
Copy link

Issue: Type Mismatch in Dataset Mapping

Description

There is an issue with the map function in the datasets library where the mapped output does not reflect the expected type change. After applying a mapping function to convert an integer label to a string, the resulting type remains an integer instead of a string.

Reproduction Code

Below is a Python script that demonstrates the problem:

from datasets import Dataset

# Original data
data = {
    'text': ['Hello', 'world', 'this', 'is', 'a', 'test'],
    'label': [0, 1, 0, 1, 1, 0]
}

# Creating a Dataset object
dataset = Dataset.from_dict(data)

# Mapping function to convert label to string
def add_one(example):
    example['label'] = str(example['label'])
    return example

# Applying the mapping function
dataset = dataset.map(add_one)

# Iterating over the dataset to show results
for item in dataset:
    print(item)
    print(type(item['label']))

Expected Output

After applying the mapping function, the expected output should have the label field as strings:

{'text': 'Hello', 'label': '0'}
<class 'str'>
{'text': 'world', 'label': '1'}
<class 'str'>
{'text': 'this', 'label': '0'}
<class 'str'>
{'text': 'is', 'label': '1'}
<class 'str'>
{'text': 'a', 'label': '1'}
<class 'str'>
{'text': 'test', 'label': '0'}
<class 'str'>

Actual Output

The actual output still shows the label field values as integers:

{'text': 'Hello', 'label': 0}
<class 'int'>
{'text': 'world', 'label': 1}
<class 'int'>
{'text': 'this', 'label': 0}
<class 'int'>
{'text': 'is', 'label': 1}
<class 'int'>
{'text': 'a', 'label': 1}
<class 'int'>
{'text': 'test', 'label': 0}
<class 'int'>

Why necessary

In the case of Image process we often need to convert PIL to tensor with same column name.

Thank for every dev who review this issue. 🤗

@marko1616
Copy link
Author

By the way, following code is working. This show the inconsistentcy.

from datasets import Dataset

# Original data
data = {
    'text': ['Hello', 'world', 'this', 'is', 'a', 'test'],
    'label': [0, 1, 0, 1, 1, 0]
}

# Creating a Dataset object
dataset = Dataset.from_dict(data)

# Mapping function to convert label to string
def add_one(example):
    example['label'] += 1
    return example

# Applying the mapping function
dataset = dataset.map(add_one)

# Iterating over the dataset to show results
for item in dataset:
    print(item)
    print(type(item['label']))

@Dref360
Copy link
Contributor

Dref360 commented Sep 5, 2024

Hello, thanks for submitting an issue.

FWIU, the issue is that datasets tries to limit casting ref and as such will try to convert your strings back to int to preserve the Features.

A quick solution would be to use dataset.cast or to supply features when calling dataset.map.

# using Dataset.cast
dataset = dataset.cast_column('label', Value('string'))

# Alternative, supply features
dataset = dataset.map(add_one, features=Features({**dataset.features, 'label': Value('string')}))

@marko1616
Copy link
Author

marko1616 commented Sep 5, 2024

LGTM! Thanks for the review.

Just to clarify, is this intended behavior, or is it something that might be addressed in a future update?
I'll leave this issue open until it's fixed if this is not the intended behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants