Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Encoding is not respected by make_dataset_description #1356

Open
scott-huberty opened this issue Dec 19, 2024 · 8 comments · May be fixed by #1357
Open

UTF-8 Encoding is not respected by make_dataset_description #1356

scott-huberty opened this issue Dec 19, 2024 · 8 comments · May be fixed by #1357
Labels
Milestone

Comments

@scott-huberty
Copy link
Contributor

scott-huberty commented Dec 19, 2024

Description of the problem

Given the wide use of encoding=utf-8-sig in mne-bids, I assume that the example below is a bug, and mne-bids does want to support characters like ł (latin small letter with stroke) that appear in some people's names

Steps to reproduce

import tempfile
from pathlib import Path

import mne_bids

text = "Michał is a name that includes the character ł."
with tempfile.TemporaryDirectory() as temp_dir:
    mne_bids.make_dataset_description(
        path=temp_dir,
        name=text,
        authors=["Michał"],
    )
    with Path(f"{temp_dir}/dataset_description.json").open(encoding="utf-8") as fid:
        dataset_description = fid.read()
    # this works
    with Path("./test.txt").open("w", encoding="utf-8") as temp_file:
        temp_file.write(text)
    with Path("./test.txt").open("r", encoding="utf-8") as temp_file:
        expected_text = temp_file.read()

print("WHAT MNE BIDS WROTE TO DATASET_DESCRIPTION.JSON:")
print(dataset_description)
print(" ")
print(f"WHAT I WOULD EXPECT: {expected_text}")

Expected results

WHAT I WOULD EXPECT: Michał is a name that includes the character ł.

Actual results

{
    "Name": "Micha\u0142 is a name that includes the character \u0142.",
    "BIDSVersion": "1.7.0",
    "DatasetType": "raw",
    "Authors": [
        "Micha\u0142"
    ]
}

Additional information

I took a quick look at the code but haven't figured out yet what is going wrong ☹️

@hoechenberger
Copy link
Member

That's odd, as we do write JSON files in UTF-8 encoding (as the JSON standard requires):

mne-bids/mne_bids/utils.py

Lines 236 to 239 in 46f284b

json_output = json.dumps(dictionary, indent=4)
with open(fname, "w", encoding="utf-8") as fid:
fid.write(json_output)
fid.write("\n")

Which operating system are you using?

@scott-huberty
Copy link
Contributor Author

Which operating system are you using?

I'm on an Intel Mac! And using the development version of MNE-BIDS and MNE-Python v1.9:

Platform             macOS-14.6.1-x86_64-i386-64bit
Python               3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:43:39) [Clang 11.1.0 ]
Executable           /Users/scotterik/miniforge3/envs/mnedev_310/bin/python
CPU                  Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Memory               32.0 GiB

Core
├☑ mne               1.9.0 (latest release)
├☑ numpy             1.26.4 (OpenBLAS 0.3.26 with 4 threads)
├☑ scipy             1.12.0
└☑ matplotlib        3.8.3 (backend=MacOSX)

Numerical (optional)
├☑ sklearn           1.4.1.post1
├☑ numba             0.59.0
├☑ nibabel           5.2.1
├☑ nilearn           0.10.3
├☑ dipy              1.9.0
├☑ openmeeg          2.5.7
├☑ pandas            2.2.1
├☑ h5io              0.2.2
├☑ h5py              3.10.0
└☐ unavailable       cupy

Visualization (optional)
├☑ pyvista           0.43.4 (OpenGL 4.1 INTEL-22.5.11 via Intel(R) Iris(TM) Plus Graphics OpenGL Engine)
├☑ pyvistaqt         0.11.0
├☑ vtk               9.2.6
├☑ qtpy              2.4.1 (PyQt5=5.15.8)
├☑ ipympl            0.9.3
├☑ pyqtgraph         0.13.4
├☑ mne-qt-browser    0.6.1
├☑ ipywidgets        8.1.2
├☑ trame_client      2.16.3
├☑ trame_server      2.17.2
├☑ trame_vtk         2.8.5
└☑ trame_vuetify     2.4.3

Ecosystem (optional)
├☑ mne-bids          0.1.0.dev1495+g8695f58.d20241219
├☑ mne-connectivity  0.6.0
├☑ neo               0.13.0
├☑ eeglabio          0.0.2-4
├☑ edfio             0.4.0
├☑ mffpy             0.8.0
├☑ pybv              0.7.5
└☐ unavailable       mne-nirs, mne-features, mne-icalabel, mne-bids-pipeline

@hoechenberger
Copy link
Member

hoechenberger commented Dec 19, 2024

Ecosystem (optional)
├☑ mne-bids 0.1.0.dev1495+g8695f58.d20241219

Possibly unrelated but something's wrong with your installation. The version string isn't right. Be sure to fetch all tags.

@hoechenberger
Copy link
Member

Reproducible with:

import mne_bids
mne_bids.write._write_json("/tmp/foo.json", {"name": "Michał"}, overwrite=True)
❯ cat /tmp/foo.json
{
    "name": "Micha\u0142"
}

@hoechenberger
Copy link
Member

The bug is in our call to json.dumps(). It has the default argument ensure_ascii=True:

https://docs.python.org/3/library/json.html#json.dumps

Calling

    json_output = json.dumps(dictionary, indent=4, ensure_ascii=False)

fixes the issue

@larsoner @sappelhoff I'm afraid we'll find this problem among different parts of the MNE codebase…

@larsoner
Copy link
Member

@larsoner @sappelhoff I'm afraid we'll find this problem among different parts of the MNE codebase…

At least in MNE-Python I think we want the default ensure_ascii=True so that writing to FIF string (which uses latin-1 encoding) works. And then on read these get reencoded so we're all good:

>>> x = """
... {
...     "name": "Micha\u0142"
... }
... """
>>> import json
>>> json.loads(x)
{'name': 'Michał'}

So really the scope is a bit more limited I think: we should check for places where we actually write to a .json file or something else a user should read. Maybe in MNE-BIDS-Pipeline for example there are some of these, not sure...

@hoechenberger
Copy link
Member

So really the scope is a bit more limited I think: we should check for places where we actually write to a .json file or something else a user should read. Maybe in MNE-BIDS-Pipeline for example there are some of these, not sure...

Yes this is exactly what I had in mind!

@hoechenberger hoechenberger added this to the 0.17 milestone Dec 19, 2024
@scott-huberty
Copy link
Contributor Author

Thx @hoechenberger and @larsoner !

Given that I need this to be fixed for a project, I can take a shot at it (maybe over xmas break). But no problem if someone beats me to it 🙂

@scott-huberty scott-huberty linked a pull request Dec 19, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants