Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework UTFString API to only keep UTF-8 #184

Merged
merged 6 commits into from
Dec 27, 2023
Merged

Conversation

robUx4
Copy link
Contributor

@robUx4 robUx4 commented Dec 26, 2023

In EBML and in most systems, UTF-8 is used by default. Converting from wchar_t should be the exception and not done for every unicode string we read/write.

Some API's with wchar_t are gone. They are not used in VLC (which just deals wtih UTF-8) and in some corresponding to_wide functions in mkvtoolnix seem to be unused and can also be removed.

The length() reported is different. It now reports the length in bytes of the UTF-8 string which is the length needed to write the buffer in EBML anyway.

This PR also fixes #180.

We don't need to know about many UCS2 characters are in a string, Matroska is using UTF-8.
The internal std::string and std::wstring will be handled automatically.
@robUx4 robUx4 added api-break breaks the API (e.g. programs using it will have to adjust their source code) abi-break breaks the ABI (e.g. programs linked against the library have to be recompiled) labels Dec 26, 2023
Matroska only deals with UTF-8. The only reason to use wchar_t is when converting
to/from host strings that are not in UTF-8 (usually Windows). And we only need the
conversion in one way.

This will avoid plenty of uneeded conversions when only dealing with UTF-8.
If the string buffer comes from an EBML file we should be able to copy the same
buffer without interpreting it, even if it's bogus. It may be intentional.

Fixes Matroska-Org#180
In C++20 it would use char8_t.
Copy link
Contributor

@mbunkus mbunkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this very much; thanks. Tested successfully with MKVToolNix (after a small amount of fixes that also work with v1.x). Please merge.

@robUx4 robUx4 merged commit 58b3385 into Matroska-Org:master Dec 27, 2023
20 checks passed
@robUx4 robUx4 deleted the unistring branch December 27, 2023 13:51
robUx4 added a commit to Matroska-Org/libmatroska that referenced this pull request Dec 28, 2023
robUx4 added a commit to Matroska-Org/libmatroska that referenced this pull request Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abi-break breaks the ABI (e.g. programs linked against the library have to be recompiled) api-break breaks the API (e.g. programs using it will have to adjust their source code)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FIx handling of broken UTF-8 data
2 participants