Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix support for non-ASCII characters in Oracle CLOBs #1184

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vadz
Copy link
Member

@vadz vadz commented Nov 19, 2024

Replaces #1183.

No real changes, just make it possible to use these helpers in the unit
test to be added.

This commit is best viewed using Git --color-moved option.
This will allow to use it for creating functions as well as procedures
(and also procedures with a name other than "soci_test" if this is ever
needed).
We need to read the entire contents of the CLOB in Oracle backend and
not just the number of bytes corresponding to its length in characters
as returned by OCILobGetLength() because this may (and will) be strictly
less than its full size in bytes for any encoding using multiple bytes
per character, such as the de facto standard UTF-8.

Also make reading CLOBs more efficient by doing what Oracle
documentation suggests and using the LOB chunk size for reading.

Finally, add a unit test checking that using non-ASCII strings in UTF-8
(which had to be enabled for the CI) with CLOBs does work.

This commit is best viewed ignoring whitespace-only changes.
@avpalienko
Copy link

There might be an issue with the Fixed-Width Client-Side Character Set. In this case, Oracle states that the output value of amtp is in characters. However, I haven't been able to create a test case for it.

On "streaming mode" the offset is ignored, except for the first call, so
don't bother updating it.

Also restructure the code in a slightly simpler way.
Read directly into the provided string instead of reading into a
temporary buffer and then copying into the string.
@vadz
Copy link
Member Author

vadz commented Nov 20, 2024

There might be an issue with the Fixed-Width Client-Side Character Set. In this case, Oracle states that the output value of amtp is in characters.

Oracle documentation says

The output amount indicates how many bytes were read into the buffer bufp.

so I don't think it's ever in characters (except on input — using different units for an in/out parameter deserves some kind of a prize for the worst API design ever).

@avpalienko
Copy link

There might be an issue with the Fixed-Width Client-Side Character Set. In this case, Oracle states that the output value of amtp is in characters.

Oracle documentation says

The output amount indicates how many bytes were read into the buffer bufp.

so I don't think it's ever in characters (except on input — using different units for an in/out parameter deserves some kind of a prize for the worst API design ever).

I am basing this on the following
image

@vadz
Copy link
Member Author

vadz commented Nov 20, 2024

Hmm, yes. I wonder if they really include UTF-16 and UTF-32 in "fixed width encodings", I have a feeling that they might be speaking about pre-Unicode fixed width encodings only (e.g. CP1252 etc). In any case, I can't test this neither: setting NLS_LANG to .AL32UTF16 or anything like this results in a failure to create the Oracle environment.

@avpalienko
Copy link

Hmm, yes. I wonder if they really include UTF-16 and UTF-32 in "fixed width encodings", I have a feeling that they might be speaking about pre-Unicode fixed width encodings only (e.g. CP1252 etc). In any case, I can't test this neither: setting NLS_LANG to .AL32UTF16 or anything like this results in a failure to create the Oracle environment.

Currently, only AL16UTF16 is relevant. But SOCI is not ready to use it

@vadz
Copy link
Member Author

vadz commented Nov 20, 2024

Currently, only AL16UTF16 is relevant. But SOCI is not ready to use it

I get an error (without any details) from OCISessionBegin() with this charset too. Maybe we could support UTF-16 with #1179 but I'm not really interested in this, to be honest.

@avpalienko
Copy link

Currently, only AL16UTF16 is relevant. But SOCI is not ready to use it

I get an error (without any details) from OCISessionBegin() with this charset too. Maybe we could support UTF-16 with #1179 but I'm not really interested in this, to be honest.

The username and password must be in Unicode in this case, and SQL texts must also be in Unicode. I worked on it today but didn't achieve any results.
SOCI connection parameters include charset and ncharset, but I'm not sure if they are equivalent to the NLS_LANG setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants