Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle exception "ORA-12704: character set mismatch" if hash validation includes NVARCHAR columns #1406

Open
skusam opened this issue Jan 21, 2025 · 3 comments

Comments

@skusam
Copy link

skusam commented Jan 21, 2025

During a row validation using the --hash "*" option we observed, that the SQL that is sent to Oracle behind the scenes produces an exception if the table being validated contains NVARCHAR columns.

The reason seems to be that the coalesce function used in the SQL generated by DVT that is called to replace null values with a non-empty String like 'DEFAULT_REPLACEMENT_STRING' requires two parameters of the same datatype . In case of a NVARCHAR column one parameter of the coalesce function is of type NVARCHAR using the configured NCHAR character set for that database and the second parameter (the replacement String) is of type CHAR using the default database charset which may be different from the NCHAR character set.

Example code to reproduce this exception:

CREATE TABLE nchartest (id NUMBER(5) PRIMARY KEY, content nvarchar2(10));

INSERT INTO nchartest VALUES (1, 'one');
INSERT INTO nchartest VALUES (2, 'two');
INSERT INTO nchartest VALUES (3, null);

SELECT id, coalesce(content, 'DEFAULT_REPLACEMENT_STRING') FROM nchartest;

The SELECT statement causes the observed character-set-mismatch exception. The exception disappears, if either the value of the content column is converted to the default database charset using to_char or the replacement String is marked as NCHAR value with a leading 'N'. So both of the following statements seem to work.

SELECT id, coalesce(to_char(content), 'DEFAULT_REPLACEMENT_STRING') FROM nchartest;

or

SELECT id, coalesce(content, N'DEFAULT_REPLACEMENT_STRING') FROM nchartest;

Can one of these solutions be applied to the SQL that is generated by DVT whenever COALESCE will get called on such nvarchar columns ? Is there any other solution or workaround to make concat or hash validations work with NCHAR columns ?

@nj1973
Copy link
Contributor

nj1973 commented Jan 22, 2025

Research

We could use a special token in query_builder.py ifnull() if the expression is for an Oracle NCHAR/NVARCHAR. In place of the usual "DEFAULT_REPLACEMENT_STRING" we could use something like "NCHAR_REPLACEMENT_STRING".

We would then override the Oracle Coalesce method with something like below:

+def _coalesce(t, op):
+    arg1 = t.translate(op.arg[0])
+    arg2 = t.translate(op.arg[1])
+    if arg1.name == constst.NCHAR_REPLACEMENT_STRING:
+        arg2 = sa.sql.literal_column(f"N'{arg2.value}'")
+    sa_args = [arg1, arg2]
+    return sa.func.coalesce(*sa_args)

...

         # null handling
+        ops.Coalesce: _coalesce,

We would also need the raw Oracle data type in ConfigManager. I've looked into this before for BLOB/CLOB columns, the code is on branch 1335-postgresql-jsonb-data-type-support and could be also be used for this work.

@sundar-mudupalli-work
Copy link
Collaborator

Hi,

I could be making this more complicated, here is my research.
DVT makes assumptions about the ORACLE installation and that is why the current code works. Specifically for Row Hash validation, we generate the following code in ORACLE - lower(standard_hash(convert(t3.concat__all, 'UTF8'), 'SHA256')) AS hash__all. Oracle discourages the use of convert function here explaining that the only supported usage is when NLS_CHARACTERSET, the character set for t3.concat__all (VARCHAR data type) is UTF8. UTF8 is the default value for NLS_CHARACTERSET, but upgrades from older implementations could have a different character set. It is possible when the ORACLE installation has a different NLS_CHARACTER_SET, this still works, it is not supported.

Oracle has two encodings - NLS_CHARACTERSET for the VARCHAR data type and NLS_NCHAR_CHARACTERSET for the NVARCHAR2 data type. NLS_CHARACTERSET could have many different values - the default is AL32UTF8 - which is compatible with ASCII and can store all Unicode characters. NLS_CHARACTERSET could also be US7ASCII or WE8ISO8859P1 which only supports ASCII and Western European characters respectively. NLS_NCHAR_CHARACTERSET must be either AL16UTF16 or AL32UTF8, since NCHAR's must represent all Unicode characters, which is a superset.

One approach to address this is to convert all ORACLE strings to NVARCHAR2(2000), do all the manipulations we need, convert it to UTF8 and do the standard_hash as before. You can cast as CAST(string_column as NVARCHAR2(2000)) - max size is 4000 bytes or 2000 characters. The hashing is done by LOWER(STANDARD_HASH(UTL_RAW.CONVERT(UTL_RAW.CAST_TO_RAW(concat_all), 'AL32UTF8', AL16UTF16), 'SHA256')). The AL16UTF16 parameter above is the value of NLS_NCHAR_CHARACTERSET for the database. To retrieve this value issue this query select value from nls_database_parameters where parameter = 'NLS_NCHAR_CHARACTERSET';The same logic can be applied to custom queries and should work no matter the specific datatype.

Hope this helps

Sundar Mudupalli

@nj1973
Copy link
Contributor

nj1973 commented Jan 30, 2025

Interesting discovery regarding convert(). That was introduced in commit 2e24eae.

This is the issue that added it: #875

Even the part in the Oracle doc regarding RAW conversions is recommending it's use for characters in "neither the database nor the national character set".

In relation to this issue, are you think if we always convert to the national character set then we don't need the code to get the raw data types? I believe I would still need the code to force the replacement literal to an nchar string.

I feel the risk level has increased with the suggested change. We've gone from ensuring we provide a valid replacement literal to refining character set processing in addition to fixing the replacement literal.

This is a lower priority issue for the customer that raised it so we have time to discuss this more deeply and give it more thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants