cancel
Showing results for 
Search instead for 
Did you mean: 

HANA - UTF8 characters being garbled when using the Windows ODBC driver

tony_marston25
Explorer
0 Kudos

I have installed SAP Hana Express Edition (virtual machine method) on my Windows PC so that I can use it with my PHP application. I have created a couple of databases and loaded data into several tables, and I can access these OK with my PHP application. EXCEPT when it comes to using UTF8 characters. I can create a record without error, but when I try and read it back I get nothing. When I view the data with an ODBC Query Tool the UTF8 characters have become garbled. When I view the same data in Eclipse they show up as garbled.

If I create the same record in Eclipse and view it in Eclipse the UTF8 characters are shown correctly.

Eclipse uses the JDBC driver whereas both the ODBC Query Tool and my PHP application use the Windows ODBC driver supplied by SAP, therefore it looks like the problem is with the Windows ODBC driver.

Note the following:

- all my string columns have been defined as NVARCHAR

- I have added CHAR_AS_UTF8=TRUE to the ODBC connection properties

- my ODBC driver version is 2.06.64.28100

Is there some other secret setting that I am missing?

jeff_albion
Employee
Employee
0 Kudos

Hi Anthony,

Can you explain a bit more about how PHP is inserting the data and the script you're using? Are you using Unified ODBC or PDO ODBC?

As per the ODBC driver specification, the ODBC driver on Windows supports wide ODBC calls with a UCS-2 text encoding. So if you're using a SQL text literal to insert - i.e.

INSERT INTO T1 (c1) VALUES ('😊'));

Depending on the codepoint you're trying to represent (i.e. if it's outside of the BMP, like the example above), it wouldn't be supported.

You can directly insert non-BMP codepoints into HANA via Windows with a SQL text literal using the BINTOSTR() function and specify the code-point in CESU-8 format.

INSERT INTO T1 (c1) VALUES (BINTOSTR('eda0bdedb88a'));

If you're using ODBC bound parameters (i.e. SQLBindCol), then you can bind the data directly as a char* (SQL_C_CHAR). I think in PHP this is covered via odbc_prepare() / odbc_execute() so are you using these methods to insert, or are you using odbc_execute() directly?

Best regards,

Jeff

tony_marston25
Explorer
0 Kudos

I am using the odbc_exec() function. I do not use PDO or prepared statements. The query that is being executed looks like the following:

UPDATE unit_of_measure SET `uom_symbol`='€' WHERE uom_id='EUR';

where uom_symbol is defined as NVARCHAR(4).

The same code works in MySQL and SQL Server. HANA can handle the '£' symbol but not '€'.

jeff_albion
Employee
Employee

Hi Anthony,

I have reproduced this issue locally now, with '€' (U+20AC) in the PHP script. I'm taking a look at the ODBC traces and see for the update statement you provided, and it is logged with the erroneous data on input:

SQLExecDirect 2021-06-04 10:28:37.838000
StatementHandle  [in]    : 0x000001e9d4e11c50
StatementText    [in]    : 'UPDATE unit_of_measure SET uom_symbol = '€' WHERE uom_id = 'EUR';'
TextLength       [in]    : SQL_NTS
enc                      : UCS2 (LE) (native)


Since U+20AC is 0xE2 0x82 0xAC in UTF-8 in the PHP source, I can see it's being incorrectly encoded as three separate codepoints by PHP:

U+00E2 - LATIN SMALL LETTER A WITH CIRCUMFLEX
U+0082 - BREAK PERMITTED HERE
U+00AC - NOT SIGN

So PHP odbc module is not formatting the UCS-2 text correctly on Windows for the HANA ODBC driver from my test UTF-8 encoded PHP script to pass along Unicode values in SQL literals.

To properly send UCS-2 (BMP) Unicode data in ODBC on Windows in SQL text literals, PHP should be converting the UTF-8 string in the PHP source to UCS-2, then using the Unicode ODBC call SQLExecDirectW with a SQLWCHAR* parameter (where SQLWCHAR is typdef'ed to an unsigned short) to properly pass in non-7-bit ASCII for literals stored in the SQL literal text.

It unfortunately doesn't seem like PHP does any of this via odbc_exec though - it uses the SQLExecDirect (which upgrades to SQLExecDirectW internally in the Windows Driver Manager) with a C-style cast to SQLCHAR* (php_odbc.c:1316), effectively a char*:

rc = SQLExecDirect(result->stmt, (SQLCHAR *) query, SQL_NTS);

So this appears to be a PHP bug in the odbc module where it isn't Unicode-aware. This seems to be a well-known problem in the PHP community...

The same code works in MySQL and SQL Server.

That might be a case of luck for those drivers where the format on input is assumed to be UTF-8 (or Windows-1252) and isn't interpreted, even though ANSI SQLExecDirect only accepts non-UTF-8 by design. The HANA ODBC driver is Unicode-aware, so the SQLExecDirectW is being called with an UCS-2 assumption, but incorrectly formatted input.

Best regards,
Jeff

lbreddemann
Active Contributor

Great analysis and write up, Jeff!

tony_marston25
Explorer

Thanks for that great work. I shall now add this information to my PHP bug report at https://bugs.php.net/bug.php?id=80874

Accepted Solutions (0)

Answers (0)