When reading in ABAP, do UTF-8 files containing do...

Former Member · ‎11-15-2016

Hi all,

SAP NetWeaver 7.4, ERP 6.0 EHP7.

I have a UTF-8 file that contains the character 'Ñ' represented as a double-byte character - 0xC391.

The file is placed on the application server and an ABAP program then opens and reads the file. The OPEN statement specifies ENCODING UTF-8.

The file does not have a BOM at the start and when it reads the records containing 'Ñ', it takes each of the two bytes separately and incorrectly interprets them as 'Ã‘'.

I have inserted a call to method CL_ABAP_FILE_UTILITIES=>CHECK_UTF8 and the file is recognised as being UTF-8.

If I open the file in Notepad and save it with encoding UTF-8, a BOM is inserted in the first three bytes of the file. If I then use this file as input the program correctly interprets the two bytes as 'Ñ'.

It is optional to have a BOM in a UTF-8 file and I have not come across any documentation stating that a BOM is required in a UTF-8 file for it to be correctly interpreted in ABAP.

Is it necessary to have a BOM in a UTF-8 file if it contains double-byte characters?

Are there any ways of dealing with this situation in ABAP rather than having to insert a BOM before processing the file?

Thanks in advance.

Billy Johnson

raymond_giuseppi · ‎11-15-2016

BOM is optional.

For me '0xC391' looks as UTF-16 for '쎑' ,'Ñ'LATIN CAPITAL LETTER N WITH TILDE should be '0xC3' '0x91' in UTF-8, could you check again your data/file format?
Did you option like TEXT MODE and SKIPPING BYTE-ORDER MARK in OPEN DATASET statement?

Regards,
Raymond

Former Member · ‎11-15-2016

Hello Raymond,

Apologies for misleading you. It is '0xC3' '0x91' rather than '0xC391'.

TEXT MODE was specified and it was tried with and without 'SKIPPING BYTE-ORDER MARK'.

Regards

Billy

Sandra_Rossi · ‎11-15-2016

"It is '0xC3' '0x91' rather than '0xC391'" : it's the same thing, i.e. 2 bytes

raymond_giuseppi · ‎11-16-2016

In which kind of data do you READ the DATASET, what do you see thru AL11 when clicking on file?

Regards,
Raymond

matt · ‎11-16-2016

Could it be he means 00C30091? Or similar?

Sandra_Rossi · ‎11-15-2016

Ã is the Unicode character U+00C3 (0xC383 in UTF-8 - https://fr.wikipedia.org/wiki/%C3%83 ), and also 0xC3 in Latin 1, etc.

As you have run CHECK_UTF8 successfully, I guess your file is UTF-8 and is read correctly, but there's a subsequent misinterpretation.

Which byte values do you get in the ABAP debugger right after READ DATASET?

When reading in ABAP, do UTF-8 files containing double-byte characters require a BOM?