Skip to Content
avatar image
Former Member

When reading in ABAP, do UTF-8 files containing double-byte characters require a BOM?

Hi all,

SAP NetWeaver 7.4, ERP 6.0 EHP7.

I have a UTF-8 file that contains the character 'Ñ' represented as a double-byte character - 0xC391.

The file is placed on the application server and an ABAP program then opens and reads the file. The OPEN statement specifies ENCODING UTF-8.

The file does not have a BOM at the start and when it reads the records containing 'Ñ', it takes each of the two bytes separately and incorrectly interprets them as 'Ñ'.

I have inserted a call to method CL_ABAP_FILE_UTILITIES=>CHECK_UTF8 and the file is recognised as being UTF-8.

If I open the file in Notepad and save it with encoding UTF-8, a BOM is inserted in the first three bytes of the file. If I then use this file as input the program correctly interprets the two bytes as 'Ñ'.

It is optional to have a BOM in a UTF-8 file and I have not come across any documentation stating that a BOM is required in a UTF-8 file for it to be correctly interpreted in ABAP.

Is it necessary to have a BOM in a UTF-8 file if it contains double-byte characters?

Are there any ways of dealing with this situation in ABAP rather than having to insert a BOM before processing the file?

Thanks in advance.

Billy Johnson

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

3 Answers

  • Nov 15, 2016 at 03:13 PM

    BOM is optional.

    • For me '0xC391' looks as UTF-16 for '쎑' ,'Ñ'LATIN CAPITAL LETTER N WITH TILDE should be '0xC3' '0x91' in UTF-8, could you check again your data/file format?
    • Did you option like TEXT MODE and SKIPPING BYTE-ORDER MARK in OPEN DATASET statement?

    Regards,
    Raymond

    Add comment
    10|10000 characters needed characters exceeded

  • avatar image
    Former Member
    Nov 15, 2016 at 03:56 PM

    Hello Raymond,

    Apologies for misleading you. It is '0xC3' '0x91' rather than '0xC391'.

    TEXT MODE was specified and it was tried with and without 'SKIPPING BYTE-ORDER MARK'.

    Regards

    Billy

    Add comment
    10|10000 characters needed characters exceeded

  • Nov 15, 2016 at 09:01 PM

    Ã is the Unicode character U+00C3 (0xC383 in UTF-8 - https://fr.wikipedia.org/wiki/%C3%83 ), and also 0xC3 in Latin 1, etc.

    As you have run CHECK_UTF8 successfully, I guess your file is UTF-8 and is read correctly, but there's a subsequent misinterpretation.

    Which byte values do you get in the ABAP debugger right after READ DATASET?

    Add comment
    10|10000 characters needed characters exceeded