Skip to Content
1

When reading in ABAP, do UTF-8 files containing double-byte characters require a BOM?

Nov 15, 2016 at 02:30 PM

73

avatar image

Hi all,

SAP NetWeaver 7.4, ERP 6.0 EHP7.

I have a UTF-8 file that contains the character 'Ñ' represented as a double-byte character - 0xC391.

The file is placed on the application server and an ABAP program then opens and reads the file. The OPEN statement specifies ENCODING UTF-8.

The file does not have a BOM at the start and when it reads the records containing 'Ñ', it takes each of the two bytes separately and incorrectly interprets them as 'Ñ'.

I have inserted a call to method CL_ABAP_FILE_UTILITIES=>CHECK_UTF8 and the file is recognised as being UTF-8.

If I open the file in Notepad and save it with encoding UTF-8, a BOM is inserted in the first three bytes of the file. If I then use this file as input the program correctly interprets the two bytes as 'Ñ'.

It is optional to have a BOM in a UTF-8 file and I have not come across any documentation stating that a BOM is required in a UTF-8 file for it to be correctly interpreted in ABAP.

Is it necessary to have a BOM in a UTF-8 file if it contains double-byte characters?

Are there any ways of dealing with this situation in ABAP rather than having to insert a BOM before processing the file?

Thanks in advance.

Billy Johnson

10 |10000 characters needed characters left characters exceeded
* Please Login or Register to Answer, Follow or Comment.

3 Answers

Raymond Giuseppi
Nov 15, 2016 at 03:13 PM
0

BOM is optional.

  • For me '0xC391' looks as UTF-16 for '쎑' ,'Ñ'LATIN CAPITAL LETTER N WITH TILDE should be '0xC3' '0x91' in UTF-8, could you check again your data/file format?
  • Did you option like TEXT MODE and SKIPPING BYTE-ORDER MARK in OPEN DATASET statement?

Regards,
Raymond

Share
10 |10000 characters needed characters left characters exceeded
Billy Johnson Nov 15, 2016 at 03:56 PM
0

Hello Raymond,

Apologies for misleading you. It is '0xC3' '0x91' rather than '0xC391'.

TEXT MODE was specified and it was tried with and without 'SKIPPING BYTE-ORDER MARK'.

Regards

Billy

Show 3 Share
10 |10000 characters needed characters left characters exceeded

"It is '0xC3' '0x91' rather than '0xC391'" : it's the same thing, i.e. 2 bytes

0

Could it be he means 00C30091? Or similar?

1

In which kind of data do you READ the DATASET, what do you see thru AL11 when clicking on file?

Regards,
Raymond

0
Sandra Rossi Nov 15, 2016 at 09:01 PM
0

à is the Unicode character U+00C3 (0xC383 in UTF-8 - https://fr.wikipedia.org/wiki/%C3%83 ), and also 0xC3 in Latin 1, etc.

As you have run CHECK_UTF8 successfully, I guess your file is UTF-8 and is read correctly, but there's a subsequent misinterpretation.

Which byte values do you get in the ABAP debugger right after READ DATASET?

Share
10 |10000 characters needed characters left characters exceeded