Application Development Discussions
Join the discussions or start your own on all things application development, including tools and APIs, programming models, and keeping your skills sharp.
cancel
Showing results for 
Search instead for 
Did you mean: 

When reading in ABAP, do UTF-8 files containing double-byte characters require a BOM?

Former Member

Hi all,

SAP NetWeaver 7.4, ERP 6.0 EHP7.

I have a UTF-8 file that contains the character 'Ñ' represented as a double-byte character - 0xC391.

The file is placed on the application server and an ABAP program then opens and reads the file. The OPEN statement specifies ENCODING UTF-8.

The file does not have a BOM at the start and when it reads the records containing 'Ñ', it takes each of the two bytes separately and incorrectly interprets them as 'Ñ'.

I have inserted a call to method CL_ABAP_FILE_UTILITIES=>CHECK_UTF8 and the file is recognised as being UTF-8.

If I open the file in Notepad and save it with encoding UTF-8, a BOM is inserted in the first three bytes of the file. If I then use this file as input the program correctly interprets the two bytes as 'Ñ'.

It is optional to have a BOM in a UTF-8 file and I have not come across any documentation stating that a BOM is required in a UTF-8 file for it to be correctly interpreted in ABAP.

Is it necessary to have a BOM in a UTF-8 file if it contains double-byte characters?

Are there any ways of dealing with this situation in ABAP rather than having to insert a BOM before processing the file?

Thanks in advance.

Billy Johnson

6 REPLIES 6

raymond_giuseppi
Active Contributor
0 Kudos

BOM is optional.

  • For me '0xC391' looks as UTF-16 for '쎑' ,'Ñ'LATIN CAPITAL LETTER N WITH TILDE should be '0xC3' '0x91' in UTF-8, could you check again your data/file format?
  • Did you option like TEXT MODE and SKIPPING BYTE-ORDER MARK in OPEN DATASET statement?

Regards,
Raymond

Former Member
0 Kudos

Hello Raymond,

Apologies for misleading you. It is '0xC3' '0x91' rather than '0xC391'.

TEXT MODE was specified and it was tried with and without 'SKIPPING BYTE-ORDER MARK'.

Regards

Billy

0 Kudos

"It is '0xC3' '0x91' rather than '0xC391'" : it's the same thing, i.e. 2 bytes

0 Kudos

In which kind of data do you READ the DATASET, what do you see thru AL11 when clicking on file?

Regards,
Raymond

matt
Active Contributor

Could it be he means 00C30091? Or similar?

Sandra_Rossi
Active Contributor
0 Kudos

à is the Unicode character U+00C3 (0xC383 in UTF-8 - https://fr.wikipedia.org/wiki/%C3%83 ), and also 0xC3 in Latin 1, etc.

As you have run CHECK_UTF8 successfully, I guess your file is UTF-8 and is read correctly, but there's a subsequent misinterpretation.

Which byte values do you get in the ABAP debugger right after READ DATASET?