cancel
Showing results for 
Search instead for 
Did you mean: 

The ability for Full Text Indexing using OCR technology...

i810173
Product and Topic Expert
Product and Topic Expert
0 Kudos

Can TREX do OCR-based searches on image files (PDF, TIFF or others)?

In case it cannot, can it index (and subsequent searches) PDF or TIFF files which include embedded OCR information?

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

TREX can handle text from documents in numerous formats, including Microsoft Office and Adobe formats (PDF), and more than 30 languages.

Refer here: http://help.sap.com/saphelp_nw2004s/helpdata/en/a4/929d4206b70931e10000000a1550b0/content.htm

Answers (1)

Answers (1)

KarstenH
Advisor
Advisor
0 Kudos

Hi Carlos,

TREX does not contain OCR functionality. Neither does SAP NetWeaver Enterprise Search or KMC. An option you have on project basis is using the TREX Python extensions to route this kind of files through a 3rd party OCR software before indexing.

Concerning "embedded OCR info":

1 - do you mean: actual file is still a bitmap, but OCRed info is written to an attribute? Or...?

2 - In the context of which SAP solution are you using TREX?

Regards, Karsten

i810173
Product and Topic Expert
Product and Topic Expert
0 Kudos

Thank you Karsten!

With regards to "embedded OCR info" I mean that a "bitmap" file (an image file) which includes OCRed info written as an attribute inside the same file.

I actually do not know how the file data is stored, but I do know that after running the OCR functionality on Adobe Acrobat Professional (for a scanned document) I end up with a "searchable" PDF file (the file still shows the image, but the OCR information is somehow saved into the same file).

The same happens with TIFF files and the "Microsoft Office Imaging" application (when the OCR functionality is run, the TIFF files ends up being "searchable").

My second question is if TREX will use this OCRed information embedded in these files (to include their contents [OCR information within the files] as part of its index).

KarstenH
Advisor
Advisor
0 Kudos

Hi Carlos,

I will still assume, this is TREX in a KMC context.

There is no simple flag to set to search the OCRed text.

What you'd have to do in KM is:

- Check the TIF or JPEG or PDF into KM

- Upload the OCRed text to a custom property (type text). I know things like that have been atomatized in other projects by consulting. I do not know, how this works in detail.

- Configure KM search to include this attribute in full-text searches (not sure, if this works for custom properties, you may have to use "description")

- Ensure that image files (if they are) are not excluded from KM crawling

Possibly, you will also have to make an additional entry in TREXValidMimeTypes.ini to include JPG and/or TIF here.

Regards, Karsten