on 06-22-2007 6:49 AM
Can TREX do OCR-based searches on image files (PDF, TIFF or others)?
In case it cannot, can it index (and subsequent searches) PDF or TIFF files which include embedded OCR information?
TREX can handle text from documents in numerous formats, including Microsoft Office and Adobe formats (PDF), and more than 30 languages.
Refer here: http://help.sap.com/saphelp_nw2004s/helpdata/en/a4/929d4206b70931e10000000a1550b0/content.htm
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Carlos,
TREX does not contain OCR functionality. Neither does SAP NetWeaver Enterprise Search or KMC. An option you have on project basis is using the TREX Python extensions to route this kind of files through a 3rd party OCR software before indexing.
Concerning "embedded OCR info":
1 - do you mean: actual file is still a bitmap, but OCRed info is written to an attribute? Or...?
2 - In the context of which SAP solution are you using TREX?
Regards, Karsten
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thank you Karsten!
With regards to "embedded OCR info" I mean that a "bitmap" file (an image file) which includes OCRed info written as an attribute inside the same file.
I actually do not know how the file data is stored, but I do know that after running the OCR functionality on Adobe Acrobat Professional (for a scanned document) I end up with a "searchable" PDF file (the file still shows the image, but the OCR information is somehow saved into the same file).
The same happens with TIFF files and the "Microsoft Office Imaging" application (when the OCR functionality is run, the TIFF files ends up being "searchable").
My second question is if TREX will use this OCRed information embedded in these files (to include their contents [OCR information within the files] as part of its index).
Hi Carlos,
I will still assume, this is TREX in a KMC context.
There is no simple flag to set to search the OCRed text.
What you'd have to do in KM is:
- Check the TIF or JPEG or PDF into KM
- Upload the OCRed text to a custom property (type text). I know things like that have been atomatized in other projects by consulting. I do not know, how this works in detail.
- Configure KM search to include this attribute in full-text searches (not sure, if this works for custom properties, you may have to use "description")
- Ensure that image files (if they are) are not excluded from KM crawling
Possibly, you will also have to make an additional entry in TREXValidMimeTypes.ini to include JPG and/or TIF here.
Regards, Karsten
User | Count |
---|---|
86 | |
10 | |
10 | |
9 | |
6 | |
6 | |
6 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.