cancel
Showing results for 
Search instead for 
Did you mean: 

Mechanism of Document Information Extraction

0 Kudos

Hi,

I'm trying to use SAP Document Information Extraction to extract data from documents.

I have read the answer to the question in the following URL and test the results using several documents, it would be appreciated if you could answer the two questions stated below.

https://answers.sap.com/questions/13508298/how-to-train-sap-document-information-extraction.html

I created my own schema and created a template using one sample document with annotation. When I used the template on that same document again, it was 100% correct. But when I tried it on a document of the same format but with different contents, several fields were not extracted from the right spot (the spot annotated in the sample document). According to the answer in the URL, "template" feature is used to process incoming documents of known templates(the same as which was used to define the template?), all the fields are expected to be extracted from the same spot or coordinate in the document. However, it seems not to be the case. If the fields are not extracted based on the annotated coordinate, what is the mechanism behind the extraction. Is there any learning mechanism instead of simply remembering the coordinates?

Also, I noticed that multiple sample documents(5 max?) can be uploaded and annotated within one template, and according to the URL, all the samples should share the same format. What is the benefit of uploading multiple samples with the same format? For example, higher accuracy/reliability of the extracted fields, or more robust to the misalignment or inclination of the format?

Thank you for your help.

Best Regards.

Accepted Solutions (0)

Answers (3)

Answers (3)

tomasz_janasz
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Ludovic,

our support team would need to see the template and the sample document. Please raise a ticket with the following component: CA-ML-BDP-TEM. Please provide the template (you can export it via the UI).

https://support.sap.com/

Best regards,
Tomasz

Ludovic_MOOS
Explorer
0 Kudos

tomasz.janasz

OK, I've just submitted the ticket.

Thanks.

tomasz_janasz
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Ludovic,

pelase note: you can apply the Tempalte feature only to one layout. I.e. combining different layouts within one Template will raise issues.

If you use template for one of the standard documents (e.g. invoice or purchase orders) the pre-trained Global Model will also kick in to support the extraction of the values. If you do not want that to happen you need to avoid using Default Extractors. This you can define in your particular Schema that you use for your Template creation:

https://help.sap.com/docs/DOCUMENT_INFORMATION_EXTRACTION/5fa7265b9ff64d73bac7cec61ee55ae6/020ab638c...

Best regards,

Tomasz

Ludovic_MOOS
Explorer
0 Kudos

Hi tomasz.janasz,

Thanks.

Indeed I'm aware I can apply Template feature to only one layout. In my case, I've created a custom template based on a custom schema. My issue is that I made an annotation on page 1 of my template document through DOX UI. Day one: DOX extracts the corresponding text perfectly well (on page 1). Day two: DOX extracts data on page 3! Since then, I can't make it go back extract data on page 1. How come such inconsistent results?

tomasz_janasz
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Gao,

the current templating feature of Document Information Extraction is coordinates-based. It means that you specify the location on the document where you expect the key-value pair to reside. That is also why you do not need to annotate more than 5 samples because it does not add value.

If you use the template for incoming documents of the same layout you need to specify the template ID or use the template auto-detect function. Please refer to the corresponding help documentation:

https://help.sap.com/viewer/5fa7265b9ff64d73bac7cec61ee55ae6/SHIP/en-US/b722fe7170af4dd8b171f8394f43...

If you still experience poor extraction results with a template please raise a support ticket with the following component: CA-ML-BDP under https://launchpad.support.sap.com/. Please provide a sample document and the exported template.

Best regards,

Tomasz (from the product team)

Ludovic_MOOS
Explorer
0 Kudos

tomasz.janasz

Hi,

I've been using DOX for quite some time and from time to time I'm facing a very intriguing issue that can block my bot: I'm using DOX to extract data from several PDF files using the same template, the service seems to be working just fine, however when I take a look at the extraction results, I notice that DOX extracted data from a wrong page (the coordinates of the annotation zone are right but the page is wrong). This can result in inaccurate extracted data or can even block my bot when I use a RegEx on a specific text extraction.

Any clue what can cause this and how to fix it?

LM.