Solved: Get Data out of PDF

former_member185171 · ‎01-08-2016

Hello,

as a result of a declaration of Social Security, we get some resulting PDF-files form the Federal Administration.

We must connect these files to the corresponding personel-number. This number can be found in the PDF-file.

Can anyone tell us how we can automate the detection of this number in the PDF? How can we read these 'unstructered' data in an ABAP-program?

The document is not a result of an interactive adobe form from SAP.

Thanks for helping.

Kris

raymond_giuseppi · ‎01-12-2016

Don't expect a easy single pure Abap solution (only for text, with a good+ knowledge of pdf technical specification and link to some Adobe libraries) whatever the actual pdf contains (bitmap, compressed image, text blocks, tags, table) - google on Adobe forums...

Better look for add-ons/applications, is there already any OCR/dematerialization tool used for scanning received invoices or delivery documents letter, mail, fax from company as SAP/Open Text, Readsoft and many other in your system?

Regards,

Raymond

matt · ‎01-08-2016

The bit you are interested in may be held as a bitmap - if that's the case, you'll need some kind of OCR. If you are lucky and it is held as text, there are many free tools that convert PDF to text. You could run one of those on your application server to do the conversion. Then it's just(!) a question of parsing the text output.

SilvioMiranda · ‎01-11-2016

Maybe you should take a look into Adobe Interactive Forms ... It works very weel to generate PDF forms. Maybe you can use it to read PDFs...

former_member185171 · ‎01-11-2016

Hi,

we already looked at this possibility. But it doesn't work because the PDF is not the result of an interactive form. That is exactly the problem.

Jelena · ‎01-11-2016

This is not feasible by using only an ABAP program if that's what you are asking. Matthew already answered regarding the conversion, so not sure what other replies are expected...

matt · ‎01-12-2016

Perhaps my answer wasn't the one wanted.

Former Member · ‎01-12-2016

Using OCR or any tool for conversion (from any image/pdf text) not promise to give 100% correct answer .. there may be confusion arises with conversation.

for some example like

it may consider I to 1

it may consider O to 0.

and vise versa.....

is it reliable?

Edit PDF file attributes with your required number at a time of pdf file generation and

try to read PDF File attributes in your abap program.

its just a guess ..

matt · ‎01-12-2016

Yes, OCR would be problematic - but not insurmountable. For example, decent OCRs are trainable -and presumably the PDFs from a single source are going to have some kind of consistency.

However, the main thrust of my advice is to convert the PDF to text. PDFs contain a mixture of image and text data (this latter with formatting information). So long as the data the OP is interested in is in the PDF as text data there is no ambiguity. There are free tools that run on unix to convert pdf to text. Take a few sample files and convert them, and see if the result is consistently parsable.

Former Member · ‎01-12-2016

It would be fine it things go in this direction but what about the performance to convert pdf to text and search for specific element where the pdf is of hundreds of pages.

matt · ‎01-12-2016

Got an alternative? If there's only one way of doing it and that isn't fast, then you've a choice - suck it up, or don't bother.

In any case, text searches can be super fast and these are PDFs from social security, a few pages at most, Hardly the complete works of Shakespeare - The lady doth protest too much, methinks!

Former Member · ‎01-12-2016

I am not arguing very often but just look at my point of view.

I’m just thinking to play with pdf attributes/property (pdf document information). Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap. There may not any issue regarding performance. This is just imagination anyone can correct me if I am wrong with this idea.

Shakespeare - Its Ok

Jelena · ‎01-12-2016


Avirat Patel wrote:

I’m just thinking to play with pdf attributes/property (pdf document information).  Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap.

As I understand, OP gets these files from "the Federal Administration", so he/she does not control the file creation.

From my experience, government institutions are not always accommodating in that regard, so most likely this PDF is just what it is. But if OP has not asked the other party if it's possible to get the data in another format (e.g. XML) or include ID in the file name then he/she should've done that before posting on SCN. (I assumed he/she already did, but maybe it's incorrect.)

P.S. 2 pages of comments already yet OP is MIA... Makes one wonder who actually needs this more.

matt · ‎01-12-2016

I’m just thinking to play with pdf attributes/property (pdf document information). Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap

Actually - that's a really good idea. I doubt they can influence what metadata is include in the pdf as it comes from the government. But perhaps it's already there!

former_member185171 · ‎01-13-2016

Hi,

since I was out of office yesterday, I could not reply. But I'm very interested.

We did ask all these questions about the content of the file and about the naming of the file.

But as you put it: we get this file from federal administration and that's it. We can regret that, but that's the way the cookie crumbles.

And up to now we are trying to find out whether or not it is possible to automate this issue.

Has anyone ever considered other ways of thinking:

can we read this file using PI? Or do we need it to contain more structured data?
can we perform some searches saving it in a HANA table and do a text-search?

Thanks anyway for the reactions.

matt · ‎01-13-2016


Kris Claes wrote:



Has anyone ever considered other ways of thinking:

can we read this file using PI? Or do we need it to contain more structured data?
can we perform some searches saving it in a HANA table and do a text-search?

As far as I am aware, PI has no built in PDF to text converter. So whether you send it to PI, or put it into HANA, the basic problem remains. You need first to convert to text. The ways of converting to text have already been suggested.

Ryan-Crosby · ‎01-13-2016

Hi Kris,

Matthew is correct that there is no built-in PDF to text converter in PI. However there are some open source APIs out there such as Apache PDFBox that will allow you to extract text data from a PDF document. In that case you could also take those libraries to create a custom adapter module in PI to extract the content of interest (SSN) and generate an appropriate message to pass to your backend system for processing/linking as you described. Convert the payload of the file into XSTRING or base64 format and including the extracted content so you can link with the appropriate SSN # - might be able to use attachments instead but I have never tried that so I could not say.

Regards,

Ryan Crosby

raymond_giuseppi · ‎01-12-2016

Don't expect a easy single pure Abap solution (only for text, with a good+ knowledge of pdf technical specification and link to some Adobe libraries) whatever the actual pdf contains (bitmap, compressed image, text blocks, tags, table) - google on Adobe forums...

Better look for add-ons/applications, is there already any OCR/dematerialization tool used for scanning received invoices or delivery documents letter, mail, fax from company as SAP/Open Text, Readsoft and many other in your system?

Regards,

Raymond

Former Member · ‎01-12-2016

Kris,

What is the file name of the PDF? If 'personel-number' is part of the file name then it is just a matter of extracting it. If not, you can request that the file be sent to you WITH the 'personel-number' in the file name itself.

I have not used any OCR (or such) software to read data into my ABAP programs, so this might be a irrelevant question but what if the format of the PDF that you receive in future changes? Would that pose a problem?

-Amit.

former_member185171 · ‎01-13-2016

Amit,

the federal administration apparently can't change the name. That's a pity.

Thanks anyway.

Kris

raymond_giuseppi · ‎01-13-2016

Not sure, if you are from Belgium, did you check for available web services at https://www.socialsecurity.be/site_fr/employer/infos/index.htm?

Regards,

Raymond

former_member652893 · ‎02-03-2020

PDF Data extraction is Possible in UIPath using OCR technique and general PDF data reading technique