01-08-2016 2:43 PM
Hello,
as a result of a declaration of Social Security, we get some resulting PDF-files form the Federal Administration.
We must connect these files to the corresponding personel-number. This number can be found in the PDF-file.
Can anyone tell us how we can automate the detection of this number in the PDF? How can we read these 'unstructered' data in an ABAP-program?
The document is not a result of an interactive adobe form from SAP.
Thanks for helping.
Kris
01-12-2016 7:36 AM
Don't expect a easy single pure Abap solution (only for text, with a good+ knowledge of pdf technical specification and link to some Adobe libraries) whatever the actual pdf contains (bitmap, compressed image, text blocks, tags, table) - google on Adobe forums...
Better look for add-ons/applications, is there already any OCR/dematerialization tool used for scanning received invoices or delivery documents letter, mail, fax from company as SAP/Open Text, Readsoft and many other in your system?
Regards,
Raymond
01-08-2016 3:32 PM
The bit you are interested in may be held as a bitmap - if that's the case, you'll need some kind of OCR. If you are lucky and it is held as text, there are many free tools that convert PDF to text. You could run one of those on your application server to do the conversion. Then it's just(!) a question of parsing the text output.
01-11-2016 10:46 AM
Maybe you should take a look into Adobe Interactive Forms ... It works very weel to generate PDF forms. Maybe you can use it to read PDFs...
01-11-2016 12:47 PM
Hi,
we already looked at this possibility. But it doesn't work because the PDF is not the result of an interactive form. That is exactly the problem.
01-11-2016 11:29 PM
This is not feasible by using only an ABAP program if that's what you are asking. Matthew already answered regarding the conversion, so not sure what other replies are expected...
01-12-2016 6:47 AM
01-12-2016 7:04 AM
Using OCR or any tool for conversion (from any image/pdf text) not promise to give 100% correct answer .. there may be confusion arises with conversation.
for some example like
it may consider I to 1
it may consider O to 0.
and vise versa.....
is it reliable?
Edit PDF file attributes with your required number at a time of pdf file generation and
try to read PDF File attributes in your abap program.
its just a guess ..
01-12-2016 8:33 AM
Yes, OCR would be problematic - but not insurmountable. For example, decent OCRs are trainable -and presumably the PDFs from a single source are going to have some kind of consistency.
However, the main thrust of my advice is to convert the PDF to text. PDFs contain a mixture of image and text data (this latter with formatting information). So long as the data the OP is interested in is in the PDF as text data there is no ambiguity. There are free tools that run on unix to convert pdf to text. Take a few sample files and convert them, and see if the result is consistently parsable.
01-12-2016 9:13 AM
It would be fine it things go in this direction but what about the performance to convert pdf to text and search for specific element where the pdf is of hundreds of pages.
01-12-2016 12:51 PM
Got an alternative? If there's only one way of doing it and that isn't fast, then you've a choice - suck it up, or don't bother.
In any case, text searches can be super fast and these are PDFs from social security, a few pages at most, Hardly the complete works of Shakespeare - The lady doth protest too much, methinks!
01-12-2016 3:51 PM
I am not arguing very often but just look at my point of view.
I’m just thinking to play with pdf attributes/property (pdf document information). Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap. There may not any issue regarding performance. This is just imagination anyone can correct me if I am wrong with this idea.
Shakespeare - Its Ok
01-12-2016 4:26 PM
Avirat Patel wrote:
I’m just thinking to play with pdf attributes/property (pdf document information). Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap.
As I understand, OP gets these files from "the Federal Administration", so he/she does not control the file creation.
From my experience, government institutions are not always accommodating in that regard, so most likely this PDF is just what it is. But if OP has not asked the other party if it's possible to get the data in another format (e.g. XML) or include ID in the file name then he/she should've done that before posting on SCN. (I assumed he/she already did, but maybe it's incorrect.)
P.S. 2 pages of comments already yet OP is MIA... Makes one wonder who actually needs this more.
01-12-2016 7:27 PM
I’m just thinking to play with pdf attributes/property (pdf document information). Just try to edit property details at every time of file creation, attach ID at specific field and read it from abap
Actually - that's a really good idea. I doubt they can influence what metadata is include in the pdf as it comes from the government. But perhaps it's already there!
01-13-2016 10:19 AM
Hi,
since I was out of office yesterday, I could not reply. But I'm very interested.
We did ask all these questions about the content of the file and about the naming of the file.
But as you put it: we get this file from federal administration and that's it. We can regret that, but that's the way the cookie crumbles.
And up to now we are trying to find out whether or not it is possible to automate this issue.
Has anyone ever considered other ways of thinking:
Thanks anyway for the reactions.
01-13-2016 10:33 AM
Kris Claes wrote:
Has anyone ever considered other ways of thinking:
- can we read this file using PI? Or do we need it to contain more structured data?
- can we perform some searches saving it in a HANA table and do a text-search?
As far as I am aware, PI has no built in PDF to text converter. So whether you send it to PI, or put it into HANA, the basic problem remains. You need first to convert to text. The ways of converting to text have already been suggested.
01-13-2016 3:22 PM
Hi Kris,
Matthew is correct that there is no built-in PDF to text converter in PI. However there are some open source APIs out there such as Apache PDFBox that will allow you to extract text data from a PDF document. In that case you could also take those libraries to create a custom adapter module in PI to extract the content of interest (SSN) and generate an appropriate message to pass to your backend system for processing/linking as you described. Convert the payload of the file into XSTRING or base64 format and including the extracted content so you can link with the appropriate SSN # - might be able to use attachments instead but I have never tried that so I could not say.
Regards,
Ryan Crosby
01-12-2016 7:36 AM
Don't expect a easy single pure Abap solution (only for text, with a good+ knowledge of pdf technical specification and link to some Adobe libraries) whatever the actual pdf contains (bitmap, compressed image, text blocks, tags, table) - google on Adobe forums...
Better look for add-ons/applications, is there already any OCR/dematerialization tool used for scanning received invoices or delivery documents letter, mail, fax from company as SAP/Open Text, Readsoft and many other in your system?
Regards,
Raymond
01-12-2016 4:10 PM
Kris,
What is the file name of the PDF? If 'personel-number' is part of the file name then it is just a matter of extracting it. If not, you can request that the file be sent to you WITH the 'personel-number' in the file name itself.
I have not used any OCR (or such) software to read data into my ABAP programs, so this might be a irrelevant question but what if the format of the PDF that you receive in future changes? Would that pose a problem?
-Amit.
01-13-2016 10:22 AM
Amit,
the federal administration apparently can't change the name. That's a pity.
Thanks anyway.
Kris
01-13-2016 1:04 PM
Not sure, if you are from Belgium, did you check for available web services at https://www.socialsecurity.be/site_fr/employer/infos/index.htm?
Regards,
Raymond
02-03-2020 11:22 AM
PDF Data extraction is Possible in UIPath using OCR technique and general PDF data reading technique