Extracting specific text from pdf files (unstructu...

Former Member · ‎08-22-2016

Hi,

I have some pdf files which contain some data and images. In each of these pdf files, there is a reference number maintained like (Ref: 00.00.00001).

I need to extract this Ref No in a column in HANA table from various pdf files placed in the directory.

For this purpose, I have uploaded pdf files in HANA using a python script. All the content of pdf files goes into a single column of datatype BLOB of HANA table.

Now, I need to search within this BLOB column (which is pdf file content), extract the reference number and put it in another column.

I am not sure how to do this. Can you please guide me how this can be done ?

Is it possible to get this done in HANA via some text mining or text analysis technique or any other way ? I am new to text mining and tech analysis in HANA.

Regards,

Amandeep Singh

pfefferf · ‎08-22-2016

You can use text analysis with CGUL rules.

A very simlar case is described in blog

Regards,

Florian

SergioG_TX · ‎08-22-2016

Amandeep,

i have not done this myself, but i took one of the openSAP course on text mining, etc.. here is the official documentation

SAP HANA Advanced Data Processing – SAP Help Portal Page

for another reference, please check out the opensap course ontext analytics

Text Analytics with SAP HANA Platform - Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wie...

Extracting specific text from pdf files (unstructured data) to a HANA table

Accepted Solutions (0)

Answers (2)

Answers (2)

Re: How to configure SAP system in Eclipse ?

Re: BTP CI/CD Error while UPLOAD set

Re: LLM, RAG and Cloud Foundry: No space left on d...

Re: CF Deployment Error: Error getting tenant t0

Re: Refresh not working in CL_SALV_TABLE