Skip to Content
author's profile photo Former Member
Former Member

Extracting specific text from pdf files (unstructured data) to a HANA table

Hi,

I have some pdf files which contain some data and images. In each of these pdf files, there is a reference number maintained like (Ref: 00.00.00001).

I need to extract this Ref No in a column in HANA table from various pdf files placed in the directory.

For this purpose, I have uploaded pdf files in HANA using a python script. All the content of pdf files goes into a single column of datatype BLOB of HANA table.

Now, I need to search within this BLOB column (which is pdf file content), extract the reference number and put it in another column.

I am not sure how to do this. Can you please guide me how this can be done ?

Is it possible to get this done in HANA via some text mining or text analysis technique or any other way ? I am new to text mining and tech analysis in HANA.

Regards,

Amandeep Singh

Add a comment
10|10000 characters needed characters exceeded

Assigned Tags

Related questions

2 Answers

  • Posted on Aug 22, 2016 at 01:49 PM

    Amandeep,

    i have not done this myself, but i took one of the openSAP course on text mining, etc.. here is the official documentation

    SAP HANA Advanced Data Processing – SAP Help Portal Page

    for another reference, please check out the opensap course ontext analytics

    Text Analytics with SAP HANA Platform - Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wiesner | openSAP

    Add a comment
    10|10000 characters needed characters exceeded

  • Posted on Aug 22, 2016 at 02:00 PM

    You can use text analysis with CGUL rules.

    A very simlar case is described in blog SAP HANA: Understanding regular expression operators supported in CGUL for Text Analytics

    Regards,

    Florian

    Add a comment
    10|10000 characters needed characters exceeded

Before answering

You should only submit an answer when you are proposing a solution to the poster's problem. If you want the poster to clarify the question or provide more information, please leave a comment instead, requesting additional details. When answering, please include specifics, such as step-by-step instructions, context for the solution, and links to useful resources. Also, please make sure that you answer complies with our Rules of Engagement.
You must be Logged in to submit an answer.

Up to 10 attachments (including images) can be used with a maximum of 1.0 MB each and 10.5 MB total.