Skip to Content
avatar image
Former Member

Extracting specific text from pdf files (unstructured data) to a HANA table

Hi,

I have some pdf files which contain some data and images. In each of these pdf files, there is a reference number maintained like (Ref: 00.00.00001).

I need to extract this Ref No in a column in HANA table from various pdf files placed in the directory.

For this purpose, I have uploaded pdf files in HANA using a python script. All the content of pdf files goes into a single column of datatype BLOB of HANA table.

Now, I need to search within this BLOB column (which is pdf file content), extract the reference number and put it in another column.

I am not sure how to do this. Can you please guide me how this can be done ?

Is it possible to get this done in HANA via some text mining or text analysis technique or any other way ? I am new to text mining and tech analysis in HANA.

Regards,

Amandeep Singh

Add comment
10|10000 characters needed characters exceeded

  • Follow
  • Get RSS Feed

2 Answers

  • Aug 22, 2016 at 01:49 PM

    Amandeep,

    i have not done this myself, but i took one of the openSAP course on text mining, etc.. here is the official documentation

    SAP HANA Advanced Data Processing – SAP Help Portal Page

    for another reference, please check out the opensap course ontext analytics

    Text Analytics with SAP HANA Platform - Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wiesner | openSAP

    Add comment
    10|10000 characters needed characters exceeded

  • Aug 22, 2016 at 02:00 PM

    You can use text analysis with CGUL rules.

    A very simlar case is described in blog SAP HANA: Understanding regular expression operators supported in CGUL for Text Analytics

    Regards,

    Florian

    Add comment
    10|10000 characters needed characters exceeded