cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with IMPORT FROM when file used SOH character as field delimiter - created using AWS HIVE

Former Member
0 Kudos

Hi All,

    I'm pressed with a nagging problem. I'm using AWS HIVE to output the results of a HIVE query to an output file that I'm going to load into SAP HANA. The problem is that HIVE output files using the SOH character ^A, \01, 0x01 as the field delimiter. I have no idea on how to specify this non-printable character in the IMPORT FROM statement. Here is an example:

IMPORT FROM CSV FILE '/wiki-data/year=2013/month=04/000000'

INTO "WIKIPEDIA"."pagecounts"

WITH RECORD DELIMITED BY '\n'

FIELD DELIMITED BY '\01'

I've tried using '^A', '\a', '\01', '0x01' and I'm stumped.

Any ideas?

Regards,

Bill

Accepted Solutions (0)

Answers (1)

Answers (1)

Former Member
0 Kudos

FWIW - I used the following sed command to replace the SOH characters with commas

sed "s/\x01/,/g" 000000 > 000000.csv

I have two challenges with this approach:

  1. This introduces another process into my "Big Data" flow using AWS HIVE
  2. All my records don't load! In the file, there are 30,000 rows and only 2000 load. To make matters worse - it's not the first 2000 characters that were loaded.

Needless to say '\x01' didn't work directly in the IMPORT FROM statement.

The reason that I didn't get the expected result is that one of my data values every now and then had a comma in the name.  The good news is that I could use the '|' pipe character - as there were none in the source file.

So, the updated sed command achieved the desired result:

sed "s/\x01/,/g" 000000 > 000000.csv

I then changed my IMPORT FROM command to the following:

IMPORT FROM CSV FILE '/wiki-data/year=2013/month=04/000000.csv'

INTO "WIKIPEDIA"."pagecounts"

WITH RECORD DELIMITED BY '\n'

FIELD DELIMITED BY '|';

As a result, I got all 30,000 records. I'd still like to be able to process the file directly - any help would be appreciated. At least I'm not blocked for now.

Regards,

Bill

Former Member
0 Kudos

Hi Bill,

I've not tried this myself but perhaps you can try and define your HIVE table as EXTERNAL using something like:

Create external table EXT_TABLE1

(COLA INT, COLB STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/hive/warehouse'

[NOTE: THIS IS HIVE SQL NOT HANA SQL]

When you insert rows to this table they should be delimited by '|' so fingers crossed the file created in HDFS should import into HANA quite easily.

I've mostly been using HBASE and Impala instead of HIVE.

I use Python to connect to HANA and HBASE or IMPALA to transfer the data in chunks.

It's a bit more work to setup than just using CSV imports, but you have more control of the data flow (e.g. to perform MERGE DELTA along the way)

FYI: So far I've transferred up to 60 Million records into HANA using this method.

If you want to go down that route then this is a good place to start:

http://scn.sap.com/community/developer-center/hana/blog/2012/06/08/sap-hana-and-python-yes-sir

If I ever have enough free time I hope to to see if it's possible to get SQOOP to manage transfers between HADOOP and HANA.

It's a shame BO Dataservices isn't available to try on AWS otherwise that'd be my first choice to try..


All the best

Aron

Former Member
0 Kudos

Hi Aron,

    Thanks for your reply - I was going to try the external table approach as you suggested with the S3: bucket for the file location. Right now, I'm ok with the SED approach - it took about 26 seconds to rip through 256 MB of Hive data to replace the SOH with '|' characters for 5.5 million rows.

    HANA One imported the resulting pipe delimited file in 49 seconds - not bad. All told - I now have 523 MB loaded up in my column table with 11.7 million rows taking up 203 MB of RAM for a approximately 2.6 times compression in memory compared to on disk.

   The problem that I have with SQOOP based solutions with AWS is that it's expensive to keep the cluster up and running with the Hive database. In my current solution, my data is sitting on S3 storage and I spin up a 6 node cluster when I need to do some "quick" queries as I can't load all the data into HANA One at one time. I then pump out the results to S3 storage based on my Hive query and then tear down the cluster as soon as I check to make sure the output is as expected.

   However, what is needed is the IMPORT FROM statement to support non-printable characters as field delimiters for the best flexibility.

Thanks,

Bill

Former Member
0 Kudos

Hi Bill,

It makes sense what you are doing. It’s always good to keep the costs down. 

AWS is fantastic for the flexibility it offers. It’s just so easy to spin out nodes and  terminate them as required (whether using AWS Elastic MapReduce or EC2 to run an HADOOP cluster).

I’m definitely hooked.    🙂

Sadly though it’s quite expensive if you need to leave these instances running.

Hopefully someone else will be able to give you an answer on HANA Import problem  with non-printable characters.

In the mean time, other than the external table approach, you might also be able to try calling the ‘sed’ statement at the end of your map reduce task to avoid the subsequent manual step.

It’d also be very cool to have an optional  step to check if your AWS HANA One box is ‘started’. If it is then you could automatically transfer the file and call the import, otherwise just store the results on S3 for later.

If you wanted to go even more over the top then you can even execute commands to start and stop your AWS HANA instance.   J

It doesn’t sound an issue for you yet, but if your uncompressed file gets too big you may find the Import struggles.  I’ve hit problems importing 7Gb files, with my 17.1Gb AWS HANA box.

In this case I needed to partition the table in HANA to allow MERGE DELTAS to complete.

BTW: If you’ve ever been tempted to setup a permanent HADOOP cluster at home (or under your desk at work), you can get some pretty good, cheap hardware these days. I know a guy running a 2 node cluster for $720 ($30 p/month over 24 months), which operates at similar speeds to a 3 Node cluster on AWS, at a tiny fraction of the cost.  The more experimenting I do with HADOOP the more I’m considering this option.

All the best on your AWS HIVE / HANA integration experiments.

Aron