Join of two very large files

Former Member · ‎05-12-2013

Hi,

We are trying to join two large files (one of 23GB - ~180M lines and the other 7GB - ~55M lines)

Join is done on a single field, and both files contains large objects (CLOB) fields

After 13 hours ~40M records where inserted into target table (which is abount 20%)

Does this make sense?

Is there a better approach to deal with large files?

Thanks,

Amir

Former Member · ‎05-14-2013

Hi,

Thank you all for your input.

I will try your suggestions and update.

Amir

Former Member · ‎05-14-2013

Probably you should try converting CLOB to Varchar and load it. I would use dataflows to create a staging in database before the actual loading.

Arun

kamal264 · ‎05-13-2013

Amir,

First prerequisite is :- load this file data into database table then apply join.

If it would be a Memory issue could think about loop or data dividing option but as you are facing time issue and using CLOB(it will lock the target table performance related options).

You can do one thing keep the CLOB into one table with reference Key ROWID join the remaining attributes with the another table and perform the functions whatever you want and then join the output table and the CLOB table using reference key ROWID.

~Kamal

Former Member · ‎05-13-2013

Hello

Joining large datasets in memory (which is what would happen with files) can take plenty of server resources. If your job server is limited on memory, CPU, and the files are read across a network, then these are areas to be looked at.

Inserting CLOB data might be what's slowing down your process, because it sets the commit size to 1 (did you get a warning?), which can really slow down the insert process.

Can you explain the exact logic in your dataflow? Sometimes joins can be replaced with lookups, which might speed things up.

Michael

Former Member · ‎05-13-2013

Hello Amir,

I'm not much sure, please check if bulk loading is gonna help.

Thanks,

Yusuf.

former_member200473 · ‎05-13-2013

Hi Amir,

As per my knowledge, here can be some alternative to this.

Scenario 1-

step1- Import these tables to same Database and then import these tables in same datastore.

step2- join(i hope you are using inner join) these tables and then using "push_down()" function you can push down the joining criteria to the database and at the same time you should select "Run Join as separate Process" (well i don't think it is required if you are using join nothing other than in that dataflow) .

step 3- make sure that you are using same datastore for target also(if not possible then you can use the functionality of database link or you can use DATA_Transfer Transform).

step 4- make sure you are using Pageable cache(can be set through dataflow property) and increase the cache size as per your requirement from the server manager and job server option.

step 5 - Here scenario1 ends.

Scenario 2-

Step 1- using Data_Transfer Transform you can take these tables in to same datastore and then you can continue from step 2 from Scenario 1.

My conclusion: I feel you should go with the scenario because , you have mention that you have very big amount of file ,it can cause DATAservices OUT of Memory Error(not sure ).

Please share your experience with us after applying these tricks.

Regards,

Shiva Sahu

Join of two very large files

Accepted Solutions (0)

Answers (6)

Answers (6)

Re: CF Deployment Error: Error getting tenant t0

Re: Refresh not working in CL_SALV_TABLE

Re: SAC BW live connection: Error Illegal State: d...

Re: CF Deployment Error: Error getting tenant t0

Re: Embed mode not working for optimized story