cancel
Showing results for 
Search instead for 
Did you mean: 

Join of two very large files

Former Member
0 Kudos

Hi,

We are trying to join two large files (one of 23GB - ~180M lines and the other 7GB - ~55M lines)

Join is done on a single field, and both files contains large objects (CLOB) fields

After 13 hours ~40M records where inserted into target table (which is abount 20%)

Does this make sense?

Is there a better approach to deal with large files?

Thanks,

Amir

Accepted Solutions (0)

Answers (6)

Answers (6)

Former Member
0 Kudos

Hi,

Thank you all for your input.

I will try your suggestions and update.

Amir

Former Member
0 Kudos

Probably you should try converting CLOB to Varchar and load it. I would use dataflows to create a staging in database before the actual loading.

Arun

kamal264
Active Participant
0 Kudos

Amir,

First prerequisite is :- load this file data into database table then apply join.

If it would be a Memory issue could think about loop or data dividing option but as you are facing time issue and using CLOB(it will lock the target table performance related options).

You can do one thing keep the CLOB into one table with reference Key ROWID join the remaining attributes with the another table and perform the functions whatever you want and then join the output table and the CLOB table using reference key ROWID.

~Kamal

Former Member
0 Kudos

Hello

Joining large datasets in memory (which is what would happen with files) can take plenty of server resources. If your job server is limited on memory, CPU, and the files are read across a network, then these are areas to be looked at.

Inserting CLOB data might be what's slowing down your process, because it sets the commit size to 1 (did you get a warning?), which can really slow down the insert process.

Can you explain the exact logic in your dataflow?  Sometimes joins can be replaced with lookups, which might speed things up.

Michael

Former Member
0 Kudos

Hello Amir,

I'm not much sure, please check if bulk loading is gonna help.

Thanks,

Yusuf.

former_member200473
Contributor
0 Kudos

Hi Amir,

As per my knowledge, here can be some alternative to this.

Scenario 1-

step1- Import these tables to same Database and then import these tables in same datastore.

step2- join(i hope you are using inner join) these tables and then using "push_down()" function you can push down the joining criteria to the database and at the same time you should select  "Run Join as separate Process" (well i don't think it is required if you are using join nothing other than in that dataflow) .

step 3- make sure that you are using same datastore for target also(if not possible then you can use the functionality of database link or you can use DATA_Transfer Transform).

step 4- make sure you are using Pageable cache(can be set through dataflow property) and increase the cache size as per your requirement from the server manager and job server option.

step 5 - Here scenario1 ends.

Scenario 2-

Step 1- using Data_Transfer Transform you can take these tables in to same datastore and then you can continue from step 2 from Scenario 1.

My conclusion: I feel you should go with the scenario because , you have mention that you have very big amount of file ,it can cause DATAservices OUT of Memory Error(not sure ).

Please share your experience with us after applying these tricks.

Regards,

Shiva Sahu

Former Member
0 Kudos

Shiva

There is no function called push_down(), I assume you mean pushdown_sql(), which probably wouldn't be required for a simple join over 2 columns.  If the data was loaded into the same database and the optimiser generated a query to join the 2 tables together (which is normal unless you have an odd join condition) then the run as a separate process wouldn't add anything because the database would be doing the join, same goes for set the cache to pagable.

Michael

former_member200473
Contributor
0 Kudos

Hi Michael,

yes , i totally agree with your point , i had suggested pushdown_sql()( by mistake, i wrote Push_down()) in case if the source and the target are on different database.

For the Run as a separate process, i already mentioned that if we are not using any logic apart from it join it ,that will not be required.

But i really appreciate your suggestion and wish the same in future .

Shiva