Skip to Content
0

Slow performance loading Flat File

Oct 27, 2017 at 07:47 PM

130

avatar image
Former Member

Dear All

I'm doing a Proof of Concept with Sap Data Services, and I'm getting slow performance in a simple task.
Any tips, or clue how to improve the time for the task will be welcome.
I have already changed parameters in DSconfig.txt and other parallelism options, with some improvement, but far from the time of other ETL tools.
I noticed that during processing the CPU goes to 97% but the memory uses it reaches up to 11 GB, still holding 5 GB of free memory.

Goal:
Load 10 files with 10,000,000 lines, 4 integer fields, grouping by 3 fields, counting line numbers and ordering the output to 1 file.
Each file size of 200MB
Source and destination of the files on the server

Issue:
Low performance for reading files, comparing with other ETL tool
Pentaho : 50s
IBM DataStage: 37s
PowerCenter: 47s
SAP Data Services: 126s

Tested on linux, reading the same 10 files (100,000,000 lines) via cat delay 7s System

Environment

Software:

VMWARE ESX 5.5 on SUSE Enterprise 12
Sap Data Services 4.2 SP9
Repository AnyWhere 17

Hardware:

Dell PowerEdge M630 blades with 2 x Intel (R) Xeon (R) CPU E5-2697
v4 @ 2.30GHz and 512GB of RAM.
Storage IBM Storwize v7000
Switch Brocade 5100
16 GB RAM

Developed two Jobs, one simple with few components and one with a dataflow for each.

Second job use more objects, although is more quickly, but still slow then other ETL tools.
----------------------------------------------------------------------------------------------------------------------

1) First Job (JB_POC)

Developed a single dataflow with the components below:

Object Properties
Degree of paralellism: 10
Cache type: in_memory_cache

Flat File Source Properties

Query

select ... from FF_Arquivo1
group by FF_Arquivo1.Field2, FF_Arquivo1.Field3, FF_Arquivo1.Field4
order by FF_Arquivo1.Field2 ASC, FF_Arquivo1.Field3 ASC, FF_Arquivo1.Field4 ASC
Column name = Mapping Description Datatype Content Type
Field2 = FF_Arquivo1.Field2 INT
Field3 = FF_Arquivo1.Field3 INT
Field4 = FF_Arquivo1.Field4 INT
Field5 = count( * ) INT

Total time 157s

-----------------------------------------------------------------------------------------------------------------------
2) Second Job (JB_POC_MULTIFILES)

Developed one workflow (WF_MULTIFILES) with 10 dataflow, Each dataflow (DF_POC_ArquivoX) reads a specific file, grouping and counting rows, and saves it to 1 Log file.
Another dataflow (DF_POC_GROUP) to consolidate the 10 files summarizing with grouping and counting.

WorkFlow (WF_MULTIFILES)

Object Properties
Degree of paralellism: 1
Cache type: in_memory_cache

Total time 126s

Thanks in advance.

Best Regards

10 |10000 characters needed characters left characters exceeded
* Please Login or Register to Answer, Follow or Comment.

3 Answers

Dirk Venken
Oct 28, 2017 at 10:27 AM
1

Don't try sorts, aggregations... on files. DS will have to read the complete file contents into memory to perform those operations.

Copy your files to a staging database first. You can do that with a Data_Transfer as the first transform in your data flow. The logic of the Query transform will be executed in the database and only the results will stream thru DS memory to the output file.

Show 2 Share
10 |10000 characters needed characters left characters exceeded
Former Member

We really appreciate your reply, but on this case we have the requirement to don’t use any staging area or database as a support to this job.

Is there a way to first allocate all the flat file data on DS memory and just after that start with the next transformation? Do you think this can help to improve the performance?

Thanks,

0

I am sorry but I don’t see any reason for such a requirement. If you want to get most out of the tool, you use it according to best practices. In DS, that often simply means “let the database do the hard work“.

0
avatar image
Former Member Oct 31, 2017 at 01:35 PM
0

Dirk, thanks for your reply. We understand your point, and using a DBMS would be the way to do it, generally speaking. But in this particular case, the requirement seems to be exactly to measure performance handling files. And considering that other tools managed to do it as required, using a DBMS is not an option. Anyway, the whole process takes 119s. A job to read the files and discard the data (map_operation with discard) takes 90s. Doing it at the OS level (cat *.txt > newfile) only takes 8s. So, the problem is not sorting or aggregating the data, but reading the flat files into memory. DS takes about 90s to read the data and 30s to sort and aggregate. Is there any configuration parameter or method we can use to speed up reading the flat files ?

Thanks.

Share
10 |10000 characters needed characters left characters exceeded
avatar image
Former Member Feb 28 at 04:08 PM
0

Hi,

Can you tell me what the end result was? I am having similar issues.

Thanks

Share
10 |10000 characters needed characters left characters exceeded