Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
_Hemant_Sharma
Explorer
Hi SAP,

In this blog I am going to put forth a detailed discussion between the two functions in SAP Cloud Platform Intergration Data Services (CPI-DS) which can be said are two sides of the same coin, namely -"JOIN" (Inner Join) and "LOOKUP".

In my working with CPI-DS these are two are most common functions that I have used to build various dataflows to load MasterData and KeyFigure data across various business lines.

A little bit of background on the functionality of both functions.

"JOIN" function:

The "JOIN" function in CPI-DS is a complete tab on its own when creating a dataflow as visible under the Transform Details. "JOIN" helps to bring common results by matching same columns from Table A and Table B. There are two types of Joins which are available in CPI-DS:

  1. Left Outer Join - In this Join, let's say we Table A on the Left Side input and Table B on the Right-Side input.  A Left Outer Join will bring all the records from the Table A and only the matching records from Table B which is a left join will always bring all the records from Left Side Input and only matching records from Right Side Input.

  2. Inner Join - An inner join will only bring records which are matching in both tables - Table A & Table B regardless of whether the tables are inserted into the Left or Right-Side Input.


When defining the Join function, we have to fulfill 3 things:

  1. Select Left and Right-Side Input tables you want to use.

  2. Select the type of join whether you want to use Left Outer join or Inner join.

  3. Give proper matching condition in the Join Condition box and save it.


We can define multiple Join Pairs under the Join Tab. Each Join should be relevant and should make logical sense.

Additionally, there is a feature under the Join Tab called Join Rank. This Join Rank comes in handy when a single dataflow has multiple joins defined and it's required that there is order of priority set to the joins as to which join will be executed from first to last. Join Rank also as an option called "Cache" which allows data to be Cached according to the option set against it. (The Cache options include - Automatic, Yes, and No)

"LOOKUP" function:

LookUp fucntion allows result to be retrieved from a second table based on the comparison of columns between the original table and the lookup table. Additionally, apart from defining comparison between matching columns between the tables, CPI-DS also allows matching of specific Strings/ Characters present in the columns of the lookup table (For example: If a user only wants to LookUp records where in the column "PRODUCT DESCRIPTION" of the lookup table there exists String 'CHEMCALS'. This will only allow records from the column "PRODUCT DESCRIPTION" which has 'CHEMICALS' record in it. (Please note that the any given String is Case sensitive)).

LookUp too has three different types of cache functions:

  1. NO_CACHE - This reads value from lookup table without caching records.

  2. PRE_LOAD_CACHE - This loads the result column and the compare column into memory after applying filters but before executing the function. (By default, when applying LookUp function you can see that PRE_LOAD_CACHE is selected).

  3. DEMAND_LOAD_CACHE - This loads the result column, and the compare column values into memory as the function identifies them.


As mentioned previously, Lookup and Join (Inner Join) function are two sides of the same coin. What I mean here is that both functions are used to validate data sets of first table (most likely a Source Table from SAP or NON-SAP Source) with a secondary table ( most likely in a reference or index table) so that only required data can be loaded into the system, garbage data can be eliminated, hence decreasing un-necessary processing and increasing efficiency of each dataflow in-terms of runtime and system resources utilized.

Scenario - I will be using a very simple scenario to analyse the use case of both functions. Let's say XYZ Company has multiple business lines - A,B,C,D,&E for which they want to create separate planning areas to perform demand and supply planning individually without bearing of business lines on each other. IBP team is able to define various Master Datas like Product, Location, Customer etc by identifying filters when loading data from source tables for each planning area. The catch is that the transactional source tables consist of data from across all the 5 business lines and IBP team needs to find a way to bring only relevant data for every business line. Now to tackle this IBP team decides to create extraction tables for each Master Data in MySQL and use that as a reference table when loading data to KeyFigure.

(Note: Join and LookUp functions used is enabling the "Cache" feature in all sub-scenarios mentioned further. For sake of simplicity PRODUCT MD load tasks are used as below to discuss the sub-scenarios)


TSK_SQL_PRODUCT_TASKWITHJOIN - Cache set to "YES" for the both source tables


 


 


TSK_SQL_PRODUCT_TASKWITHLOOKUP  -Cache set to "PRE_LOAD_CACHE"



 

Sub-Scenario A: In this scenario we are loading data in both tasks without adding any other function or logic before Join or Lookup function.


"Image A"



"Image B"


We can see below in the Monitor Log for TSK_SQL_PRODUCT_TASKWITHJOIN when there is absolutely no other logic or function to be performed before the Join as in "Image A" the Absolute Time taken for it to complete the run is 7.421 secs.


Similarly, we can see below for TSK_SQL_PRODUCT_TASKWITHLOOKUP when there is absolutely no other logic or function to be performed before the LookUp as in "Image B" (Note: After the LookUp the "Default Value" in this case 'N/A' has to be filtered out) the Absolute Time taken for it to complete the run is 8.359 secs.



Findings - We can see in this case the difference between the two is negligible and Lookup took a second extra as it has to filter out Default Value 'N/A' as we don't want to load such records.

Sub-Scenario B: But what happens when we employ simple function just before the Lookup and Join function Here, we are taking a scenario where the data has to be cleansed using ltrim and rtrim functions as below.


Image C



Image D


We can see below in the Monitor Log for TSK_SQL_PRODUCT_TASKWITHJOIN when there is simple logic to be performed before the Join as in "Image C" the Absolute Time taken for it to complete the run is 10.468 secs.


Similarly, we can see below in the Monitor Log for TSK_SQL_PRODUCT_TASKWITHLOOKUP when there is simple logic to be performed before the Join as in "Image D" the Absolute Time taken for it to complete the run is 8.843 secs.


Findings - Again in this case we can observe the difference between the Join and Lookup is hardly 2 seconds which can be called negligible, but this time Lookup function was the faster task to completion.

Let's explore sub-scenarios where there are more than 2 tables involved in the Join and Lookup functions as below. (Here for sake of simplicity, I will be loading data to Location Product MD)


Image E



Image F


We can see below in the Monitor Log for TSK_SQL_LOCATIONPRODUCT_TASKWITHJOIN when there are more than 2 tables involved in an Inner Join as in "Image E" the Join function starts building a cartesian product between the 3 tables used here. This results in the creation of more than 337 million lines of record and counting as seen in the Monitor Log below. At some point of time this task will fail if it keeps performing the cartesian product process as it will timeout and/or lose connection to the server. Hence, making the Join Function unreliable in such scenarios.


We can see below in the Monitor Log for TSK_SQL_LOCATIONPRODUCT_TASKWITHLOOKUP when there are more than 2 tables involved in a LookUp as in "Image F" the Absolute Time taken for it to complete the run is 9.281 secs and unlike the Join function which most likely will fail even if left to run for a full day.


Findings - So we observed that Joins in cases where more than 2 tables are involved is going to build a cartesian product between the tables creating millions of lines of records. This results in a waste of system resources, time and money. While the Join function performs the same task within seconds and succeeds to load the data.

In conclusion, although Join and LookUp can be used interchangeable, in some complex scenarios it's best practice to pro-actively use LookUp function over Inner Join when building dataflows involving 2 or more tables. At the end of the day a Data Integration specialist should look to use the most efficient logics and functions to build dataflows which help build an efficient system and also save resources and time in the process.

I hope, I have added some value via this blog. Thank You for your time and patience for going through this write up.
2 Comments
Labels in this area