cancel
Showing results for 
Search instead for 
Did you mean: 

SAP HANA Partitioning

matma24
Participant

Hello guys,

I've been working with HANA for a little bit of time right now but I'm not sure about some of the quirks/ ways that HANA does stuff so I'd like to ask a questions (provide answer on how I see/ understand it) and get some feedback if my thinking is correctly or not.

1. How does insert works and when is locking happening?

As far as I know in traditional DB locks are put on a page - which simply means that if im trying to update two records (in to different sessions/ transactions) that are being stored in the same page then second transaction will wait for lock to being freed up (which might cause some delay), but in terms of HANA there is an additional data structure - Delta Storage.

Delta storage is basically the place where all inserts/ updates are stored for the time being and when this Delta storage grow up to some space content of Delta storage (optimized for Write operations) is being moved into the Main storage (optimized for Reads) and persisted on the disk.

Locks are happening only when we are trying to update the same ID record (one transaction getting access to that place in memory, while second one waiting for it to be freed up) and this is basically the only moment when locking is happening?

What are object locks? I've found that we have a lot of locks actually, most of them are Row Locks which seems to be really straightforward what is happening there but we also have some "Object Locks" on procedures and functions and I'm not really sure what's happening there.

2. How does partitioning a table influences performance?

For the time being I would like to consider Hash partitioning.

We are having a table that is partitioned by a Hash. Each partition act as a "separate table" which basically means that each partition got their own Delta Storage and Main Storage - this can improve inserts into the tables because theoretically if content of one Delta Storage (in partition 1) is being moved to Main Storage we still are able to write data to other partitions because they got their own Delta Storage. If we don't partition the table then everything lands in the same Delta Storage so performance of large inserts/updates is very heavily influenced by this merge operation because if content of Delta is being moved to Main Storage fot the time being all Write Operations are "halted"?

I've tested it - created two tables with the same schema, one with no partitions and one with 10partitions and inserted 500million rows to each of them (5 sessions 100million rows each). No surprise inserting to partitioned table was faster - around 4.2x faster to be precise.

However partitions come with a cost - they can heavily influence performance of Reads if we are not able to provide predicate in WHERE clause that points to columns used in hash. This simply mean that if I have a table with ID, FIRST_NAME and LAST_NAME where ID is primary key and is used in hash - then when I'm putting this ID in WHERE or JOIN everything is fine because HANA knows in which partition requested data is but when I'm doing search by LAST_NAME then I need to visit all partitions separately and union the result instead of going into one when there are no partitions.

3. Are triggers a good idea?

In our processing after two of the tables are being changed - there are inserts/updates going on there we are running a few triggers (between 7 and 12 - depends on the operation), this seems to impact performance by really a lot which seems to be clear because each operation (update/insert) is really 7+1 - 12+1 operations that need to happen in the background so it's "holding" the transaction to finish and that's why there is such a big delay before running another statement on the "main" tables?

4. Insert performance during big data loads.

Let's assume that I'm not able to partition my table so I'm not able to improve performance this way, the "simplest" thing I can do to improve inserts efficiency on non partitioned table is increasing Delta Storage space? By theory I will be able to have more data there in the same time = Merge Operation won't happen as that often and from documentation I can see this operation is the most time consuming one.

Would be really grateful for any hints/ tips/ confirmation/ critique and clarification.

Best regards,

Mat I

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor

Oh, gee, just one little question there, right? 🙂

Your questions cover a wide range of DBMS implementation topics.
In order to get a good understanding of how and possibly why HANA implements things in one way or another, I recommend getting familiar with the fundamentals. In "How to know all these things" I linked a few sources, that I found very helpful. I especially want to point out the OpenHPI course "In-Memory Data Management" as well as "The Red Book", both of which are available for free.

Now, let's see what I can add to your questions.

1. How does insert works and when is locking happening?

Locking is a mechanism to make sure that access to shared resources (records, DB objects like tables or views, storage structures, memory area, etc.) happens in an orderly fashion. That is to say: we want to be able to know in advance, what state the resource will be in after a series of actions. This is where the term "transaction protocol" comes into play as the playbook for how to share "stuff" in the system among independent processes/actors/sessions.

Locks are used in this context to indicate which changes are possible by which party at what point in time.

This in place, let's see what resources may need coordination in HANA:

- records? Sure, we want ACID transaction support for our SQL transactions, so records accesses need to be coordinated. These have locks.

- Columns? Tables? Views? Yes, again, as we don't want to allow things like changing the column definition while another session is still using the data. So, here are our object locks.

- Storage structures? You mentioned Delta Storage which is one of the HANA column store data structures that is used to capture changed data. This structure is a lock-less append-only data structure for the user - which is the SQL session running an UPDATE/INSERT/DELETE, etc. This means, for the action of adding a change record into this delta storage, no additional lock needs to be acquired.
However, this data structure stops being available for sessions, when a DELTA MERGE occurs. In this scenario, a second delta store structure gets used going forward and the old one is only read for the delta merge process itself.

There are plenty of other locking mechanisms in place, e.g. mutexes or semaphores/critical regions mostly for even more HANA internal processes, like memory management.

2. How does partitioning a table influences performance?

Wow, what a simple question ... not really.

One way to answer would of course be "a lot" and leave it at that.

The main driving factors for the performance of data change operations in column store tables are

- available CPUs

- available table partitions

- commit frequency

- delta merge frequency

Savepoints that can/will occur during the data change operations can also have a big impact, but for the sake of keeping this simpler, let's go with these four factors.

As this question is not new, there does exist a blog post about this: Merging right. Some tests with data loading in HANA from Down Under.

The fact that there is a single delta store for several session is less critical (the sessions are not waiting for a "delta store lock") than that the memory structures for this delta store "belong" to a single NUMA node and that the different threads trying to access it face scheduling by the operating system.

When a delta merge happens, as explained in my blog post, the data changing can continue, since there is a secondary delta log structure that is put in place for this reason. With the caveat that only new change transactions get assigned the new delta log. Committing data changes in between, can help with performance and resource consumption.

Read cost of partitions

You summed up one of the potential drawbacks of splitting tables into partitions well. If I have four files in which I store my information, and no dictionary for the information, then I have to go through every file. That's right.

This is, partly, what makes designing a partitioning scheme complex. Write patterns in DBMS are typically much better understood than read-pattern. So, commonly, finding the best compromise between all writers and readers becomes an iterative process.

3. Are triggers a good idea?

In my opinion: No!

Typically, triggers are thought of as some form of event-based business logic driver. But the data in the tables and the events the triggers react to are far too low level to represent actual business events well.

But not only are triggers a poor development artifact (how would you test and debug a series of triggers in a production environment with thousands of active sessions?), they tend to be very non-obvious.

When interacting with the database, there is nothing telling you which triggers you, well, trigger when performing any activity. They are like black-magic-f**ckery happening in the background with no obvious indication of their involvement. You update table A and tables B, C, D, and Z get updated/deleted/loaded into memory or copied. Why? You cannot know unless you explicitly look for triggers on the tables you just touched.

How would you do that? The standard HANA tools don't easily show triggers and in tools like DBeaver you have to go and open the table definition to learn if there are any triggers involved.

Are there certain scenarios that can reasonably be built with triggers? I'd say, logging data changes or versioning data cover the range of scenarios fairly well.

Again - this is my opinion; If you like triggers, fine with me.

4. Insert performance during big data loads.

Nope, "increasing delta storage" is not an option you have. HANA manages the delta store for each column/table automatically. What you can do is to influence under which conditions a delta merge occurs. Details on that are in the already mentioned blog post "Merging right".

There you go, now you know.

Lars

matma24
Participant
0 Kudos

Hello Lars, thank you very much for your responses!

I'd like to ask a few follow-up questions if that's fine for you :), I haven't gone through resources that you provided, will do that tomorrow so I'm not sure if the answers will be there.


1. How does insert works and when is locking happening?

"You mentioned Delta Storage which is one of the HANA column store data structures that is used to capture changed data. This structure is a lock-less append-only data structure for the user - which is the SQL session running an UPDATE/INSERT/DELETE, etc."

- Your answer insinuate that every session has it's own Delta Storage for a table?

- What do you really mean by "lock-less" ? Since I'm able to update record with specific ID in parallel how can I do it properly without record-locks? How can I be sure which update will be first? If really every session has it's own Delta Storage then it might have a sense but otherwise it seems unclear to me.

- Since you pointed out that there are record locks, when exactly they occur? In one part of the answer you stated that Delta Storage is "lock-less" (and I assume that Delta Storage is the place where all INSERTS/UPDATES/DELETES are stored for the time-being before merging) then how those lock records have a sense here? Since I've updated record with specific ID and it will be in "lock-less" Delta Storage then even if another query want to update the same ID it won't get locked. The only possible place that comes into my mind is actually during the merge, if Delta Storages from different sessions are "fighting" for the same records.

2. How does partitioning a table influences performance?

From the test I've listed in my original question I assume that 4.2x faster inserts into partitioned table (5 sessions, 100mil inserts in each of them), comes actually strictly from delta merge frequency?

Since I've ran it on the same machine with the same resources and the same load the only thing from points you listed that makes sense is Delta Merge frequency, basically because I have 10partitions and each of them have it's own Delta Storage I can assume that merge isn't happening as that often simply because if I have one partition then I put all 100mil rows into the same Delta Storage but when I have 10partitions I put 10mil in each of them(unrealistic with HASH but let's assume that's the truth for simplicity) ?

3. Are triggers a good idea?

Actually me - myself don't like them as well, unfortunately they are commonly used in my company so I would say that I "inherited" them as they are. Been fighting to remove them and simply run cron job that does the same thing as those triggers but for larger amount of data. Simply speaking instead of running those triggers 10000x, I'd like to run procedure once with 10000 ID as my "data selection id's".

Nevertheless your answers are very good so I'm accepting it as it is but hope for few follow-ups in a comments :).

Mat I

lbreddemann
Active Contributor

- Your answer insinuate that every session has its own Delta Storage for a table?

Nope, that's not what I wrote and not what it is. Every column (!) has its own delta storage. Not table, but column.

- What do you really mean by "lock-less" ?

By that, I mean that when values get entered into the delta stores, no additional locks need to be held/waited for that to work. In very basic terms: when you throw stuff on a pile, do you need to get "a lock" first so that you can be sure that the stuff lands on the pile? Nope, you can just "add" stuff to the pile.

I recommend reading up on lock-less data structures if you want to dive deeper into this.

- Since you pointed out that there are record locks, when exactly they occur?
Whenever it's required for the current "isolation level'. Depending on what degree of avoidance of transaction errors you need, even reading records may lead to locks. Or even table locks.

For your questions, I want to point out that "LOCKS" is a concept that occurs on many independent layers in the DBMS implementation. The "record"-locks you may see on SQL-level have nothing to do with how delta stores are lock-less.

- From the test I've listed in my original question I assume that 4.2x faster inserts into partitioned table (5 sessions, 100mil inserts in each of them), comes actually strictly from delta merge frequency?

No way to tell without actually looking at your scenario and the system setup.
"Assuming" that things happen or not has nothing to do with performance analysis. One has to measure - period.

Nevertheless your answers are very good so I'm accepting it as it is but hope for few follow-ups in a comments :).

Thanks, and you got your follow-ups.
For further questions, individual question-posts with a tighter focus and actual example data/code may get you more replies.

All-out question lists like yours that cover many aspects can easily scare off people that may have a good answer for one or two of the topics.

matma24
Participant
0 Kudos

Thanks again!

The simple example of partitions that i described:

CREATE COLUMN TABLE "TEST_PARTITION"(
	ID BIGINT,
	JOB_ID INT,
	PRIMARY KEY (ID, JOB_ID)
) PARTITION BY HASH (ID, JOB_ID) PARTITIONS 10;
DO BEGIN
	lt_numbers = SELECT ELEMENT_NUMBER AS id,
		1 as job_id
	FROM SERIES_GENERATE_INTEGER(1, 0, 100000000);
	
	INSERT INTO "TEST_PARTITION"
	SELECT *
	FROM :lt_numbers;
END;


CREATE COLUMN TABLE "TEST_NO_PARTITION"(
	ID BIGINT,
	JOB_ID INT,
	PRIMARY KEY (ID, JOB_ID)
);
DO BEGIN
	lt_numbers = SELECT ELEMENT_NUMBER AS id,
		1 as job_id
	FROM SERIES_GENERATE_INTEGER(1, 0, 100000000);
	
	INSERT INTO "TEST_NO_PARTITION"
	SELECT *
	FROM :lt_numbers;
END;

Load on the system was exactly the same (nothing was happening other than running that).

Merges statistics from M_DELTA_MERGE_STATISTICS

I'm not really sure then where this 4.2x better performance came from.

From your amazing blog post about merging I could insinuate that less frequent merges can lead to a better performance. Since collectively for the same amount of rows partitioned table had 110 merges and non_partitioned had 12merges, inserts into both tables were in a different point of time (5 sessions 100mil records, nothing else running on the system) and since Delta Store is lock-free then I'm not really finding any reason why the difference is as that big.

Answers (0)