on 09-26-2013 9:41 AM
Hi
I'm learning HANA and I have a question about the usage of indexes and query plans. I use HANA SPS6 Developer Edition VM on Cloudshare for my tests.
I populated a column table with 15,000,000 of rows and I'm trying to test some queries. A simple example:
select * from "LEONID"."CDRs" where NUM_ORIG = '9054052020680';
This query returns one row, even though values in NUM_ORIG column (defined as VARCHAR(15)) doesn't have to be unique. The query consistently takes 15 ms. to finish.
Then, I create an index on NUM_ORIG column:
create index idx_CDRs_NUM_ORIG on "LEONID"."CDRs"(NUM_ORIG);
From INDEXES view, I can see that HANA chooses "INVERTED VALUE" index type for the index. Then, I run my query again, and it consistently returns in 1 ms. Apparently, the index get used because it is the only change that made between the tests. If I drop the index, the query response time returns to 15 ms.
To my great surprise, I don't find any mention of index usage in the query plan. Actually, the query plans for the query with and without indexes looks exactly the same. Is it an expected behavior? Should I monitor the index usage in other place? See the output from EXPLAIN_PLAN_TABLE below, I omitted some obvious information to make it slightly more compact:
OPERATOR_NAME,OPERATOR_DETAILS,TABLE_SIZE,OUTPUT_SIZE,SUBTREE_COST,OPERATOR_ID,PARENT_OPERATOR_ID,LEVEL,POSITION
"COLUMN SEARCH","CDRs.CDR_ID, CDRs.CC_ORIG, CDRs.AC_ORIG, CDRs.NUM_ORIG, CDRs.CC_DEST, CDRs.AC_DEST, CDRs.NUM_DEST, CDRs.STARTTIME, CDRs.ENDTIME, CDRs.DURATION (LATE MATERIALIZATION)",?,1,0.000008691042685,1,?,1,1
" COLUMN TABLE","FILTER CONDITION: CDRs.NUM_ORIG = '9054052020680'",15000000,1,?,2,1,2,1
I tried to examine the query plan also using SAP HANA Studio, but still see the same plan with and without index and there is no mention of index usage in the plan.
Another related question: is there any best practices for indexing in HANA? I'm asking about it because in other SAP column-based RDBMS, Sybase IQ, the approach for indexing is quite different from traditional row-based RDBMSs, such as Sybase ASE, Oracle or SQL Server.
Thanks in advance
Hi,
Check this,
http://www.saphana.com/docs/DOC-3169
http://www.saphana.com/community/blogs/blog/2011/09/15/any-attribute-as-an-index
Syntax
CREATE [UNIQUE] [BTREE | CPBTREE] INDEX <index_name> ON <table_name> (<column_name_order>, ...) [ASC | DESC]
Syntax Elements
<index_name> ::= [<schema_name>.]<identifier> <column_name_order> ::= <column_name> [ASC | DESC]
Regards
NK
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi all, this is an interesting discussion. Just wanted to jump in here with a little bit of feedback and some observations from the system I am working on.
I did a similar test as Leonid regarding indices on a table with no keys, and had a volume of around 700,000 records.
SELECT *
FROM "MOLJUS02"."MARA"
WHERE "MATNR" = 'TRM108274';
First run time was about 75ms, but subsequent executions were around 8ms
CREATE INDEX IDX_MOLJUS02_MARA_MATNR on "MOLJUS02"."MARA" ("MATNR");
SELECT *
FROM "MOLJUS02"."MARA"
WHERE "MATNR" = 'TRM108274';
First run time was about 75ms, but subsequent executions were around 8ms
So what I observed is that the runtime was not improved at all after introducing an index.
I decided to take this a step further to dive into a more detailed testing. My interest here is because in the system I am working on now, there are a lot of tables that are non SLT-delivered. These also happen not to have any primary keys defined, and therefore could not make use of the Inverted Index mentioned in this thread. In discussions with a colleague, it was indicated that introducing primary keys on the tables may benefit the runtimes of the models built on top of these.
Aside from the bad design principle of not having keys defined to ensure data contstraints, I was trying to reach the conclusion to the answer of "do primary keys speed up queries?", as seen from a pure SELECT perspective as well as within an Analytic View (star schema) perspective.
Purpose: Determine efficiencies gained in performance by using table with primary keys defined
Hypothesis: Tables with primary keys will perform better with SELECT and JOIN operations due to the use of the inverted index.
Testing methodology
Table Sizes
MARA (keys MANDT/MATNR) – 703,000
KNVV (keys MADNT, KUNNR, VKORG, SPART, VTWEG) – 726,000
CE12000 (Fact – many keys) – 1.1 billion records
SELECT Results
Query | With Keys | No Keys | Variance from keys to non-keys |
MARA (single column key) | 6.6ms | 11ms | +5.4ms (81%) |
KNVV (multi column key) | 4ms | 5ms | +1ms (25%) |
KNVV (smaller query set) | 2ms | 3ms | +1ms (33%) |
Model Results
Dimension and query | With Keys | No Keys | Variance from keys to non-keys |
MARA - aggregated | 784ms | 823ms | +39ms (5%) |
MARA – detailed | 5,497ms | 5,427ms | -70ms (~) |
KNVV – aggregated | 780ms | 750ms | -30ms (3.85%) |
KNVV - detailed | 1,685ms | 1,682ms | -2ms (~) |
MARA - filter applied | 249ms | 246ms | -3ms (~) |
So analyzing what I have seen, the conclusion that I have come to is that the presence of a primary key and thus the inverted index actually had no effect on the runtimes when compared to the equivalent data structures that had no keys/indices defined.
Any thoughts on these results or ways to point out how it could be done better? Ultimate goal is performance optimization, and right now this (keys and indices) doesn't look like a very effective angle for improving Analytic View performance.
Regards,
Justin
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hey Justin,
Unfortunately the use of indexes hasn't been well documented. Hoping and team will help with this in 2014. I don't have time for a detailed analysis right now, but here's what I found:
- Indexes will almost never help with Analytic Views because they run in the OLAP engine and are used for aggregations.
- Indexes can help with SQL, Attribute Views and Calc Views, in certain situations. This is situations where the default index on a column isn't well sorted to return the information you wanted.
The latter case is what's happening in 's case. If you do something similar and create a large table with individual document IDs, then an INDEX will help with SELECT * performance.
It can equally help where you have joins between two tables and are looking to find individual records. I have never found indexes to help with OLAP-style aggregations.
Hope this helps you.
John
Hey there,
just real quickly...
- I am planning to put out something on indexes next year. Currently there's just no time for me to prepare this properly.
- It actually doesn't matter which engine accesses the table, as the engine doesn't determine the index usage. There is a thin functional layer around the actual columns (attribute engine - ok, it is something called 'engine'...) that takes requests for column data and decides per column, per request whether or not an index should be used for this access.
This is also the reason why the explain plan cannot tell you about upfront: the SQL layer doesn't know about the column store indexes.
Very roughly, indexes on column store tables can be useful, when
a) the table is really big and scanning takes a lot of time
b) the number of records we are looking for is relatively small
However, inverted indexes don't help with sorting at all - due to the necessary mix-up of at least one main and one delta store and the fact that the sort order of the whole table can change during compression optimization.
As mentioned above, I've something half ready on this topic and will try to push it out next year.
Cheers, Lars
Hey Justin,
the thing here is: OLAP queries are the anti-use case for inverted indexes.
For OLAP queries, we want aggregation, volume scanning and dimension joining.
Inverted indexes really rather help with very specific filter conditions on large tables that happen often.
For the primary key definitions: don't think primary keys as performance enhancements. They are not. Primary key carry semantic meaning in your data model, they provide identity to each record.
- Lars
Thanks for the details Lars, much appreciated. I think I have to take a look back at the details around joins and select from the HPI course for some more info.
One quick question here on the following statement
"Inverted indexes really rather help with very specific filter conditions on large tables that happen often"
Couldn't this mean the same as applying a filter on an attribute view (dimension)? Could this not speed up the reads on that specific table before moving into the join phase of an OLAP query?
Regards and Happy New Year!
Justin
Thanks for quick responses. Unfortunately, answers provided so far don't answer my question. The concept of inverted index is clear to me. My question is why I don't see any mention of index usage in the query plan, even though the index is apparently used by my query.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Lars, thank you very much for the information. I must admit that I'm very surprised about it, all RDBMSs that I have worked with until now has such a capability. I hope that SAP will fix the issue soon, otherwise query plans are not very useful.
Allow me to ask some related questions:
1. I'd like to understand the usage of optimizer statistics in HANA. Does HANA collects the optimizer statistics by itself? I see that CREATE STATISTICS statement does exist in HANA, so in which cases it should be used? I tried to run CREATE STATISTICS on some columns in my test table, but STATISTICS view is still empty. Is it possible to see optimizer estimations vs. actual values somehow?
2. What types of table joins are offered by HANA? In most of modern "traditional" databases, like SAP Sybase ASE, SAP SQL Anywhere, Oracle, SQL Server, nested-loop, hash and sort-merge joins are available, while SAP Sybase IQ offers also additional join types. I just can't find it in HANA documentation and it is not clear from query plans.
I can split my questions above to different discussion topics, if you like.
Thanks in advance
Leonid Gvirtz
Leonid Gvirtz wrote:
Lars, thank you very much for the information. I must admit that I'm very surprised about it, all RDBMSs that I have worked with until now has such a capability. I hope that SAP will fix the issue soon, otherwise query plans are not very useful.
Well, lots of the information are available in the PlanViz and even if you today don't explicitly see the usage of an inverted index on column store tables, you can already clearly see where time is spend and how much data is moved between different processing steps (plan operations - POP).
So it's not as bad as you put it and sure enough we're working on improving the options to analyze query execution.
Leonid Gvirtz wrote:
Allow me to ask some related questions:
1. I'd like to understand the usage of optimizer statistics in HANA. Does HANA collects the optimizer statistics by itself? I see that CREATE STATISTICS statement does exist in HANA, so in which cases it should be used? I tried to run CREATE STATISTICS on some columns in my test table, but STATISTICS view is still empty. Is it possible to see optimizer estimations vs. actual values somehow?
The CREATE STATISTICS command is there to provide statistic information for objects that can be used in a SQL query as a source table and that don't provide the information by themselves.
Currently this applies to smart data data access tables as these tables not stored within SAP HANA.
For row store and column store tables, the SAP HANA core data structures automatically provide the information required to perform query optimization (like # of distinct values, # of total records, min/max values etc.). No additional statistics collection is required for this.
And sure enough you can check the estimation against the actual values.
Look like you missed out on PlanViz quite a bit...
Leonid Gvirtz wrote:
2. What types of table joins are offered by HANA? In most of modern "traditional" databases, like SAP Sybase ASE, SAP SQL Anywhere, Oracle, SQL Server, nested-loop, hash and sort-merge joins are available, while SAP Sybase IQ offers also additional join types. I just can't find it in HANA documentation and it is not clear from query plans.
Actually more join algorithms are available on each of those (bitmap join, merge join cartesian, semi-anti join... ) but the three you mentioned are clearly the most well known ones.
Anyhow, most of those join algorithms aim at single thread record level execution and are not particularly cache conscious.
In SAP HANA the join functions are implemented with massive parallel execution on in memory data structures in mind. And you're right, these algorithms have not been explained in the documentation (yet).
Again, documentation is work in progress and for the 2 years of SAP HANA being generally available the set of documentation grew considerably.
Looking back at the documentation of Oracle 7 or MS SQL Server 6.5 I don't find extensive explanations of join algorithms there either.
Anyway, even with the short live span of SAP HANA there actually *is* lots of information available outside the official documentation documents.
You may want to check the saphana.com site for this:
JOIN: http://www.saphana.com/docs/DOC-3170
PARALLEL JOIN: http://www.saphana.com/docs/DOC-3174
oh and look:
INDICES: http://www.saphana.com/docs/DOC-3169
Leonid Gvirtz wrote:
I can split my questions above to different discussion topics, if you like.
Thanks in advance
Leonid Gvirtz
No worries about splitting the post.
But now it's up to you to educate yourself. So have fun reading and watching the videos
Personally I find this thread has been answered and won't further engage in this one.
For new, specific questions, feel free to open a new thread.
- Lars
Hi Leonid,
DO you know how Column Store works in HANA?
If no, then you can learn more about it by attending OpenHPI sessio at:
https://openhpi.de/course/imdb2013
If you know then i guess you know about Attribute Vector and Value IDs.
HANA creates Inverted Index which maps all record IDs to Value IDs.
To monitor indexes, you can check INDEXES and INDEX_COLUMNS Views in SYS schema
You can get information about logical index structure.
Also check M_RS_INDEXES, you can get information for physical storage of indexes on rowstore columns.
Regards,
Vivek
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
85 | |
10 | |
10 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.