Former Member

### Verifying update statistics run on tables

Hi,

I executed the command UPDATE STATISTICS for some table columns supplying a 'sample definition'. It ran successfully and returned with an OK message. My question is -

Where should I look to find the details of the statistics that have been created as a result of this run? More specifically, I am looking for histograms for the columns; or any other kind of detailed information that MaxDB stores as part of table column statistics.

~maximus

P.S.- Some background:

System details - MaxDB version - 7.6.05.09, Platform - Linux, 64-bit

I am novice user of MaxDB. I have installed MaxDB, created a database, created tables and loaded them with data. Now I want to update the statistics for the columns of the tables.

I looked in the table OPTIMIZERSTATISTICS but it only gives the number of distinct values, but not the histogram. In fact, even the PAGECOUNT field in this table always has nothing but a '?' mark.

10|10000 characters needed characters exceeded

Former Member
Oct 30, 2008 at 12:12 PM

Hi Lars,

I was not sure about which parameters you were interested in. I executed PARAM_GETDIRECTALL at the DBMCLI prompt and chose some parameters that I thought might be relevant. Please correct me if this is not what you are looking for.

System details - MaxDB version - 7.6.05.09, Platform - Linux, 64-bit

Database Parameters -

KERNELVERSION                           KERNEL    7.6.05   BUILD 009-123-191-997
INSTANCE_TYPE                           OLTP
OPTIM_MAX_MERGE                         500
OPTIM_INV_ONLY                          YES
OPTIM_CACHE                             NO
OPTIM_JOIN_FETCH                        0
JOIN_SEARCH_LEVEL                       0
JOIN_MAXTAB_LEVEL4                      16
JOIN_MAXTAB_LEVEL9                      5
TRACE_JOIN                              0
TRACE_KSQL                              0
OPTIMIZE_AGGREGATION                    YES
OPTIMIZE_FETCH_REVERSE                  YES
OPTIMIZE_STAR_JOIN                      YES
OPTIMIZE_JOIN_ONEPHASE                  YES
OPTIMIZE_JOIN_OUTER                     YES
OPTIMIZE_MIN_MAX                        YES
OPTIMIZE_FIRST_ROWS                     YES
OPTIMIZE_OPERATOR_JOIN                  YES
OPTIMIZE_JOIN_HASHTABLE                 YES
OPTIMIZE_JOIN_HASH_MINIMAL_RATIO        1
OPTIMIZE_OPERATOR_JOIN_COSTFUNC         YES
OPTIMIZE_JOIN_PARALLEL_MINSIZE          1000000
OPTIMIZE_JOIN_PARALLEL_SERVERS          0
OPTIMIZE_JOIN_OPERATOR_SORT             YES
OPTIMIZE_QUAL_ON_INDEX                  YES
OPTIMIZE_QUERYREWRITE                   OPERATOR
UPDSTATWANTED                           1


DDL statements

create table t1 (c1 integer primary key, c2 integer, c3 integer)
insert into t1 (c1)  (select rowno from tables a, tables b, tables c where rowno <=1000000)
update t1 set c2=mod(c1, 14)
update t1 set c3= c1

create table t2 (c1 integer primary key, c2 integer, c3 integer)
insert into t2 (c1)  (select rowno from tables a, tables b, tables c where rowno <=1000000)
update t2 set c2=0 where c1<=100
update t2 set c2=1 where 100<c1 and c1<=300
update t2 set c2=2 where 300<c1 and c1<=700
update t2 set c2=3 where 700<c1 and c1<=1500
update t2 set c2=4 where 1500<c1 and c1<=3100
update t2 set c2=5 where 3100<c1 and c1<=6300
update t2 set c2=6 where 6300<c1 and c1<=12700
update t2 set c2=7 where 12700<c1 and c1<=25500
update t2 set c2=8 where 25500<c1 and c1<=51100
update t2 set c2=9 where 51100<c1 and c1<=100000
update t2 set c2=10 where 100000<c1 and c1<=200000
update t2 set c2=11 where 200000<c1 and c1<=400000
update t2 set c2=12 where 400000<c1 and c1<=800000
update t2 set c2=13 where 800000<c1
update t2 set c3= c1

update statistics column (*) for t1
update statistics column (*) for t2


As can be seen, in the table t2, there are 100 rows with c2=0, 200 rows with c2=1, and so on.

I tried EXPLAIN-ing the following query

explain join (result) select * from t1, t1 t3, t2 where t1.c1=t3.c1 and t1.c1=t2.c1 and t1.c2=1 and t2.c2<3
| STRATEGY             | TABLENAME     | ADDITIONAL ROWS            | ACCUMULATED_COSTS          |
| ---------------------| ------------- | -------------------------- | -------------------------- |
|                      | T1            |          998527.9999999999 |                       4304 |
| JOIN VIA SINGLE KEY  | T3            |          998527.9999999999 |                       8608 |
| JOIN VIA SINGLE KEY  | T2            |          998527.9999999999 |          13126.22624434389 |



The other columns have comparable values for all 3 tables. Note that the estimated ADDITIONAL ROWS in all tables is approx 1 million - the size of the whole table.

In the 3-way join above, the optimizer choses the join order \[ t1 t3 t2 ]. But, I would expect the ideal order to be \[ t2 t1 t3 ]. The condition c2<3 causes only 700 rows to be retrieved and must be selected first. This result should be joined with t1 (which has the next smallest result on its own, due to the condition c2=1 which gives 1000000/14 u2248 71000 rows). Finally, the result should be joined with t3. Is this reasoning valid?

Warm regards,

Maximus

Edited by: maximus on Oct 30, 2008 1:12 PM

10|10000 characters needed characters exceeded
• Hi Maximus,

Ok, I tried to reproduce your examples.

The first thing here is obvious: col2 is not indexed, so although conditions like 't2.c2<3' reduce the result set a lot, MaxDB cannot use this knowledge, as it can either use the primary key or an table scan for the data access.

So, for this unindexed col2 there won't be any evaluation as it cannot be used for data access at all.

Create an index on col2 and you'll see that the estimations will be much better.

Concerning the 3-way-join: I just can't reproduce this.

That's what I get:


explain join select * from t1, t1 t3, t2  where t1.c1=t3.c1 and t1.c1=t2.c1 and t1.c2=1 and t2.c2<3;

LARS   T2         997473           1           1                   8673.67826086956  997473           4281
JOIN VIA SINGLE KEY  LARS   T1         999141           1           1                   12468.4125        997473           8794.45248868778
JOIN VIA SINGLE KEY         T3         999141           1           1                   13853.7916666667  997473           13307.9049773756


That's the correct join order.

regards,

Lars

• Former Member
Oct 21, 2008 at 01:19 PM

From Melanie's answer, it is clear to me that MaxDB statistics does not store histograms. Then, how does the optimizer calculate the cardinality of a condition?

e.g. In the query

SELECT * FROM t1, t2 WHERE t1.f1 = t2.f2 AND t1.f3 BETWEEN <val-1> AND <val-n>

the optimizer will try to calculate the selectivity of the BETWEEN condition. But, in the absence of a histogram, it does not have accurate values for the selectivities of each value <val-i> between <val-1> and <val-n>.

Does the optimizer assume a uniform distribution of the values? In other words, if there are N distinct values for the field t1.f3 and the range <val-1> to <val-n> contains n values, does the optimizer assume that the selectivity of the condition is n/N ?

Warm regards,

Maximus

P.S.- Thanks Melanie and Lars! The answers were very useful.

10|10000 characters needed characters exceeded
• > , how does the optimizer calculate the cardinality of a condition?

You should really read the Information about he optimizer in the WIKI [SQL Optimizer |https://wiki.sdn.sap.com/wiki/x/DSg ].

MaxDB does not use stored histograms but performs evaluation a.k.a. sampling during optimization time.

> Does the optimizer assume a uniform distribution of the values?

No, it does not (see above) - evalutations are used.

Basically the MaxDB performs test-reads into the B*Tree to figure out how big (in terms of pages that need to be read) a specific range would be.

Before you ask - there is no way to see the results of this sampling process in a reasonable way.

Anyhow, for SQL tuning you may want to take a closer look to explain sequence and explain join.

regards,

Lars

• Oct 20, 2008 at 07:09 AM

Hi,

if you are new to MaxDB you might want to read our MaxDB performance guide and the support guide.

Both can be found here:

This section specifically explains the statistics:

https://wiki.sdn.sap.com/wiki/x/Dyg

We don't have any histograms - so you won't find any.

Best regards,

Melanie

10|10000 characters needed characters exceeded
• Oct 20, 2008 at 11:22 AM

HI Maximus,

Melanie is of course completely right: read the documentation and the wiki first!

Anyhow, here are just a few steps to demonstrate what kind of information is gathered when:

## Let's create a test table:

create table test (kcol1 varchar(10) primary key, nkcol2 integer, nkcol3 varchar(20))
//
insert into test (kcol1)  (select rowno from tables a, tables b, tables c where rowno <=10000)
//
update test set nkcol2=mod(kcol1, 100)
//
update test set nkcol3= kcol1


Ok, ncol1 is obviously the pimary key for this table, and has 10000 distinct values,

nkcol3 is a direct copy of it, so it also has 10000 distinct values in it, whereas nkcol2 has only 100 distinct values due to the mod( x, 100) function used in the update.

## Let's check the stats right now:

sqlcli DB760=>
sqlcli DB760=> \mu
Multiline mode switched ON
sqlcli DB760=> select t.tablename, f.entrycount, treeindexsize+treeleavessize+lobsize SIZE_IN_PAGES
=> from files f, tables t
=> where
=> t.tableid = f.fileid
=> and f.type='TABLE'
=> and t.tablename='TEST'
=> and t.schemaname=user;
| TABLENAME  | ENTRYCOUNT         | SIZE_IN_PAGES       |
| ---------- | ------------------ | ------------------- |
| TEST       |              10000 |                 368 |

1 row selected (122.726 msec)


So, the first thing I looked at here, are actually no statistics but current size values from the filedirectory.

I did this just to see what statistical values we could expect.

By the way: to run multiline statements like the above one in sqlcli you've to set \mu first to activate the multiline mode. To mark the end of a statement in multilinemode, you've to add a ';' to the statement, just as you do with Oracle's sqlplus.

sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME | TABLENAME  | COLUMNNAME | INDEXNAME  | DISTINCTVALUECOUNT   | CALCULATEDSIZE   |
| ---------- | ---------- | ---------- | ---------- | -------------------- | ---------------- |
| LARS       | TEST       | KCOL1      | ?          |                    0 | ?                |
| LARS       | TEST       | ?          | ?          |                    0 |                8 |

2 rows selected (24.137 msec)


As we see, even without having ever run a update statistics there are two entries in the table OPTIMIZERINFORMATION (if you check the documentation, you'll figure that OPTIMIZERSTATISTICS is nowadays deprecated).

We've one line for the table (you'll notice the NULL value ? in the field COLUMNNAME) and one for the primary key "KCOL1". Unfortunately the values we see for DISTINCTVALUECOUNT and CALCULATEDSIZE are pretty much nonsense, as we know from the first query.

ATTENTION: here you can see why it is such a big problem for the optimizer if the statistics are not fitting to reality. Although MaxDB uses statistics only for Join-Statements, this table would be estimated to be very small...

## Let's gather some stats:

sqlcli DB760=> update statistics test;

0 rows affected (81.886 msec)

sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME  | TABLENAME  | COLUMNNAME  | INDEXNAME  | DISTINCTVALUECOUNT       | CALCULATEDSIZE     |
| ----------- | ---------- | ----------- | ---------- | ------------------------ | ------------------ |
| LARS        | TEST       | KCOL1       | ?          |                    10000 | ?                  |
| LARS        | TEST       | ?           | ?          |                    10000 |                360 |

2 rows selected (24.013 msec)


This looks much better, now!

But what about the other columns? For all non-indexes and non-primary columns you've to specify if you want to gather statistics on them. Usually this is not necessary as such columns won't ever be used for a join transition:

sqlcli DB760=> update statistics column (*) for test;
0 rows affected (140.526 msec)

sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME | TABLENAME  | COLUMNNAME | INDEXNAME | DISTINCTVALUECOUNT | CALCULATEDSIZE  |
| ---------- | ---------- | ---------- | --------- | ------------------ | --------------- |
| LARS       | TEST       | KCOL1      | ?         |              10000 | ?               |
| LARS       | TEST       | NKCOL2     | ?         |                100 | ?               |
| LARS       | TEST       | NKCOL3     | ?         |              10000 | ?               |
| LARS       | TEST       | ?          | ?         |              10000 |             360 |

4 rows selected (24.795 msec)


As we see, the OPTIMIZERINFORMATION does indeed show the different results of a UPDATE STAT command - you just have to look right to it.

regards,

Lars

10|10000 characters needed characters exceeded
• Former Member
Oct 30, 2008 at 06:40 AM

I performed the following experiment -

Create 2 tables t1(c1,c2) and t2(c1,c2) (c1 is the key in both tables).

I filled 1 million records in t1 so that c1=1,2,...,1000000 and c2= c1 mod 14. Thus column c2 has 14 values, all having the same frequency of occurrence in the table t1.

I filled 1 million records in t2 so that c1=1,2,...,1000000 and c2=0 for the first 100 records, c2=1 for the next 200 records, c2=2 for the next 400 records and so on. It turns out that column c2 has 14 values but with vastly differing frequencies of occurrence in the table t2.

Now, when I executed an EXPLAIN JOIN on the 2 queries -

Q1:
select * from t1, t1 t3 where t1.c1=t3.c1 and t1.c2>6;

Q2:
select * from t2, t2 t4 where t2.c1=t4.c1 and t2.c2>6;


I got very similar cost estimates (see below)

explain join (result) select * from t1, t1 t3 where t1.c1=t3.c1 and t1.c2>6
| STRATEGY                                 | TABLENAME                        | ACCUMULATED_COSTS          |
| ---------------------------------------- | -------------------------------- | -------------------------- |
|                                          | T1                               |                       4304 |
| JOIN VIA SINGLE KEY                      | T3                               |                       8608 |

explain join (result) select * from t2, t2 t4 where t2.c1=t4.c1 and t2.c2>6
| STRATEGY                                 | TABLENAME                        | ACCUMULATED_COSTS          |
| ---------------------------------------- | -------------------------------- | -------------------------- |
|                                          | T2                               |                       4522 |
| JOIN VIA SINGLE KEY                      | T4                               |                       9044 |


(The other columns, omitted here due to space constraints, also had almost identical values.)

Ideally, the optimizer should have estimated the cost of Q2 to be much higher than Q1, I feel.

Based on this, I am tempted to conclude that the MaxDB optimizer either (a) does not consider the distribution of values or (b) is not provide this data by the statistics, and hence is helpless anyway.

Is the experiment correctly designed and is the conclusion valid? Please comment.

Aside: I also tried other queries with these tables - notably 3-way joins. In each case, when I expected that the optimizer to pick a particular join order that must be the least cost, it did not and made a naïve choice instead (as was evident in the EXPLAIN SEQUENCE output)

Regards,

Maximus

P.S.- Thanks once again Lars for the tips. I did try to read most of the relevant documentation, as you suggested.

10|10000 characters needed characters exceeded
• Hi Maximus,

Let us know:

- database version

- parameter setup

- exact DDL/DML used to produce the testcase

You did not mention that you gathered statistics after the dataload.

If you did it anyhow, then what was the exact command used for that?

As already stated, MaxDB does not store histogram or data distribution information, but samples the B*Trees on strategy evalutation.

Concerning the execution plans: both join orders are correct here. So even if the displayed accumulated_costs do not fit your expectation, it's still the right plan chosen.

The statement and plan for the wrong join order is missing in your post - so currently it's just what you think...

regards,

Lars

• Former Member
Nov 05, 2008 at 07:31 AM

I tried again after building indexes. Now it gives the right join order. That answers my question.

Thanks a lot, Lars, for the help!

Warm regards,

Maximus