cancel
Showing results for 
Search instead for 
Did you mean: 

SAP Hana Vs MSSQL on Amazon AWS

Former Member
0 Kudos

Hi,

I'm testing the performances of HANA  VS. MSSQL on Amazon AWS.

I have a little testing application written in VB.NET that connects to HANA through ODBC (.NET System.Data.Odbc) and connects to MSSQL through SqlClient (.NET System.Data.SqlClient)

I'm using 3 machines on AWS:

AppSrv:    t1.micro          (Application)

HANA:     m2.4xlarge      (HANA DB)

MSSQL:   m2.4xlarge     (MSSQL DB)


On both HANA and MSSQL i have a table called "FORSELECTER" (ROWSTORE) with 1 milion rows.

Here's the structure of the table:

NameSQL Data TypeDim
CODENVARCHAR4000
DESCRIPTIONNVARCHAR4000
PRICEINTEGER
SCP0NVARCHAR4000
SCP1NVARCHAR4000
.....NVARCHAR4000
.....NVARCHAR4000
SCP30NVARCHAR4000

No indexes have been used.

Doing a "SELECT TOP 1 *" would give you:

CODEDESCRIPTIONPRICESCP0SCP1SCP..SCP...SCP30
TC43440TD434402728999XXX

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX








Taking some code from my testing application:

Here are the objects I use for connecting to the DB (HANA or MSSQL)

...

   Private connection As System.Data.Common.DbConnection = Nothing

   Private command As System.Data.Common.DbCommand = Nothing

   Private reader As System.Data.Common.DbDataReader

...

I initialize them differently depending on the DB I'm working on that moment:

...

      Select Case mUsedDbAPI

         Case DbAPIEnum.Odbc

            '

            connection = New System.Data.Odbc.OdbcConnection(sConStrHANA)

            command = New System.Data.Odbc.OdbcCommand()

         Case DbAPIEnum.SqlClient

            connection = New System.Data.SqlClient.SqlConnection(sConStrMSSQL)

            command = New System.Data.SqlClient.SqlCommand()

      End Select

...

For testing purposes i tried as well not to use IHERITANCE:

...

   Private hConnection As System.Data.Odbc.OdbcConnection = Nothing

   Private hCommand As System.Data.Odbc.OdbcCommand = Nothing

...

      hConnection = New System.Data.Odbc.OdbcConnection(sConStrHANA)

      hCommand = New System.Data.Odbc.OdbcCommand

Attached you'll find a text file (results.log) with some SQL statements and the rows affected (Rows), time in seconds (s) and speed (Rows/s),copy pasting from that file:

**************************

Odbc (this means I'm connecting to HANA using ODBC with inheritance)

SELECT *

FROM FORSELECTER

WHERE  SUBSTRING(CODE, 3, 4) = '3991' (111 Rows) (0.1716022 s) (646.844853970404 Rows/s)

**************************

SqlClient (this means I'm connecting to MSSQL using SqlClient)

SELECT *

FROM FORSELECTER

WHERE  SUBSTRING(CODE, 3, 4) = '3991' (111 Rows) (0.0312004 s) (3557.64669683722 Rows/s)

**************************

HANA SPECIFIC (this means I'm connecting to HANA , NO inheritance)

SELECT *

FROM FORSELECTER

WHERE  SUBSTRING(CODE, 3, 4) = '3991' (111 Rows) (0.1248184 s) (889.291963364376 Rows/s)

As you can see in these and the other examples you'll find in the attached file (results.txt) MSSQL is faster than HANA.

As HANA works in memory and MSSQL works on disks (I avoided localised selects, cleaned the buffer etc) there must be something I am missing.

Any clues? Maybe some settings or other things?

Accepted Solutions (0)

Answers (4)

Answers (4)

Former Member
0 Kudos

This is a very poor use case for comparison. The reason that column stores are generally faster than row stores is because, in the real world, SELECT * operations are not very common. If you are executing a SELECT *, then it will perform better on a row store than a column store. In a row store, all columns in a row (SELECT *) can be read with one I/O operation - not so with a column store. However, in reality, we find it's common to have tables with, for example, 100 columns, and you're only selecting 5 of them. So, when the number of columns selected is comparatively low, and the number of rows selected is higher, this is when a column store really shines.

As for the in-memory portion, a simple read from a fully buffered table in a single-user database will not show true-to-life performance of a disk-based database. So, in essence, you're comparing the best case for SQL Server vs the worst case for HANA.

Cheers,

David.

Former Member
0 Kudos

I have to disagree with David here. There are no "poor use cases for comparison" between row and column store performance.

The reality is that HANA must have the goal to perform SQL equally well on disk based row store tables despite being set up as a column store. A custom application shouldn't be forced to change its SQL to get to comparable performance on a column store. It's the same for ERP on HANA, where we can't completely rewrite ERP to get to comparable performance.

Sure, the "SELECT *" puts the column store at a natural disadvantage, especially for a large number of columns, but that's SAP's problem to fix and not that of the user. The column store should always be the default, except for special cases like customization tables that don't change a lot and for which no analytics are required.

The value of HANA will come from analytical applications where you can expect significant performance improvements on the column store, not just comparable OLTP performance. As some have said here, even conventional disk-based database systems satisfy requests often from their in memory buffers, so they are fast too. What conventional disk-based database systems can't do is to do parallelized aggregations of millions of records on the database. But for that, it is often required to rewrite some of the logic and push it down to the database layer using SQL Script and give up logic in Visual Basic or Java.

Users shouldn't expect breakthrough performance on the column store if you do a "SELECT *". Worst case is that HANA for some circumstances may actually not be faster than a conventional disk based system. Sure, you use an artificial number of wide columns, which would probably not be found in a good database design, but I guess you wanted to test out a hypothesis that HANA can't be faster in these circumstances. However, in this case the difference seems a little too large (170 ms vs. 30 ms) for a relatively simply query. It could either be an issue with SUBSTRING that doesn't parallelize the query as Daniel suggests, or it's an issue in the JDBC driver. Or it may indeed be that there is optimization potential for the large number of wide columns you have used.

Have you executed the query in the HANA Studio and see whether the runtime looks any better, thus bypassing ODBC? I also like Ravi's tests 3 and 4. Not sure if the indexes will make a big difference. I was told that even our largest customer projects have not required indexes, simply because the in memory performance is so good.

Cheers,

Michael

Former Member
0 Kudos

Well I must point out I'm not comparing the two RDBMS "in general", I am comparing the two in a specific situation in wich actually selecting all the columns in a table is heavily used and I'm quite sure I'm working in the real world .

I am evaluating HANA to see if it has some fetaures that can improve our bussiness and till now row store is what I have tested as SELECT * is what we mainly do.

I'm still testing though, as I said I will post all my results here.

Former Member
0 Kudos

Hi David,

Fully agree that you are indeed comparing apples to apples. HANA, row store, fully in memory vs. MSSQL, row store, fully on disk. And the results don't look too encouraging for HANA for this particular scenario.

Two things to keep in mind:

  • HANA is not just another apple (read: just another relational database). It shines for some usage scenarios (like the aggregation functions mentioned by Ravindra - try a SUM on your PRICE column combined with a GROUP BY on any other column. Of course using the column store in HANA), but does not provide big improvements for others. You happened to test one (SELECT * in combination with row store) that is not a particular strength of HANA. Daniel, David and Ravindra pointed that out already
  • The AWS deployment option is a low-cost option for developers. It doesn't make use of many of the low-level tweaks HANA uses to make the best use of the hardware. CPU caches, parallel processing, ... - everything is a little bit crippled on AWS, especially on those ec2.* instances (the cc2 instances for HANA One use a bare metal hypervisor that let's HANA talk much more directly to the hardware. I would expect the results to be better for all tests you might be doing).
    At SAP, we are aware of the performance implications of the AWS deployment - and that's why the EULA (End-User License Agreement) you had to accept to get access to those instances explicitly mentions that "You may not ... (c) publish any results of benchmark tests run on the Software or involving the Software,".
    I'm not telling you to "stop what you are doing, it violates your license" - I'm just asking you to take the results with the appropriate amount of salt

cheers

--Juergen

Former Member
0 Kudos

Ops.. didn't really red the whole EULA... thank you for pointing that out.

I'm testing cc2 instances as well, but it seems I cannot publish results then.

Though what I've published before is not for giving others a comparison but for asking others if I forgot something like a setting, a command for enabling some kind of performance uplift etc.

Thank you for your answer,

Cheers,

David.

Former Member
0 Kudos

Hi David,

I didn't want to scare you away. Feel free to continue with your tests and feel free to publish your results here. Nobody will go after you (maybe don't mention "benchmark"), and the information you generate is indeed helpful. When you experiment more with HANA, you will see that the way you ask your question can have dramatic impact on response times...

--Juergen

former_member184768
Active Contributor
0 Kudos

Hi David,

In my opinion, the SELECT * in traditional database would be easier (and faster) as the entire memory block will be read to buffer. But if the row length is more than the continuous memory blocks, leading to row chaining / row migration, then the performance will be impacted.

In HANA, constructing the row for SELECT * might also be going through the similar operation of row chaining, where the row could be constructed by reading from the memory locations which may not be continuous due to columnar storage. I am not completely sure about how is the data actually written to memory blocks in HANA, may be somebody from SAP can help us understanding the same.

Regarding Indexes on HANA, my test cases on query performance has showed that there are improvements. SAP also recommends to create indexes if the search columns are not part of the primary keys. Also there is a default index for the primary key columns. So in all, index should not be a bad idea.

This discussion topic is really good as it also brings up the topic if HANA can be used for Traditional database OLTP application. I am sure there would be use cases for the same, but considering the cost, I think the usage may not be as wide as any existing database.

Regards,

Ravi

Former Member
0 Kudos

Hi Juergen,

sorry, but I don't understand how HANA row store and SQL Server row store can be a good comparison. The default - unless it's absolutely clear that there will be no aggregates on the table - should be column store for HANA. And as Daniel has pointed out, only the HANA column store will provide the performance due to parallelizing the queries. It's SAP's job to compensate the natural overhead for "SELECT *" on the column store. It should also be understood that there are use cases where we just can't do a good enough job and won't get to a comparable performance of a disk based row store. Why in this particular case, where only 111 of a million records are selected, we get to worse performance is the real question here.

Cheers,

Michael

Former Member
0 Kudos

Hi David (Mandujano),

can you say more about the specific use case? I just find it odd to have a table with what looks like 30 placeholders of 4000 characters each. And why do you have a lookup on a substring?

The whole thing looks like a generic highly de-normalized database schema, so I'm sure you have a reason to design it that way. Just curious ...

Thanks,

Michael

Former Member
0 Kudos

Well, the DB we use is denormalized on purpose for speeding up the data retrival.

Actually the real tables we work on have Y*X columns, beeing Y the granularity of detail and X the number of different data we are interested. We put them this way so with fewer I/O we get all the data requested.

The queries here are actually just the first ones ever for testing, I just wanted to go on index break (CODE is primary key) that's why the "substring" to see how table scan was performing.

Former Member
0 Kudos

Hi David,

ok, I think I understand your objective. A HANA column store table with parallelized queries against a long table scan on a disk based row store is quite an interesting comparision. 

If I calculate correctly, your database is about 33 fields x 4000 characters x 1m records, so about 132 GBytes. Sounds about right? HANA compression will probably bring it down significantly, but I guess you are pushing against the 60 GB limit of your Amazon instance.

In the work that we have done with startups, a lot of companies were quite happy with AWS, but some have preferred physical non-virtualized HANA instances because of large data sets. Obviously, for multi-core parallelized scans HANA uses awareness of the underlying hardware for optimizations, for example size of hash tables aligned with processor caches. The virtualization layer of Amazon tries to emulate the hardware, but obviously can't do a perfect job here.

Whether this is a situation where you push against today's architectural limits of Amazon virtualization technology is hard to say. It's certainly possible that a disk-based long table scan performs better because HANA just can't utilize the hardware optimizations like on a real physical non-virtualized HANA instance.

As I said, the best is to gradually reduce possible issues, and I would start with executing the query in HANA Studio. I doubt that ODBC performance is an issue, but it's worth checking. Next, we would have to look at the query performance itself. Daniel is very familiar with query performance optimizations, so he may be able to help more.

Thanks,

Michael

danielculp
Explorer
0 Kudos

Hi David,

In your concrete case column store should actually perform better, as it is well optimized for executing simple table scans in parallel.

Best Regards

Daniel

former_member184768
Active Contributor
0 Kudos

Hi David,

Can I recommend / ask for clarifications few things:

1) Do you have the primary key defined for the table and is the where clause using the Primary keys. This information is required to check if the primary key index is available in HANA and MSSQL and how is the performance on the same.

2) In earlier versions of the traditional databases, if a function is used for the Primary key column in where clause, then it used to skip the index search. I am not sure, but that is also a possibility in HANA.

3) Can you please try any of the aggregation functions like count(column_name) or min(), max(), sum() functions and compare the performance.

4) In my opinion, constructing the row or decompressing the column store to form a row might be an expensive operation. But the search on a particular column should be fast provided it is indexed or the search is performed without any data conversion / data pattern search.

If you can try some of the tests mentioned above, it would be helpful for all of us.

Regards,

Ravi

Former Member
0 Kudos

Hi Ravi,

1), 2) There is no primary key or index in this first run of tests, I will try with them of course, but on the first run of tests I wanted to see how the two RDBMS delt with table scans.

3) 4) Will do.

Thank you very much for your asnswer, I'll post the next results of the tests.

Former Member
0 Kudos

Well  - all databases work in memory.  But that's a longer story.

It's a rowstore - I'd only consider using these in HANA for a small sub set of tables.  HANA really flies with colum store

Suggest you try comparing the column store in SQL Server 2012 with HANA.  I ran some tests of this against IQ.  IQ  won hands down.

Former Member
0 Kudos

Thank you for your answer

HANA is not just a columnar DB, it's also a In memory DB, so all the data in user's tables is held in memory. Data is saved by the persistance layer at fixed points in time.

MSSQL, talking about user's tables, holds in RAM the "recent" data requested and the "nearby" data requested. If i clean MSSQL cache by runnning a DBCC DROPCLEANBUFFERS i know that the next query will fetch data from disks.

Will sure give a try to column store, but column store is not always the best choice, it depends on the data you need to extract. (all columns, few columns, 90% or 10% of the rows in a table etc).

former_member184768
Active Contributor
0 Kudos

Hi David,

My two cents:

The In-Memory behavior for HANA is also similar to Database buffer cache for any other database. The data is available in main memory, but after a restart of HANA DB, only those objects are loaded to the main memory from the persistence storage, which are queried for. The tables can also be "unloaded" from the main memory.

Row store or column store for objects depend upon the use case. I think OLTP applications should have row stores as the no of inserts / updates are very high. Also the the query access pattern for OLTP applications (small no of records but ALL columns) is more suitable for ROW store objects.

Data Warehousing applications which go with bulk writes and massive read on selective columns are more optimal for Column store.

Hence the comparison on different database should be done with the similar storage type to have level playing field.

Regards,

Ravi