avatar image
Former Member

Overall Question concering concept of colmun-orented databases like HANA

Hello everybody

I am currently learning concepts of column store Databases like HANA from scratch and theres one point I don't understand. Theres the "dictionary" with the dictinct list of values with its corresponding integer and than theres the so called " Attribute Vector" in which the position of the tulpe ( or Row ) is assigned to this interger. So far so good.

I do understand that the dictionary may be sorted by the Value ( e.g. String Surname, forname etc.). If someone does a Full talble Scan on the Surname. The dictionary may be searched using binary Search. The Search returnes the corresponding integer of the Surname. So far so good.

We now looking the integer up in the attribute vector to determine the position of the tulpe/row. And this i dont understand:

Is the attribute Vector also sorted by the integer to use binary sreach? If it is not sorted by this integer, this would again be a full table/colum scan ( ? ) Or is the whole column be sreached from top to bottom and the reason of performance is only the reduced size of the column compared to the whole table in row-based databases?

And please dont give me a link to Open HPI i am learning that.

Thanks

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

1 Answer

  • Jan 12, 2017 at 03:46 PM

    Hi there

    the attribute vector indeed is searched sequentially - so, it is effectively a full column scan (this might be different, when there is additional compression on the attribute vector. So all this discussion is on a rather conceptual level and the HANA implementation may be different).

    Now, why is this full scan of the attrib. vector still faster? There are several things that help here:

    • since we now work on a single column, instead of a table cache page, which means the DB do not need to 'peel out' the interesting columns and ignore the remaining data. So, a lot more interesting data can be fetched from RAM into the CPU cache at once.
      Since the RAM-to-CPU-Cache transfer is a major runtime contributor, the effect of this is quite big.
    • Since the data in the CPU-cache is simply a long array and not actual data values, it is possible to use SIMD (single instruction multiple data) commands to find matching entries. These commands allow processing arrays of data in parallel in the CPU, instead of looping over single entries.
      So, instead of looping over single records, the column store can loop over large chunks of the attribute vector.

    This approach looks a lot like 'brute force' search, and it is, but it's very well supported on CPU level and several CPU cores can work on different columns in parallel, making queries with multiple conditions even faster.

    Hope that answers your question.

    Add comment
    10|10000 characters needed characters exceeded

Skip to Content