cancel
Showing results for 
Search instead for 
Did you mean: 

Compare Timestamps of rows of the same table

Former Member
0 Kudos

Hello everybody,

i'm actually working on my masterthesis, whose topic is "big data technologies for fraud detection".

I want to build a system, which consist of hadoop and sap hana. hadoop has to import, parse and persistent the logfiles. the result of this process is one hugh logfile, which is eventually exported to sap hana appliance.

and here we go.

the format of the logfile look like this: Date, Time, User, IP, Timestamp

And i want to figure out, which user, changed his IP adress in less then 5 seconds. I tried to use a calculation view and to build a semi-cross product by joining the logfile with it self, by joining it by username. But the problem with this join is, that it joins everything twice. (A | B) and (B | A) because they matching the condition (username = username AND ip != ip)

So how could i only get the distinct values of the cross product?

Regards,

Suerte

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor
0 Kudos

HI Felix

How about this approach?

create column table log (mydate date, mytime time, user_name varchar(20), ip varchar(15), mytimestamp timestamp)

insert into log values (current_date, current_time, 'LARS', '192.158.123.4', current_timestamp)

...

insert into log values (current_date, current_time, 'PAUL', '192.158.123.42', current_timestamp)

...

select * from log

/*

MYDATE    |MYTIME |USER_NAME|IP |MYTIMESTAMP           

2014-02-18|16:12:09|LARS     |192.158.123.4 |2014-02-18 16:12:09.923

2014-02-18|16:12:13|LARS     |192.158.123.4 |2014-02-18 16:12:13.968

2014-02-18|16:12:17|LARS     |192.158.123.4 |2014-02-18 16:12:17.064

2014-02-18|16:12:28|PAUL     |192.158.123.4 |2014-02-18 16:12:28.062

2014-02-18|16:12:31|PAUL     |192.158.123.42|2014-02-18 16:12:31.899

*/

select *,SECONDS_between (a.mytimestamp, b.mytimestamp) dist from

log a left outer join log b

on a.user_name = b.user_name

where

  1. a.mytimestamp > b.mytimestamp

and abs(SECONDS_between (a.mytimestamp, b.mytimestamp) )>=2

and a.ip <> b.ip

/*

MYDATE    |MYTIME |USER_NAME|IP |MYTIMESTAMP            |MYDATE    |MYTIME |USER_NAME|IP |MYTIMESTAMP            |DIST

2014-02-18|16:12:31|PAUL     |192.158.123.42|2014-02-18 16:12:31.899|2014-02-18|16:12:28|PAUL |192.158.123.4|2014-02-18 16:12:28.062|-3 

*/


Not sure if this is what you are looking for but it works not too bad.


- Lars

Former Member
0 Kudos

Hi Lars,

this is a good suggestion, but i want to use almost CE functions. Actually i'm working with a calculation view via gui. i thought there is maybe a possibility to get this things done faster by using ce functions via the script via.

Do you maybe got an idea of how to solve this problem with ce functions?

Regards,

Felix

lbreddemann
Active Contributor
0 Kudos

What's the reason for using CE functions only?

Former Member
0 Kudos

Because i want to prevent SAP Hana to jump to the sql optimizer and therefor loosing performance.

But maybe the dimension of my data will prevent me to build a cross product of my table. at the end, the table will include round about 10 million entries. i think a cross product with a join on username will increase the number of entries exponential.

so maybe there is an other way to process the data.

is there a possibility to get the next row to compare the username und the timestamp?

Regards,

lbreddemann
Active Contributor
0 Kudos

Sorry Felix, but the notion of "SQL optimizer = slow performance" is plain wrong.

If you check the execution plan for the statement I gave you, you'll see that all processing happens in the column store.

For things like "next row" you would typically look at window functions which are most often executed by the row engine (and yes, that might slow you down then).

As always: before making assumptions on performance and system behavior, just check what's actually happening.

SAP HANA is constantly evolving and what has been true three revisions ago might be false now.

- Lars

Former Member
0 Kudos

Hi Lars,

in future, i'll first have a lok at the execution plan to talk about performance.

your sql statement works fine and i get what i modelled into my synthetic generated logfiles. but maybe the windows function will be the best solution. could you please give me a solution or a hint how this function could be used to solve my problem?

regards,

lbreddemann
Active Contributor
0 Kudos

Well, try it out and let us know how you proceed

.

The SQL using Window functions could look like this:


select * from (

  select MYDATE, MYTIME, USER_NAME, IP, MYTIMESTAMP,

         abs(SECONDS_between (MYTIMESTAMP,  LAG (mytimestamp, 1) OVER (order by user_name asc,  mytimestamp asc))) as diff

  from log

  order by user_name, mytimestamp asc)

where diff >=2

Good luck with that.

- Lars

Former Member
0 Kudos

Nice one Lars,

Thank you very much!

This helps me very much. I only added the Second IP to the outer select to check if IP <> IP2.

This is what it looks like in the end:


SELECT * FROM ( 

  SELECT "DATE", "TIME", "NAME", "ORIGINALCLIENTIP", "TIMESTAMP", 

         ABS(SECONDS_between ("TIMESTAMP",  LAG("TIMESTAMP", 1) OVER (ORDER BY "NAME" ASC,  "TIMESTAMP" ASC))) AS diff,

         LAG("ORIGINALCLIENTIP",1) OVER (ORDER BY "NAME" ASC,  "TIMESTAMP" ASC) AS "ORIGINALCLIENTIP2" 

  FROM "FXBR"."LOGS"

  ORDER BY "NAME", "TIMESTAMP" ASC

  ) 

WHERE diff<5 AND "ORIGINALCLIENTIP" != '-' AND "ORIGINALCLIENTIP2" != '-' AND "ORIGINALCLIENTIP" <> "ORIGINALCLIENTIP2"

Thanks!!

Former Member
0 Kudos

- In my opinion it's still fair to follow the "old adage" of not mixing CE functions and SQL syntax...

From my experience, it's a mixed bag - depending on complexity, data volumes etc - sometimes mixing slows it down, and other times it's almost same performance as pure CE functions or pure SQL. So, governing rule - generally avoid mixing due to uncertain performance impact.

In any case, back to the original problem Felix - I believe you could still accomplish this using Lars' code but in CE functions. You'd join on username via CE_JOIN (bit of a cross join), and then in CE_PROJECTION you'd add in filters that satisfy the additional join conditions.

Cross join sounds expensive, but I've found to be a great trick for accomplish rather odd logic with good performance (you can check out my document on efficient cross/theta joins using CE_JOIN as well as Former Member's write-up on efficient time-based reporting for examples of leveraging cross joings). Just a thought.

Links to those docs:

http://scn.sap.com/docs/DOC-41218

http://scn.sap.com/docs/DOC-50420

Former Member
0 Kudos

Hi Jody Hesh,

i will have a look on both techniques.

Thanks for your advice.

lbreddemann
Active Contributor
0 Kudos

Hi Jody

Fair point.

My angle towards the decision on the choice of development 'language' is a bit different here.

I tend to prefer solutions that are actually maintainable and I don't see that this is realistic with CE_-functions.

It just takes too much effort to get the syntax and table/field references all correct.

While in SQL you can - at least partly - express intention and what data you want to see, CE_-functions force you to tell the database what to do and how to do it.

And concerning performance: I believe (as in not tested recently) that mixing the operator classes still is not recommendable, agreed.

However, making performance decisions without measuring will lead to the wrong decision in 66% cases (the other being that performance got worse or that performance stayed the same).

And that is why I asked about the reason for the CE_-operator usage.

- Lars

Former Member
0 Kudos

Lars Breddemann wrote:

Hi Jody

Fair point.

My angle towards the decision on the choice of development 'language' is a bit different here.

I tend to prefer solutions that are actually maintainable and I don't see that this is realistic with CE_-functions.

It just takes too much effort to get the syntax and table/field references all correct.

This is a good point. I have bad habit of referring to scripted CE functions, which indeed are less syntax-friendly than SQL. My intent though is to include graphical nodes in the concepts, which in my opinion are more maintainable than SQL. The case above (as with all cases except those requiring the almost-never-used CE_VERTICAL_UNION) could certainly be accomplished graphically. I'll try to be more conscientious about this with my comments!

While in SQL you can - at least partly - express intention and what data you want to see, CE_-functions force you to tell the database what to do and how to do it.

I wouldn't entirely agree. In both SQL and CE functions you express various operations like joins, aggregations, projections, etc. With SQL, every operation you specify is executed, albeit in some optimized fashion. With CE operators (graphically or functions), HANA will prune fields, joins, and branches where possible depending on the (resulting instantiated model which satisfies the) client query - which in my mind is more declarative in nature.

And concerning performance: I believe (as in not tested recently) that mixing the operator classes still is not recommendable, agreed.

However, making performance decisions without measuring will lead to the wrong decision in 66% cases (the other being that performance got worse or that performance stayed the same).

And that is why I asked about the reason for the CE_-operator usage.

Sure. But given the graphical option - I'd say the right decision is made 66% of the time. (i.e. same performance with more maintainable model is the better option).

- Lars

lbreddemann
Active Contributor
0 Kudos

Hi Felix

Something else: you store the IP4 address as a string.

You really shouldn't do this.

Instead invest a little effort into the loading of the data and your SQL coding and keep the 4 bytes separated (e.g. byte1, byte2, byte3, byte4).

That way you get way better compression as you can only get 256 different values in each byte in the worst case (which you won't hit, due to the way IP addresses are maintained).

By storing the string you always store the combination of the bytes + three dots which really doesn't add anything to your query.

With this you end up with a much larger dictionary, and a value vector that compresses much worse due to less repetitions.

The smart thing to do here is to store the data in separate bytes. Just make sure to query the combination of bytes correctly and by using the tuple-notion:

... WHERE (y.byte1, y.byte2, y.byte3, y.byte4) = (x.byte1, x.byte2, x.byte3, x.byte4)

This way even opens the opportunity for a higher degree of parallelism as each byte column can be filtered independent from the others.

Without having this tested on some realistic data volume I say: I would be surprised if this wouldn't increase the performance of your solution.

- Lars

Answers (0)