cancel
Showing results for 
Search instead for 
Did you mean: 

Data aggregation in blocks

0 Kudos

Hi,

I want to implement a solution where I have a data set like below -

I want to aggregate the data based on lat-long, if the same lat-long is sent in a row, these rows should get aggregated and return into a single row. Below is the output I am expecting -

However, the catch here is, if the lat-long combination is sent again at later point in time, it shouldn't get aggregated. It should have a new row for that(for eg. rows with s_no 1 and 5 have the same lat-long, but are not aggregated.

I implemented the solution using loops but since the data set is huge, the performance is bad.

Can anyone suggest a better solution for this(which doesn't include loops/cursors)?

Below are the sql queries to create the table and insert data -

create table dev_tab (id nvarchar(10), lat decimal(12,9), long decimal(12,9), timestamp timestamp);

insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-11 07:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-11 10:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',46.943718,8.999521,'2018-09-11 14:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',44.291991,0.938082,'2018-09-12 05:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',44.291991,0.938082,'2018-09-12 07:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',12.885067,77.553435,'2018-09-12 08:00:00');

insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-13 10:00:00');


Thanks in advance.

Regards,
Prachi

lbreddemann
Active Contributor
0 Kudos

Please provide the create table statement and the insert commands so that one can test solutions on your data set.

0 Kudos

Hi Lars,

Updated.

Thanks,

Prachi

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor
0 Kudos

Thanks for putting in some test data; that makes it possible to work on the question.

What you have here is a classic SQL problem, often referred to as "islands and gaps problem".

I recommend reading up on this as there are several different approaches each with pros and cons. Google is your friend here or simply grab the Joe Celko all-time classic "SQL for smarties" (you should have this book if your job regularly involves writing SQL anyway).

For this question I simply went with the google-it approach and found this page: https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/

Yes, it's about MS SQL Server and no, I don't care about that as the principle works the same in most SQL based DBMS, such as HANA.


I set up a table with your data:

select id, lat, long, timestamp
from dev_tab 
order by timestamp;

/*
ID  LAT         LONG        TIMESTAMP               
g1  48.762438   9.186757    11/09/2018 7:00:00.0 AM 
g1  48.762438   9.186757    11/09/2018 10:00:00.0 AM
g1  46.943718   8.999521    11/09/2018 2:00:00.0 PM 
g1  44.291991   0.938082    12/09/2018 5:00:00.0 AM 
g1  44.291991   0.938082    12/09/2018 7:00:00.0 AM 
g1  12.885067   77.553435   12/09/2018 8:00:00.0 AM 
g1  48.762438   9.186757    13/09/2018 10:00:00.0 AM
*/

And then followed the example. Basically the approach is to find the "borders" of your groups (see the "startpoints" and "endpoints" queries) and then to join these two sets.

with src as (select id, lat, long, timestamp
                  , row_number () over (order by timestamp) as seq       
				from dev_tab 
				order by timestamp
             ),
startpoints as (
    select id, seq, lat, long, timestamp, row_number() over (order by seq) as rownum
    from src src_a
    where not exists (
                       select * 
                       from src src_b 
                       where (src_b.lat, src_b.long) = (src_a.lat, src_b.long)
                       and src_b.seq = src_A.seq - 1
                       )
    ),
endpoints as (
    select id, seq, lat, long, timestamp, row_number() over (order by seq) as rownum
    from src src_a
    where not exists (
                       select * 
                       from src src_b 
                       where (src_b.lat, src_b.long) = (src_a.lat, src_b.long)
                       and src_b.seq = src_A.seq + 1
                       )
    )               				
select
     s.id, s.seq, s.lat, s.long, s.timestamp as start_timestamp, e.timestamp as end_timestamp 
from startpoints s 
left outer join	endpoints e
on (s.lat, s.long) = (e.lat, e.long)
and s.rownum = e.rownum;

ID  SEQ LAT         LONG        START_TIMESTAMP             END_TIMESTAMP           
g1  1   48.762438   9.186757    11/09/2018 7:00:00.0 AM     11/09/2018 10:00:00.0 AM
g1  3   46.943718   8.999521    11/09/2018 2:00:00.0 PM     11/09/2018 2:00:00.0 PM 
g1  4   44.291991   0.938082    12/09/2018 5:00:00.0 AM     12/09/2018 7:00:00.0 AM 
g1  6   12.885067   77.553435   12/09/2018 8:00:00.0 AM     12/09/2018 8:00:00.0 AM 
g1  7   48.762438   9.186757    13/09/2018 10:00:00.0 AM    13/09/2018 10:00:00.0 AM

The result is nearly what you asked for. The result set as you see it shows the END_TIMESTAMP of each "island". But in your problem statement, you wanted to see the START_TIMESTAMP of the next group.

Since this last bit is just one additional WINDOW expression for the final SELECT, I leave it to you to find that one out 🙂

0 Kudos

Hi Lars,

Thank you so much for helping me out on this. As you are aware that for END_TIMESTAMP, I want to see the START_TIMESTAMP of the next group. I used WINDOW function below as suggested by you -

lead(s.timestamp) over (partition by s.id order by s.seq) end_timestamp

With this function for END_TIMESTAMP, I don't think I need "endpoints" query at all. Please correct me if I am wrong.


Regards,
Prachi

Answers (0)