on 09-13-2018 2:25 PM
Hi,
I want to implement a solution where I have a data set like below -
I want to aggregate the data based on lat-long, if the same lat-long is sent in a row, these rows should get aggregated and return into a single row. Below is the output I am expecting -
However, the catch here is, if the lat-long combination is sent again at later point in time, it shouldn't get aggregated. It should have a new row for that(for eg. rows with s_no 1 and 5 have the same lat-long, but are not aggregated.
I implemented the solution using loops but since the data set is huge, the performance is bad.
Can anyone suggest a better solution for this(which doesn't include loops/cursors)?
Below are the sql queries to create the table and insert data -
create table dev_tab (id nvarchar(10), lat decimal(12,9), long decimal(12,9), timestamp timestamp);
insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-11 07:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-11 10:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',46.943718,8.999521,'2018-09-11 14:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',44.291991,0.938082,'2018-09-12 05:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',44.291991,0.938082,'2018-09-12 07:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',12.885067,77.553435,'2018-09-12 08:00:00');
insert into dev_tab (id,lat,long,timestamp) values ('g1',48.762438,9.186757,'2018-09-13 10:00:00');
Thanks in advance.
Regards,
Prachi
Thanks for putting in some test data; that makes it possible to work on the question.
What you have here is a classic SQL problem, often referred to as "islands and gaps problem".
I recommend reading up on this as there are several different approaches each with pros and cons. Google is your friend here or simply grab the Joe Celko all-time classic "SQL for smarties" (you should have this book if your job regularly involves writing SQL anyway).
For this question I simply went with the google-it approach and found this page: https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/
Yes, it's about MS SQL Server and no, I don't care about that as the principle works the same in most SQL based DBMS, such as HANA.
I set up a table with your data:
select id, lat, long, timestamp
from dev_tab
order by timestamp;
/*
ID LAT LONG TIMESTAMP
g1 48.762438 9.186757 11/09/2018 7:00:00.0 AM
g1 48.762438 9.186757 11/09/2018 10:00:00.0 AM
g1 46.943718 8.999521 11/09/2018 2:00:00.0 PM
g1 44.291991 0.938082 12/09/2018 5:00:00.0 AM
g1 44.291991 0.938082 12/09/2018 7:00:00.0 AM
g1 12.885067 77.553435 12/09/2018 8:00:00.0 AM
g1 48.762438 9.186757 13/09/2018 10:00:00.0 AM
*/
And then followed the example. Basically the approach is to find the "borders" of your groups (see the "startpoints" and "endpoints" queries) and then to join these two sets.
with src as (select id, lat, long, timestamp
, row_number () over (order by timestamp) as seq
from dev_tab
order by timestamp
),
startpoints as (
select id, seq, lat, long, timestamp, row_number() over (order by seq) as rownum
from src src_a
where not exists (
select *
from src src_b
where (src_b.lat, src_b.long) = (src_a.lat, src_b.long)
and src_b.seq = src_A.seq - 1
)
),
endpoints as (
select id, seq, lat, long, timestamp, row_number() over (order by seq) as rownum
from src src_a
where not exists (
select *
from src src_b
where (src_b.lat, src_b.long) = (src_a.lat, src_b.long)
and src_b.seq = src_A.seq + 1
)
)
select
s.id, s.seq, s.lat, s.long, s.timestamp as start_timestamp, e.timestamp as end_timestamp
from startpoints s
left outer join endpoints e
on (s.lat, s.long) = (e.lat, e.long)
and s.rownum = e.rownum;
ID SEQ LAT LONG START_TIMESTAMP END_TIMESTAMP
g1 1 48.762438 9.186757 11/09/2018 7:00:00.0 AM 11/09/2018 10:00:00.0 AM
g1 3 46.943718 8.999521 11/09/2018 2:00:00.0 PM 11/09/2018 2:00:00.0 PM
g1 4 44.291991 0.938082 12/09/2018 5:00:00.0 AM 12/09/2018 7:00:00.0 AM
g1 6 12.885067 77.553435 12/09/2018 8:00:00.0 AM 12/09/2018 8:00:00.0 AM
g1 7 48.762438 9.186757 13/09/2018 10:00:00.0 AM 13/09/2018 10:00:00.0 AM
The result is nearly what you asked for. The result set as you see it shows the END_TIMESTAMP of each "island". But in your problem statement, you wanted to see the START_TIMESTAMP of the next group.
Since this last bit is just one additional WINDOW expression for the final SELECT, I leave it to you to find that one out 🙂
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Lars,
Thank you so much for helping me out on this. As you are aware that for END_TIMESTAMP, I want to see the START_TIMESTAMP of the next group. I used WINDOW function below as suggested by you -
lead(s.timestamp) over (partition by s.id order by s.seq) end_timestamp
With this function for END_TIMESTAMP, I don't think I need "endpoints" query at all. Please correct me if I am wrong.
Regards,
Prachi
User | Count |
---|---|
101 | |
13 | |
13 | |
11 | |
11 | |
7 | |
6 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.