cesarchavez
asked on
Get distinct values when using inner join
HI Experts!
I have a problem that is similar to one i asked previously
I have two tables tablea and tableb
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
I have something like this:
select a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
but this produces repeated records for for overlaps in the time, so i get two entries at 12:12:03 and 12:12:04
Thanks
I have a problem that is similar to one i asked previously
I have two tables tablea and tableb
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
I have something like this:
select a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
but this produces repeated records for for overlaps in the time, so i get two entries at 12:12:03 and 12:12:04
Thanks
when questions get simplified, there is a tendency for the cycle to solution to be long.
The problem is that there is overlap in the tableb starttime/data, so to eradicate the repetition this has to be addressed by some method of simplification.
Proposals for doing so might change a lot depending on the overall query - can you post the whole query?
perhaps this will work for you?
select a.thetime, a.x
from tablea a
inner join tableb b
on a.thetime between (select min() starttime from tableb)
and (select max() endtime from tableb )
but I have to admit it looks 'weird' now I've typed it out
anyway, a better view into your overall query might reveal a better path.
The problem is that there is overlap in the tableb starttime/data, so to eradicate the repetition this has to be addressed by some method of simplification.
Proposals for doing so might change a lot depending on the overall query - can you post the whole query?
perhaps this will work for you?
select a.thetime, a.x
from tablea a
inner join tableb b
on a.thetime between (select min() starttime from tableb)
and (select max() endtime from tableb )
but I have to admit it looks 'weird' now I've typed it out
anyway, a better view into your overall query might reveal a better path.
ASKER
The only real difference is that there are hundreds of variables that are represented by x in table A and hundreds of time periods in table B. Unfortunately your solution will not work as there could be times in between where the data should not be included.
for example
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:05 5
12:12:06 6
for example
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:05 5
12:12:06 6
ASKER
I did post the comment 1 minute after i posted the question... but i just realised that the simple solution doesnt work when you have hundreds of variables and a massive data set (>100GB)
please ignore my original post - not sure what I was thinking - certainly not workable and ahas syntax errors - sorry
tableb - in question
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
=======================
tableb - in above comment
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
=======================
you have changed the data along the way. Is the data like the original question, or like comment immediately above?
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
=======================
tableb - in above comment
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
=======================
you have changed the data along the way. Is the data like the original question, or like comment immediately above?
also, what are the datatypes of thetime, starttime, endtime
are they date?
are they date?
ASKER
table b has a lot of records and im just showing examples. so:
and in a different situation i might have
for example
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:05 5
12:12:06 6
and in a different situation i might have
tablea
thetime, x
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8
tableb
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
the desired query output is:
thetime, x,
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
ASKER
they are all timestamps
if the tableb startime/endtime do not overlap, then the repetition stops
this is Ok
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
this is NOT
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
how many records are in tableb? can you create a table which has just the non-overlapping periods with key back to tableb?
is subquery factorization feasible here?
WITH x AS ()
this is Ok
id starttime endtime
100 12:12:01 12:12:03
101 12:12:05 12:12:06
this is NOT
id starttime endtime
100 12:12:01 12:12:04
101 12:12:03 12:12:06
how many records are in tableb? can you create a table which has just the non-overlapping periods with key back to tableb?
is subquery factorization feasible here?
WITH x AS ()
ASKER
This seems a little crazy to me. Im not asking you to redifine my problem. I have overlapping periods and the query needs to handle them.
Ceasar, not trying to redefine or to drive you crazy - just trying to understand and locate something feasible for you. Only tthe information provided here is what I know about the issues.
Anyway, here is an attempt to remove the overlapping TableB time-periods which in limited testing appears to work. However I cannot guarantee it will perform very quickly over a very large set of data.
Anyway, here is an attempt to remove the overlapping TableB time-periods which in limited testing appears to work. However I cannot guarantee it will perform very quickly over a very large set of data.
select
a.thetime
, a.x
from tableA a
inner join (
select
starttime
, case
when endtime > lead(starttime,1,endtime) over (order by starttime)
then lead(starttime,1,endtime) over (order by starttime) - 1/(24*60*60)
else endtime
end as adj_endtime
from tableB
) b on a.thetime >= b.starttime and a.thetime <= b.adj_endtime
order by
a.thetime
the testing, done here http://sqlfiddle.com/#!4/bc44f/3 produces the expected results:
THETIME X
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
HOWEVER, I cannot judge if adjusting the endtime to produce this result will affect other calculations in your overall query. To make that judgement would require knowing quite a lot more about that whole query.
Paul
THETIME X
12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
HOWEVER, I cannot judge if adjusting the endtime to produce this result will affect other calculations in your overall query. To make that judgement would require knowing quite a lot more about that whole query.
Paul
ASKER
Thanks for your efforts, i appreciate them, but this doesn't work at all. its not that some periods overlap and i'm looking for a workaround. I could do that myself. Its that the periods are random, some overlap, some have three embeded periods. I could have 20 periods all defined as exactly the same time.
It may be that there is no answer... but i find that hard to believe.
It may be that there is no answer... but i find that hard to believe.
select DISTINCT a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
select a.thetime, max(a.x)
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
group by a.thetime
these are you standard choices.
+ I tried something else based on the sample data provided - but it's unacceptable.
I fear to respond further for causing a redefinition of the question but only you are in possession of the full facts - this web page contains the only facts known to other readers.
You have just revealed new facts:
the periods are random,
some overlap,
some have three embedded periods.
could have multiple periods all defined as exactly the same time.
and - don't get angry please - you haven't really explained what the overall objective is.
I believe you are spreading data over time-frames and when doing this have to live with the 'row repetition' then 'simplify'.
The simplification might be a 'group by' (high chance of this). I think we probably agree it isn't going to be achieved by starting with 'select distinct' at the top of a giant query, so each part of the query needs to be assessed for correctness and fit.
If you take just the minimum data required from this A to B join and place it into a subquery using group by (possibly a with as, maybe a temp table) - then you might gain the overall desired outcome.
>>It may be that there is no answer
it's quite possible that the answer isn't one you really want.
Appears to me that the 'repetition' brought about by joining A to B will need simplification and my best guess is you will achieve that through 'group by'.
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
select a.thetime, max(a.x)
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
group by a.thetime
these are you standard choices.
+ I tried something else based on the sample data provided - but it's unacceptable.
I fear to respond further for causing a redefinition of the question but only you are in possession of the full facts - this web page contains the only facts known to other readers.
You have just revealed new facts:
the periods are random,
some overlap,
some have three embedded periods.
could have multiple periods all defined as exactly the same time.
and - don't get angry please - you haven't really explained what the overall objective is.
I believe you are spreading data over time-frames and when doing this have to live with the 'row repetition' then 'simplify'.
The simplification might be a 'group by' (high chance of this). I think we probably agree it isn't going to be achieved by starting with 'select distinct' at the top of a giant query, so each part of the query needs to be assessed for correctness and fit.
If you take just the minimum data required from this A to B join and place it into a subquery using group by (possibly a with as, maybe a temp table) - then you might gain the overall desired outcome.
>>It may be that there is no answer
it's quite possible that the answer isn't one you really want.
Appears to me that the 'repetition' brought about by joining A to B will need simplification and my best guess is you will achieve that through 'group by'.
ASKER
Thanks for your explanation... i was really thinking that there might be a relatively straight forward solution. However im not quite sure i understand. What if i told you there were 4 embedded periods... do you think that should change the query? i really dont think that i have to provide every single possible situation. A query is meant to provide a general solution.
:) it's a frustration of communication by web page I'm afraid., and I truly regret to say I don't understand this bit: " 4 embedded periods." in tableA? or tableB?
would this change the query = sounds like it
>>dont think that i have to provide every single possible situation
but you clearly know enough about queries to know that if exceptions aren't catered for the query outcome will disappoint (like my earlier attempt did)
It's a pity I can't describe adequately what it's like at the other side of this question - I can sense you are frustrated because I can't follow you - but equally I'm frustrated because I cannot see the big picture , or test anything against real data or provide concrete assistance.
Not sure what I can do right now to help.
would this change the query = sounds like it
>>dont think that i have to provide every single possible situation
but you clearly know enough about queries to know that if exceptions aren't catered for the query outcome will disappoint (like my earlier attempt did)
It's a pity I can't describe adequately what it's like at the other side of this question - I can sense you are frustrated because I can't follow you - but equally I'm frustrated because I cannot see the big picture , or test anything against real data or provide concrete assistance.
Not sure what I can do right now to help.
I too am having a hard time trying to figure out the requirements.
If I do understand it, it looks like you want the rows from tablea that exist in the ranges by second in tableb.
If that's correct, try the code below. I'm not sure how it will perform on large tables but it is the path I started down.
If it fails, please add to the test case.
If I do understand it, it looks like you want the rows from tablea that exist in the ranges by second in tableb.
If that's correct, try the code below. I'm not sure how it will perform on large tables but it is the path I started down.
If it fails, please add to the test case.
drop table tab1 purge;
create table tab1(thetime timestamp, x number);
insert into tab1 values(to_date('12:12:01','HH:MI:SS'),1);
insert into tab1 values(to_date('12:12:02','HH:MI:SS'),2);
insert into tab1 values(to_date('12:12:03','HH:MI:SS'),3);
insert into tab1 values(to_date('12:12:04','HH:MI:SS'),4);
insert into tab1 values(to_date('12:12:05','HH:MI:SS'),5);
insert into tab1 values(to_date('12:12:06','HH:MI:SS'),6);
insert into tab1 values(to_date('12:12:07','HH:MI:SS'),7);
insert into tab1 values(to_date('12:12:08','HH:MI:SS'),8);
commit;
drop table tab2 purge;
create table tab2(id number, starttime timestamp, endtime timestamp);
insert into tab2 values(100,to_date('12:12:01','HH:MI:SS'),to_date('12:12:03','HH:MI:SS'));
insert into tab2 values(101,to_date('12:12:05','HH:MI:SS'),to_date('12:12:06','HH:MI:SS'));
commit;
select thetime,x from tab1
where thetime in (
select starttime + numtodsinterval(column_value-1,'second')
from tab2,
table (
cast(
multiset( select level from dual connect by level <= ((cast(endtime as date) - cast(starttime as date))*86400)+1)
as sys.odcinumberlist
)
)
)
/
>>they are all timestamps<<
Are they all at one second intervals?
Are they all at one second intervals?
ASKER
thanks for the suggestions... ill try them out this afternoon.
no they are not at one second intervals. Sorry i was just trying to simplify the problem.
The times in table a are unique but could be at any frequency. slightwv i think you understand. all im looking for is a query that will return the data that exists between a range of times defined in table b. The ranges could overlap or not. the query i provided at the start does almost what i want. it gives me all the data between the ranges. It doesnt require any definition of whether the data is guaranteed second by second... it just returns the rows that are at times between the intervals. The only problem with it is that if the intervals overlap i get repeated records which i dont want.
I need a distinct set of records with no repeated values between any combination of ranges. If the data is between a range then it is included. If not.. then it is not included.
To me its sort of like i have a union all and what i want is a union.
no they are not at one second intervals. Sorry i was just trying to simplify the problem.
The times in table a are unique but could be at any frequency. slightwv i think you understand. all im looking for is a query that will return the data that exists between a range of times defined in table b. The ranges could overlap or not. the query i provided at the start does almost what i want. it gives me all the data between the ranges. It doesnt require any definition of whether the data is guaranteed second by second... it just returns the rows that are at times between the intervals. The only problem with it is that if the intervals overlap i get repeated records which i dont want.
I need a distinct set of records with no repeated values between any combination of ranges. If the data is between a range then it is included. If not.. then it is not included.
To me its sort of like i have a union all and what i want is a union.
>> the data that exists between a range of times defined
Down to what lowest level of granularity? Timestamps in Oracle have milliseconds.
So do you want the range to be down to milliseconds?
>>the query i provided at the start does almost what i want
It's not pretty but would a select distinct work?
Down to what lowest level of granularity? Timestamps in Oracle have milliseconds.
So do you want the range to be down to milliseconds?
>>the query i provided at the start does almost what i want
It's not pretty but would a select distinct work?
>>all im looking for is a query that will return the data that exists between a range of times
select a.thetime, a.x
from tablea a
where a.thetime >= (select min(starttime) from tableb)
and a.thetime <= (select max(endtime) from tableb)
take the join out the picture, just use tableb as the reference point for lowest and highest date boundaries, this way no duplication of tablea rows can occur
select a.thetime, a.x
from tablea a
where a.thetime >= (select min(starttime) from tableb)
and a.thetime <= (select max(endtime) from tableb)
take the join out the picture, just use tableb as the reference point for lowest and highest date boundaries, this way no duplication of tablea rows can occur
I have no idea what happened to the comment I posted earlier, but it basically stated the following [words in brackets are new additions]:
While using select distinct can be a performance hog [as slightwv notes not pretty], it should provide the output you need.
select DISTINCT a.thetime, a.x
from tab1 a inner join tab2 b on a.thetime between b.starttime and b.endtime
order by 1, 2;
Note - PortletPaul already suggested this approach but I wonder if you have tried it yet.
While using select distinct can be a performance hog [as slightwv notes not pretty], it should provide the output you need.
select DISTINCT a.thetime, a.x
from tab1 a inner join tab2 b on a.thetime between b.starttime and b.endtime
order by 1, 2;
Note - PortletPaul already suggested this approach but I wonder if you have tried it yet.
>>where a.thetime >= (select min(starttime) from tableb) and a.thetime <= (select max(endtime) from tableb)
That won't work if the dates have a gap.
Think of if in numbers:
tableb:
id,startnum,endnum
1,1,4
2,6,7
You would return anything between 1 and 7.
What is wanted is:
between (1 and 4) or between (6 and 7)
That won't work if the dates have a gap.
Think of if in numbers:
tableb:
id,startnum,endnum
1,1,4
2,6,7
You would return anything between 1 and 7.
What is wanted is:
between (1 and 4) or between (6 and 7)
>>all im looking for is a query that will return the data that exists between A range of times
emphasis added
What is wanted is:
between (1 and 4) or between (6 and 7)
is it?
>>the desired query output is:
thetime, x,
12:12:01 1 <-- MIN(b.starttime)
12:12:02 2
12:12:03 3
12:12:05 5
12:12:06 6 <-- MAX(b.endtime)
http://sqlfiddle.com/#!4/bc44f/11
and, yes, select distinct was proposed some while back
but my natural inclination is to avoid this for large sets :: i.e. I agree, it's ugly :)
emphasis added
What is wanted is:
between (1 and 4) or between (6 and 7)
is it?
>>the desired query output is:
thetime, x,
12:12:01 1 <-- MIN(b.starttime)
12:12:02 2
12:12:03 3
12:12:05 5
12:12:06 6 <-- MAX(b.endtime)
http://sqlfiddle.com/#!4/bc44f/11
and, yes, select distinct was proposed some while back
but my natural inclination is to avoid this for large sets :: i.e. I agree, it's ugly :)
>>is it?
Yes.
Per http:#a39154070
Notice "12:12:04 4" is not between the ranges in tableb and not in the desired results.
Yes.
Per http:#a39154070
Notice "12:12:04 4" is not between the ranges in tableb and not in the desired results.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
>>taking that into account, one last attempt (I hope)
Didn't run it but I agree it looks like it will work and is sooooo much better than the path I went down.
Didn't run it but I agree it looks like it will work and is sooooo much better than the path I went down.
:)
took us a while - and I couldn't sleep thinking there just had to be a better way
might sleep now
oh url error above, it should have been: http://sqlfiddle.com/#!4/207a9/2
took us a while - and I couldn't sleep thinking there just had to be a better way
might sleep now
oh url error above, it should have been: http://sqlfiddle.com/#!4/207a9/2
ASKER
genius
ASKER