Link to home
Start Free TrialLog in
Avatar of cesarchavez
cesarchavezFlag for Afghanistan

asked on

Get distinct values when using inner join

HI Experts!

I have a problem that is similar to one i asked previously

I have two tables tablea and tableb

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

I have something like this:

select a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

but this produces repeated records for for overlaps in the time, so i get two entries at 12:12:03 and 12:12:04

Thanks
Avatar of cesarchavez
cesarchavez
Flag of Afghanistan image

ASKER

i cant just do a group by to the output.  This is a simplified view of the problem and i need a query that will operate efficiently on large data sets.
Avatar of PortletPaul
when questions get simplified, there is a tendency for the cycle to solution to be long.

The problem is that there is overlap in the tableb  starttime/data, so to eradicate the repetition this has to be addressed by some method of simplification.

Proposals for doing so might change a lot depending on the overall query - can you post the whole query?
perhaps this will work for you?

select a.thetime, a.x
from tablea a
inner join tableb b
      on a.thetime between (select min() starttime from tableb)
            and (select max() endtime  from tableb )


but I have to admit it looks 'weird' now I've typed it out

anyway, a better view into your overall query might reveal a better path.
The only real difference is that there are hundreds of variables that are represented by x in table A and hundreds of time periods in table B.  Unfortunately your solution will not work as there could be times in between where the data should not be included.

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6
I did post the comment 1 minute after i posted the question... but i just realised that the simple solution doesnt work when you have hundreds of variables and a massive data set (>100GB)
please ignore my original post - not sure what I was thinking - certainly not workable and ahas syntax errors - sorry
tableb - in question

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

=======================
tableb - in above comment

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06
=======================

you have changed the data along the way. Is the data like the original question, or like comment immediately above?
also, what are the datatypes of thetime, starttime, endtime
are they date?
table b has a lot of records and im just showing examples. so:

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6

Open in new window


and in a different situation i might have

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

Open in new window

they are all timestamps
if the tableb startime/endtime do not overlap, then the repetition stops

this is Ok
id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06

this is NOT
id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

how many records are in tableb? can you create a table which has just the non-overlapping periods with key back to tableb?

is subquery factorization feasible here?
WITH x AS ()
This seems a little crazy to me.  Im not asking you to redifine my problem.  I have overlapping periods and the query needs to handle them.
Ceasar, not trying to redefine or to drive you crazy - just trying to understand and locate something feasible for you. Only tthe information provided here is what I know about the issues.

Anyway, here is an attempt to remove the overlapping TableB time-periods which in limited testing appears to work. However I cannot guarantee it will perform very quickly over a very large set of data.
select
  a.thetime
, a.x
from tableA a 
inner join (
            select
              starttime
            , case
                when endtime > lead(starttime,1,endtime) over (order by starttime) 
                  then lead(starttime,1,endtime) over (order by starttime) - 1/(24*60*60) 
                else endtime 
              end as adj_endtime
            from tableB
            ) b on a.thetime >= b.starttime and a.thetime <= b.adj_endtime
order by
  a.thetime

Open in new window

the testing, done here http://sqlfiddle.com/#!4/bc44f/3 produces the expected results:

THETIME      X
12:12:01      1
12:12:02      2
12:12:03      3
12:12:04      4
12:12:05      5
12:12:06      6

HOWEVER, I cannot judge if adjusting the endtime to produce this result will affect other calculations in your overall query.  To make that judgement would require knowing quite a lot more about that whole query.

Paul
Thanks for your efforts, i appreciate them, but this doesn't work at all.  its not that some periods overlap and i'm looking for a workaround. I could do that myself.  Its that the periods are random, some overlap, some have three embeded periods.  I could have 20 periods all defined as exactly the same time.  

 It may be that there is no answer... but i find that hard to believe.
select DISTINCT a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

select a.thetime, max(a.x)
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
group by a.thetime

these are you standard choices.
+ I tried something else based on the sample data provided - but it's unacceptable.


I fear to respond further for causing a redefinition of the question but only you are in possession of the full facts - this web page contains the only facts known to other readers.

You have just revealed new facts:
the periods are random,
some overlap,
some have three embedded periods.
could have multiple periods all defined as exactly the same time.  

and - don't get angry please - you haven't really explained what the overall objective is.

I believe you are spreading data over time-frames and when doing this have to live with the 'row repetition' then 'simplify'.

The simplification might be a 'group by' (high chance of this). I think we probably agree it isn't going to be achieved by starting with 'select distinct' at the top of a giant query, so each part of the query needs to be assessed for correctness and fit.

If you take just the minimum data required from this A to B join and place it into a subquery using group by (possibly a with as, maybe a temp table) - then you might gain the overall desired outcome.

>>It may be that there is no answer
it's quite possible that the answer isn't one you really want.
Appears to me that the 'repetition' brought about by joining A to B will need simplification and my best guess is you will achieve that through 'group by'.
Thanks for your explanation... i was really thinking that there might be a relatively straight forward solution. However im not quite sure i understand.  What if i told you there were 4 embedded periods... do you think that should change the query? i really dont think that i have to provide every single possible situation.  A query is meant to provide a general solution.
:) it's a frustration of communication by web page I'm afraid., and I truly regret to say I don't understand this bit: " 4 embedded periods." in tableA? or tableB?
would this change the query = sounds like it

>>dont think that i have to provide every single possible situation
but you clearly know enough about queries to know that if exceptions aren't catered for the query outcome will disappoint (like my earlier attempt did)

It's a pity I can't describe adequately what it's like at the other side of this question - I can sense you are frustrated because I can't follow you - but equally I'm frustrated because I cannot see the big picture , or test anything against real data or provide concrete assistance.

Not sure what I can do right now to help.
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

I too am having a hard time trying to figure out the requirements.

If I do understand it, it looks like you want the rows from tablea that exist in the ranges by second in tableb.

If that's correct, try the code below.  I'm not sure how it will perform on large tables but it is the path I started down.

If it fails, please add to the test case.

drop table tab1 purge;
create table tab1(thetime timestamp, x number);

insert into tab1 values(to_date('12:12:01','HH:MI:SS'),1);
insert into tab1 values(to_date('12:12:02','HH:MI:SS'),2);
insert into tab1 values(to_date('12:12:03','HH:MI:SS'),3);
insert into tab1 values(to_date('12:12:04','HH:MI:SS'),4);
insert into tab1 values(to_date('12:12:05','HH:MI:SS'),5);
insert into tab1 values(to_date('12:12:06','HH:MI:SS'),6);
insert into tab1 values(to_date('12:12:07','HH:MI:SS'),7);
insert into tab1 values(to_date('12:12:08','HH:MI:SS'),8);
commit;

drop table tab2 purge;
create table tab2(id number, starttime timestamp, endtime timestamp);

insert into tab2 values(100,to_date('12:12:01','HH:MI:SS'),to_date('12:12:03','HH:MI:SS'));
insert into tab2 values(101,to_date('12:12:05','HH:MI:SS'),to_date('12:12:06','HH:MI:SS'));
commit;

select thetime,x from tab1
where thetime in (
select starttime + numtodsinterval(column_value-1,'second')
from tab2,
table (
	cast(
		multiset( select level from dual connect by level <= ((cast(endtime as date) - cast(starttime as date))*86400)+1)
  		as sys.odcinumberlist
	)
)
)
/

Open in new window

>>they are all timestamps<<
Are they all at one second intervals?
thanks for the suggestions... ill try them out this afternoon.

no they are not at one second intervals.  Sorry i was just trying to simplify the problem.

The times in table a are unique but could be at any frequency.  slightwv i think you understand.  all im looking for is a query that will return the data that exists between a range of times defined in table b.  The ranges could overlap or not.  the query i provided at the start does almost what i want.  it gives me all the data between the ranges.  It doesnt require any definition of whether the data is guaranteed second by second... it just returns the rows that are at times between the intervals. The only problem with it is that if the intervals overlap i get repeated records which i dont want.

I need a distinct set of records with no repeated values between any combination of ranges.  If the data is between a range then it is included.  If not.. then it is not included.

To me its sort of like i have a union all and what i want is a union.
>> the data that exists between a range of times defined

Down to what lowest level of granularity?  Timestamps in Oracle have milliseconds.

So do you want the range to be down to milliseconds?

>>the query i provided at the start does almost what i want

It's not pretty but would a select distinct work?
>>all im looking for is a query that will return the data that exists between a range of times

select a.thetime, a.x
from tablea a
where a.thetime  >= (select min(starttime) from tableb)
and a.thetime  <= (select max(endtime) from tableb)

take the join out the picture, just use tableb as the reference point for lowest and highest date boundaries, this way no duplication of tablea rows can occur
I have no idea what happened to the comment I posted earlier, but it basically stated the following [words in brackets are new additions]:
While using select distinct can be a performance hog [as slightwv notes not pretty], it should provide the output you need.
select DISTINCT a.thetime, a.x
from tab1 a inner join tab2 b on a.thetime between b.starttime and b.endtime
order by 1, 2;
Note - PortletPaul already suggested this approach but I wonder if you have tried it yet.
>>where a.thetime  >= (select min(starttime) from tableb)  and a.thetime  <= (select max(endtime) from tableb)

That won't work if the dates have a gap.

Think of if in numbers:

tableb:
id,startnum,endnum
1,1,4
2,6,7

You would return anything between 1 and 7.  
What is wanted is:
between (1 and 4) or between (6 and 7)
>>all im looking for is a query that will return the data that exists between A range of times
emphasis added

What is wanted is:
between (1 and 4) or between (6 and 7)
is it?

>>the desired query output is:

 thetime, x,
12:12:01 1  <-- MIN(b.starttime)
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6  <-- MAX(b.endtime)

http://sqlfiddle.com/#!4/bc44f/11

and, yes, select distinct was proposed some while back
but my natural inclination is to avoid this for large sets :: i.e. I agree, it's ugly :)
>>is it?

Yes.

Per http:#a39154070

Notice "12:12:04 4" is not between the ranges in tableb and not in the desired results.
ASKER CERTIFIED SOLUTION
Avatar of PortletPaul
PortletPaul
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>taking that into account, one last attempt (I hope)

Didn't run it but I agree it looks like it will work and is sooooo much better than the path I went down.
:)

took us a while - and I couldn't sleep thinking there just had to be a better way
might sleep now

oh url error above, it should have been: http://sqlfiddle.com/#!4/207a9/2
genius