?
Solved

Get distinct values when using inner join

Posted on 2013-05-09
30
Medium Priority
?
613 Views
Last Modified: 2013-05-13
HI Experts!

I have a problem that is similar to one i asked previously

I have two tables tablea and tableb

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

I have something like this:

select a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

but this produces repeated records for for overlaps in the time, so i get two entries at 12:12:03 and 12:12:04

Thanks
0
Comment
Question by:cesarchavez
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 13
  • 10
  • 5
  • +1
30 Comments
 

Author Comment

by:cesarchavez
ID: 39153947
i cant just do a group by to the output.  This is a simplified view of the problem and i need a query that will operate efficiently on large data sets.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154057
when questions get simplified, there is a tendency for the cycle to solution to be long.

The problem is that there is overlap in the tableb  starttime/data, so to eradicate the repetition this has to be addressed by some method of simplification.

Proposals for doing so might change a lot depending on the overall query - can you post the whole query?
perhaps this will work for you?

select a.thetime, a.x
from tablea a
inner join tableb b
      on a.thetime between (select min() starttime from tableb)
            and (select max() endtime  from tableb )


but I have to admit it looks 'weird' now I've typed it out

anyway, a better view into your overall query might reveal a better path.
0
 

Author Comment

by:cesarchavez
ID: 39154070
The only real difference is that there are hundreds of variables that are represented by x in table A and hundreds of time periods in table B.  Unfortunately your solution will not work as there could be times in between where the data should not be included.

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:cesarchavez
ID: 39154074
I did post the comment 1 minute after i posted the question... but i just realised that the simple solution doesnt work when you have hundreds of variables and a massive data set (>100GB)
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154082
please ignore my original post - not sure what I was thinking - certainly not workable and ahas syntax errors - sorry
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154115
tableb - in question

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

=======================
tableb - in above comment

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06
=======================

you have changed the data along the way. Is the data like the original question, or like comment immediately above?
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154118
also, what are the datatypes of thetime, starttime, endtime
are they date?
0
 

Author Comment

by:cesarchavez
ID: 39154124
table b has a lot of records and im just showing examples. so:

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6

Open in new window


and in a different situation i might have

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

Open in new window

0
 

Author Comment

by:cesarchavez
ID: 39154125
they are all timestamps
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154150
if the tableb startime/endtime do not overlap, then the repetition stops

this is Ok
id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06

this is NOT
id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

how many records are in tableb? can you create a table which has just the non-overlapping periods with key back to tableb?

is subquery factorization feasible here?
WITH x AS ()
0
 

Author Comment

by:cesarchavez
ID: 39154218
This seems a little crazy to me.  Im not asking you to redifine my problem.  I have overlapping periods and the query needs to handle them.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154402
Ceasar, not trying to redefine or to drive you crazy - just trying to understand and locate something feasible for you. Only tthe information provided here is what I know about the issues.

Anyway, here is an attempt to remove the overlapping TableB time-periods which in limited testing appears to work. However I cannot guarantee it will perform very quickly over a very large set of data.
select
  a.thetime
, a.x
from tableA a 
inner join (
            select
              starttime
            , case
                when endtime > lead(starttime,1,endtime) over (order by starttime) 
                  then lead(starttime,1,endtime) over (order by starttime) - 1/(24*60*60) 
                else endtime 
              end as adj_endtime
            from tableB
            ) b on a.thetime >= b.starttime and a.thetime <= b.adj_endtime
order by
  a.thetime

Open in new window

0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154411
the testing, done here http://sqlfiddle.com/#!4/bc44f/3 produces the expected results:

THETIME      X
12:12:01      1
12:12:02      2
12:12:03      3
12:12:04      4
12:12:05      5
12:12:06      6

HOWEVER, I cannot judge if adjusting the endtime to produce this result will affect other calculations in your overall query.  To make that judgement would require knowing quite a lot more about that whole query.

Paul
0
 

Author Comment

by:cesarchavez
ID: 39154455
Thanks for your efforts, i appreciate them, but this doesn't work at all.  its not that some periods overlap and i'm looking for a workaround. I could do that myself.  Its that the periods are random, some overlap, some have three embeded periods.  I could have 20 periods all defined as exactly the same time.  

 It may be that there is no answer... but i find that hard to believe.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154488
select DISTINCT a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

select a.thetime, max(a.x)
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
group by a.thetime

these are you standard choices.
+ I tried something else based on the sample data provided - but it's unacceptable.


I fear to respond further for causing a redefinition of the question but only you are in possession of the full facts - this web page contains the only facts known to other readers.

You have just revealed new facts:
the periods are random,
some overlap,
some have three embedded periods.
could have multiple periods all defined as exactly the same time.  

and - don't get angry please - you haven't really explained what the overall objective is.

I believe you are spreading data over time-frames and when doing this have to live with the 'row repetition' then 'simplify'.

The simplification might be a 'group by' (high chance of this). I think we probably agree it isn't going to be achieved by starting with 'select distinct' at the top of a giant query, so each part of the query needs to be assessed for correctness and fit.

If you take just the minimum data required from this A to B join and place it into a subquery using group by (possibly a with as, maybe a temp table) - then you might gain the overall desired outcome.

>>It may be that there is no answer
it's quite possible that the answer isn't one you really want.
Appears to me that the 'repetition' brought about by joining A to B will need simplification and my best guess is you will achieve that through 'group by'.
0
 

Author Comment

by:cesarchavez
ID: 39154502
Thanks for your explanation... i was really thinking that there might be a relatively straight forward solution. However im not quite sure i understand.  What if i told you there were 4 embedded periods... do you think that should change the query? i really dont think that i have to provide every single possible situation.  A query is meant to provide a general solution.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39154536
:) it's a frustration of communication by web page I'm afraid., and I truly regret to say I don't understand this bit: " 4 embedded periods." in tableA? or tableB?
would this change the query = sounds like it

>>dont think that i have to provide every single possible situation
but you clearly know enough about queries to know that if exceptions aren't catered for the query outcome will disappoint (like my earlier attempt did)

It's a pity I can't describe adequately what it's like at the other side of this question - I can sense you are frustrated because I can't follow you - but equally I'm frustrated because I cannot see the big picture , or test anything against real data or provide concrete assistance.

Not sure what I can do right now to help.
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39155266
I too am having a hard time trying to figure out the requirements.

If I do understand it, it looks like you want the rows from tablea that exist in the ranges by second in tableb.

If that's correct, try the code below.  I'm not sure how it will perform on large tables but it is the path I started down.

If it fails, please add to the test case.

drop table tab1 purge;
create table tab1(thetime timestamp, x number);

insert into tab1 values(to_date('12:12:01','HH:MI:SS'),1);
insert into tab1 values(to_date('12:12:02','HH:MI:SS'),2);
insert into tab1 values(to_date('12:12:03','HH:MI:SS'),3);
insert into tab1 values(to_date('12:12:04','HH:MI:SS'),4);
insert into tab1 values(to_date('12:12:05','HH:MI:SS'),5);
insert into tab1 values(to_date('12:12:06','HH:MI:SS'),6);
insert into tab1 values(to_date('12:12:07','HH:MI:SS'),7);
insert into tab1 values(to_date('12:12:08','HH:MI:SS'),8);
commit;

drop table tab2 purge;
create table tab2(id number, starttime timestamp, endtime timestamp);

insert into tab2 values(100,to_date('12:12:01','HH:MI:SS'),to_date('12:12:03','HH:MI:SS'));
insert into tab2 values(101,to_date('12:12:05','HH:MI:SS'),to_date('12:12:06','HH:MI:SS'));
commit;

select thetime,x from tab1
where thetime in (
select starttime + numtodsinterval(column_value-1,'second')
from tab2,
table (
	cast(
		multiset( select level from dual connect by level <= ((cast(endtime as date) - cast(starttime as date))*86400)+1)
  		as sys.odcinumberlist
	)
)
)
/

Open in new window

0
 
LVL 32

Expert Comment

by:awking00
ID: 39155612
>>they are all timestamps<<
Are they all at one second intervals?
0
 

Author Comment

by:cesarchavez
ID: 39156209
thanks for the suggestions... ill try them out this afternoon.

no they are not at one second intervals.  Sorry i was just trying to simplify the problem.

The times in table a are unique but could be at any frequency.  slightwv i think you understand.  all im looking for is a query that will return the data that exists between a range of times defined in table b.  The ranges could overlap or not.  the query i provided at the start does almost what i want.  it gives me all the data between the ranges.  It doesnt require any definition of whether the data is guaranteed second by second... it just returns the rows that are at times between the intervals. The only problem with it is that if the intervals overlap i get repeated records which i dont want.

I need a distinct set of records with no repeated values between any combination of ranges.  If the data is between a range then it is included.  If not.. then it is not included.

To me its sort of like i have a union all and what i want is a union.
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39156219
>> the data that exists between a range of times defined

Down to what lowest level of granularity?  Timestamps in Oracle have milliseconds.

So do you want the range to be down to milliseconds?

>>the query i provided at the start does almost what i want

It's not pretty but would a select distinct work?
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39156289
>>all im looking for is a query that will return the data that exists between a range of times

select a.thetime, a.x
from tablea a
where a.thetime  >= (select min(starttime) from tableb)
and a.thetime  <= (select max(endtime) from tableb)

take the join out the picture, just use tableb as the reference point for lowest and highest date boundaries, this way no duplication of tablea rows can occur
0
 
LVL 32

Expert Comment

by:awking00
ID: 39156305
I have no idea what happened to the comment I posted earlier, but it basically stated the following [words in brackets are new additions]:
While using select distinct can be a performance hog [as slightwv notes not pretty], it should provide the output you need.
select DISTINCT a.thetime, a.x
from tab1 a inner join tab2 b on a.thetime between b.starttime and b.endtime
order by 1, 2;
Note - PortletPaul already suggested this approach but I wonder if you have tried it yet.
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39156471
>>where a.thetime  >= (select min(starttime) from tableb)  and a.thetime  <= (select max(endtime) from tableb)

That won't work if the dates have a gap.

Think of if in numbers:

tableb:
id,startnum,endnum
1,1,4
2,6,7

You would return anything between 1 and 7.  
What is wanted is:
between (1 and 4) or between (6 and 7)
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39156519
>>all im looking for is a query that will return the data that exists between A range of times
emphasis added

What is wanted is:
between (1 and 4) or between (6 and 7)
is it?

>>the desired query output is:

 thetime, x,
12:12:01 1  <-- MIN(b.starttime)
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6  <-- MAX(b.endtime)

http://sqlfiddle.com/#!4/bc44f/11

and, yes, select distinct was proposed some while back
but my natural inclination is to avoid this for large sets :: i.e. I agree, it's ugly :)
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39156541
>>is it?

Yes.

Per http:#a39154070

Notice "12:12:04 4" is not between the ranges in tableb and not in the desired results.
0
 
LVL 49

Accepted Solution

by:
PortletPaul earned 2000 total points
ID: 39156618
different sets of data, my sqlfiddle was established from the original, but noted.

taking that into account, one last attempt (I hope)

this result
THETIME      X
12:12:01      1
12:12:02      2
12:12:03      3
12:12:05      5
12:12:06      6

from:
select to_char(a.thetime,'HH:MM:SS') as thetime, a.x
from tablea a 
where exists (
               select 1 from tableB b
               where a.thetime >= b.starttime and a.thetime <= b.endtime
             )
order by a.x

Open in new window

revised data at  http:#a39154070 Z
(note an order was added it may not be needed)
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39156627
>>taking that into account, one last attempt (I hope)

Didn't run it but I agree it looks like it will work and is sooooo much better than the path I went down.
0
 
LVL 49

Expert Comment

by:PortletPaul
ID: 39156637
:)

took us a while - and I couldn't sleep thinking there just had to be a better way
might sleep now

oh url error above, it should have been: http://sqlfiddle.com/#!4/207a9/2
0
 

Author Closing Comment

by:cesarchavez
ID: 39162748
genius
0

Featured Post

Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Truncate is a DDL Command where as Delete is a DML Command. Both will delete data from table, but what is the difference between these below statements truncate table <table_name> ?? delete from <table_name> ?? The first command cannot be …
Shell script to create broker configuration file using current broker Configuration, solely for purpose of backup on Linux. Script may need to be modified depending on OS-installation. Please deploy and verify the script in a test environment.
Via a live example, show how to take different types of Oracle backups using RMAN.
This video shows how to Export data from an Oracle database using the Datapump Export Utility.  The corresponding Datapump Import utility is also discussed and demonstrated.
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question