Solved

Get distinct values when using inner join

Posted on 2013-05-09
30
579 Views
Last Modified: 2013-05-13
HI Experts!

I have a problem that is similar to one i asked previously

I have two tables tablea and tableb

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

I have something like this:

select a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

but this produces repeated records for for overlaps in the time, so i get two entries at 12:12:03 and 12:12:04

Thanks
0
Comment
Question by:cesarchavez
  • 13
  • 10
  • 5
  • +1
30 Comments
 

Author Comment

by:cesarchavez
Comment Utility
i cant just do a group by to the output.  This is a simplified view of the problem and i need a query that will operate efficiently on large data sets.
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
when questions get simplified, there is a tendency for the cycle to solution to be long.

The problem is that there is overlap in the tableb  starttime/data, so to eradicate the repetition this has to be addressed by some method of simplification.

Proposals for doing so might change a lot depending on the overall query - can you post the whole query?
perhaps this will work for you?

select a.thetime, a.x
from tablea a
inner join tableb b
      on a.thetime between (select min() starttime from tableb)
            and (select max() endtime  from tableb )


but I have to admit it looks 'weird' now I've typed it out

anyway, a better view into your overall query might reveal a better path.
0
 

Author Comment

by:cesarchavez
Comment Utility
The only real difference is that there are hundreds of variables that are represented by x in table A and hundreds of time periods in table B.  Unfortunately your solution will not work as there could be times in between where the data should not be included.

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x,
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6
0
 

Author Comment

by:cesarchavez
Comment Utility
I did post the comment 1 minute after i posted the question... but i just realised that the simple solution doesnt work when you have hundreds of variables and a massive data set (>100GB)
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
please ignore my original post - not sure what I was thinking - certainly not workable and ahas syntax errors - sorry
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
tableb - in question

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

=======================
tableb - in above comment

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06
=======================

you have changed the data along the way. Is the data like the original question, or like comment immediately above?
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
also, what are the datatypes of thetime, starttime, endtime
are they date?
0
 

Author Comment

by:cesarchavez
Comment Utility
table b has a lot of records and im just showing examples. so:

for example

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6

Open in new window


and in a different situation i might have

tablea

thetime,  x

12:12:01 1
12:12:02 2
12:12:03 3
12:12:04 4
12:12:05 5
12:12:06 6
12:12:07 7
12:12:08 8


tableb

id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06


the desired query output is:

 thetime, x, 
12:12:01 1  
12:12:02 2  
12:12:03 3  
12:12:04 4  
12:12:05 5  
12:12:06 6  

Open in new window

0
 

Author Comment

by:cesarchavez
Comment Utility
they are all timestamps
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
if the tableb startime/endtime do not overlap, then the repetition stops

this is Ok
id     starttime endtime
100  12:12:01 12:12:03
101  12:12:05 12:12:06

this is NOT
id     starttime endtime
100  12:12:01 12:12:04
101  12:12:03 12:12:06

how many records are in tableb? can you create a table which has just the non-overlapping periods with key back to tableb?

is subquery factorization feasible here?
WITH x AS ()
0
 

Author Comment

by:cesarchavez
Comment Utility
This seems a little crazy to me.  Im not asking you to redifine my problem.  I have overlapping periods and the query needs to handle them.
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
Ceasar, not trying to redefine or to drive you crazy - just trying to understand and locate something feasible for you. Only tthe information provided here is what I know about the issues.

Anyway, here is an attempt to remove the overlapping TableB time-periods which in limited testing appears to work. However I cannot guarantee it will perform very quickly over a very large set of data.
select
  a.thetime
, a.x
from tableA a 
inner join (
            select
              starttime
            , case
                when endtime > lead(starttime,1,endtime) over (order by starttime) 
                  then lead(starttime,1,endtime) over (order by starttime) - 1/(24*60*60) 
                else endtime 
              end as adj_endtime
            from tableB
            ) b on a.thetime >= b.starttime and a.thetime <= b.adj_endtime
order by
  a.thetime

Open in new window

0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
the testing, done here http://sqlfiddle.com/#!4/bc44f/3 produces the expected results:

THETIME      X
12:12:01      1
12:12:02      2
12:12:03      3
12:12:04      4
12:12:05      5
12:12:06      6

HOWEVER, I cannot judge if adjusting the endtime to produce this result will affect other calculations in your overall query.  To make that judgement would require knowing quite a lot more about that whole query.

Paul
0
 

Author Comment

by:cesarchavez
Comment Utility
Thanks for your efforts, i appreciate them, but this doesn't work at all.  its not that some periods overlap and i'm looking for a workaround. I could do that myself.  Its that the periods are random, some overlap, some have three embeded periods.  I could have 20 periods all defined as exactly the same time.  

 It may be that there is no answer... but i find that hard to believe.
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
select DISTINCT a.thetime, a.x
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime

select a.thetime, max(a.x)
from tablea a inner join tableb b on a.thetime between b.starttime and b.endtime
group by a.thetime

these are you standard choices.
+ I tried something else based on the sample data provided - but it's unacceptable.


I fear to respond further for causing a redefinition of the question but only you are in possession of the full facts - this web page contains the only facts known to other readers.

You have just revealed new facts:
the periods are random,
some overlap,
some have three embedded periods.
could have multiple periods all defined as exactly the same time.  

and - don't get angry please - you haven't really explained what the overall objective is.

I believe you are spreading data over time-frames and when doing this have to live with the 'row repetition' then 'simplify'.

The simplification might be a 'group by' (high chance of this). I think we probably agree it isn't going to be achieved by starting with 'select distinct' at the top of a giant query, so each part of the query needs to be assessed for correctness and fit.

If you take just the minimum data required from this A to B join and place it into a subquery using group by (possibly a with as, maybe a temp table) - then you might gain the overall desired outcome.

>>It may be that there is no answer
it's quite possible that the answer isn't one you really want.
Appears to me that the 'repetition' brought about by joining A to B will need simplification and my best guess is you will achieve that through 'group by'.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:cesarchavez
Comment Utility
Thanks for your explanation... i was really thinking that there might be a relatively straight forward solution. However im not quite sure i understand.  What if i told you there were 4 embedded periods... do you think that should change the query? i really dont think that i have to provide every single possible situation.  A query is meant to provide a general solution.
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
:) it's a frustration of communication by web page I'm afraid., and I truly regret to say I don't understand this bit: " 4 embedded periods." in tableA? or tableB?
would this change the query = sounds like it

>>dont think that i have to provide every single possible situation
but you clearly know enough about queries to know that if exceptions aren't catered for the query outcome will disappoint (like my earlier attempt did)

It's a pity I can't describe adequately what it's like at the other side of this question - I can sense you are frustrated because I can't follow you - but equally I'm frustrated because I cannot see the big picture , or test anything against real data or provide concrete assistance.

Not sure what I can do right now to help.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
Comment Utility
I too am having a hard time trying to figure out the requirements.

If I do understand it, it looks like you want the rows from tablea that exist in the ranges by second in tableb.

If that's correct, try the code below.  I'm not sure how it will perform on large tables but it is the path I started down.

If it fails, please add to the test case.

drop table tab1 purge;
create table tab1(thetime timestamp, x number);

insert into tab1 values(to_date('12:12:01','HH:MI:SS'),1);
insert into tab1 values(to_date('12:12:02','HH:MI:SS'),2);
insert into tab1 values(to_date('12:12:03','HH:MI:SS'),3);
insert into tab1 values(to_date('12:12:04','HH:MI:SS'),4);
insert into tab1 values(to_date('12:12:05','HH:MI:SS'),5);
insert into tab1 values(to_date('12:12:06','HH:MI:SS'),6);
insert into tab1 values(to_date('12:12:07','HH:MI:SS'),7);
insert into tab1 values(to_date('12:12:08','HH:MI:SS'),8);
commit;

drop table tab2 purge;
create table tab2(id number, starttime timestamp, endtime timestamp);

insert into tab2 values(100,to_date('12:12:01','HH:MI:SS'),to_date('12:12:03','HH:MI:SS'));
insert into tab2 values(101,to_date('12:12:05','HH:MI:SS'),to_date('12:12:06','HH:MI:SS'));
commit;

select thetime,x from tab1
where thetime in (
select starttime + numtodsinterval(column_value-1,'second')
from tab2,
table (
	cast(
		multiset( select level from dual connect by level <= ((cast(endtime as date) - cast(starttime as date))*86400)+1)
  		as sys.odcinumberlist
	)
)
)
/

Open in new window

0
 
LVL 31

Expert Comment

by:awking00
Comment Utility
>>they are all timestamps<<
Are they all at one second intervals?
0
 

Author Comment

by:cesarchavez
Comment Utility
thanks for the suggestions... ill try them out this afternoon.

no they are not at one second intervals.  Sorry i was just trying to simplify the problem.

The times in table a are unique but could be at any frequency.  slightwv i think you understand.  all im looking for is a query that will return the data that exists between a range of times defined in table b.  The ranges could overlap or not.  the query i provided at the start does almost what i want.  it gives me all the data between the ranges.  It doesnt require any definition of whether the data is guaranteed second by second... it just returns the rows that are at times between the intervals. The only problem with it is that if the intervals overlap i get repeated records which i dont want.

I need a distinct set of records with no repeated values between any combination of ranges.  If the data is between a range then it is included.  If not.. then it is not included.

To me its sort of like i have a union all and what i want is a union.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
Comment Utility
>> the data that exists between a range of times defined

Down to what lowest level of granularity?  Timestamps in Oracle have milliseconds.

So do you want the range to be down to milliseconds?

>>the query i provided at the start does almost what i want

It's not pretty but would a select distinct work?
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
>>all im looking for is a query that will return the data that exists between a range of times

select a.thetime, a.x
from tablea a
where a.thetime  >= (select min(starttime) from tableb)
and a.thetime  <= (select max(endtime) from tableb)

take the join out the picture, just use tableb as the reference point for lowest and highest date boundaries, this way no duplication of tablea rows can occur
0
 
LVL 31

Expert Comment

by:awking00
Comment Utility
I have no idea what happened to the comment I posted earlier, but it basically stated the following [words in brackets are new additions]:
While using select distinct can be a performance hog [as slightwv notes not pretty], it should provide the output you need.
select DISTINCT a.thetime, a.x
from tab1 a inner join tab2 b on a.thetime between b.starttime and b.endtime
order by 1, 2;
Note - PortletPaul already suggested this approach but I wonder if you have tried it yet.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
Comment Utility
>>where a.thetime  >= (select min(starttime) from tableb)  and a.thetime  <= (select max(endtime) from tableb)

That won't work if the dates have a gap.

Think of if in numbers:

tableb:
id,startnum,endnum
1,1,4
2,6,7

You would return anything between 1 and 7.  
What is wanted is:
between (1 and 4) or between (6 and 7)
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
>>all im looking for is a query that will return the data that exists between A range of times
emphasis added

What is wanted is:
between (1 and 4) or between (6 and 7)
is it?

>>the desired query output is:

 thetime, x,
12:12:01 1  <-- MIN(b.starttime)
12:12:02 2  
12:12:03 3  
12:12:05 5  
12:12:06 6  <-- MAX(b.endtime)

http://sqlfiddle.com/#!4/bc44f/11

and, yes, select distinct was proposed some while back
but my natural inclination is to avoid this for large sets :: i.e. I agree, it's ugly :)
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
Comment Utility
>>is it?

Yes.

Per http:#a39154070

Notice "12:12:04 4" is not between the ranges in tableb and not in the desired results.
0
 
LVL 48

Accepted Solution

by:
PortletPaul earned 500 total points
Comment Utility
different sets of data, my sqlfiddle was established from the original, but noted.

taking that into account, one last attempt (I hope)

this result
THETIME      X
12:12:01      1
12:12:02      2
12:12:03      3
12:12:05      5
12:12:06      6

from:
select to_char(a.thetime,'HH:MM:SS') as thetime, a.x
from tablea a 
where exists (
               select 1 from tableB b
               where a.thetime >= b.starttime and a.thetime <= b.endtime
             )
order by a.x

Open in new window

revised data at  http:#a39154070 Z
(note an order was added it may not be needed)
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
Comment Utility
>>taking that into account, one last attempt (I hope)

Didn't run it but I agree it looks like it will work and is sooooo much better than the path I went down.
0
 
LVL 48

Expert Comment

by:PortletPaul
Comment Utility
:)

took us a while - and I couldn't sleep thinking there just had to be a better way
might sleep now

oh url error above, it should have been: http://sqlfiddle.com/#!4/207a9/2
0
 

Author Closing Comment

by:cesarchavez
Comment Utility
genius
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

Suggested Solutions

Working with Network Access Control Lists in Oracle 11g (part 2) Part 1: http://www.e-e.com/A_8429.html Previously, I introduced the basics of network ACL's including how to create, delete and modify entries to allow and deny access.  For many…
Cursors in Oracle: A cursor is used to process individual rows returned by database system for a query. In oracle every SQL statement executed by the oracle server has a private area. This area contains information about the SQL statement and the…
Via a live example show how to connect to RMAN, make basic configuration settings changes and then take a backup of a demo database
This video shows how to copy an entire tablespace from one database to another database using Transportable Tablespace functionality.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now