Solved

Unexpected behavior with GROUP BY

Posted on 2001-07-05
16
459 Views
Last Modified: 2010-08-05
When I run the following SQL:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     12/1/2002       ...which is fine

When I run:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     1/1/2001
1234567     12/1/2002
7654321     12/1/2002      ...which is strange

My original intention was to use the following SQL with an aggregate. For the examples above I culled out the aggregate, even though the net effect is the same (ie I still get 4 records, 2 with the incorrect date):

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I thought I would receive:

1234567     1/1/2001     1
7654321     12/1/2002   1      

 ...though I got results similar to EX 2 above (the count was correct though).

I thought the above was what the Sybase manual suggested I would receive. Can someone describe why I am getting what appears to be a Cartesian product?

Thanks,
-Chuck
0
Comment
Question by:chuck_forbes
  • 7
  • 5
  • 3
  • +1
16 Comments
 
LVL 3

Expert Comment

by:mathavra
ID: 6256798
The correct SQL statement should look like below:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no, cc_issue_arb.examdt

Group by statements should include all the Scalar values
from the SELECT statements.


The following from your question is a expected behavior. In some of the other databases, the following query would error.

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     1/1/2001
1234567     12/1/2002
7654321     12/1/2002      ...which is strange

Again, you should add cc_issue_arb.examdt in the group by clause.

0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6256881
In the text  which accompanies Sybase, it mentions that the syntax I was using is valid. Assume the foolowing data in a table, Table1

id    name           date_changed
1     Chuck          1/1/2001
2     mathavra     2/8/1999
3     Chuck          2/1/2001
4     Chuck          3/1/2001

The output I would expect from the following query:

select name, date_changed, (Count(*)) As num_records
from table1
group by name

Is:

Chuck        1/1/2001   3
Chuck        2/1/2001   3
Chuck        3/1/2001   3
mathavra   2/8/1999   1

At least, this is what I discerned from pg 1-193 of the "Sybase SQL Server Reference Manual: Volume I - Commands, Functions, and Topics".

-cf
0
 
LVL 3

Expert Comment

by:mathavra
ID: 6257150
In Sybase, the syntax is a valid syntax. But the query would not produce as you expect. May be there is a mis print in the manual. I am trying to find that manual. Did you find this manual online?.
0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6257220
I don't think there's a misprint. The manual may be somewhat outdated, though. (it arrived with Sybase SQL server release 10.0). I don't think that matters.

The example reads as follows:

"SELECT type, title_id, avg(price)
FROM titles
GROUP BY type

type                       title_id
business             BU1032          13.73
business             BU1111          13.73
business             BU2075          13.73
business             BU7832          13.73
mod_cook           MC2222          11.49
mod_cook           MC3021          11.49
UNDECIDED      MC3026          NULL
popular_comp    PC1035          21.48
popular_comp    PC8888          21.48
popular_comp    PC9999          21.48

The transact-SQL extended column, title_id (in the SELECT list, not an aggregate and not in the GROUP BY clause) causes all qualified rows to display in each qualified group, even though GROUP BY should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1."

(If you have online documentation, search for 'extended column')

The SQL from Example 1 reads as follows:

"SELECT type, avg(price)
FROM titles
GROUP BY type"


-Chuck
0
 
LVL 3

Expert Comment

by:mathavra
ID: 6257318
I found it online.

Seems like they clearly explained what would happen if the non-aggregate column is not added in the group by clause.

Lets go step by step:

Step1 :
-------

Example 1:

select type, avg(price)
from titles
group by type

 type                  
----------------------  -----------   UNDECIDED                     NULL   business                     13.73   mod_cook                     11.49   popular_comp                 21.48   psychology                   13.50   trad_cook                    15.96  
6 rows affected)  

In the above query, it groups the results by type. Since the column "type" (non-aggregate column) is in the SELECT clause and as well in the GROUP BY clause, it groups all the rows by type and finds the average per "type" group.

----------------------------------------------------
So, the result set has one type row and its average.
----------------------------------------------------

Step 2:
-------

Lets look at the query no. 2 in the book.

select type, title_id, avg(price)
from titles
group by type

  type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
mod_cook     MC2222              11.49  
mod_cook     MC3021              11.49  
UNDECIDED    MC3026               NULL  
popular_comp PC1035              21.48  
popular_comp PC8888              21.48  
popular_comp PC9999              21.48    
psychology   PS1372              13.50    
psychology   PS2091              13.50    
psychology   PS2106              13.50    
psychology   PS3333              13.50    
psychology   PS7777              13.50    
trad_cook    TC3218              15.96    
trad_cook    TC4203              15.96    
trad_cook    TC7777              15.96

(18 rows affected)  

Sybase explanation:
The Transact-SQL extended column, title_id (in the select list, not an aggregate and not in the group by clause) causes all qualified rows to display in each qualified group, even though group by should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1.

They clearly explained that since the non-aggregate column "title_id", one of the non-aggregate column is not in group by clause, IT CAUSES ALL QUALIFIED ROWS FOR THAT TYPE TO BE DUPLICATED (cartesian) WITH THE RESULT. You can see the average is same for the group.

The following query

select title_id from titles
where type = "business"

returns as below:

title_id
---------
BU1032              
BU1111              
BU2075              
BU7832              

(4 rows affected)

So, always the results will be the "cartesian product" of the result with all "non-aggregate" column values for that group.

Please let me know if you need more clarification. I can provide my own example to explain it.
0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6257420
It doesn't say duplicated, it says displayed. If you look at both query outputs, there are four records for business. In my query, I initially had only one record per license_no, but after the introduction of the GROUP BY, I had two records per license_no.
0
 
LVL 3

Expert Comment

by:mathavra
ID: 6257454
Sorry for using the word "duplicated". It is displayed with the "CARTESIAN" product. Anyways,

Which 2 queries you are talking about when you meant the following sentence:

"If you look at both query outputs, there are four records
for business"

0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6257538
Man, mathavra, my previous post sounded evil. Sorry, I was just writing you back as fast as I could. No evil intended.



select type, title_id, avg(price)
                       from titles
                       group by type

                        type        title_id            
                       ------------ -------- ----------------  
                       business     BU1032              13.73  
                       business     BU1111              13.73  
                       business     BU2075              13.73  
                       business     BU7832              13.73  
                       mod_cook     MC2222              11.49  
                       ........




select title_id from titles
                       where type = "business"

                       returns as below:

                       title_id
                       ---------
                       BU1032              
                       BU1111              
                       BU2075              
                       BU7832              

                       (4 rows affected)

-cf
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 3

Accepted Solution

by:
mathavra earned 100 total points
ID: 6257614
The reason why I have that query is to show how many rows are there for the type "business".

If you run the query without title_id in the select clause, you are suppose to get only one row.

Type        
--------------------
business     13.73  
                   
 
The moment you add title_id in the select clause without adding it also in the group by will display the same above "one row" with all the possible "title_id" for that type. i.e. 4 possible title_ids displayed by the second query. So the output becomes

type        title_id  
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73

-----------------------------------------------------------
Take another type for example "popular_comp":

The proper group by query

QUERY1
------
select type, avg(price) from titles
where type = "popular_comp"
group by type

returns:

popular_comp      21.48  


Find how many titles are available for that type "popular_comp":

QUERY2
------
select title_id from titles where
type = "popular_comp"

will return the following three rows:

PC1035
PC8888
PC9999              

NOW when you add title_id in the select clause of the first query as below:

QUERY3
------
select type, title_id, avg(price) from titles
wheere type = "popular_comp"
group by type

It will display that one row result with the cartesian of all the possible titles (shown in the second query) for that type which would be three rows as below.

popular_comp PC1035              21.48  
popular_comp PC8888              21.48  
popular_comp PC9999              21.48    

0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6257825
The documentation doesn't mention anything about cartesian products, though. In my post, I was suggesting that I was seeing similar behaviors. The way I see it, the following two queries are the same:

select type,
title_id,
avg(price)
from titles
group by type

select type,
title_id,
(SELECT avg(t1.price) FROM titles t1 where t1.type = titles.type)
from titles

If the book example was behaving similarly to my original post above, you wouldn't get:

type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
mod_cook     MC2222              11.49  
mod_cook     MC3021              11.49  
UNDECIDED    MC3026               NULL  
popular_comp PC1035              21.48  
popular_comp PC8888              21.48  
..........

but rather:

type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
business     MC2222              13.73  
business     MC3021              13.73  
business    MC3026               13.73  
business     PC1035              13.73  
business      PC8888              13.73  
..........


And the only difference between my example and the books is that I have two tables vs their one.

-cf
0
 
LVL 5

Expert Comment

by:amitpagarwal
ID: 6258224
its become such a big page now .. couldn't go through all the comments .. chucks .. where do we stand now ?
0
 
LVL 3

Expert Comment

by:ahoor
ID: 6259036
What happens is that the avg uses the column in the group by, that's why you have a different value for each type.

Can't explain it any clearer really... what is the mis-understanding here?
0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6260013
The example in the Sybase manual makes sense. If I have the following information in a table:

store            product          price

store_a       product_1     10.00
store_a       product_2     20.00
store_a       product_3     30.00
store_b       product_1     10.00
store_b       product_2     20.00

and use the SQL:

select store, product, avg(price)
from table1
group by store

I'd get

store_a       product_1     20.00
store_a       product_2     20.00
store_a       product_3     20.00
store_b       product_1     15.00
store_b       product_2     15.00

Which makes sense:


What I have is something like the following:

Table1
ID
123-45-6789
987-65-4321
         

Table2
ID                          mod_date
123-45-6789      1/1/2001
987-65-4321      12/31/2001

When I run the following query:

SELECT Table1.ID, Table2.mod_date, (Count(*)) As CountOF
FROM Table1, Table2,
WHERE Table1.ID = Table2.ID
GROUP BY Table1.ID

I do NOT get:

Table1.ID           Table2.mod_date        CountOf
123-45-6789     1/1/2001                               1
987-65-4321     12/31/2001                           1

Instead, I get

Table1.ID           Table2.mod_date        CountOf
123-45-6789     1/1/2001                               1
987-65-4321     1/1/2001                               1
123-45-6789     12/31/2001                           1
987-65-4321     12/31/2001                           1

I'm just having a really difficult time understanding why this occurs.

-Chuck
0
 
LVL 3

Expert Comment

by:ahoor
ID: 6260681
Chuck the main differnce between the Sybase example and your q
0
 
LVL 3

Expert Comment

by:ahoor
ID: 6260692
Chuck
The main differnce between the Sybase example and your query is that you use 2 tables... so the group by makes a carthesian product between the 2 tables to calculate the count and then uses the join.
If you have one table there is no carthesian product occurring. That must be it.

Arjan
0
 
LVL 1

Author Comment

by:chuck_forbes
ID: 6267211
Thanks for the reponses. This one still escapes me, although I've received a fair share of comments here and elsewhere which emphasize a pathway which leads to the Cartesian result I was receiving. I' m just not seeing the machinations that Sybase is going through in this case.

-cf
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

In this article, I show you step by step with screenshots to assist you - HOW TO: Deploy and Install the VMware vCenter Server Appliance 6.5 (VCSA 6.5), with some helpful tips along the way.
Use of TCL script on Cisco devices:  - create file and merge it with running configuration to apply configuration changes
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now