Link to home
Start Free TrialLog in
Avatar of chuck_forbes
chuck_forbes

asked on

Unexpected behavior with GROUP BY

When I run the following SQL:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     12/1/2002       ...which is fine

When I run:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     1/1/2001
1234567     12/1/2002
7654321     12/1/2002      ...which is strange

My original intention was to use the following SQL with an aggregate. For the examples above I culled out the aggregate, even though the net effect is the same (ie I still get 4 records, 2 with the incorrect date):

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I thought I would receive:

1234567     1/1/2001     1
7654321     12/1/2002   1      

 ...though I got results similar to EX 2 above (the count was correct though).

I thought the above was what the Sybase manual suggested I would receive. Can someone describe why I am getting what appears to be a Cartesian product?

Thanks,
-Chuck
Avatar of mathavra
mathavra

The correct SQL statement should look like below:

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no, cc_issue_arb.examdt

Group by statements should include all the Scalar values
from the SELECT statements.


The following from your question is a expected behavior. In some of the other databases, the following query would error.

SELECT   ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM     cc_issue_arb, ca_arbiters
WHERE    cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567     1/1/2001
7654321     1/1/2001
1234567     12/1/2002
7654321     12/1/2002      ...which is strange

Again, you should add cc_issue_arb.examdt in the group by clause.

Avatar of chuck_forbes

ASKER

In the text  which accompanies Sybase, it mentions that the syntax I was using is valid. Assume the foolowing data in a table, Table1

id    name           date_changed
1     Chuck          1/1/2001
2     mathavra     2/8/1999
3     Chuck          2/1/2001
4     Chuck          3/1/2001

The output I would expect from the following query:

select name, date_changed, (Count(*)) As num_records
from table1
group by name

Is:

Chuck        1/1/2001   3
Chuck        2/1/2001   3
Chuck        3/1/2001   3
mathavra   2/8/1999   1

At least, this is what I discerned from pg 1-193 of the "Sybase SQL Server Reference Manual: Volume I - Commands, Functions, and Topics".

-cf
In Sybase, the syntax is a valid syntax. But the query would not produce as you expect. May be there is a mis print in the manual. I am trying to find that manual. Did you find this manual online?.
I don't think there's a misprint. The manual may be somewhat outdated, though. (it arrived with Sybase SQL server release 10.0). I don't think that matters.

The example reads as follows:

"SELECT type, title_id, avg(price)
FROM titles
GROUP BY type

type                       title_id
business             BU1032          13.73
business             BU1111          13.73
business             BU2075          13.73
business             BU7832          13.73
mod_cook           MC2222          11.49
mod_cook           MC3021          11.49
UNDECIDED      MC3026          NULL
popular_comp    PC1035          21.48
popular_comp    PC8888          21.48
popular_comp    PC9999          21.48

The transact-SQL extended column, title_id (in the SELECT list, not an aggregate and not in the GROUP BY clause) causes all qualified rows to display in each qualified group, even though GROUP BY should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1."

(If you have online documentation, search for 'extended column')

The SQL from Example 1 reads as follows:

"SELECT type, avg(price)
FROM titles
GROUP BY type"


-Chuck
I found it online.

Seems like they clearly explained what would happen if the non-aggregate column is not added in the group by clause.

Lets go step by step:

Step1 :
-------

Example 1:

select type, avg(price)
from titles
group by type

 type                  
----------------------  -----------   UNDECIDED                     NULL   business                     13.73   mod_cook                     11.49   popular_comp                 21.48   psychology                   13.50   trad_cook                    15.96  
6 rows affected)  

In the above query, it groups the results by type. Since the column "type" (non-aggregate column) is in the SELECT clause and as well in the GROUP BY clause, it groups all the rows by type and finds the average per "type" group.

----------------------------------------------------
So, the result set has one type row and its average.
----------------------------------------------------

Step 2:
-------

Lets look at the query no. 2 in the book.

select type, title_id, avg(price)
from titles
group by type

  type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
mod_cook     MC2222              11.49  
mod_cook     MC3021              11.49  
UNDECIDED    MC3026               NULL  
popular_comp PC1035              21.48  
popular_comp PC8888              21.48  
popular_comp PC9999              21.48    
psychology   PS1372              13.50    
psychology   PS2091              13.50    
psychology   PS2106              13.50    
psychology   PS3333              13.50    
psychology   PS7777              13.50    
trad_cook    TC3218              15.96    
trad_cook    TC4203              15.96    
trad_cook    TC7777              15.96

(18 rows affected)  

Sybase explanation:
The Transact-SQL extended column, title_id (in the select list, not an aggregate and not in the group by clause) causes all qualified rows to display in each qualified group, even though group by should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1.

They clearly explained that since the non-aggregate column "title_id", one of the non-aggregate column is not in group by clause, IT CAUSES ALL QUALIFIED ROWS FOR THAT TYPE TO BE DUPLICATED (cartesian) WITH THE RESULT. You can see the average is same for the group.

The following query

select title_id from titles
where type = "business"

returns as below:

title_id
---------
BU1032              
BU1111              
BU2075              
BU7832              

(4 rows affected)

So, always the results will be the "cartesian product" of the result with all "non-aggregate" column values for that group.

Please let me know if you need more clarification. I can provide my own example to explain it.
It doesn't say duplicated, it says displayed. If you look at both query outputs, there are four records for business. In my query, I initially had only one record per license_no, but after the introduction of the GROUP BY, I had two records per license_no.
Sorry for using the word "duplicated". It is displayed with the "CARTESIAN" product. Anyways,

Which 2 queries you are talking about when you meant the following sentence:

"If you look at both query outputs, there are four records
for business"

Man, mathavra, my previous post sounded evil. Sorry, I was just writing you back as fast as I could. No evil intended.



select type, title_id, avg(price)
                       from titles
                       group by type

                        type        title_id            
                       ------------ -------- ----------------  
                       business     BU1032              13.73  
                       business     BU1111              13.73  
                       business     BU2075              13.73  
                       business     BU7832              13.73  
                       mod_cook     MC2222              11.49  
                       ........




select title_id from titles
                       where type = "business"

                       returns as below:

                       title_id
                       ---------
                       BU1032              
                       BU1111              
                       BU2075              
                       BU7832              

                       (4 rows affected)

-cf
ASKER CERTIFIED SOLUTION
Avatar of mathavra
mathavra

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The documentation doesn't mention anything about cartesian products, though. In my post, I was suggesting that I was seeing similar behaviors. The way I see it, the following two queries are the same:

select type,
title_id,
avg(price)
from titles
group by type

select type,
title_id,
(SELECT avg(t1.price) FROM titles t1 where t1.type = titles.type)
from titles

If the book example was behaving similarly to my original post above, you wouldn't get:

type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
mod_cook     MC2222              11.49  
mod_cook     MC3021              11.49  
UNDECIDED    MC3026               NULL  
popular_comp PC1035              21.48  
popular_comp PC8888              21.48  
..........

but rather:

type        title_id            
------------ -------- ----------------  
business     BU1032              13.73  
business     BU1111              13.73  
business     BU2075              13.73  
business     BU7832              13.73  
business     MC2222              13.73  
business     MC3021              13.73  
business    MC3026               13.73  
business     PC1035              13.73  
business      PC8888              13.73  
..........


And the only difference between my example and the books is that I have two tables vs their one.

-cf
its become such a big page now .. couldn't go through all the comments .. chucks .. where do we stand now ?
What happens is that the avg uses the column in the group by, that's why you have a different value for each type.

Can't explain it any clearer really... what is the mis-understanding here?
The example in the Sybase manual makes sense. If I have the following information in a table:

store            product          price

store_a       product_1     10.00
store_a       product_2     20.00
store_a       product_3     30.00
store_b       product_1     10.00
store_b       product_2     20.00

and use the SQL:

select store, product, avg(price)
from table1
group by store

I'd get

store_a       product_1     20.00
store_a       product_2     20.00
store_a       product_3     20.00
store_b       product_1     15.00
store_b       product_2     15.00

Which makes sense:


What I have is something like the following:

Table1
ID
123-45-6789
987-65-4321
         

Table2
ID                          mod_date
123-45-6789      1/1/2001
987-65-4321      12/31/2001

When I run the following query:

SELECT Table1.ID, Table2.mod_date, (Count(*)) As CountOF
FROM Table1, Table2,
WHERE Table1.ID = Table2.ID
GROUP BY Table1.ID

I do NOT get:

Table1.ID           Table2.mod_date        CountOf
123-45-6789     1/1/2001                               1
987-65-4321     12/31/2001                           1

Instead, I get

Table1.ID           Table2.mod_date        CountOf
123-45-6789     1/1/2001                               1
987-65-4321     1/1/2001                               1
123-45-6789     12/31/2001                           1
987-65-4321     12/31/2001                           1

I'm just having a really difficult time understanding why this occurs.

-Chuck
Chuck the main differnce between the Sybase example and your q
Chuck
The main differnce between the Sybase example and your query is that you use 2 tables... so the group by makes a carthesian product between the 2 tables to calculate the count and then uses the join.
If you have one table there is no carthesian product occurring. That must be it.

Arjan
Thanks for the reponses. This one still escapes me, although I've received a fair share of comments here and elsewhere which emphasize a pathway which leads to the Cartesian result I was receiving. I' m just not seeing the machinations that Sybase is going through in this case.

-cf