asked on

Unexpected behavior with GROUP BY

When I run the following SQL:

SELECT ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM cc_issue_arb, ca_arbiters
WHERE cc_issue_arb.md_license_number = ca_arbiters.md_license_no

I get:

1234567 1/1/2001
7654321 12/1/2002 ...which is fine

When I run:

SELECT ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM cc_issue_arb, ca_arbiters
WHERE cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567 1/1/2001
7654321 1/1/2001
1234567 12/1/2002
7654321 12/1/2002 ...which is strange

My original intention was to use the following SQL with an aggregate. For the examples above I culled out the aggregate, even though the net effect is the same (ie I still get 4 records, 2 with the incorrect date):

SELECT ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM cc_issue_arb, ca_arbiters
WHERE cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I thought I would receive:

1234567 1/1/2001 1
7654321 12/1/2002 1

...though I got results similar to EX 2 above (the count was correct though).

I thought the above was what the Sybase manual suggested I would receive. Can someone describe why I am getting what appears to be a Cartesian product?

Thanks,
-Chuck

mathavra

The correct SQL statement should look like below:

SELECT ca_arbiters.md_license_no, cc_issue_arb.examdt, (Count(*)) as num_licenses
FROM cc_issue_arb, ca_arbiters
WHERE cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no, cc_issue_arb.examdt

Group by statements should include all the Scalar values
from the SELECT statements.

The following from your question is a expected behavior. In some of the other databases, the following query would error.

SELECT ca_arbiters.md_license_no, cc_issue_arb.examdt
FROM cc_issue_arb, ca_arbiters
WHERE cc_issue_arb.md_license_number = ca_arbiters.md_license_no
GROUP BY ca_arbiters.md_license_no

I get:

1234567 1/1/2001
7654321 1/1/2001
1234567 12/1/2002
7654321 12/1/2002 ...which is strange

Again, you should add cc_issue_arb.examdt in the group by clause.

chuck_forbes

ASKER

In the text which accompanies Sybase, it mentions that the syntax I was using is valid. Assume the foolowing data in a table, Table1

id name date_changed
1 Chuck 1/1/2001
2 mathavra 2/8/1999
3 Chuck 2/1/2001
4 Chuck 3/1/2001

The output I would expect from the following query:

select name, date_changed, (Count(*)) As num_records
from table1
group by name

Is:

Chuck 1/1/2001 3
Chuck 2/1/2001 3
Chuck 3/1/2001 3
mathavra 2/8/1999 1

At least, this is what I discerned from pg 1-193 of the "Sybase SQL Server Reference Manual: Volume I - Commands, Functions, and Topics".

-cf

mathavra

In Sybase, the syntax is a valid syntax. But the query would not produce as you expect. May be there is a mis print in the manual. I am trying to find that manual. Did you find this manual online?.

chuck_forbes

ASKER

I don't think there's a misprint. The manual may be somewhat outdated, though. (it arrived with Sybase SQL server release 10.0). I don't think that matters.

The example reads as follows:

"SELECT type, title_id, avg(price)
FROM titles
GROUP BY type

type title_id
business BU1032 13.73
business BU1111 13.73
business BU2075 13.73
business BU7832 13.73
mod_cook MC2222 11.49
mod_cook MC3021 11.49
UNDECIDED MC3026 NULL
popular_comp PC1035 21.48
popular_comp PC8888 21.48
popular_comp PC9999 21.48

The transact-SQL extended column, title_id (in the SELECT list, not an aggregate and not in the GROUP BY clause) causes all qualified rows to display in each qualified group, even though GROUP BY should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1."

(If you have online documentation, search for 'extended column')

The SQL from Example 1 reads as follows:

"SELECT type, avg(price)
FROM titles
GROUP BY type"

-Chuck

mathavra

I found it online.

Seems like they clearly explained what would happen if the non-aggregate column is not added in the group by clause.

Lets go step by step:

Step1 :
-------

Example 1:

select type, avg(price)
from titles
group by type

type
---------------------- ----------- UNDECIDED NULL business 13.73 mod_cook 11.49 popular_comp 21.48 psychology 13.50 trad_cook 15.96
6 rows affected)

In the above query, it groups the results by type. Since the column "type" (non-aggregate column) is in the SELECT clause and as well in the GROUP BY clause, it groups all the rows by type and finds the average per "type" group.

----------------------------------------------------
So, the result set has one type row and its average.
----------------------------------------------------

Step 2:
-------

Lets look at the query no. 2 in the book.

select type, title_id, avg(price)
from titles
group by type

type title_id
------------ -------- ----------------
business BU1032 13.73
business BU1111 13.73
business BU2075 13.73
business BU7832 13.73
mod_cook MC2222 11.49
mod_cook MC3021 11.49
UNDECIDED MC3026 NULL
popular_comp PC1035 21.48
popular_comp PC8888 21.48
popular_comp PC9999 21.48
psychology PS1372 13.50
psychology PS2091 13.50
psychology PS2106 13.50
psychology PS3333 13.50
psychology PS7777 13.50
trad_cook TC3218 15.96
trad_cook TC4203 15.96
trad_cook TC7777 15.96

(18 rows affected)

Sybase explanation:
The Transact-SQL extended column, title_id (in the select list, not an aggregate and not in the group by clause) causes all qualified rows to display in each qualified group, even though group by should produce a single row per group. The average price per group, displayed on each row of each group, is identical to Example 1.

They clearly explained that since the non-aggregate column "title_id", one of the non-aggregate column is not in group by clause, IT CAUSES ALL QUALIFIED ROWS FOR THAT TYPE TO BE DUPLICATED (cartesian) WITH THE RESULT. You can see the average is same for the group.

The following query

select title_id from titles
where type = "business"

returns as below:

title_id
---------
BU1032
BU1111
BU2075
BU7832

(4 rows affected)

So, always the results will be the "cartesian product" of the result with all "non-aggregate" column values for that group.

Please let me know if you need more clarification. I can provide my own example to explain it.

chuck_forbes

ASKER

It doesn't say duplicated, it says displayed. If you look at both query outputs, there are four records for business. In my query, I initially had only one record per license_no, but after the introduction of the GROUP BY, I had two records per license_no.

mathavra

Sorry for using the word "duplicated". It is displayed with the "CARTESIAN" product. Anyways,

Which 2 queries you are talking about when you meant the following sentence:

"If you look at both query outputs, there are four records
for business"

chuck_forbes

ASKER

Man, mathavra, my previous post sounded evil. Sorry, I was just writing you back as fast as I could. No evil intended.

select type, title_id, avg(price)
from titles
group by type

type title_id
------------ -------- ----------------
business BU1032 13.73
business BU1111 13.73
business BU2075 13.73
business BU7832 13.73
mod_cook MC2222 11.49
........

select title_id from titles
where type = "business"

returns as below:

title_id
---------
BU1032
BU1111
BU2075
BU7832

(4 rows affected)

-cf

ASKER CERTIFIED SOLUTION

mathavra

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

chuck_forbes

ASKER

The documentation doesn't mention anything about cartesian products, though. In my post, I was suggesting that I was seeing similar behaviors. The way I see it, the following two queries are the same:

select type,
title_id,
avg(price)
from titles
group by type

select type,
title_id,
(SELECT avg(t1.price) FROM titles t1 where t1.type = titles.type)
from titles

If the book example was behaving similarly to my original post above, you wouldn't get:

type title_id
------------ -------- ----------------
business BU1032 13.73
business BU1111 13.73
business BU2075 13.73
business BU7832 13.73
mod_cook MC2222 11.49
mod_cook MC3021 11.49
UNDECIDED MC3026 NULL
popular_comp PC1035 21.48
popular_comp PC8888 21.48
..........

but rather:

type title_id
------------ -------- ----------------
business BU1032 13.73
business BU1111 13.73
business BU2075 13.73
business BU7832 13.73
business MC2222 13.73
business MC3021 13.73
business MC3026 13.73
business PC1035 13.73
business PC8888 13.73
..........

And the only difference between my example and the books is that I have two tables vs their one.

-cf

amitpagarwal

its become such a big page now .. couldn't go through all the comments .. chucks .. where do we stand now ?

ahoor

What happens is that the avg uses the column in the group by, that's why you have a different value for each type.

Can't explain it any clearer really... what is the mis-understanding here?

chuck_forbes

ASKER

The example in the Sybase manual makes sense. If I have the following information in a table:

store product price

store_a product_1 10.00
store_a product_2 20.00
store_a product_3 30.00
store_b product_1 10.00
store_b product_2 20.00

and use the SQL:

select store, product, avg(price)
from table1
group by store

I'd get

store_a product_1 20.00
store_a product_2 20.00
store_a product_3 20.00
store_b product_1 15.00
store_b product_2 15.00

Which makes sense:

What I have is something like the following:

Table1
ID
123-45-6789
987-65-4321

Table2
ID mod_date
123-45-6789 1/1/2001
987-65-4321 12/31/2001

When I run the following query:

SELECT Table1.ID, Table2.mod_date, (Count(*)) As CountOF
FROM Table1, Table2,
WHERE Table1.ID = Table2.ID
GROUP BY Table1.ID

I do NOT get:

Table1.ID Table2.mod_date CountOf
123-45-6789 1/1/2001 1
987-65-4321 12/31/2001 1

Instead, I get

Table1.ID Table2.mod_date CountOf
123-45-6789 1/1/2001 1
987-65-4321 1/1/2001 1
123-45-6789 12/31/2001 1
987-65-4321 12/31/2001 1

I'm just having a really difficult time understanding why this occurs.

-Chuck

ahoor

Chuck the main differnce between the Sybase example and your q

ahoor

Chuck
The main differnce between the Sybase example and your query is that you use 2 tables... so the group by makes a carthesian product between the 2 tables to calculate the count and then uses the join.
If you have one table there is no carthesian product occurring. That must be it.

Arjan

chuck_forbes

ASKER

Thanks for the reponses. This one still escapes me, although I've received a fair share of comments here and elsewhere which emphasize a pathway which leads to the Cartesian result I was receiving. I' m just not seeing the machinations that Sybase is going through in this case.

-cf