Chapter 33 – Searching with Groups¶
Note
You are reading a digital copy of SQL-99 Complete, Really, a book that documents the SQL-99 standard.
For more information specific to CrateDB, check out the CrateDB Reference.
This chapter deals with three optional points of a SELECT
statement: the
GROUP BY
clause, the set functions {AVG, COUNT, MAX, MIN, SUM, EVERY,
ANY, SOME, GROUPING
} and the HAVING
clause. Often these things appear
together: the common factor is summaries, or amalgams, of Columns – with
groups, rather than with details. We group together where values are equal. For
example, confronted with the detail list {Smith Smith Smith Jones Jones}, we
could summarize it to be: “three Smiths, two Joneses”. Such a summary is known
in SQL as a grouped Table.
Table of Contents
GROUP BY Clause¶
The GROUP BY
clause is an optional portion of the SELECT
statement.
It defines a grouped Table. The required syntax for the GROUP BY
clause is:
GROUP BY <grouping specification>
<grouping specification> ::=
<grouping Column reference list> |
ROLLUP (<grouping Column reference list>) |
CUBE (<grouping Column reference list>) |
GROUPING SETS (<grouping set list>) |
() |
<grouping set>,<grouping set list>
<grouping Column reference list> ::=
<Column reference> [ COLLATE <Collation name> ] [ ,... ]
<grouping set list> ::=
<grouping set> [ {,<grouping set>}... ]
<grouping set> ::=
<grouping Column reference> |
(<grouping column reference list>) |
ROLLUP (<grouping Column reference list>) |
CUBE (<grouping Column reference list>) |
()
GROUP BY
defines a grouped Table: a set of groups of rows, where each group
consists of the rows in which all the values of the grouping Column(s) are
equal (the group is the entire Table if you use the HAVING
clause without a
preceding GROUP BY
clause). Here are three SELECT
statements; each
contains a GROUP BY
clause:
SELECT column_1 FROM Table_1
GROUP BY column_1;
SELECT column_1 FROM Table_1
WHERE column_1 BETWEEN 10 AND 50
GROUP BY column_1;
SELECT column_1 FROM Table_1
WHERE column_1 BETWEEN 10 AND 50
GROUP BY column_1 HAVING SUM(column_1)>12;
These examples illustrate that the GROUP BY
clause contains a list of
<Column reference>s (the grouping Columns) and that it comes at a certain place
within the various optional or compulsory clauses of a SELECT
statement. In
each case, the grouping <Column reference> names a Column that belongs to a
Table named in the SELECT ... FROM
clause – the argument of the GROUP
BY
clause is a list of (optionally qualified) <Column name>s and only <Column
name>s; SQL doesn’t allow you to use <literal>s, Column expressions, or any
operator/function except COLLATE
in a GROUP BY
clause. In addition,
when a SELECT
statement includes GROUP BY
, the statement’s select list
may consist only of references to Columns that are single-valued per group –
this means that the select list can’t include a reference to an interim result
Column that isn’t also included in the GROUP BY
clause unless that Column
is an argument for one of the set functions (AVG, COUNT, MAX, MIN, SUM,
EVERY, ANY, SOME
; each of which reduce the collection of values from a Column
to a single value). These are severe restrictions, but we’ll show you ways to
get around some of the restrictions.
The GROUP BY
clause allows you to summarize SQL-data. For an example of how
it works, let’s look at some basic queries on one of the sample Tables we
defined in our chapter on “Simple Search Conditions”: the PAYROLL
Table,
which looks like this:
PAYROLL |
||||
---|---|---|---|---|
|
|
|
|
|
1 |
6.00 |
10TH FLOOR |
1989-10-31 |
10:15:00 |
2 |
5.00 |
16TH FLOOR |
1989-09-30 |
10:20:00 |
3 |
5.00 |
WAREHOUSE |
1989-09-30 |
10:30:00 |
4 |
8.00 |
BASEMENT |
1989-10-15 |
12:00:10 |
10 |
16.00 |
16TH FLOOR |
1989-09-30 |
12:30:00 |
11 |
16.00 |
16TH FLOOR |
1989-10-15 |
13:15:10 |
20 |
9.00 |
WAREHOUSE |
1989-10-15 |
14:00:00 |
28 |
? |
16TH FLOOR |
1989-09-15 |
14:10:00 |
35 |
9.00 |
10TH FLOOR |
1989-10-31 |
14:20:00 |
40 |
16.00 |
10TH FLOOR |
1989-10-31 |
14:35:07 |
The simplest GROUP BY
example on PAYROLL
groups the LOCATION
Column into its four different values. Here’s the SELECT
statement:
SELECT location FROM Payroll
GROUP BY location;
To get the result for this SELECT
statement, your DBMS will first
evaluate the FROM
clause to construct an interim result Table (in
this case, the entire PAYROLL
Table), then pass this interim result
to the GROUP BY
clause. When it evaluates the GROUP BY
clause,
it breaks the interim result into groups which have the same values in
LOCATION
(the grouping Column), then passes this interim result to
the select list. When it evaluates the select list, your DBMS throws out
all Columns which aren’t named in the select list. The result, then, is:
|
10TH FLOOR |
16TH FLOOR |
WAREHOUSE |
BASEMENT |
A slightly more complicated example groups PAYROLL
's LOCATION
and
RATE
Columns. Here’s the SELECT
statement:
SELECT location, rate FROM Payroll
GROUP BY location, rate;
To get the result for this SELECT
statement, your DBMS will follow the same
steps we described for the last example – up until it reaches the interim
result that contains groups which have the same values in LOCATION
(the
first grouping Column). At this point, instead of passing the result to the
select list, the DBMS will now break the new interim result into groups which
have the same values in LOCATION
and RATE
(the second grouping Column).
This interim result is then passed on to the select list for evaluation. The
final result is (note that the NULL
value in the RATE
Column is in a
group of its own; for GROUP BY
all NULL
s form a single group):
|
|
10TH FLOOR |
6.00 |
10TH FLOOR |
9.00 |
10TH FLOOR |
16.00 |
16TH FLOOR |
5.00 |
16TH FLOOR |
16.00 |
16TH FLOOR |
? |
WAREHOUSE |
5.00 |
WAREHOUSE |
9.00 |
BASEMENT |
8.00 |
One last example: grouping with a WHERE
clause. Here’s a SELECT
statement:
SELECT location, rate FROM Payroll
WHERE rate > 6.00
GROUP BY location, rate;
To get the result for this SELECT
statement, your DBMS will get a copy of
the PAYROLL
Table (evaluate the FROM
clause), remove any rows where the
RATE
value is less than or equal to 6 (evaluate the WHERE
clause),
break the result into groups which have the same values in LOCATION
and
RATE
(evaluate the GROUP BY
clause) and remove any groups which aren’t
named in the select list. The result is:
|
|
10TH FLOOR |
9.00 |
10TH FLOOR |
16.00 |
16TH FLOOR |
16.00 |
16TH FLOOR |
? |
WAREHOUSE |
9.00 |
BASEMENT |
8.00 |
Rules For Grouping Columns¶
In the GROUP BY
clause:
Each Column in a
GROUP BY
clause must unambiguously name a Column that belongs to a Table named in theSELECT
statement’sFROM
clause. The name may be qualified, i.e.: it may be a <Column reference>. Such a Column is called a grouping Column: its values will be grouped for the final result.The grouping Column <data type> may not be
BLOB
,CLOB
,NCLOB
orARRAY
.If the grouping Column <data type> is
CHAR
orVARCHAR
, then the <Column reference> may be accompanied byCOLLATE
<Collation name>. This addition was added to SQL with SQL-92 and may not be supported by all DBMSs. The idea is that you should be able to match for upper|lower case, or usePAD SPACES
when you’re defining a group’s values. Other thanCOLLATE
, all SQL operators and functions are illegal in aGROUP BY
clause.
In the select list:
You must follow “The Single-Value Rule” – every Column named in the select list must also be a grouping Column, unless it is an argument for one of the set functions.
The select list may include derived Columns – <literal>s, scalar functions, and <Column name>s within expressions (only the
GROUP BY
clause disallows expressions).
Here are some examples of grouped SELECT
statements, legal and not
(a
is a Column of Table T
):
SELECT a FROM T GROUP BY a;
-- legal: grouping Column = select Column
SELECT a || a FROM T GROUP BY a;
-- legal: a is grouped and it applies to both instances in select list
SELECT MAX(a) AS b FROM T GROUP BY a;
-- legal: a need not be in select list
SELECT a+5 FROM T GROUP BY a;
-- legal: expression in select list refers to grouping Column
SELECT 5 FROM T GROUP BY a;
-- legal: the <literal> isn't a reference to a Column of T, so it
doesn't have to be a grouping Column: you'll get a bunch of "5"s
SELECT a*5 AS b FROM T GROUP BY b;
-- illegal: a is not a grouping Column and b isn't evaluated until the
select list is; by then it's too late
SELECT a,max(a) FROM T;
-- illegal: GROUP BY "implied", see set functions
SELECT a+5 FROM T GROUP BY a+5;
-- illegal: expression in GROUP BY
Caution
The superficial similarity of the GROUP BY
clause and the ORDER BY
clause often misleads people. The big difference is that grouping is done on
the input (that is, the Tables named in the FROM
clause), while ordering
is done on the output (that is, the Columns named in the select list). So,
although you can say “ORDER BY
integer” (only in SQL-92 though) and
“ORDER BY
expression”, it makes no sense to say “GROUP BY
integer” or
“GROUP BY
expression”. On the other hand, grouping Columns don’t have to
be in the select list, as sorted Columns must.
Caution
The SQL Standard doesn’t specify how many Columns can be grouped or what the
size of a grouping Column may be (except that it can’t be a large object
string), but most DBMSs allow fairly small numbers and sizes. When people
create Tables, they often allot hundreds of characters for VARCHAR
Columns (like “address” or “comment”), thinking that there is plenty of room
in the row. Later they find that their DBMS can’t use those Columns In a
GROUP BY
clause, due to their size. Moral: Don’t make Columns bigger than
they have to be.
The Single-Value Rule¶
The rationale for this rule is as follows. Suppose you have a list of cities and countries. If this SQL statement were legal:
SELECT city, country FROM Cities_And_Countries
GROUP BY country;
what value would come out in the CITY
Column? There are plausible answers
such as “any city will do provided it’s in the country”, or “the DBMS should
assume that we want to group by both country and city”. The problem with those
answers is that they try to compensate for a formal user error. What the user
really needs to know is “that does not compute”.
The rule does not mean that all values must be distinct. We could multiply everything times zero, yielding zeros in every Column in the select list, and we would still be specifying “single-valued per group”. The true meaning is that there must be, for each group, one (and only one) value which is appropriate as an answer for the query. Sometimes the DBMS doesn’t realize this, as in a SQL statement like:
SELECT capitalcity, country FROM Cities_And_Countries
GROUP BY country;
but if that’s the case, we can easily tell the DBMS this is so by adding a grouping Column to the statement:
SELECT capitalcity, country FROM Cities_And_Countries
GROUP BY country, capitalcity;
The single-value rule is sensible and it is Standard SQL. Don’t be misled by a “textbook” describing the one DBMS which does not follow the Standard.
Grouping by Expressions¶
We said earlier that, while you can use “GROUP BY
Column list” in your
SELECT
statements, you can’t use “GROUP BY
expression list”. This means
that these two SQL statements are both illegal:
SELECT EXTRACT(MONTH FROM bdate) FROM Table_1
GROUP BY EXTRACT(MONTH FROM bdate);
SELECT EXTRACT(MONTH FROM bdate) AS ebdate FROM Table_1
GROUP BY ebdate;
Looks like we can’t group by MONTH
: we have to group by the whole date.
Bummer. Luckily, there is a way out if you have SQL-92 and some patience:
SELECT ebdate
FROM (SELECT EXTRACT(MONTH FROM bdate) AS ebdate FROM Table_1)
GROUP BY ebdate;
This example uses a Table subquery in the FROM
clause, and puts all the
necessary calculations inside that Table subquery. The outer SELECT
does a
grouping of the result – which is now legal, since the EBDATE
Column is
clearly identified as a Column belonging to the Table named in the FROM
clause. If your DBMS doesn’t support Table subqueries in the FROM
clause,
try making a View, or a temporary Table, with the same logic. Whatever you do
will be slow, because two Tables are being produced, one for the temporary
Table and one for the grouping.
New Syntax¶
Until now, all of our examples have shown GROUP BY
followed by one or more
<Column reference>s. This was the only available option until SQL3, which adds
this syntax:
GROUP BY ROLLUP(grouping Columns)
GROUP BY CUBE(grouping Columns)
GROUP BY ()
GROUP BY GROUPING SETS grouping Column
GROUP BY GROUPING SETS (grouping Columns)
GROUP BY GROUPING SETS ROLLUP(grouping Columns)
GROUP BY GROUPING SETS CUBE(grouping Columns)
GROUP BY GROUPING SETS()
GROUP BY (grouping Columns),<grouping set list>
GROUP BY ROLLUP(grouping Columns),<grouping set list>
GROUP BY CUBE(grouping Columns),<grouping set list>
GROUP BY (),<grouping set list>
GROUP BY GROUPING SETS
effectively allows groups within groups – in one
pass, it generates multiple aggregated groups that would otherwise require a
set operator to put the different result sets together. For example:
SELECT A.id,B.name,COUNT(*)
FROM Table_1 AS A, Table_2 as B
WHERE A.number = B.number
GROUP BY GROUPING SETS (A.id,(A.id,B.name));
In this example, the GROUP BY
clause determines the first requirement –
groups of IDs – by grouping the A.ID
values from the TABLE_1
Table. It
then determines the second requirement – number of IDs by ID
and NAME
– by grouping the A.ID
values from TABLE_1
with the B.NAME
values
from TABLE_2
.
The grouping forms like “GROUP BY
(grouping Columns), <grouping set list>”,
also known as concatenated grouping, puts groups together.
A <grouping specification> of () (called grand total in the Standard) is equivalent to grouping the entire result Table; i.e.: to the result returned by:
SELECT select list FROM Tables
WHERE conditions
HAVING conditions
A <grouping specification> of ROLLUP
means get one group out of all
grouping Columns, by rolling each group into the next until only one
remains. It is equivalent to the result returned by:
SELECT select list FROM Tables
WHERE conditions
GROUP BY GROUPING SETS (
(ROLLUP col_1,ROLLUP col_2,...,ROLLUP col_n),
(ROLLUP col_1,ROLLUP col_2,...,ROLLUP col_n-1),
(ROLLUP col_1,ROLLUP col_2,...,ROLLUP col_n-2),
...
(ROLLUP col_1),
() )
HAVING conditions
A <grouping specification> of CUBE
means get one group out of all grouping
Columns, by grouping all possible combinations of the grouping Columns. It is
equivalent to the result returned by:
SELECT select list FROM Tables
WHERE conditions
GROUP BY GROUPING SETS (
(CUBE col_1,CUBE col_2,...,CUBE col_n-2,CUBE col_n-1,CUBE col_n),
(CUBE col_1,CUBE col_2,...,CUBE col_n-2,CUBE col_n-1),
(CUBE col_1,CUBE col_2,...,CUBE col_n-2,CUBE col_n),
(CUBE col_1,CUBE col_2,...,CUBE col_n-2),
...
(CUBE col_1),
(CUBE col_2,...,CUBE col_n-2,CUBE col_n-1,CUBE col_n),
(CUBE col_2,...,CUBE col_n-2,CUBE col_n-1),
(CUBE col_2,...,CUBE col_n-2,CUBE col_n),
(CUBE col_2,...,CUBE col_n-2),
...
(CUBE col_2),
...
(CUBE col_3),
...
(CUBE col_n),
() )
HAVING conditions
As an example of these new options, assume there is a Table, TABLE_1
, that
looks like this:
|
||
---|---|---|
|
|
|
1 |
A |
.55 |
1 |
A |
.55 |
1 |
B |
1.00 |
1 |
B |
1.35 |
2 |
A |
6.00 |
2 |
A |
1.77 |
Let’s do an ordinary GROUP BY
, with only <Column reference>s as grouping
Columns, on TABLE_1
:
SELECT column_1,
column_2,
SUM(column_3) AS "SUM"
FROM Table_1
GROUP BY column_1,column_2;
The result is:
|
|
|
1 |
A |
1.10 |
1 |
B |
2.35 |
2 |
A |
7.77 |
Now let’s do a GROUP BY
with ROLLUP
:
SELECT column_1,
column_2,
SUM(column_3) AS "SUM"
FROM Table_1
GROUP BY ROLLUP(column_1,column_2);
This time, the result is:
|
|
|
1 |
A |
1.10 |
1 |
B |
2.35 |
1 |
|
3.45 |
2 |
A |
7.77 |
2 |
|
7.77 |
|
|
11.22 |
In addition to the same groups returned by the ordinary GROUP BY
, GROUP
BY ROLLUP
gets a group of each group: the group of COLUMN_1
“1” values
(with a NULL
for the grouping of “A” and “B” in COLUMN_2
), the group of
COLUMN_1
“2” values (with a NULL
for the grouping of “A” in
COLUMN_2
) and the group of COLUMN_1
“1” and “2” values (with a NULL
for the grouping of “A” and “B” in COLUMN_2
).
Finally, let’s do a GROUP BY
with CUBE
:
SELECT column_1,
column_2,
SUM(column_3) AS "SUM"
FROM Table_1
GROUP BY CUBE(column_1,column_2);
This time, the result is:
|
|
|
1 |
A |
1.10 |
1 |
B |
2.35 |
1 |
|
3.45 |
2 |
A |
7.77 |
2 |
|
7.77 |
|
A |
8.87 |
|
B |
2.35 |
|
|
11.22 |
In addition to the same groups returned by the GROUP BY ROLLUP
, GROUP BY
CUBE
gets a group of each combination of groups: the group of COLUMN_1
“1” values, the group of COLUMN_1
“2” values, the group of COLUMN_2
“A”
values (with a NULL
for the grouping of “1” and “2” in COLUMN_1
) and
the group of COLUMN_2
“B” values (with a NULL
for the grouping of “1”
in COLUMN_1
).
If you want to restrict your code to Core SQL, don’t use ROLLUP
or
CUBE
, and don’t add a COLLATE
clause to any grouping Column reference.
Set Functions¶
The SQL set functions – more commonly called aggregate functions – are
AVG
, COUNT
, MAX
, MIN
, SUM
, EVERY
, ANY
, SOME
.
Each one takes the collection of values from a Column and reduces the
collection to a single value. The required syntax for a set function
specification is:
<set function specification> ::=
<general set function> |
COUNT(*) |
<grouping operation>
<general set function> ::=
<set function type> ([ {DISTINCT | ALL} ] Column expression)
<set function type> ::=
COUNT | MAX | MIN | SUM | AVG | EVERY | ANY | SOME
<grouping operation> ::=
GROUPING (<Column reference>)
The Column expression that follows a <general set function> can be a <Column
reference>, a <literal>, a scalar function or an arithmetic expression – in
short, it can be any expression (other than a query or subquery or another set
function) which evaluates to a Column, as long as that Column doesn’t have a
<data type> of BLOB
, CLOB
or NCLOB
. Each <general set function>
operates on the entire collection of non-null values in its Column argument –
grouping rules apply because grouping is implied – and can optionally be
qualified with either DISTINCT
or ALL
. If you specify DISTINCT
,
your DBMS will eliminate any duplicate values from the Column before applying
the set function. If you specify ALL
, your DBMS will apply the set function
to every non-null value in the Column. The default is ALL
. When a set
function eliminates NULL
s, your DBMS will return the SQLSTATE warning
01003 "warning-null value eliminated in set function."
Here’s an example Table; we’ll use it for the explanations of the set functions that follow:
|
---|
|
10 |
20 |
10 |
20 |
30 |
|
The COUNT
function has two forms: it can have an asterisk as an argument
or a Column expression as an argument. It returns an integer.
COUNT(*)
returns the number of rows in the argument Table, rather than the number of values in any particular Column, so this SQL statement returns6
:SELECT COUNT(*) FROM Table_1
COUNT(Column)
andCOUNT(ALL Column)
are equivalent: both return the number of non-null values in “Column”, so these SQL statements both return5
:SELECT COUNT(column_1) FROM Table_1; SELECT COUNT(ALL column_1) FROM Table_1;
COUNT(DISTINCT Column)
returns the number of unique, non-null values in “Column”, so this SQL statement returns3
:SELECT COUNT(DISTINCT column_1) FROM Table_1;
The MAX
function returns the maximum value and can’t be used with Columns
that have a <data type> of ROW
, REF
, BLOB
, CLOB
or NCLOB
,
or with a UDT. The <data type> and size of the result will be the same as the
<data type> and size of the expression itself.
MAX(Column)
,MAX(ALL Column)
andMAX(DISTINCT Column)
are equivalent: all three return the largest of the non-null values in “Column”, so these SQL statements all return30
:SELECT MAX(column_1) FROM Table_1; SELECT MAX(ALL column_1) FROM Table_1; SELECT MAX(DISTINCT column_1) FROM Table_1;
It’s rare to find explicit ALL
or DISTINCT
with MAX
.
The MIN
function returns the minimum value and can’t be used with Columns
that have a <data type> of ROW
, REF
, BLOB
, CLOB
or NCLOB
,
or with a UDT. The <data type> and size of the result will be the same as the
<data type> and size of the expression itself.
MIN(Column)
,MIN(ALL Column)
andMIN(DISTINCT Column)
are equivalent: all three return the smallest of the non-null values in “Column”, so these SQL statements all return10
:SELECT MIN(column_1) FROM Table_1; SELECT MIN(ALL column_1) FROM Table_1; SELECT MIN(DISTINCT column_1) FROM Table_1;
As with MAX
, it’s rare to find explicit ALL
or DISTINCT
with
MIN
.
The SUM
function returns a total and can’t be used with Columns that have a
<data type> of ROW
, REF
, BLOB
, CLOB
or NCLOB
, or with a
UDT. The <data type> of the expression must be numeric or INTERVAL
; in the
first case, the result <data type> will also be numeric, with the same scale as
the expression, but (possibly) a larger precision. That’s necessary – consider
what would happen if you had a DECIMAL(2)
Column containing two values:
51
, 51
. The result of SUM
would be 102
, which is
DECIMAL(3)
, so the DBMS has to be able to increase the precision for the
result. If you’re not sure that your DBMS will give SUM
a great enough
precision, increase the precision of the expression using a CAST
function.
In the second case, the result <data type> will be INTERVAL
with the same
precision as the expression.
SUM(Column)
andSUM(ALL Column)
are equivalent: both return the total of the non-null values in “Column”, so these SQL statements both return90
:SELECT SUM(column_1) FROM Table_1; SELECT SUM(ALL column_1) FROM Table_1;
SUM(DISTINCT Column)
returns the total of the unique, non-null values in “Column”, so this SQL statement returns60
:SELECT SUM(DISTINCT column_1) FROM Table_1;
The AVG
function returns an average and can’t be used with Columns that
have a <data type> of ROW
, REF
, BLOB
, CLOB
or NCLOB
, or
with a UDT. The <data type> of the expression must be numeric or INTERVAL
;
in the first case, the result <data type> will also be numeric, with scale and
precision equal to or greater than the expression’s scale and precision (most
DBMSs increase the scale but leave the precision alone). In the second case,
the result <data type> will be INTERVAL
with the same precision as the
expression.
AVG(Column)
andAVG(ALL Column)
are equivalent: both return the average of the non-null values in “Column”, so these SQL statements both return18
:SELECT AVG(column_1) FROM Table_1; SELECT AVG(ALL column_1) FROM Table_1;
AVG(DISTINCT Column)
returns the average of the unique, non-null values in “Column”, so this SQL statement returns20
:SELECT AVG(DISTINCT column_1) FROM Table_1;
The EVERY
function, new to SQL with SQL3, returns a truth value and can
only be used with Columns that have a <data type> of BOOLEAN
.
EVERY(Column)
,EVERY(ALL Column)
andEVERY(DISTINCT Column)
are equivalent: all three returnFALSE
if any of the values in “Column” areFALSE
and all three returnTRUE
if no value in “Column” isFALSE
. For these two Tables:TABLE_1
TABLE_2
COLUMN_1
COLUMN_1
TRUE
TRUE
TRUE
TRUE
FALSE
UNKNOWN
FALSE
UNKNOWN
these SQL statements all return
FALSE
:SELECT EVERY(column_1) FROM Table_1; SELECT EVERY(ALL column_1) FROM Table_1; SELECT EVERY(DISTINCT column_1) FROM Table_1;
And these SQL statements all return
TRUE
:SELECT EVERY(column_1) FROM Table_2; SELECT EVERY(ALL column_1) FROM Table_2; SELECT EVERY(DISTINCT column_1) FROM Table_2;
The ANY
function (and its, synonym, SOME
), new to SQL with SQL3,
returns a truth value and can only be used with Columns that have a <data type>
of BOOLEAN
.
ANY(Column)
,ANY(ALL Column)
andANY(DISTINCT Column)
are equivalent: all three returnTRUE
if any of the values in “Column” areTRUE
and all three returnFALSE
if no value in “Column” isTRUE
. For these two Tables:TABLE_1
TABLE_2`
COLUMN_1
COLUMN_1
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
UNKNOWN
UNKNOWN
these SQL statements all return
TRUE
:SELECT ANY(column_1) FROM Table_1; SELECT ANY(ALL column_1) FROM Table_1; SELECT ANY(DISTINCT column_1) FROM Table_1;
And these SQL statements all return FALSE:
SELECT ANY(column_1) FROM Table_2; SELECT ANY(ALL column_1) FROM Table_2; SELECT ANY(DISTINCT column_1) FROM Table_2;
The GROUPING
function, new to SQL with SQL3, operates on an argument that
must be a <Column reference>, where the Column referred to is a grouping Column
for the query. You use it in conjunction with the various grouping sets options
of the GROUP BY
clause. It returns either an integer or the value of some
Column – the Standard is unclear on this point.
DISTINCT Set Functions¶
Since DISTINCT
<set function> is usually a slow process, it’s worthwhile to
look for ways to avoid it. With MIN
, MAX
, EVERY
, ANY
and
SOME
, DISTINCT
means nothing, so there’s no problem omitting it there.
If the Column happens to be a primary key or unique key, then DISTINCT
will
have no effect because the values are all distinct anyway, so again, no problem
omitting it there. (Although a unique key Column might have a million
NULL
s in it, that won’t matter because the set functions ignore NULL
s
anyway). The only things you must be sure of, if you remove the <keyword>
DISTINCT
because the Column is primary or unique, are (a) that the
PRIMARY KEY/UNIQUE
Constraint is NOT DEFERRABLE
and (b) that there is
no chance that the database definition will ever change.
Tip
If you use ODBC then you should check whether your DBMS supports the
SQLRowCount
function for results sets; if it does, then you won’t have to
use COUNT(*)
to find out how many rows answer a query. This is
particularly a time-saver if the query contains DISTINCT
, since
COUNT(DISTINCT ...)
is a particularly slow function.
Set Functions and the “Ignore NULLs” Policy¶
All the set functions ignore NULL
s (the COUNT(*)
function looks like
an exception but if you think of it as a shorthand for COUNT(1)
you will
realize that no NULL
s are ever involved there). In other words, when we
calculate SUM(x)
we are not calculating the sum of all values of x
, but
the sum of all known values of x
. This policy is the result of practical
experience: some early SQL DBMSs didn’t ignore NULL
s, but now they all do.
Now, this is a little inconsistent, because in most arithmetic expressions that
involve NULL
, the result is NULL
(e.g.: “5 + NULL
yields
NULL
”). This leads to a TRAP: SUM(a)+SUM(b)
is not the same as
SUM(a+b)
! Consider these two rows:
|
|
|
row#1 |
5 |
|
row#2 |
5 |
5 |
Since the SUM
of {5,5
} is 10
, and the SUM
of {NULL,5
} is
5
, the result of SUM(a)+SUM(b)
is 15
. However, since (5+NULL)
is NULL
, and (5+5)
is 10
, and the SUM
of {NULL,10
} is
10
, then SUM(a+b)
is 10
! So which answer is correct: 15
or
10
? The cautious person would perhaps reply: both answers are wrong – the
result should be NULL
. However, we have already posited that in the context
of set functions we want to ignore NULL
s. In this context, the best answer
is the one that ignores the most NULL
s – to wit: 15
. Therefore the
correct expression to use is SUM(a)+SUM(b)
, not SUM(a+b)
. A clinching
argument is that SUM(a)+SUM(b)
involves fewer “add” operations than
SUM(a+b)
, and is therefore less subject to the cumulative effects of
rounding. Similar considerations apply for all set functions with expressions
that contain +
or -
or ||
operators.
Because of the “ignore NULL
s” policy, it should be rare for a set function
to return NULL
. The only cases are: for MIN
or MAX
or SUM
or
AVG
, if every one of the Column values is NULL
, or if the SELECT
returns no rows at all, the function returns NULL
(so in fact these
functions could return NULL
even if the Column is defined as NOT NULL
,
yielding another case where you shouldn’t be fooled by a NOT NULL
Constraint definition.) The COUNT
function can never return NULL
– if
there are no rows returned, then the count is, naturally enough, zero.
Set Functions in Subqueries¶
Who makes more than the average salary? What is the cheapest book? That’s the kind of question that requires a comparison with a subquery that contains a set function. In other words, you need a condition with this general format:
... WHERE <value> <comparison-operator> (SELECT <set function> ...)
For example:
SELECT ... FROM ... WHERE ... = (SELECT MIN(price) FROM book);
This is one of the cases where it’s much easier to use a subquery than a join.
If you try to do something more complex, you will probably run into some
restrictions because set functions within subqueries are hard to implement.
We’ll give just one example of a restriction (from SQL-92) – if the set
function’s argument is an “outer reference” Column (i.e.: the name of a Column
in the outer enclosing query), then that must be the only <Column reference>
within that argument and the set function has to appear either in a select
list or within a subquery that belongs to a HAVING
clause. For example,
this complex query uses illegal syntax:
SELECT * FROM Table_1 WHERE 0 = (
SELECT SUM(Table_1.column_1+Table_2.column_1) FROM Table_2);
It would be even more illegal if the set function appeared in the WHERE
clause, or if there was a third level of subquery nesting. Nobody can remember
such a complex rule, and there are more like it, so the cautious programmer
simply avoids trying anything fancy when using set functions within
subqueries.
Retrieval with a Set Function¶
Here’s some examples of set functions, using the sample database we defined in our chapter on “Simple Search Conditions”. To find the total number of employees with payroll records (retrieve the number of records in a Table):
SELECT COUNT(*) AS pay_count
FROM Payroll;
The result is:
|
10 |
COUNT(*)
counts all rows of a Table, regardless of NULL
s.
To find the number of employees with known pay rates (retrieve the number of non-null values in a Column):
SELECT COUNT(rate) AS pay_count
FROM Payroll;
The result is:
|
9 |
COUNT(column)
eliminates NULL
s before counting a Column’s values.
To find the number of pay rates (retrieve the number of unique values in a Column):
SELECT COUNT(DISTINCT rate) AS pay_count
FROM Payroll;
The result is:
|
5 |
The DISTINCT
option eliminates duplicates before a set function is
processed.
To find the sum of the pay rates by location (group a Table into like values combined with a set function):
SELECT location,
SUM(rate) AS sum_rate
FROM Payroll
GROUP BY location;
The result is:
|
|
10TH FLOOR |
31.00 |
16TH FLOOR |
37.00 |
BASEMENT |
8.00 |
WAREHOUSE |
14.00 |
To find the number of employees in each department:
SELECT COUNT(empnum) AS emp_count,
dept
FROM Employee
WHERE dept<'D'
GROUP BY dept;
The result is:
|
|
2 |
A |
4 |
B |
1 |
C |
If you want to restrict your code to Core SQL, don’t use the set functions
EVERY
, ANY
, SOME
or GROUPING
, don’t use a set function unless it operates only
on a <Column reference> that refers to a Column belonging to a Table named in
the FROM
clause, and when counting, always use COUNT(*)
: don’t use
COUNT(Column)
or COUNT(ALL Column)
at all.
HAVING Clause¶
What departments have four employees? In which branches is the smallest book’s
size less than 40mm? Those are two questions that require comparison of an
absolute value with a set function, and thus represent the commonest situations
where we’d find it useful to bring in a HAVING
clause. The HAVING
clause is an optional portion of the SELECT
statement. It, too, defines a
grouped Table. The required syntax for the HAVING
clause is:
HAVING <search condition>
The HAVING
clause defines a grouped Table that contains only those groups
for which its search condition is TRUE
and usually follows a GROUP BY
clause. HAVING
operates on the interim result produced by the evaluation of
the SELECT
statement at that stage:
If the statement is “
SELECT ... FROM ... HAVING ...
”,HAVING
operates on the entireFROM
Table, treating it as a single group. (GROUP BY
is implied, but the result has no grouping Columns.)If the statement is “
SELECT ... FROM ... WHERE ... HAVING ...
”,HAVING
operates on the rows of theFROM
Table that areTRUE
for theWHERE
conditions, again, treating the interim result as a single group.If the statement is “
SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ...
”,HAVING
operates on the interim groups returned byGROUP BY
.
The HAVING
clause’s search condition may include AND
s, OR
s,
relational operators, scalar and arithmetic expressions, and so on – much like
a WHERE
clause’s search condition. The difference between a WHERE
condition and a HAVING
condition is implied by the clause order. The DBMS
evaluates WHERE
before it does the grouping; it evaluates HAVING
after it does the grouping – thus, WHERE
filters rows and HAVING
filters groups. It’s usually more efficient to filter before grouping (because
then there will be fewer rows to group), so it’s best to put most simple
conditions in the WHERE
clause and use HAVING
only for these
situations:
If one of the condition’s operands is a set function. Since a set function is not evaluated until you group, it is impossible to use one in a
WHERE
clause.If there is a quirk in a particular DBMS’s optimizer. For one DBMS,
GROUP BY a HAVING a = 7
works well because (since the grouping Column and the condition Column are the same) the two clauses can be evaluated simultaneously. For most DBMSs, putting a condition in theHAVING
clause is a hint that you want to avoid using any indexes.
Operands in the HAVING
clause are subject to the same restrictions as in
the select list:
Column expressions in both must be single-valued per group.
Column references must be unambiguous.
If a
SELECT
statement containsHAVING
without a precedingGROUP BY
clause, the select list can’t include any references to Columns belonging to a Table named in theFROM
clause unless those references are used with a set function.If
HAVING
includes a subquery, it can’t include outer Column references unless those references are to grouping Columns or are used with a set function.
Let’s look closely at an SQL statement which contains a HAVING
clause:
SELECT column_1,
COUNT(column_2)
FROM Table_1
GROUP BY column_1
HAVING COUNT(column_1) >= 5;
In this SQL statement, the expression COUNT(column_2)
is okay because,
although COLUMN_2
is not a grouping Column, it appears within a set
function. Suppose that TABLE_1
looks like this:
|
|
---|---|
|
|
1 |
0 |
1 |
1 |
1 |
2 |
2 |
3 |
2 |
4 |
2 |
5 |
2 |
6 |
2 |
7 |
2 |
8 |
2 |
9 |
TABLE_1
has 3 rows where COLUMN_1
is 1 and 7 rows where COLUMN_1
is
2. The COLUMN_1 = 1
group does not meet the HAVING
clause’s condition
– the smallest acceptable count is 5. However, the COLUMN_1 = 2
group does
meet HAVING
’s condition, and so the result of our SQL statement is:
|
|
2 |
7 |
Here is another way to get the same result, without using a HAVING
clause:
SELECT column_1,
c_count
FROM (SELECT column_1,
COUNT(column_2) AS c_count
FROM Table_1
GROUP BY column_1)
WHERE c_count >=5;
It is always possible to eliminate a HAVING
clause and use a query
expression in the FROM
clause instead. But most programmers prefer to stick
with HAVING
because it is the traditional and familiar tool for solving
this sort of problem.
HAVING without GROUP BY¶
This SQL statement means “if there are more than 3 a
's in the whole Table,
display how many a
's there are”:
SELECT COUNT(a) FROM Somethings
HAVING COUNT(a) > 5;
As is usual, because there is a set function in the SELECT
statement, there
is an implied GROUP BY ()
. Therefore grouping rules apply: the select list
in such an SQL statement may contain only single-valued Columns.
Retrieval Using Grouping¶
Here’s one more example of grouping, using the sample database we defined in our chapter on “Simple Search Conditions”. To find the departments with less than two employees (group a Table, then eliminate all groups which do not fulfill a condition):
SELECT dept FROM Employee
GROUP BY dept HAVING COUNT(*)<2;
The result is:
|
C |
D |
Views of Groups¶
It’s a straightforward exercise to make a View which is based on a grouping operation – the result is a grouped View. Here’s an example:
CREATE VIEW Employee_Groups AS
SELECT department_id, COUNT(employee_id) AS count_employee_id
FROM Employees
GROUP BY department_id;
Now, whoever uses the EMPLOYEE_GROUPS
View will see an ordinary Table with
two Columns: DEPARTMENT_ID
and COUNT_EMPLOYEE_ID
. But the user’s view
window won’t be totally transparent.
Problem 1
Consider what happens for this SQL statement:
SELECT * FROM Employee_Groups
WHERE count_employee_id = 7 AND department_id = 'XX';
Remember (from our chapter on “Tables”) that your DBMS may try to “transform” this query into a query on the underlying Base table. The best transform is:
SELECT department_id, COUNT(employee_id) FROM Employees
WHERE department_id = 'XX'
GROUP BY department_id HAVING COUNT(employee_id) = 7;
Observe that one, and only one, of the conditions of the original WHERE
clause has been split out and put in a new HAVING
clause. Based on timings,
we believe that some DBMSs don’t do this. We believe they put both conditions
in the HAVING clause. That’s much less efficient.
Problem 2
Consider what happens for this SQL statement:
SELECT MAX(count_employee_id) FROM Employee_Groups;
Your DBMS might attempt to transform this into:
SELECT department_id, MAX(COUNT(employee_id)) FROM Employees
GROUP BY department_id;
that is, into a sensible question along the lines of “which department has most
employees”. But MAX(COUNT(employee_id))
is an illegal expression (you can’t
nest set functions) and the result will be some error message defined by your
DBMS. This is legal: the SQL Standard doesn’t demand that a DBMS should handle
groups within groups, or set functions within set functions.
Tip
That’s why we named this View and its Column EMPLOYEE_GROUPS
and
COUNT_EMPLOYEE_ID
– the names tip off the user that a COUNT
in a
grouped View is involved. That lessens the confusion if the DBMS doesn’t
generate a clear error message.
Dialects¶
The older (pre-1992) DBMSs had these restrictions on grouping:
Any set functions in the
HAVING
clause had to appear in the select list.HAVING
was illegal if there was noGROUP BY
clause.Groups within Views and subqueries worked only under a full moon (or some undefined equivalent!).
Nowadays, SQL DBMSs support grouping, set functions and HAVING
clauses, in
a manner pretty much as we have described here, except for the new SQL3
features.
It’s also fairly easy to find DBMSs which will accept non-standard syntax.
Sybase ignores the Single-Value Rule so you can use any Column you like in the
select list. Oracle includes basic statistical-analysis set functions: STDDEV
(standard deviation) and VARIANCE. Several DBMSs allow “GROUP BY
expression” or even “GROUP BY
ordinal position, as if the GROUP BY
clause is analogous to the ORDER BY
clause.