Chapter 6 – Characters¶
Note
You are reading a digital copy of SQL-99 Complete, Really, a book that documents the SQL-99 standard.
For more information specific to CrateDB, check out the CrateDB Reference.
“Now we know our Alphabet all the way from A to Z[ed]”
(Grade 1 Tune)
“Now we know our ABC all the way from A to Z[ee]”
(same, USA variant)
If you remember one of the above jingles, you’ve probably got a few things to unlearn before you can use SQL character strings properly. Many people think that the alphabet is something from grade 1: that they know it already. If you fall into this category, be patient with us. By the end of this chapter, you’ll understand why we feel it’s necessary to devote so much space to what appears to be a fairly simple topic.
Before we can begin to address the question “what does SQL do with characters?”, we have first have to ask: “what should it do?”. That’s why this book has two chapters on characters and character strings. The first one, the one you’re reading, explains what characters are, what the important character sets are and how to relate or switch … the universal things. The next chapter gets into what the database solutions are. After you’ve read this chapter, you’ll see why the “solutions” are varied, random and wrong.
Table of Contents
What is a Character?¶
Consider two vertical bars with a horizontal bar between them: H. This H is a
glyph. To us, this glyph is a symbol – it’s the letter “Aitch” and it appears
in English words. Thus, a “letter” is a combination of a glyph and a meaning.
(The importance of this will become apparent as you go through this chapter.)
We know that H is the eighth letter of an alphabet. Or – in SQL terms – H is
the eighth symbol in a character repertoire. While an alphabet is a familiar
example of a character repertoire, any computer will also support a repertoire
containing digits, punctuators and special characters as well as glyphless
(non-printing) characters such as the \0
which terminates strings in C.
However, no computer can handle the letter H by itself. It can only handle numbers, so we have a convention that the letter H will be represented by some number, such as 72. If we all agree that “72 is the code for H … 74 is the code for J …” and so on for every character in the repertoire, we can make a computer character set. Every character set has two parts: a character repertoire and an agreement on how the repertoire’s characters will be encoded as numbers. (The SQL Standard calls the encoding scheme itself a Form-of-use.)
Internationally, there are many character sets, and many names for them. When using SQL, you’ll be concerned mainly with the ones that the International Standards Organization (ISO) has published standards for.
ISO 646 (1965) also known as US-ASCII, DIN 66003, etc.
ISO 8859-1 (1985) also known as LATIN 1 and Windows Code Page 1252.
ISO 8859-2 (1987) also known as LATIN 2 and Windows Code Page 1250.
The numeric codes between 0 and 127 have the same meaning in every ISO standard; differences exist only for codes greater than 127. In each standard, the codes between 0 and 127 include punctuation, digits, special characters and (what’s most important for this discussion) the upper case letters A to Z (for example, the code for H is indeed 72) and the lower case letters a to z. Thus, all character sets start with Latin.
Latin¶
Cicero never saw a W or a lower-case j, but medieval and modern Latin have 26 letters, each of which has two forms: uppercase and lowercase. These are the simple Latin letters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ -- simple upper-case Latin letters
abcdefghijklmnopqrstuvwxyz -- simple lower-case Latin letters
These letters appear in all standard computer character sets including 7-bit ASCII and EBCDIC. They even appear in character sets for languages which don’t use Latin. For instance, character sets used in Russia (such as KOI8R or ISO 8859-5) contain all the Latin letters with the same codes as 7-bit ASCII has; the Russian Cyrillic letters follow the Latin letters.
Because they are universal, the Latin letters are vital for computer languages. In SQL, all <keyword>s and <regular identifier>s (or names) are made with Latin letters (though they may also contain digits and the underline character).
The letters of the Latin alphabet are in alphabetic order. That might sound … well, “Duh” … but it helps to point out the most fundamental point about a collation, or collating sequence – namely, that a collation represents a fixed order, and that order is known to most members of the general public. They expect to see words and names sorted that way (though they don’t expect A to sort before a – that’s a computer thing).
English¶
English uses the Latin alphabet but adds rules when sorting. To set the scene for the complexities that follow, here are the rules for English sorting. (We´ll naively pretend that when we write á l´anglaise, we don´t need even a soupçon of those pesky “accented letters” fot names like El Niño – we´ll look at English only. All examples come from The Concise Oxford Dictionary Of Current English.)
Preprocess (Map to Simple Latin Upper Case Letters)¶
Preprocessing involves (a) stripping accents off all accented characters, e.g., Ò becomes O, (b) eliminating all special characters such as hyphen or space or apostrophe or dot, and (c) changing ligatures such as Æ to double-characters such as AE.
Tiebreakers¶
Tiebreakers apply only if two words are otherwise equal.
If a special character (such as a hyphen or apostrophe) was stripped in the preprocess stage, the word containing it follows. For example: “cant” before “can’t” before “Cant” and “recollect” before “re-collect” and “nones” before “non est” and “francophone” before “franc tireur”.
If a word contained an accented character, it follows. For example: “lame” before “lamé”.
If a word contains a capital letter, it doesn’t matter. For example: “Negus” before “negus”, “AR” before “Ar” – but, “dan” before “Dan”, “ma” before “MA”. (We acknowledge that some people believe it does matter: we report only what actually happens.)
About 50% of English-world phone books have additional rules:
Mc and Mac appear before M (sometimes M’ does too, but only in Ireland). For example: “McDonald” before “Macdonald” before “Maastricht”.
Ste and St appear as if they were spelled “Saint”. For example: “St John” before “Saan Stores Limited”.
In these cases, we effectively combine a group of letters into one, and place the result in the hidden interstices between O and M, or between R and S. It looks odd, but many languages have two-letter combinations (called digraphs) which fit between primary letters of the alphabet.
From all this, we see that English sorting involves, in addition to the basic A-to-Z stuff, several special collation rules, including mapping, ligatures (“one character maps to two”), tie breaking and digraphs (“two characters map to one”). Nobody makes a fuss about English because the special cases are rare. But special cases in non-English languages are not rare at all – they are the norm, so it’s the non-English collation rules that get the attention.
French¶
French words have:
a or e with Grave accent |
ès à |
a or e with Acute accent |
Chrétien résumé |
a or i or o with Circumflex |
hôte côte |
c with Cedilla |
soupçon ça |
i or e or o with Diaeresis |
naïf coöpërant |
In simple cases, the French sorting rules are like the English rules.
Accented characters map to their Latin upper case equivalents (“résumé‚” comes after “rester” but before “retard”).
Words with accented characters come after words with none (“à” comes after “a” – but that’s only a tie-breaker, “à” comes before “abbé`”).
Then things get complicated.
For words with multiple accented characters, the word with the last accented character follows. For example: “pêche” before “péché”.
France and French Canada don’t capitalize accented letters the same way. In France the upper case of the word “résumé” is “RESUME” – there are no capital letters with accents, which is why the old PC-DOS Character set (IBM extended ASCII) could dispense with the letters É, Ó, etc. But French Canadians capitalize the word “résumé” and keep the accents: “RÉSUMÉ”.
This shows us that we have more characters, another collation rule and conflicting ideas for changing lower case to upper case.
German¶
German words have:
a or o or u with umlaut |
Mädchen östlicher Führer |
sharp S or Eszet |
bißchen |
Sharp S’s similarity to a Greek Beta is just coincidence: it is really a ligature which maps to “ss”. This is the only character in any Latin alphabet which is always lower case.
There are three ways to sort characters with umlauts, depending on the German-speaking country and the application.
In the German DIN-1 official standard, Ä = A and Ö = O and Ü = U. That is, accented characters map to unaccented equivalents. This is the standard used for dictionaries, book indexes or any lists of words. Thus, Führer = Fuhrer
In German DIN-2 official standard, Ä = AE and Ö = OE and Ü = UE. That is, accented characters map to unaccented character plus E. This is the standard used for phone books, voter lists, or any lists of names in Germany. Thus, Göbbels = Goebbels and Düsseldorf = Duesseldorf.
In the Austrian unofficial standard, Ä > AZ and Ö > OZ and Ü > UZ. That is, accented character are treated as a separate letter between A and B or O and P or U and V. This is the standard used for lists of names in Austria. Thus, Müller > Mzilikaze.
The Swiss use DIN-1 for all sorting.
The German tiebreaker rules are: unaccented before accented (this rule is universal), Eszet before umlaut, umlaut before any other accent. The following table shows three lists sorted according to DIN-1 (Zürich phone book), DIN-2 (Berlin phone book), and Austrian collation (Vienna phone book).
DIN-1 |
DIN-2 |
Austrian |
---|---|---|
Moeller |
Moeller |
Moeller |
Moffat |
Möller |
Moffat |
Möller |
Moffat |
Morse |
Morse |
Morse |
Möller |
Möse |
Möse |
Möse |
Motler |
Motler |
Motler |
At this point, you may be wondering what happens if you click “use German collating sequence” on some Windows dialog box. Well, Microsoft seems to prefer DIN 1, but the point is: if you just click, you’ll get a Microsoft choice which is wrong at least half the time. As we said at the beginning of this chapter: you have to know what the problem is before you know what the solutions are worth.
Spanish¶
Spanish words have:
tilde over n |
cañon |
acute accent over a u o i |
corazón |
Some circles (such as the Spanish-speaking community in the United States) use the English style rules to sort Spanish words: all accented characters map to their simple Latin equivalents, period. However, the following discussion on Spanish collation refers to the traditional rules which apply in most situations.
CH comes between C and D
LL comes between L and M
N comes before Ñ
Note
(From now on, we’ll use the formula “<x> follows <y>” to mean “<x> is treated as a separate letter which follows <y> in the alphabet”, as in: CH follows C.
The Spanish CH and LL digraphs are truly separate letters. They get separate headings in the dictionaries and the Spanish alphabet really has 28 letters: a b c ch d e f g h i j k l ll m n ñ o p q r s t u v x y z.
Here is an example of a sorted Spanish list:
calza |
calle |
cantor |
cañon |
culto |
che |
Dutch, Italian, Portuguese, Basque, Catalan¶
These languages present no special problems. Dutch does have one digraph, IJ, which is sometimes used for Y, but it is acceptable to use the letters IJ and sort with the digraph IJ. Acute accents are common in Portuguese, grave accents in Italian; however, they all map to simple Latin letters and there are no special collation rules.
Nordic¶
The Scandinavian alphabets have a common history and a few similar features, the most notable of which is their tendency to add letters at the end of the alphabet (after Z).
Central Europe¶
So far, we’ve discussed eleven alphabets and collations without having to change the standard computer character set. Every one of the symbols in the alphabets we’ve discussed appears in ISO 8859-1, usually called the LATIN 1 character set. They also appear in ISO 8859-1’s most famous derivative, Windows code page 1252, which is the default code page on computers in North America and in all European countries west of a line drawn through Finland, Germany, Austria and Italy. We are now going to cross that line, and briefly discuss ISO 8859-2, usually called the LATIN 2 character set (or, in Windows, code page 1250).
The Central European character set supports languages that use Latin-based characters with macrons, slashes, haceks, ogoneks, breves, bars and carons. Many of these do not map to a simple Latin letter when sorting.
The many extra letters created by adding these marks to the simple Latin
letters made it impossible to fit them all into the 8-bit LATIN 1 character
set, since there can only be a maximum of 256 different symbols in an 8-bit
set. (In fact, LATIN 1 actually provides fewer than 256 characters because
0x00
to 0x1f
and 0x7f
to 0x9f
are reserved for control
“characters”.) Therefore, the most convenient thing to do was group the
Latin-based East European character sets together into LATIN 2.
There has been lots of confusion in the past and several national character set encodings are still in common use, so it’s a bit idealist to write as if the ISO standards were already universal; however, there’s no other useful assumption that a small shop can make.
Baltic¶
The Baltic countries have alphabets containing yet more characters, culminating in the world’s longest Latin-based alphabet – Latvian. Again, the addition of the extra letters made it necessary to set up another 8-bit character set. This one is ISO 8859-4, usually called the BALTIC character set (or, in Windows, code page 1257). The Baltic collations are the only ones in which the simple Latin letters do not follow what we usually think of as alphabetic order.
Turkish¶
The Turkish character set is defined by ISO 8859-9, usually called the LATIN 5 character set (or, in Windows, code page 1254). There is an irritating capitalization problem in Turkish (irritating to us that is, we assume it doesn’t bother the Turks much): the capital of i is a dotted capital “I” and the capital of i is I. Because of this single character, it is impossible to capitalize simple-Latin-letter strings without being sure that the strings are not written in Turkish.
Cyrillic¶
The main Cyrillic character sets can be represented by a single Windows code page – 1251, a derivative of ISO 8859-5. In Ukrainian, the soft sign appears at the end of the alphabet, instead of the position (between YERU and E) used by other Cyrillic languages. This affects the order of a few words, so two collations are necessary for completeness: a Ukrainian collation and a generalized-Cyrillic (Bulgarian Russian Serbian) collation, which differs only with respect to the position of this one letter.
Note
The Cyrillic letters A B E I J K M H O P C S T Y X are only similar to Latin letters in one respect: their appearance. Remember that we defined a character as “a combination of a glyph and a meaning” – this is why. Similarity don’t mean equivalence. These Cyrillic letters have codes of their own, they never map to Latin letters.
Greek¶
Greek is the oldest European character set and contains a few archaic features (for example, breath marks) which don’t really indicate anything nowadays. On the other hand, we’d just as often want to use the Greek letters for classical texts as for modern Greek. So it’s good that they’re all found in ISO 8859-7, usually called the GREEK character set (or, in Windows, code page 1253).
The Rest of the World¶
We have now described all the alphabets of Europe and given you some idea of their collation quirks. We think that’s the minimum requirement for a book that mentions character sets. We also think that it’s the maximum! With some relief, we note that other alphabets and other writing systems (syllabaries, hieroglyphs etc.) are unavailable on a typically-configured computer in “the western world”, so we won’t talk about them in any detail. Three other systems deserve a mention, though.
Arabic¶
In Arabic, a character is displayable in several different forms – depending on whether it is alone, or is the first, middle or last character in a word. Thus, search and sort features are needed that can recognize multiple encodings of the same symbol.
Chinese¶
In Chinese, there are at least 13000 different signs, which are implemented with double byte character sets (usually “BIG5” in Taiwan, or “GB” in Red China); ideally the arrangement is that characters are sorted phonetically (Level 1) or according to radical and number of strokes (Level 2).
Japanese¶
In Japanese, several systems are in simultaneous use. For example, the word for “water” can be a single character (Kanji), two characters (Hiragana), two characters (Katakana) or four Latin letters (Romaji). The double character sets are usually encoded using escape characters. That saves space, but makes it impossible to determine a text’s length or find its nth letter without starting at the front and working through the text.
What’s in it for Me?¶
Now that you’ve read this far, you know everything you’ll need to write your own Serbian Sort or make party chat about the relative utility of 8-bit-code subsets. Of course, you’re a high-level programmer – you haven’t got any intention of doing either. Instead, you’re going to want to ask:
Is it safe to regard character sets and collations as black boxes, for which you only need to know what the inputs are without worrying about the process or the outputs?
Are my OS and DBMS vendors aware of the problems and do they handle all cases correctly?
Answers (we hope you got this, if not read the chapter again):
No.
No.
If you got both answers right, congratulations – you’ve understood the problem. In the next chapter, we’ll look at how an SQL implementation solves it, but first we’ll look at the Character sets that you can expect your DBMS to provide.
Predefined Character sets¶
In SQL, a Character set may be a Character set defined by a national or international standard, by your DBMS or by a user of SQL-data.
Standard-defined Character sets consist of a set of characters predefined by
some standards body and have a default Collation that is the order of the
characters in the relevant standard. The default Collation has the PAD
SPACE
characteristic. The SQL Standard requires a DBMS to support, at a
minimum, these standard-defined Character sets: SQL_CHARACTER
,
GRAPHIC_IRV
(also called ASCII_GRAPHIC
), LATIN1
, ISO8BIT
(also
called ASCII_FULL
) and UNICODE
(also called ISO10646
).
Implementation-defined Character sets consist of a set of characters predefined
by your DBMS and have a default Collation that is also defined by your DBMS.
The default Collation may have either the PAD SPACE
characteristic or the
NO PAD
characteristic. The SQL Standard requires a DBMS to support, at a
minimum, this implementation-defined Character set SQL_TEXT
.
SQL_CHARACTER¶
The SQL_CHARACTER
Character set is an 8-bit Character set that consists of
the 83 <SQL language character>s. These are found in codepage 1252.
SQL_CHARACTER
characters and their codepage 1252 Form-of-use codes
decimal code 032 |
<space> |
|
decimal code 034 |
“ |
<double quote> |
decimal code 037 |
% |
<percent> |
decimal code 038 |
& |
<ampersand> |
decimal code 039 |
‘ |
<single quote> |
decimal code 040 |
( |
<left parenthesis> |
decimal code 041 |
) |
<right parenthesis> |
decimal code 042 |
<asterisk> |
|
decimal code 043 |
<plus sign> |
|
decimal code 044 |
, |
<comma> |
decimal code 045 |
<minus sign> |
|
decimal code 046 |
. |
<period> |
decimal code 047 |
/ |
<solidus> |
decimal code 048 |
0 |
|
decimal code 049 |
1 |
|
decimal code 050 |
2 |
|
decimal code 051 |
3 |
|
decimal code 052 |
4 |
|
decimal code 053 |
5 |
|
decimal code 054 |
6 |
|
decimal code 055 |
7 |
|
decimal code 056 |
8 |
|
decimal code 057 |
9 |
|
decimal code 058 |
: |
<colon> |
decimal code 059 |
; |
<semicolon> |
decimal code 060 |
< |
<less than operator> |
decimal code 061 |
= |
<equals operator> |
decimal code 062 |
> |
<greater than operator> |
decimal code 063 |
? |
<question mark> |
decimal code 065 |
A |
|
decimal code 066 |
B |
|
decimal code 067 |
C |
|
decimal code 068 |
D |
|
decimal code 069 |
E |
|
decimal code 070 |
F |
|
decimal code 071 |
G |
|
decimal code 072 |
H |
|
decimal code 073 |
I |
|
decimal code 074 |
J |
|
decimal code 075 |
K |
|
decimal code 076 |
L |
|
decimal code 077 |
M |
|
decimal code 078 |
N |
|
decimal code 079 |
O |
|
decimal code 080 |
P |
|
decimal code 081 |
Q |
|
decimal code 082 |
R |
|
decimal code 083 |
S |
|
decimal code 084 |
T |
|
decimal code 085 |
U |
|
decimal code 086 |
V |
|
decimal code 087 |
W |
|
decimal code 088 |
X |
|
decimal code 089 |
Y |
|
decimal code 090 |
Z |
|
decimal code 095 |
_ |
<underscore> |
decimal code 097 |
a |
|
decimal code 098 |
b |
|
decimal code 099 |
c |
|
decimal code 100 |
d |
|
decimal code 101 |
e |
|
decimal code 102 |
f |
|
decimal code 103 |
g |
|
decimal code 104 |
h |
|
decimal code 105 |
i |
|
decimal code 106 |
j |
|
decimal code 107 |
k |
|
decimal code 108 |
l |
|
decimal code 109 |
m |
|
decimal code 110 |
n |
|
decimal code 111 |
o |
|
decimal code 112 |
p |
|
decimal code 113 |
q |
|
decimal code 114 |
r |
|
decimal code 115 |
s |
|
decimal code 116 |
t |
|
decimal code 117 |
u |
|
decimal code 118 |
v |
|
decimal code 119 |
w |
|
decimal code 120 |
x |
|
decimal code 121 |
y |
|
decimal code 122 |
z |
|
decimal code 124 |
<vertical bar> |
[NON-PORTABLE] The default Collation for the SQL_CHARACTER
Character set is
non-standard because the SQL Standard requires implementors to define
SQL_CHARACTER
’s default Collation. Most DBMSs will sort the characters of
SQL_CHARACTER
in the decimal order shown above. [OCELOT Implementation] The
OCELOT DBMS that comes with this book has a default Collation (called
SQL_CHARACTER
) for the SQL_CHARACTER
Character set that sorts the
characters in their codepage 1252 decimal order.
GRAPHIC_IRV and ASCII_GRAPHIC¶
The GRAPHIC_IRV
Character set (ASCII_GRAPHIC
is a synonym) is an 8-bit
Character set that consists of 95 characters: all of the SQL_CHARACTER
characters, plus an additional 12 characters from codepage 1252. The default
Collation for the GRAPHIC_IRV
Character set sorts the characters in their
codepage 1252 decimal order.
GRAPHIC_IRV
characters, and their codepage 1252 Form-of-use codes, are
(in default Collation order)
decimal code 032 |
<space> |
|
decimal code 033 |
! |
<exclamation mark> |
decimal code 034 |
“ |
<double quote> |
decimal code 035 |
# |
<number sign> |
decimal code 036 |
$ |
<dollar sign> |
decimal code 037 |
% |
<percent> |
decimal code 038 |
& |
<ampersand> |
decimal code 039 |
‘ |
<single quote> |
decimal code 040 |
( |
<left parenthesis> |
decimal code 041 |
) |
<right parenthesis> |
decimal code 042 |
<asterisk> |
|
decimal code 043 |
<plus sign> |
|
decimal code 044 |
, |
<comma> |
decimal code 045 |
<minus sign> |
|
decimal code 046 |
. |
<period> |
decimal code 047 |
/ |
<solidus> |
decimal code 048 |
0 |
|
decimal code 049 |
1 |
|
decimal code 050 |
2 |
|
decimal code 051 |
3 |
|
decimal code 052 |
4 |
|
decimal code 053 |
5 |
|
decimal code 054 |
6 |
|
decimal code 055 |
7 |
|
decimal code 056 |
8 |
|
decimal code 057 |
9 |
|
decimal code 058 |
: |
<colon> |
decimal code 059 |
; |
<semicolon> |
decimal code 060 |
< |
<less than operator> |
decimal code 061 |
= |
<equals operator> |
decimal code 062 |
> |
<greater than operator> |
decimal code 063 |
? |
<question mark> |
decimal code 064 |
@ |
<commercial at sign> |
decimal code 065 |
A |
|
decimal code 066 |
B |
|
decimal code 067 |
C |
|
decimal code 068 |
D |
|
decimal code 069 |
E |
|
decimal code 070 |
F |
|
decimal code 071 |
G |
|
decimal code 072 |
H |
|
decimal code 073 |
I |
|
decimal code 074 |
J |
|
decimal code 075 |
K |
|
decimal code 076 |
L |
|
decimal code 077 |
M |
|
decimal code 078 |
N |
|
decimal code 079 |
O |
|
decimal code 080 |
P |
|
decimal code 081 |
Q |
|
decimal code 082 |
R |
|
decimal code 083 |
S |
|
decimal code 084 |
T |
|
decimal code 085 |
U |
|
decimal code 086 |
V |
|
decimal code 087 |
W |
|
decimal code 088 |
X |
|
decimal code 089 |
Y |
|
decimal code 090 |
Z |
|
decimal code 091 |
[ |
<left square bracket> |
decimal code 092 |
<backslash> |
|
decimal code 093 |
] |
<right square bracket> |
decimal code 094 |
^ |
<circumflex accent mark> |
decimal code 095 |
_ |
<underline character> |
decimal code 096 |
` |
<grave accent mark> |
decimal code 097 |
a |
|
decimal code 098 |
b |
|
decimal code 099 |
c |
|
decimal code 100 |
d |
|
decimal code 101 |
e |
|
decimal code 102 |
f |
|
decimal code 103 |
g |
|
decimal code 104 |
h |
|
decimal code 105 |
i |
|
decimal code 106 |
j |
|
decimal code 107 |
k |
|
decimal code 108 |
l |
|
decimal code 109 |
m |
|
decimal code 110 |
n |
|
decimal code 111 |
o |
|
decimal code 112 |
p |
|
decimal code 113 |
q |
|
decimal code 114 |
r |
|
decimal code 115 |
s |
|
decimal code 116 |
t |
|
decimal code 117 |
u |
|
decimal code 118 |
v |
|
decimal code 119 |
w |
|
decimal code 120 |
x |
|
decimal code 121 |
y |
|
decimal code 122 |
z |
|
decimal code 123 |
{ |
<left brace> |
decimal code 124 |
<vertical bar> |
|
decimal code 125 |
} |
<right brace> |
decimal code 126 |
~ |
<tilde> |
[NON-PORTABLE] The default Collation for the GRAPHIC_IRV
Character set has
a non-standard name because the SQL Standard requires implementors to define
what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that
comes with this book has a default Collation called GRAPHIC_IRV
for the
GRAPHIC_IRV
Character set.
LATIN1¶
The LATIN1
Character set is an 8-bit Character set that consists of all the
characters commonly used in Danish, Dutch, English, Faeroese, Finnish, French,
German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish
– a total of 191 characters: all of the GRAPHIC_IRV
characters, plus an
additional 96 characters from codepage 1252. The default Collation for the
LATIN1 Character set sorts the characters in their codepage 1252 decimal order.
The additional LATIN1
characters, and their codepage 1252 decimal order.
Additional LATIN1
characters and their codepage 1252 Form-of-use
codes (in default Collation order)
decimal code 160 |
<no-break space> |
|
decimal code 161 |
¡ |
<inverted exclamation mark> |
decimal code 162 |
¢ |
<cent sign> |
decimal code 163 |
£ |
<pound sign> |
decimal code 164 |
¤ |
<currency sign> |
decimal code 165 |
¥ |
<yen sign> |
decimal code 166 |
¦ |
<broken bar> |
decimal code 167 |
§ |
<section sign> |
decimal code 168 |
¨ |
<diaeresis> |
decimal code 169 |
© |
<copyright sign> |
decimal code 170 |
ª |
<feminine ordinal indicator> |
decimal code 171 |
« |
<left angle quotation mark> |
decimal code 172 |
¬ |
<not sign> |
decimal code 173 |
<soft hyphen> |
|
decimal code 174 |
® |
<registered trade mark sign> |
decimal code 175 |
¯ |
<macron> |
decimal code 176 |
° |
<degree sign> |
decimal code 177 |
± |
<plus-minus sign> |
decimal code 178 |
2 |
<superscript two> |
decimal code 179 |
3 |
<superscript three> |
decimal code 180 |
´ |
<acute accent> |
decimal code 181 |
µ |
<micro sign> |
decimal code 182 |
¶ |
<pilcrow sign> |
decimal code 183 |
<middle dot> |
|
decimal code 184 |
¸ |
<cedilla> |
decimal code 185 |
1 |
<superscript one> |
decimal code 186 |
º |
<masculine ordinal indicator> |
decimal code 187 |
› |
<right angle quotation mark> |
decimal code 188 |
¼ |
<fraction one quarter> |
decimal code 189 |
½ |
<fraction one half> |
decimal code 190 |
¾ |
<fraction three quarters> |
decimal code 191 |
¿ |
<inverted question mark> |
decimal code 192 |
À |
<A accent grave> |
decimal code 193 |
Á |
<A acute accent> |
decimal code 194 |
 |
<A circumflex> |
decimal code 195 |
à |
<A tilde> |
decimal code 196 |
Ä |
<A umlaut> |
decimal code 197 |
Å |
<A circle> |
decimal code 198 |
Æ |
<AE ligature> |
decimal code 199 |
Ç |
<C cedilla> |
decimal code 200 |
È |
<E accent grave> |
decimal code 201 |
É |
< E acute accent> |
decimal code 202 |
Ê |
<E carat> |
decimal code 203 |
Ë |
<E umlaut> |
decimal code 204 |
Ì |
<I accent grave> |
decimal code 205 |
Í |
<I acute accent> |
decimal code 206 |
Ī |
<I circumflex> |
decimal code 207 |
Ï |
<I diaeresis> |
decimal code 208 |
Ð |
<Icelandic ETH> |
decimal code 209 |
Ñ |
<N tilde> |
decimal code 210 |
Ò |
<O accent grave> |
decimal code 211 |
Ó |
<O acute accent> |
decimal code 212 |
Ô |
<O circumflex> |
decimal code 213 |
Õ |
<O tilde> |
decimal code 214 |
Ö |
<O umlaut> |
decimal code 215 |
x |
<multiplication sign> |
decimal code 216 |
ø |
<O stroke> |
decimal code 217 |
Ù |
<U accent grave> |
decimal code 218 |
Ú |
<U acute accent> |
decimal code 219 |
Û |
<U circumflex> |
decimal code 220 |
Ü |
<U umlaut> |
decimal code 221 |
Ý |
<Y acute accent> |
decimal code 222 |
Þ |
<THORN> |
decimal code 223 |
ß |
<sharp s> |
decimal code 224 |
à |
<a accent grave> |
decimal code 225 |
á |
<a acute accent> |
decimal code 226 |
â |
<a circumflex> |
decimal code 227 |
ã |
<a tilde> |
decimal code 228 |
ä |
<a diaeresis> |
decimal code 229 |
å |
<a circle> |
decimal code 230 |
æ |
<ae> |
decimal code 231 |
ç |
<c cedilla> |
decimal code 232 |
è |
<e accent grave> |
decimal code 233 |
é |
<e acute accent> |
decimal code 234 |
ê |
<e circumflex> |
decimal code 235 |
ë |
<e umlaut> |
decimal code 236 |
ì |
< i accent grave> |
decimal code 237 |
í |
<i acute accent> |
decimal code 238 |
î |
<i circumflex> |
decimal code 239 |
ï |
<i umlaut> |
decimal code 240 |
ð |
<Icelandic eth> |
decimal code 241 |
ñ |
<n tilde> |
decimal code 242 |
ò |
<o accent grave> |
decimal code 243 |
ó |
<o acute accent> |
decimal code 244 |
ô |
<o circumflex> |
decimal code 245 |
õ |
<o tilde> |
decimal code 246 |
ö |
<o umlaut> |
decimal code 247 |
÷ |
<division sign> |
decimal code 248 |
ø |
<o stroke> |
decimal code 249 |
ù |
<u accent grave> |
decimal code 250 |
ú |
<u acute accent> |
decimal code 251 |
û |
<u circumflex> |
decimal code 252 |
ü |
<u diaeresis> |
decimal code 253 |
ý |
<y acute accent> |
decimal code 254 |
þ |
<thorn> |
decimal code 255 |
ÿ |
<y umlaut> |
[NON-PORTABLE] The default Collation for the LATIN1
Character set has a
non- standard name because the SQL Standard requires implementors to define
what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that
comes with this book has a default Collation called LATIN1
for the
LATIN1
Character set.
ISO8BIT and ASCII_FULL¶
The ISO8BIT
Character set (ASCII_FULL
is a synonym) is an 8-bit
Character set that consists of 256 characters: all of the LATIN1
characters, plus the rest of the characters (including control characters and
graphic characters) from codepage 1252. The default Collation for the
ISO8BIT
Character set sorts the characters in their codepage 1252 decimal
order.
ISO8BIT
characters and their codepage 1252 Form-of-use codes, are (in
default Collation order)
decimal code 000 |
control character; not displayed |
decimal code 001 |
control character; not displayed |
decimal code 002 |
control character; not displayed |
decimal code 003 |
control character; not displayed |
decimal code 004 |
control character; not displayed |
decimal code 005 |
control character; not displayed |
decimal code 006 |
control character; not displayed |
decimal code 007 |
control character; not displayed |
decimal code 008 |
control character; not displayed |
decimal code 009 |
control character; not displayed |
decimal code 010 |
control character; not displayed |
decimal code 011 |
control character; not displayed |
decimal code 012 |
control character; not displayed |
decimal code 013 |
control character; not displayed |
decimal code 014 |
control character; not displayed |
decimal code 015 |
control character; not displayed |
decimal code 016 |
control character; not displayed |
decimal code 017 |
control character; not displayed |
decimal code 018 |
control character; not displayed |
decimal code 019 |
control character; not displayed |
decimal code 020 |
control character; not displayed |
decimal code 021 |
control character; not displayed |
decimal code 022 |
control character; not displayed |
decimal code 023 |
control character; not displayed |
decimal code 024 |
control character; not displayed |
decimal code 025 |
control character; not displayed |
decimal code 026 |
control character; not displayed |
decimal code 027 |
control character; not displayed |
decimal code 028 |
control character; not displayed |
decimal code 029 |
control character; not displayed |
decimal code 030 |
control character; not displayed |
decimal code 031 |
control character; not displayed |
decimal codes 032 through 126, from LATIN1 |
|
decimal code 127 |
control character; not displayed |
decimal code 128 |
control character; not displayed |
decimal code 129 |
control character; not displayed |
decimal code 130 |
control character; not displayed |
decimal code 131 |
control character; not displayed |
decimal code 132 |
control character; not displayed |
decimal code 133 |
control character; not displayed |
decimal code 134 |
control character; not displayed |
decimal code 135 |
control character; not displayed |
decimal code 136 |
control character; not displayed |
decimal code 137 |
control character; not displayed |
decimal code 138 |
control character; not displayed |
decimal code 139 |
control character; not displayed |
decimal code 140 |
control character; not displayed |
decimal code 141 |
control character; not displayed |
decimal code 142 |
control character; not displayed |
decimal code 143 |
control character; not displayed |
decimal code 144 |
control character; not displayed |
decimal code 145 |
control character; not displayed |
decimal code 146 |
control character; not displayed |
decimal code 147 |
control character; not displayed |
decimal code 148 |
control character; not displayed |
decimal code 149 |
control character; not displayed |
decimal code 150 |
control character; not displayed |
decimal code 151 |
control character; not displayed |
decimal code 152 |
control character; not displayed |
decimal code 153 |
control character; not displayed |
decimal code 154 |
control character; not displayed |
decimal code 155 |
control character; not displayed |
decimal code 156 |
control character; not displayed |
decimal code 157 |
control character; not displayed |
decimal code 158 |
control character; not displayed |
decimal code 159 |
control character; not displayed |
decimal codes 160 through 256, from LATIN1 |
[NON-PORTABLE] The default Collation for the ISO8BIT
Character set has a
non- standard name because the SQL Standard requires implementors to define
what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that
comes with this book has a default Collation called ISO8BIT
for the
ISO8BIT
Character set.
UNICODE and ISO10646¶
The UNICODE
Character set (ISO10646
is a synonym) is a 16-bit Character
set that consists of every character represented by the Unicode specification
(specifically, by The Unicode Standard Version 2.1). Since there are 38,887
characters in this set, we won’t include a list in this book. It suffices to
say that the UNICODE
Character set includes all of the ISO8BIT
characters, plus many others. The default Collation for the UNICODE
Character set sorts the characters in their Unicode Form-of-use code order.
[NON-PORTABLE] The default Collation for the UNICODE
Character set has a
non- standard name because the SQL Standard requires implementors to define
what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that
comes with this book has a default Collation called SQL_TEXT
for the
UNICODE
Character set.
SQL_TEXT¶
The SQL Standard requires every DBMS to provide an Ur-Character set named
SQL_TEXT
. The complete set of SQL_TEXT
characters is implementation
defined, but SQL_TEXT
must contain every <SQL language character> plus
every other character that the DBMS supports – that is, SQL_TEXT
(at a
minimum) includes all of the `UNICODE
characters and is therefore at least
a 16-bit Character set.
[NON-PORTABLE] The default Collation for the SQL_TEXT
Character set must be
called SQL_TEXT
too, but is otherwise non-standard because the SQL Standard
requires implementors to define the default Collation and Form-of-use of
SQL_TEXT
. [OCELOT Implementation] The OCELOT DBMS that comes with this book
sorts the characters of SQL_TEXT
in their Unicode Form-of-use code order.
There are some curious consequences to making SQL_TEXT
(and UNICODE
) a
16-bit, rather than an 8-bit Character set.
The consequences are the result of the fact that all <identifier>s and other
character strings in INFORMATION_SCHEMA
, all <SQL-server name>s, all
<Connection name>s, all <AuthorizationID>s, and the results of many functions
(e.g., USER
, CURRENT_USER
, SESSION_USER
, SYSTEM_USER
) are
SQL_TEXT
character strings.
SQL_TEXT
, however, is normally not the default Character set (the default
Character set is almost always an 8-bit Character set) – and SQL specifically
prohibits comparisons between strings that belong to different Character sets.
The rules require that, for such comparisons, one string be translated to the
other string’s Character set first – see TRANSLATE
function in our chapter
on character strings.
So, will you have special trouble when you want to display
INFORMATION_SCHEMA
strings, <AuthorizationID>s, <SQL-server name>s,
<Connection name>s or the like? There are two possibilities, and the answer is
usually “No” to both.
Possibility 1¶
Consider the following SQL statements:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES;
SELECT SESSION_USER FROM INFORMATION_SCHEMA.TABLES;
The SQL Standard allows a DBMS to decide whether there should be an automatic
translation from SQL_TEXT
to the default Character set when an SQL
statement assigns an SQL_TEXT
string to a host language variable or
parameter. In such cases, most DBMSs do the automatic translation for you,
e.g., OCELOT DBMS that comes with this book does. So, to display the result of
a SELECT
from INFORMATION_SCHEMA
or the result from a function that
returns an SQL_TEXT
string, you won’t need to do anything special. It won’t
normally be necessary to use the TRANSLATE
function to translate the
retrieved values from 16-bit SQL_TEXT
characters into 8-bit, readable
format or vice versa. Just write your SQL statement as if the Character set
difference doesn’t exist. (In the example above, by the way, the ASCII
<identifier> TABLE_NAME
would also be automatically converted to a
SQL_TEXT
string by your DBMS before the search was done.)
Possibility 2¶
Consider this SQL statement:
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME 'ocelot';
The SQL Standard does not allow a DBMS to decide whether there should be an
automatic translation from SQL_TEXT
to the default Character set when an
SQL statement involves a comparison between a SQL_TEXT
string and a default
Character set string. But the rules do allow a DBMS to provide a predefined
Translation for this purpose. In such cases, most DBMSs will use this option to
do an automatic translation for you, e.g., the OCELOT DBMS that comes with this
book does. So, to compare SQL_TEXT
strings with default Character set
strings, you won’t need to do anything special. It won’t normally be necessary
to use the TRANSLATE
function to translate the retrieved values from 16-bit
SQL_TEXT
characters into 8-bit, readable format or vice versa. Just write
your SQL statement as if the Character set difference doesn’t exist.
A second consequence of the fact that SQL_TEXT
is 16-bit and the default
Character set is usually 8-bit involves the use of <AuthorizationID>s,
<SQL-server name>s and <Connection name>s. Although <AuthorizationID>s,
<SQL-server name>s and <Connection name>s are stored as 16-bit SQL_TEXT
strings, the SQL-Connection arguments in CONNECT TO are inevitably in 8-bit
sets, since ASCII is used to type them. If, however, these same arguments are
assigned to host language variable parameters, you will see them as 16-bit
strings. Therefore, when you have an expression like
... WHERE <AuthorizationID> = 'x' ...
you know that the <character string literal> 'x'
is a 16-bit string because
<Authorization ID>s are SQL_TEXT
strings. But the same <literal> in a
CONNECT
statement, e.g.,
CONNECT TO 'x';
has to be an 8-bit string because it is typed in ASCII.
One last comment, about parsing SQL_TEXT
strings. Consider the following
expression, which concatenates a character string with a value from a Column:
... _INFORMATION_SCHEMA.SQL_TEXT 'A ' || column_1
The default Character set for an SQL statement is usually an 8-bit Character
set, but the string 'A'`
in this example is explicitly introduced as 16-bit
SQL_TEXT
. Thus, what appears to be two 8-bit characters – A and <space> –
is, in fact, one 16-bit character – the Unicode value 4120h.
Parsing of SQL statements is always based on 8-bit values regardless of the
Character set of a string, so your DBMS will consider the SQL_TEXT
string
'A'
to begin after the first single quote mark and end before the second
single quote mark; i.e.: although the characters inside the single quotes are
16-bit, the delimiting single quote marks themselves are 8-bit. Thus, when the
parser reaches the second single quote mark it knows that the string is over –
despite the fact that in 16-bit, a single quote mark character has not yet been
found.