From WorkOutWiki2008
The LC_COLLATE category provides a collation sequence definition for
numerous utilities in the *XCU* specification (/sort
<../xcu/sort.html>/, /uniq <../xcu/uniq.html>/, and so forth),
regular expression matching (see Regular Expressions
<re.html#tag_007>) and the /strcoll() <../xsh/strcoll.html>/,
/strxfrm() <../xsh/strxfrm.html>/, /wcscoll() <../xsh/wcscoll.html>/
and /wcsxfrm() <../xsh/wcsxfrm.html>/ functions in the *XSH*
specification.
A collation sequence definition defines the relative order between
collating elements (characters and multi-character collating
elements) in the locale. This order is expressed in terms of
collation values; that is, by assigning each element one or more
collation values (also known as collation weights). This does not
imply that implementations assign such values, but that ordering of
strings using the resultant collation definition in the locale will
behave as if such assignment is done and used in the collation
process. At least the following capabilities are provided:
1. *Multi-character collating elements*. Specification of
multi-character collating elements (that is, sequences of two
or more characters to be collated as an entity).
2. *User-defined ordering of collating elements*. Each collating
element is assigned a collation value defining its order in
the character (or basic) collation sequence. This ordering is
used by regular expressions and pattern matching and, unless
collation weights are explicitly specified, also as the
collation weight to be used in sorting.
3. *Multiple weights and equivalence classes*. Collating elements
can be assigned one or more (up to the limit
{COLL_WEIGHTS_MAX}) collating weights for use in sorting. The
first weight is hereafter referred to as the primary weight.
4. *One-to-Many mapping*. A single character is mapped into a
string of collating elements.
5. *Equivalence class definition*. Two or more collating elements
have the same collation value (primary weight).
6. *Ordering by weights*. When two strings are compared to
determine their relative order, the two strings are first
broken up into a series of collating elements; the elements in
each successive pair of elements are then compared according
to the relative primary weights for the elements. If equal,
and more than one weight has been assigned, then the pairs of
collating elements are recompared according to the relative
subsequent weights, until either a pair of collating elements
compare unequal or the weights are exhausted.
The following keywords are recognised in a collation sequence
definition. They are described in detail in the following sections.
*collating-element*
Define a collating-element symbol representing a multi-character
collating element. This keyword is optional.
*collating-symbol*
Define a collating symbol for use in collation order statements.
This keyword is optional.
*order_start*
Define collation rules. This statement is followed by one or
more collation order statements, assigning character collation
values and collation weights to collating elements.
*order_end*
Specify the end of the collation-order statements.
*copy*
Specify the name of an existing locale to be used as the
definition of this category. If this keyword is specified, no
other keyword can be specified.
The collating-element Keyword
In addition to the collating elements in the character set, the
*collating-element* keyword is used to define multi-character
collating elements. The syntax is:
"collating-element %s from \"%s\"\n", </collating-symbol/>,
</string/>
The </collating-symbol/> operand is a symbolic name, enclosed
between angle brackets (< and >), and must not duplicate any
symbolic name in the current charmap file (if any), or any other
symbolic name defined in this collation definition. The string
operand is a string of two or more characters that collates as an
entity. A </collating-element/> defined via this keyword is only
recognised with the LC_COLLATE category.
*Example*: | |
collating-element <ch> from "<c><h>"
collating-element <e-acute> from "<acute><e>"
collating-element <ll> from "ll"
The collating-symbol Keyword
This keyword will be used to define symbols for use in collation
sequence statements; that is, between the *order_start* and the
*order_end* keywords. The syntax is:
"collating-symbol %s\n", </collating-symbol/>
The </collating-symbol/> is a symbolic name, enclosed between angle
brackets (< and >), and must not duplicate any symbolic name in the
current charmap file (if any), or any other symbolic name defined in
this collation definition. A </collating-symbol/> defined via this
keyword is only recognised with the LC_COLLATE category.
*Example*: | |
collating-symbol <UPPER_CASE>
collating-symbol <HIGH>
The *collating-symbol* keyword defines a symbolic name that can be
associated with a relative position in the character order sequence.
While such a symbolic name does not represent any collating element,
it can be used as a weight.
The order_start Keyword
The *order_start* keyword must precede collation order entries and
also defines the number of weights for this collation sequence
definition and other collation rules.
The syntax of the *order_start* keyword is:
"order_start %s;%s;...;%s\n", </sort-rules/>, </sort-rules/>
The operands to the *order_start* keyword are optional. If present,
the operands define rules to be applied when strings are compared.
The number of operands define how many weights each element is
assigned; if no operands are present, one *forward* operand is
assumed. If present, the first operand defines rules to be applied
when comparing strings using the first (primary) weight; the second
when comparing strings using the second weight, and so on. Operands
are separated by semicolons (;). Each operand consists of one or
more collation directives, separated by commas (,). If the number of
operands exceeds the {COLL_WEIGHTS_MAX} limit, the utility will
issue a warning message. The following directives will be supported:
*forward*
Specifies that comparison operations for the weight level
proceed from start of string towards the end of string.
*backward*
Specifies that comparison operations for the weight level
proceed from end of string towards the beginning of string.
*position*
Specifies that comparison operations for the weight level will
consider the relative position of elements in the strings not
subject to *IGNORE*. The string containing an element not
subject to *IGNORE* after the fewest collating elements subject
to *IGNORE* from the start of the compare will collate first. If
both strings contain a character not subject to *IGNORE* in the
same relative position, the collating values assigned to the
elements will determine the ordering. In case of equality,
subsequent characters not subject to *IGNORE* are considered in
the same manner.
The directives *forward* and *backward* are mutually exclusive.
*Example*: | |
order_start forward;backward
If no operands are specified, a single *forward* operand is assumed.
The character (and collating element) order is defined by the order
in which characters and elements are specified between the
*order_start* and *order_end* keywords. This character order is used
in range expressions in regular expressions (see Regular Expressions
<re.html#tag_007>). Weights assigned to the characters and elements
define the collation sequence; in the absence of weights, the
character order is also the collation sequence.
The *position* keyword provides the capability to consider, in a
compare, the relative position of characters not subject to
*IGNORE*. As an example, consider the two strings "o-ring" and
"or-ing". Assuming the hyphen is subject to *IGNORE* on the first
pass, the two strings will compare equal, and the position of the
hyphen is immaterial. On second pass, all characters except the
hyphen are subject to *IGNORE*, and in the normal case the two
strings would again compare equal. By taking position into account,
the first collates before the second.
Collation Order
The *order_start* keyword is followed by collating identifier
entries. The syntax for the collating element entries is:
"%s %s;%s;...;%s\n", </collating-identifier/>, </weight/>,
</weight/>, ...
Each /collating-identifier/ consists of either a character (in any
of the forms defined in Locale Definition <#tag_005_003>), a
</collating-element/>, a </collating-symbol/>, an ellipsis or the
special symbol *UNDEFINED*. The order in which collating elements
are specified determines the character order sequence, such that
each collating element compares less than the elements following it.
The NUL character compares lower than any other character.
A </collating-element/> is used to specify multi-character collating
elements, and indicates that the character sequence specified via
the </collating-element/> is to be collated as a unit and in the
relative order specified by its place.
A </collating-symbol/> is used to define a position in the relative
order for use in weights. No weights are specified with a
</collating-symbol/>.
The ellipsis symbol specifies that a sequence of characters will
collate according to their encoded character values. It is
interpreted as indicating that all characters with a coded character
set value higher than the value of the character in the preceding
line, and lower than the coded character set value for the character
in the following line, in the current coded character set, will be
placed in the character collation order between the previous and the
following character in ascending order according to their coded
character set values. An initial ellipsis is interpreted as if the
preceding line specified the NUL character, and a trailing ellipsis
as if the following line specified the highest coded character set
value in the current coded character set. An ellipsis is treated as
invalid if the preceding or following lines do not specify
characters in the current coded character set. The use of the
ellipsis symbol ties the definition to a specific coded character
set and may preclude the definition from being portable between
implementations.
The symbol *UNDEFINED* is interpreted as including all coded
character set values not specified explicitly or via the ellipsis
symbol. Such characters are inserted in the character collation
order at the point indicated by the symbol, and in ascending order
according to their coded character set values. If no *UNDEFINED*
symbol is specified, and the current coded character set contains
characters not specified in this section, the utility will issue a
warning message and place such characters at the end of the
character collation order.
The optional operands for each collation-element are used to define
the primary, secondary, or subsequent weights for the collating
element. The first operand specifies the relative primary weight,
the second the relative secondary weight, and so on. Two or more
collation-elements can be assigned the same weight; they belong to
the same if they have the same primary weight. Collation behaves as
if, for each weight level, elements subject to *IGNORE* are removed,
unless the *position* collation directive is specified for the
corresponding level with the *order_start* keyword. Then each
successive pair of elements is compared according to the relative
weights for the elements. If the two strings compare equal, the
process is repeated for the next weight level, up to the limit
{COLL_WEIGHTS_MAX}.
Weights are expressed as characters (in any of the forms specified
in Locale Definition <#tag_005_003>), </collating-symbol/>s,
</collating-element/>s, an ellipsis, or the special symbol *IGNORE*.
A single character, a </collating-symbol/> or a
</collating-element/> represent the relative position in the
character collating sequence of the character or symbol, rather than
the character or characters themselves. Thus, rather than assigning
absolute values to weights, a particular weight is expressed using
the relative order value assigned to a collating element based on
its order in the character collation sequence.
One-to-many mapping is indicated by specifying two or more
concatenated characters or symbolic names. For example, if the
character <eszet> is given the string <s><s> as a weight,
comparisons are performed as if all occurrences of the character
<eszet> are replaced by <s><s> (assuming that <s> has the collating
weight <s>). If it is necessary to define <eszet> and <s><s> as an
equivalence class, then a collating element must be defined for the
string ss.
All characters specified via an ellipsis will by default be assigned
unique weights, equal to the relative order of characters.
Characters specified via an explicit or implicit *UNDEFINED* special
symbol will by default be assigned the same primary weight (that is,
belong to the same equivalence class). An ellipsis symbol as a
weight is interpreted to mean that each character in the sequence
has unique weights, equal to the relative order of their character
in the character collation sequence. The use of the ellipsis as a
weight is treated as an error if the collating element is neither an
ellipsis nor the special symbol *UNDEFINED*.
The special keyword *IGNORE* as a weight indicates that when strings
are compared using the weights at the level where *IGNORE* is
specified, the collating element is ignored; that is, as if the
string did not contain the collating element. In regular expressions
and pattern matching, all characters that are subject to *IGNORE* in
their primary weight form an equivalence class.
An empty operand is interpreted as the collating element itself.
For example, the order statement: | |
<a> <a>;<a>
is equal to: | |
<a>
An ellipsis can be used as an operand if the collating element was
an ellipsis, and is interpreted as the value of each character
defined by the ellipsis.
The collation order as defined in this section defines the
interpretation of bracket expressions in regular expressions (see RE
Bracket Expression <re.html#tag_007_003_005>).
*Example*:
order_start forward;backward
UNDEFINED IGNORE;IGNORE
<LOW>
<space> <LOW>;<space>
... <LOW>;...
<a> <a>;<a>
<a-acute> <a>;<a-acute>
<a-grave> <a>;<a-grave>
<A> <a>;<A>
<A-acute> <a>;<A-acute>
<A-grave> <a>;<A-grave>
<ch> <ch>;<ch>
<Ch> <ch>;<Ch>
<s> <s>;<s>
<eszet> "<s><s>";"<eszet><eszet>"
order_end
This example is interpreted as follows:
1. The *UNDEFINED* means that all characters not specified in
this definition (explicitly or via the ellipsis) are ignored
for collation purposes; for regular expression purposes they
are ordered first.
2. All characters between <space> and <a> have the same primary
equivalence class and individual secondary weights based on
their ordinal encoded values.
3. All characters based on the upper- or lower-case character a
belong to the same primary equivalence class.
4. The multi-character collating element <ch> is represented by
the collating symbol <ch> and belongs to the same primary
equivalence class as the multi-character collating element <Ch>.
The order_end Keyword
The collating order entries must be terminated with an *order_end*
keyword.
The collation sequence definition of the POSIX locale follows; the
code listing depicts the /localedef <../xcu/localedef.html>/ input. | |
LC_COLLATE
# This is the POSIX locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII codeset.
order_start forward
<NUL>
<SOH>
<STX>
<ETX>
<EOT>
<ENQ>
<ACK>
<alert>
<backspace>
<tab>
<newline>
<vertical-tab>
<form-feed>
<carriage-return>
<SO>
<SI>
<DLE>
<DC1>
<DC2>
<DC3>
<DC4>
<NAK>
<SYN>
<ETB>
<CAN>
<EM>
<SUB>
<ESC>
<IS4>
<IS3>
<IS2>
<IS1>
<space>
<exclamation-mark>
<quotation-mark>
<number-sign>
<dollar-sign>
<percent-sign>
<ampersand>
<apostrophe>
<left-parenthesis>
<right-parenthesis>
<asterisk>
<plus-sign>
<comma>
<hyphen>
<period>
<slash>
<zero>
<one>
<two>
<three>
<four>
<five>
<six>
<seven>
<eight>
<nine>
<colon>
<semicolon>
<less-than-sign>
<equals-sign>
<greater-than-sign>
<question-mark>
<commercial-at>
<A>
<B>
<C>
<D>
<E>
<F>
<G>
<H>
<I>
<J>
<K>
<L>
<M>
<N>
<O>
<P>
<Q>
<R>
<S>
<T>
<U>
<V>
<W>
<X>
<Y>
<Z>
<left-square-bracket>
<backslash>
<right-square-bracket>
<circumflex>
<underscore>
<grave-accent>
<a>
<b>
<c>
<d>
<e>
<f>
<g>
<h>
<i>
<j>
<k>
<l>
<m>
<n>
<o>
<p>
<q>
<r>
<s>
<t>
<u>
<v>
<w>
<x>
<y>
<z>
<left-curly-bracket>
<vertical-line>
<right-curly-bracket>
<tilde>
<DEL>
order_end
#
END LC_COLLATE