UnixSort

From WorkOutWiki2008

Jump to: navigation, search
    The LC_COLLATE category provides a collation sequence definition for
    numerous utilities in the *XCU* specification (/sort
    <../xcu/sort.html>/, /uniq <../xcu/uniq.html>/, and so forth),
    regular expression matching (see Regular Expressions
    <re.html#tag_007>) and the /strcoll() <../xsh/strcoll.html>/,
    /strxfrm() <../xsh/strxfrm.html>/, /wcscoll() <../xsh/wcscoll.html>/
    and /wcsxfrm() <../xsh/wcsxfrm.html>/ functions in the *XSH*
    specification.

    A collation sequence definition defines the relative order between
    collating elements (characters and multi-character collating
    elements) in the locale. This order is expressed in terms of
    collation values; that is, by assigning each element one or more
    collation values (also known as collation weights). This does not
    imply that implementations assign such values, but that ordering of
    strings using the resultant collation definition in the locale will
    behave as if such assignment is done and used in the collation
    process. At least the following capabilities are provided:

       1. *Multi-character collating elements*. Specification of
          multi-character collating elements (that is, sequences of two
          or more characters to be collated as an entity).

       2. *User-defined ordering of collating elements*. Each collating
          element is assigned a collation value defining its order in
          the character (or basic) collation sequence. This ordering is
          used by regular expressions and pattern matching and, unless
          collation weights are explicitly specified, also as the
          collation weight to be used in sorting.

       3. *Multiple weights and equivalence classes*. Collating elements
          can be assigned one or more (up to the limit
          {COLL_WEIGHTS_MAX}) collating weights for use in sorting. The
          first weight is hereafter referred to as the primary weight.

       4. *One-to-Many mapping*. A single character is mapped into a
          string of collating elements.

       5. *Equivalence class definition*. Two or more collating elements
          have the same collation value (primary weight).

       6. *Ordering by weights*. When two strings are compared to
          determine their relative order, the two strings are first
          broken up into a series of collating elements; the elements in
          each successive pair of elements are then compared according
          to the relative primary weights for the elements. If equal,
          and more than one weight has been assigned, then the pairs of
          collating elements are recompared according to the relative
          subsequent weights, until either a pair of collating elements
          compare unequal or the weights are exhausted.

    The following keywords are recognised in a collation sequence
    definition. They are described in detail in the following sections.

    *collating-element*
        Define a collating-element symbol representing a multi-character
        collating element. This keyword is optional. 
    *collating-symbol*
        Define a collating symbol for use in collation order statements.
        This keyword is optional. 
    *order_start*
        Define collation rules. This statement is followed by one or
        more collation order statements, assigning character collation
        values and collation weights to collating elements. 
    *order_end*
        Specify the end of the collation-order statements. 
    *copy*
        Specify the name of an existing locale to be used as the
        definition of this category. If this keyword is specified, no
        other keyword can be specified. 


               The collating-element Keyword

    In addition to the collating elements in the character set, the
    *collating-element* keyword is used to define multi-character
    collating elements. The syntax is:

        "collating-element %s from \"%s\"\n", </collating-symbol/>,
        </string/> 

    The </collating-symbol/> operand is a symbolic name, enclosed
    between angle brackets (< and >), and must not duplicate any
    symbolic name in the current charmap file (if any), or any other
    symbolic name defined in this collation definition. The string
    operand is a string of two or more characters that collates as an
    entity. A </collating-element/> defined via this keyword is only
    recognised with the LC_COLLATE category.

    *Example*: | |

collating-element <ch> from "<c><h>"
collating-element <e-acute> from "<acute><e>"
collating-element <ll> from "ll"


               The collating-symbol Keyword

    This keyword will be used to define symbols for use in collation
    sequence statements; that is, between the *order_start* and the
    *order_end* keywords. The syntax is:

        "collating-symbol %s\n", </collating-symbol/> 


    The </collating-symbol/> is a symbolic name, enclosed between angle
    brackets (< and >), and must not duplicate any symbolic name in the
    current charmap file (if any), or any other symbolic name defined in
    this collation definition. A </collating-symbol/> defined via this
    keyword is only recognised with the LC_COLLATE category.

    *Example*: | |

collating-symbol <UPPER_CASE>
collating-symbol <HIGH>

    The *collating-symbol* keyword defines a symbolic name that can be
    associated with a relative position in the character order sequence.
    While such a symbolic name does not represent any collating element,
    it can be used as a weight.


               The order_start Keyword

    The *order_start* keyword must precede collation order entries and
    also defines the number of weights for this collation sequence
    definition and other collation rules.

    The syntax of the *order_start* keyword is:

        "order_start %s;%s;...;%s\n", </sort-rules/>, </sort-rules/> 

    The operands to the *order_start* keyword are optional. If present,
    the operands define rules to be applied when strings are compared.
    The number of operands define how many weights each element is
    assigned; if no operands are present, one *forward* operand is
    assumed. If present, the first operand defines rules to be applied
    when comparing strings using the first (primary) weight; the second
    when comparing strings using the second weight, and so on. Operands
    are separated by semicolons (;). Each operand consists of one or
    more collation directives, separated by commas (,). If the number of
    operands exceeds the {COLL_WEIGHTS_MAX} limit, the utility will
    issue a warning message. The following directives will be supported:

    *forward*
        Specifies that comparison operations for the weight level
        proceed from start of string towards the end of string. 
    *backward*
        Specifies that comparison operations for the weight level
        proceed from end of string towards the beginning of string. 
    *position*
        Specifies that comparison operations for the weight level will
        consider the relative position of elements in the strings not
        subject to *IGNORE*. The string containing an element not
        subject to *IGNORE* after the fewest collating elements subject
        to *IGNORE* from the start of the compare will collate first. If
        both strings contain a character not subject to *IGNORE* in the
        same relative position, the collating values assigned to the
        elements will determine the ordering. In case of equality,
        subsequent characters not subject to *IGNORE* are considered in
        the same manner. 

    The directives *forward* and *backward* are mutually exclusive.

    *Example*: | |

order_start    forward;backward

    If no operands are specified, a single *forward* operand is assumed.

    The character (and collating element) order is defined by the order
    in which characters and elements are specified between the
    *order_start* and *order_end* keywords. This character order is used
    in range expressions in regular expressions (see Regular Expressions
    <re.html#tag_007>). Weights assigned to the characters and elements
    define the collation sequence; in the absence of weights, the
    character order is also the collation sequence.

    The *position* keyword provides the capability to consider, in a
    compare, the relative position of characters not subject to
    *IGNORE*. As an example, consider the two strings "o-ring" and
    "or-ing". Assuming the hyphen is subject to *IGNORE* on the first
    pass, the two strings will compare equal, and the position of the
    hyphen is immaterial. On second pass, all characters except the
    hyphen are subject to *IGNORE*, and in the normal case the two
    strings would again compare equal. By taking position into account,
    the first collates before the second.


               Collation Order

    The *order_start* keyword is followed by collating identifier
    entries. The syntax for the collating element entries is:

        "%s %s;%s;...;%s\n", </collating-identifier/>, </weight/>,
        </weight/>, ... 

    Each /collating-identifier/ consists of either a character (in any
    of the forms defined in Locale Definition <#tag_005_003>), a
    </collating-element/>, a </collating-symbol/>, an ellipsis or the
    special symbol *UNDEFINED*. The order in which collating elements
    are specified determines the character order sequence, such that
    each collating element compares less than the elements following it.
    The NUL character compares lower than any other character.

    A </collating-element/> is used to specify multi-character collating
    elements, and indicates that the character sequence specified via
    the </collating-element/> is to be collated as a unit and in the
    relative order specified by its place.

    A </collating-symbol/> is used to define a position in the relative
    order for use in weights. No weights are specified with a
    </collating-symbol/>.

    The ellipsis symbol specifies that a sequence of characters will
    collate according to their encoded character values. It is
    interpreted as indicating that all characters with a coded character
    set value higher than the value of the character in the preceding
    line, and lower than the coded character set value for the character
    in the following line, in the current coded character set, will be
    placed in the character collation order between the previous and the
    following character in ascending order according to their coded
    character set values. An initial ellipsis is interpreted as if the
    preceding line specified the NUL character, and a trailing ellipsis
    as if the following line specified the highest coded character set
    value in the current coded character set. An ellipsis is treated as
    invalid if the preceding or following lines do not specify
    characters in the current coded character set. The use of the
    ellipsis symbol ties the definition to a specific coded character
    set and may preclude the definition from being portable between
    implementations.

    The symbol *UNDEFINED* is interpreted as including all coded
    character set values not specified explicitly or via the ellipsis
    symbol. Such characters are inserted in the character collation
    order at the point indicated by the symbol, and in ascending order
    according to their coded character set values. If no *UNDEFINED*
    symbol is specified, and the current coded character set contains
    characters not specified in this section, the utility will issue a
    warning message and place such characters at the end of the
    character collation order.

    The optional operands for each collation-element are used to define
    the primary, secondary, or subsequent weights for the collating
    element. The first operand specifies the relative primary weight,
    the second the relative secondary weight, and so on. Two or more
    collation-elements can be assigned the same weight; they belong to
    the same if they have the same primary weight. Collation behaves as
    if, for each weight level, elements subject to *IGNORE* are removed,
    unless the *position* collation directive is specified for the
    corresponding level with the *order_start* keyword. Then each
    successive pair of elements is compared according to the relative
    weights for the elements. If the two strings compare equal, the
    process is repeated for the next weight level, up to the limit
    {COLL_WEIGHTS_MAX}.

    Weights are expressed as characters (in any of the forms specified
    in Locale Definition <#tag_005_003>), </collating-symbol/>s,
    </collating-element/>s, an ellipsis, or the special symbol *IGNORE*.
    A single character, a </collating-symbol/> or a
    </collating-element/> represent the relative position in the
    character collating sequence of the character or symbol, rather than
    the character or characters themselves. Thus, rather than assigning
    absolute values to weights, a particular weight is expressed using
    the relative order value assigned to a collating element based on
    its order in the character collation sequence.

    One-to-many mapping is indicated by specifying two or more
    concatenated characters or symbolic names. For example, if the
    character <eszet> is given the string <s><s> as a weight,
    comparisons are performed as if all occurrences of the character
    <eszet> are replaced by <s><s> (assuming that <s> has the collating
    weight <s>). If it is necessary to define <eszet> and <s><s> as an
    equivalence class, then a collating element must be defined for the
    string ss.

    All characters specified via an ellipsis will by default be assigned
    unique weights, equal to the relative order of characters.
    Characters specified via an explicit or implicit *UNDEFINED* special
    symbol will by default be assigned the same primary weight (that is,
    belong to the same equivalence class). An ellipsis symbol as a
    weight is interpreted to mean that each character in the sequence
    has unique weights, equal to the relative order of their character
    in the character collation sequence. The use of the ellipsis as a
    weight is treated as an error if the collating element is neither an
    ellipsis nor the special symbol *UNDEFINED*.

    The special keyword *IGNORE* as a weight indicates that when strings
    are compared using the weights at the level where *IGNORE* is
    specified, the collating element is ignored; that is, as if the
    string did not contain the collating element. In regular expressions
    and pattern matching, all characters that are subject to *IGNORE* in
    their primary weight form an equivalence class.

    An empty operand is interpreted as the collating element itself.

    For example, the order statement: | |

<a>    <a>;<a>

    is equal to: | |

<a>

    An ellipsis can be used as an operand if the collating element was
    an ellipsis, and is interpreted as the value of each character
    defined by the ellipsis.

    The collation order as defined in this section defines the
    interpretation of bracket expressions in regular expressions (see RE
    Bracket Expression <re.html#tag_007_003_005>).

    *Example*:

        order_start     forward;backward
        UNDEFINED       IGNORE;IGNORE
        <LOW>
        <space>         <LOW>;<space>
        ...     <LOW>;...
        <a>     <a>;<a>
        <a-acute>       <a>;<a-acute>
        <a-grave>       <a>;<a-grave>
        <A>     <a>;<A>
        <A-acute>       <a>;<A-acute>
        <A-grave>       <a>;<A-grave>
        <ch>    <ch>;<ch>
        <Ch>    <ch>;<Ch>
        <s>     <s>;<s>
        <eszet>         "<s><s>";"<eszet><eszet>"
        order_end


    This example is interpreted as follows:

       1. The *UNDEFINED* means that all characters not specified in
          this definition (explicitly or via the ellipsis) are ignored
          for collation purposes; for regular expression purposes they
          are ordered first.

       2. All characters between <space> and <a> have the same primary
          equivalence class and individual secondary weights based on
          their ordinal encoded values.

       3. All characters based on the upper- or lower-case character a
          belong to the same primary equivalence class.

       4. The multi-character collating element <ch> is represented by
          the collating symbol <ch> and belongs to the same primary
          equivalence class as the multi-character collating element <Ch>.



               The order_end Keyword

    The collating order entries must be terminated with an *order_end*
    keyword.

    The collation sequence definition of the POSIX locale follows; the
    code listing depicts the /localedef <../xcu/localedef.html>/ input. | |

LC_COLLATE
# This is the POSIX locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII codeset.
order_start forward
<NUL>
<SOH>
<STX>
<ETX>
<EOT>
<ENQ>
<ACK>
<alert>
<backspace>
<tab>
<newline>
<vertical-tab>
<form-feed>
<carriage-return>
<SO>
<SI>
<DLE>
<DC1>
<DC2>
<DC3>
<DC4>
<NAK>
<SYN>
<ETB>
<CAN>
<EM>
<SUB>
<ESC>
<IS4>
<IS3>
<IS2>
<IS1>
<space>
<exclamation-mark>
<quotation-mark>
<number-sign>
<dollar-sign>
<percent-sign>
<ampersand>
<apostrophe>
<left-parenthesis>
<right-parenthesis>
<asterisk>
<plus-sign>
<comma>
<hyphen>
<period>
<slash>
<zero>
<one>
<two>
<three>
<four>
<five>
<six>
<seven>
<eight>
<nine>
<colon>
<semicolon>
<less-than-sign>
<equals-sign>
<greater-than-sign>
<question-mark>
<commercial-at>
<A>
<B>
<C>
<D>
<E>
<F>
<G>
<H>
<I>
<J>
<K>
<L>
<M>
<N>
<O>
<P>
<Q>
<R>
<S>
<T>
<U>
<V>
<W>
<X>
<Y>
<Z>
<left-square-bracket>
<backslash>
<right-square-bracket>
<circumflex>
<underscore>
<grave-accent>
<a>
<b>
<c>
<d>
<e>
<f>
<g>
<h>
<i>
<j>
<k>
<l>
<m>
<n>
<o>
<p>
<q>
<r>
<s>
<t>
<u>
<v>
<w>
<x>
<y>
<z>
<left-curly-bracket>
<vertical-line>
<right-curly-bracket>
<tilde>
<DEL>
order_end
#
END LC_COLLATE