.\"     $NetBSD: nls.7,v 1.15 2009/04/09 02:51:54 joerg Exp $
.\"
.\" Copyright (c) 2003 The NetBSD Foundation, Inc.
.\" All rights reserved.
.\"
.\" This code is derived from software contributed to The NetBSD Foundation
.\" by Gregory McGarry.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\"    notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
.\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
.\" PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
.\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
.\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
.\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
.\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
.\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
.\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
.\" POSSIBILITY OF SUCH DAMAGE.
.\"
.Dd November 24, 2013
.Dt NLS 7
.Os
.Sh NAME
.Nm NLS
.Nd Native Language Support Overview
.Sh DESCRIPTION
Native Language Support (NLS) provides commands for a single
worldwide operating system base.
An internationalized system has no built-in assumptions or dependencies
on language-specific or cultural-specific conventions such as:
.Pp
.Bl -bullet -offset indent -compact
.It
Character classifications
.It
Character comparison rules
.It
Character collation order
.It
Numeric and monetary formatting
.It
Date and time formatting
.It
Message-text language
.It
Character sets
.El
.Pp
All information pertaining to cultural conventions and language is
obtained at program run time.
.Pp
.Dq Internationalization
(often abbreviated
.Dq i18n )
refers to the operation by which system software is developed to support
multiple cultural-specific and language-specific conventions.
This is a generalization process by which the system is untied from
calling only English strings or other English-specific conventions.
.Dq Localization
(often abbreviated
.Dq l10n )
refers to the operations by which the user environment is customized to
handle its input and output appropriate for specific language and cultural
conventions.
This is a specialization process, by which generic methods already
implemented in an internationalized system are used in specific ways.
The formal description of cultural conventions for some country, together
with all associated translations targeted to the native language, is
called the
.Dq locale .
.Pp
.Dx
provides extensive support to programmers and system developers to
enable internationalized software to be developed.
.Dx
also supplies a large variety of locales for system localization.
.Ss Localization of Information
All locale information is accessible to programs at run time so that
data is processed and displayed correctly for specific cultural
conventions and language.
.Pp
A locale is divided into categories.
A category is a group of language-specific and culture-specific conventions
as outlined in the list above.
ISO C specifies the following six standard categories supported by
.Dx :
.Pp
.Bl -tag -compact -width ".Ev LC_MONETARY"
.It Ev LC_COLLATE
string-collation order information
.It Ev LC_CTYPE
character classification, case conversion, and other character attributes
.It Ev LC_MESSAGES
the format for affirmative and negative responses
.It Ev LC_MONETARY
rules and symbols for formatting monetary numeric information
.It Ev LC_NUMERIC
rules and symbols for formatting nonmonetary numeric information
.It Ev LC_TIME
rules and symbols for formatting time and date information
.El
.Pp
Localization of the system is achieved by setting appropriate values
in environment variables to identify which locale should be used.
The environment variables have the same names as their respective
locale categories.
Additionally, the
.Ev LANG ,
.Ev LC_ALL ,
and
.Ev NLSPATH
environment variables are used.
The
.Ev NLSPATH
environment variable specifies a colon-separated list of directory names
where the message catalog files of the NLS database are located.
The
.Ev LC_ALL
and
.Ev LANG
environment variables also determine the current locale.
.Pp
The values of these environment variables contains a string format as:
.Bd -literal
	language[_territory][.codeset][@modifier]
.Ed
.Pp
Valid values for the language field come from the ISO639 standard which
defines two-character codes for many languages.
Some common language codes are:
.Bl -column "PERSIAN (farsi)" "Sy Code" "OCEANIC/INDONESIAN"
.It Sy Language Name Ta Sy Code Ta Sy Language Family
.It ABKHAZIAN    Ta AB   Ta IBERO-CAUCASIAN
.It AFAN (OROMO) Ta OM   Ta HAMITIC
.It AFAR         Ta AA   Ta HAMITIC
.It AFRIKAANS    Ta AF   Ta GERMANIC
.It ALBANIAN     Ta SQ   Ta INDO-EUROPEAN (OTHER)
.It AMHARIC      Ta AM   Ta SEMITIC
.It ARABIC       Ta AR   Ta SEMITIC
.It ARMENIAN     Ta HY   Ta INDO-EUROPEAN (OTHER)
.It ASSAMESE     Ta AS   Ta INDIAN
.It AYMARA       Ta AY   Ta AMERINDIAN
.It AZERBAIJANI  Ta AZ   Ta TURKIC/ALTAIC
.It BASHKIR      Ta BA   Ta TURKIC/ALTAIC
.It BASQUE       Ta EU   Ta BASQUE
.It BENGALI      Ta BN   Ta INDIAN
.It BHUTANI      Ta DZ   Ta ASIAN
.It BIHARI       Ta BH   Ta INDIAN
.It BISLAMA      Ta BI   Ta ""
.It BRETON       Ta BR   Ta CELTIC
.It BULGARIAN    Ta BG   Ta SLAVIC
.It BURMESE      Ta MY   Ta ASIAN
.It BYELORUSSIAN Ta BE   Ta SLAVIC
.It CAMBODIAN    Ta KM   Ta ASIAN
.It CATALAN      Ta CA   Ta ROMANCE
.It CHINESE      Ta ZH   Ta ASIAN
.It CORSICAN     Ta CO   Ta ROMANCE
.It CROATIAN     Ta HR   Ta SLAVIC
.It CZECH        Ta CS   Ta SLAVIC
.It DANISH       Ta DA   Ta GERMANIC
.It DUTCH        Ta NL   Ta GERMANIC
.It ENGLISH      Ta EN   Ta GERMANIC
.It ESPERANTO    Ta EO   Ta INTERNATIONAL AUX.
.It ESTONIAN     Ta ET   Ta FINNO-UGRIC
.It FAROESE      Ta FO   Ta GERMANIC
.It FIJI         Ta FJ   Ta OCEANIC/INDONESIAN
.It FINNISH      Ta FI   Ta FINNO-UGRIC
.It FRENCH       Ta FR   Ta ROMANCE
.It FRISIAN      Ta FY   Ta GERMANIC
.It GALICIAN     Ta GL   Ta ROMANCE
.It GEORGIAN     Ta KA   Ta IBERO-CAUCASIAN
.It GERMAN       Ta DE   Ta GERMANIC
.It GREEK        Ta EL   Ta LATIN/GREEK
.It GREENLANDIC  Ta KL   Ta ESKIMO
.It GUARANI      Ta GN   Ta AMERINDIAN
.It GUJARATI     Ta GU   Ta INDIAN
.It HAUSA        Ta HA   Ta NEGRO-AFRICAN
.It HEBREW       Ta HE   Ta SEMITIC
.It HINDI        Ta HI   Ta INDIAN
.It HUNGARIAN    Ta HU   Ta FINNO-UGRIC
.It ICELANDIC    Ta IS   Ta GERMANIC
.It INDONESIAN   Ta ID   Ta OCEANIC/INDONESIAN
.It INTERLINGUA  Ta IA   Ta INTERNATIONAL AUX.
.It INTERLINGUE  Ta IE   Ta INTERNATIONAL AUX.
.It INUKTITUT    Ta IU   Ta ""
.It INUPIAK      Ta IK   Ta ESKIMO
.It IRISH        Ta GA   Ta CELTIC
.It ITALIAN      Ta IT   Ta ROMANCE
.It JAPANESE     Ta JA   Ta ASIAN
.It JAVANESE     Ta JV   Ta OCEANIC/INDONESIAN
.It KANNADA      Ta KN   Ta DRAVIDIAN
.It KASHMIRI     Ta KS   Ta INDIAN
.It KAZAKH       Ta KK   Ta TURKIC/ALTAIC
.It KINYARWANDA  Ta RW   Ta NEGRO-AFRICAN
.It KIRGHIZ      Ta KY   Ta TURKIC/ALTAIC
.It KURUNDI      Ta RN   Ta NEGRO-AFRICAN
.It KOREAN       Ta KO   Ta ASIAN
.It KURDISH      Ta KU   Ta IRANIAN
.It LAOTHIAN     Ta LO   Ta ASIAN
.It LATIN        Ta LA   Ta LATIN/GREEK
.It LATVIAN      Ta LV   Ta BALTIC
.It LINGALA      Ta LN   Ta NEGRO-AFRICAN
.It LITHUANIAN   Ta LT   Ta BALTIC
.It MACEDONIAN   Ta MK   Ta SLAVIC
.It MALAGASY     Ta MG   Ta OCEANIC/INDONESIAN
.It MALAY        Ta MS   Ta OCEANIC/INDONESIAN
.It MALAYALAM    Ta ML   Ta DRAVIDIAN
.It MALTESE      Ta MT   Ta SEMITIC
.It MAORI        Ta MI   Ta OCEANIC/INDONESIAN
.It MARATHI      Ta MR   Ta INDIAN
.It MOLDAVIAN    Ta MO   Ta ROMANCE
.It MONGOLIAN    Ta MN   Ta ""
.It NAURU        Ta NA   Ta ""
.It NEPALI       Ta NE   Ta INDIAN
.It NORWEGIAN    Ta NO   Ta GERMANIC
.It OCCITAN      Ta OC   Ta ROMANCE
.It ORIYA        Ta OR   Ta INDIAN
.It PASHTO       Ta PS   Ta IRANIAN
.It PERSIAN (farsi) Ta FA   Ta IRANIAN
.It POLISH       Ta PL   Ta SLAVIC
.It PORTUGUESE   Ta PT   Ta ROMANCE
.It PUNJABI      Ta PA   Ta INDIAN
.It QUECHUA      Ta QU   Ta AMERINDIAN
.It RHAETO-ROMANCE Ta RM   Ta ROMANCE
.It ROMANIAN     Ta RO   Ta ROMANCE
.It RUSSIAN      Ta RU   Ta SLAVIC
.It SAMOAN       Ta SM   Ta OCEANIC/INDONESIAN
.It SANGHO       Ta SG   Ta NEGRO-AFRICAN
.It SANSKRIT     Ta SA   Ta INDIAN
.It SCOTS GAELIC Ta GD   Ta CELTIC
.It SERBIAN      Ta SR   Ta SLAVIC
.It SERBO-CROATIAN Ta SH   Ta SLAVIC
.It SESOTHO      Ta ST   Ta NEGRO-AFRICAN
.It SETSWANA     Ta TN   Ta NEGRO-AFRICAN
.It SHONA        Ta SN   Ta NEGRO-AFRICAN
.It SINDHI       Ta SD   Ta INDIAN
.It SINGHALESE   Ta SI   Ta INDIAN
.It SISWATI      Ta SS   Ta NEGRO-AFRICAN
.It SLOVAK       Ta SK   Ta SLAVIC
.It SLOVENIAN    Ta SL   Ta SLAVIC
.It SOMALI       Ta SO   Ta HAMITIC
.It SPANISH      Ta ES   Ta ROMANCE
.It SUNDANESE    Ta SU   Ta OCEANIC/INDONESIAN
.It SWAHILI      Ta SW   Ta NEGRO-AFRICAN
.It SWEDISH      Ta SV   Ta GERMANIC
.It TAGALOG      Ta TL   Ta OCEANIC/INDONESIAN
.It TAJIK        Ta TG   Ta IRANIAN
.It TAMIL        Ta TA   Ta DRAVIDIAN
.It TATAR        Ta TT   Ta TURKIC/ALTAIC
.It TELUGU       Ta TE   Ta DRAVIDIAN
.It THAI         Ta TH   Ta ASIAN
.It TIBETAN      Ta BO   Ta ASIAN
.It TIGRINYA     Ta TI   Ta SEMITIC
.It TONGA        Ta TO   Ta OCEANIC/INDONESIAN
.It TSONGA       Ta TS   Ta NEGRO-AFRICAN
.It TURKISH      Ta TR   Ta TURKIC/ALTAIC
.It TURKMEN      Ta TK   Ta TURKIC/ALTAIC
.It TWI          Ta TW   Ta NEGRO-AFRICAN
.It UIGUR        Ta UG   Ta ""
.It UKRAINIAN    Ta UK   Ta SLAVIC
.It URDU         Ta UR   Ta INDIAN
.It UZBEK        Ta UZ   Ta TURKIC/ALTAIC
.It VIETNAMESE   Ta VI   Ta ASIAN
.It VOLAPUK      Ta VO   Ta INTERNATIONAL AUX.
.It WELSH        Ta CY   Ta CELTIC
.It WOLOF        Ta WO   Ta NEGRO-AFRICAN
.It XHOSA        Ta XH   Ta NEGRO-AFRICAN
.It YIDDISH      Ta YI   Ta GERMANIC
.It YORUBA       Ta YO   Ta NEGRO-AFRICAN
.It ZHUANG       Ta ZA   Ta ""
.It ZULU         Ta ZU   Ta NEGRO-AFRICAN
.El
.Pp
For example, the locale for the Danish language spoken in Denmark
using the ISO 8859-1 character set is da_DK.ISO8859-1.
The da stands for the Danish language and the DK stands for Denmark.
The short form of da_DK is sufficient to indicate this locale.
.Pp
The environment variable settings are queried by their priority level
in the following manner:
.Bl -bullet
.It
If the
.Ev LC_ALL
environment variable is set, all six categories use the locale it
specifies.
.It
If the
.Ev LC_ALL
environment variable is not set, each individual category uses the
locale specified by its corresponding environment variable.
.It
If the
.Ev LC_ALL
environment variable is not set, and a value for a particular
.Ev LC_*
environment variable is not set, the value of the
.Ev LANG
environment variable specifies the default locale for all categories.
Only the
.Ev LANG
environment variable should be set in /etc/profile, since it makes it
most easy for the user to override the system default using the individual
.Ev LC_*
variables.
.It
If the
.Ev LC_ALL
environment variable is not set, a value for a particular
.Ev LC_*
environment variable is not set, and the value of the
.Ev LANG
environment variable is not set, the locale for that specific
category defaults to the C locale.
The C or POSIX locale assumes the ASCII character set and defines
information for the six categories.
.El
.Ss Character Sets
A character is any symbol used for the organization, control, or
representation of data.
A group of such symbols used to describe a
particular language make up a character set.
It is the encoding values in a character set that provide
the interface between the system and its input and output devices.
.Pp
The following character sets are supported in
.Dx :
.Bl -tag -width ISO_8859_family
.It ASCII
The American Standard Code for Information Exchange (ASCII) standard
specifies 128 Roman characters and control codes, encoded in a 7-bit
character encoding scheme.
.It ISO 8859 family
Industry-standard character sets specified by the ISO/IEC 8859
standard.
The standard is divided into 15 numbered parts, with each
part specifying broad script similarities.
Examples include Western European, Central European, Arabic, Cyrillic,
Hebrew, Greek, and Turkish.
The character sets use an 8-bit character encoding scheme which is
compatible with the ASCII character set.
.It Unicode
The Unicode character set is the full set of known abstract characters of
all real-world scripts.  It can be used in environments where multiple
scripts must be processed simultaneously.
Unicode is compatible with ISO 8859-1 (Western European) and ASCII.
Many character encoding schemes are available for Unicode, including UTF-8,
UTF-16 and UTF-32.
These encoding schemes are multi-byte encodings.
The UTF-8 encoding scheme uses 8-bit, variable-width encodings which is
compatible with ASCII.
The UTF-16 encoding scheme uses 16-bit, variable-width encodings.
The UTF-32 encoding scheme using 32-bit, fixed-width encodings.
.El
.Ss Font Sets
A font set contains the glyphs to be displayed on the screen for a
corresponding character in a character set.
A display must support a suitable font to display a character set.
If suitable fonts are available to the X server, then X clients can
include support for different character sets.
.Xr xterm 1
includes support for Unicode with UTF-8 encoding.
.Xr xfd 1
is useful for displaying all the characters in an X font.
.Pp
The
.Dx
.Xr syscons 4
console provides support for loading a variety of fonts using the
.Xr vidcontrol 1
utility. Available fonts can be found in
.Pa /usr/share/syscons/fonts .
.Ss Internationalization for Programmers
To facilitate translations of messages into various languages and to
make the translated messages available to the program based on a
user's locale, it is necessary to keep messages separate from the
programs and provide them in the form of message catalogs that a
program can access at run time.
.Pp
Access to locale information is provided through the
.Xr setlocale 3
and
.Xr nl_langinfo 3
interfaces.
See their respective man pages for further information.
.Pp
Message source files containing application messages are created by
the programmer and converted to message catalogs.
These catalogs are used by the application to retrieve and display
messages, as needed.
.Pp
.Dx
supports two message catalog interfaces: the X/Open
.Xr catgets 3
interface and the Uniforum
.Xr gettext 3
interface.
The
.Xr catgets 3
interface has the advantage that it belongs to a standard which is
well supported.
Unfortunately the interface is complicated to use and
maintenance of the catalogs is difficult.
The implementation also doesn't support different character sets.
The
.Xr gettext 3
interface has not been standardized yet, however it is being supported
by an increasing number of systems.
It also provides many additional tools which make programming and
catalog maintenance much easier.
.Ss Support for Multi-byte Encodings
Some character sets with multi-byte encodings may be difficult to decode,
or may contain state (i.e., adjacent characters are dependent).
ISO C specifies a set of functions using 'wide characters' which can handle
multi-byte encodings properly.
The behaviour of these functions is affected
by the
.Ev LC_CTYPE
category of the current locale.
.Pp
A wide character is specified in ISO C
as being a fixed number of bits wide and is stateless.
There are two types for wide characters:
.Em wchar_t
and
.Em wint_t .
.Em wchar_t
is a type which can contain one wide character and operates like 'char'
type does for one character.
.Em wint_t
can contain one wide character or WEOF (wide EOF).
.Pp
There are functions that operate on
.Em wchar_t ,
and substitute for functions operating on 'char'.
See
.Xr wmemchr 3
and
.Xr towlower 3
for details.
There are some additional functions that operate on
.Em wchar_t .
See
.Xr wctype 3
and
.Xr wctrans 3
for details.
.Pp
Wide characters should be used for all I/O processing which may rely
on locale-specific strings.
The two primary issues requiring special use of wide characters are:
.Bl -bullet -offset indent
.It
All I/O is performed using multibyte characters.
Input data is converted into wide characters immediately after
reading and data for output is converted from wide characters to
multi-byte encoding immediately before writing.
Conversion is controlled by the
.Xr mbstowcs 3 ,
.Xr mbsrtowcs 3 ,
.Xr wcstombs 3 ,
.Xr wcsrtombs 3 ,
.Xr mblen 3 ,
.Xr mbrlen 3 ,
and
.Xr mbsinit 3 .
.It
Wide characters are used directly for I/O, using
.Xr getwchar 3 ,
.Xr fgetwc 3 ,
.Xr getwc 3 ,
.Xr ungetwc 3 ,
.Xr fgetws 3 ,
.Xr putwchar 3 ,
.Xr fputwc 3 ,
.Xr putwc 3 ,
and
.Xr fputws 3 .
They are also used for formatted I/O functions for wide characters
such as
.Xr fwscanf 3 ,
.Xr wscanf 3 ,
.Xr swscanf 3 ,
.Xr fwprintf 3 ,
.Xr wprintf 3 ,
.Xr swprintf 3 ,
.Xr vfwprintf 3 ,
.Xr vwprintf 3 ,
and
.Xr vswprintf 3 ,
and wide character identifier of %lc, %C, %ls, %S for conventional
formatted I/O functions.
.El
.Sh SEE ALSO
.Xr gencat 1 ,
.Xr vidcontrol 1 ,
.Xr xfd 1 ,
.Xr xterm 1 ,
.Xr catgets 3 ,
.Xr gettext 3 Pq Pa devel/gettext ,
.Xr nl_langinfo 3 ,
.Xr setlocale 3
.Sh BUGS
This man page is incomplete.