.\" $NetBSD: nls.7,v 1.15 2009/04/09 02:51:54 joerg Exp $ .\" .\" Copyright (c) 2003 The NetBSD Foundation, Inc. .\" All rights reserved. .\" .\" This code is derived from software contributed to The NetBSD Foundation .\" by Gregory McGarry. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS .\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED .\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR .\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS .\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS .\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN .\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE .\" POSSIBILITY OF SUCH DAMAGE. .\" .Dd November 24, 2013 .Dt NLS 7 .Os .Sh NAME .Nm NLS .Nd Native Language Support Overview .Sh DESCRIPTION Native Language Support (NLS) provides commands for a single worldwide operating system base. An internationalized system has no built-in assumptions or dependencies on language-specific or cultural-specific conventions such as: .Pp .Bl -bullet -offset indent -compact .It Character classifications .It Character comparison rules .It Character collation order .It Numeric and monetary formatting .It Date and time formatting .It Message-text language .It Character sets .El .Pp All information pertaining to cultural conventions and language is obtained at program run time. .Pp .Dq Internationalization (often abbreviated .Dq i18n ) refers to the operation by which system software is developed to support multiple cultural-specific and language-specific conventions. This is a generalization process by which the system is untied from calling only English strings or other English-specific conventions. .Dq Localization (often abbreviated .Dq l10n ) refers to the operations by which the user environment is customized to handle its input and output appropriate for specific language and cultural conventions. This is a specialization process, by which generic methods already implemented in an internationalized system are used in specific ways. The formal description of cultural conventions for some country, together with all associated translations targeted to the native language, is called the .Dq locale . .Pp .Dx provides extensive support to programmers and system developers to enable internationalized software to be developed. .Dx also supplies a large variety of locales for system localization. .Ss Localization of Information All locale information is accessible to programs at run time so that data is processed and displayed correctly for specific cultural conventions and language. .Pp A locale is divided into categories. A category is a group of language-specific and culture-specific conventions as outlined in the list above. ISO C specifies the following six standard categories supported by .Dx : .Pp .Bl -tag -compact -width ".Ev LC_MONETARY" .It Ev LC_COLLATE string-collation order information .It Ev LC_CTYPE character classification, case conversion, and other character attributes .It Ev LC_MESSAGES the format for affirmative and negative responses .It Ev LC_MONETARY rules and symbols for formatting monetary numeric information .It Ev LC_NUMERIC rules and symbols for formatting nonmonetary numeric information .It Ev LC_TIME rules and symbols for formatting time and date information .El .Pp Localization of the system is achieved by setting appropriate values in environment variables to identify which locale should be used. The environment variables have the same names as their respective locale categories. Additionally, the .Ev LANG , .Ev LC_ALL , and .Ev NLSPATH environment variables are used. The .Ev NLSPATH environment variable specifies a colon-separated list of directory names where the message catalog files of the NLS database are located. The .Ev LC_ALL and .Ev LANG environment variables also determine the current locale. .Pp The values of these environment variables contains a string format as: .Bd -literal language[_territory][.codeset][@modifier] .Ed .Pp Valid values for the language field come from the ISO639 standard which defines two-character codes for many languages. Some common language codes are: .Bl -column "PERSIAN (farsi)" "Sy Code" "OCEANIC/INDONESIAN" .It Sy Language Name Ta Sy Code Ta Sy Language Family .It ABKHAZIAN Ta AB Ta IBERO-CAUCASIAN .It AFAN (OROMO) Ta OM Ta HAMITIC .It AFAR Ta AA Ta HAMITIC .It AFRIKAANS Ta AF Ta GERMANIC .It ALBANIAN Ta SQ Ta INDO-EUROPEAN (OTHER) .It AMHARIC Ta AM Ta SEMITIC .It ARABIC Ta AR Ta SEMITIC .It ARMENIAN Ta HY Ta INDO-EUROPEAN (OTHER) .It ASSAMESE Ta AS Ta INDIAN .It AYMARA Ta AY Ta AMERINDIAN .It AZERBAIJANI Ta AZ Ta TURKIC/ALTAIC .It BASHKIR Ta BA Ta TURKIC/ALTAIC .It BASQUE Ta EU Ta BASQUE .It BENGALI Ta BN Ta INDIAN .It BHUTANI Ta DZ Ta ASIAN .It BIHARI Ta BH Ta INDIAN .It BISLAMA Ta BI Ta "" .It BRETON Ta BR Ta CELTIC .It BULGARIAN Ta BG Ta SLAVIC .It BURMESE Ta MY Ta ASIAN .It BYELORUSSIAN Ta BE Ta SLAVIC .It CAMBODIAN Ta KM Ta ASIAN .It CATALAN Ta CA Ta ROMANCE .It CHINESE Ta ZH Ta ASIAN .It CORSICAN Ta CO Ta ROMANCE .It CROATIAN Ta HR Ta SLAVIC .It CZECH Ta CS Ta SLAVIC .It DANISH Ta DA Ta GERMANIC .It DUTCH Ta NL Ta GERMANIC .It ENGLISH Ta EN Ta GERMANIC .It ESPERANTO Ta EO Ta INTERNATIONAL AUX. .It ESTONIAN Ta ET Ta FINNO-UGRIC .It FAROESE Ta FO Ta GERMANIC .It FIJI Ta FJ Ta OCEANIC/INDONESIAN .It FINNISH Ta FI Ta FINNO-UGRIC .It FRENCH Ta FR Ta ROMANCE .It FRISIAN Ta FY Ta GERMANIC .It GALICIAN Ta GL Ta ROMANCE .It GEORGIAN Ta KA Ta IBERO-CAUCASIAN .It GERMAN Ta DE Ta GERMANIC .It GREEK Ta EL Ta LATIN/GREEK .It GREENLANDIC Ta KL Ta ESKIMO .It GUARANI Ta GN Ta AMERINDIAN .It GUJARATI Ta GU Ta INDIAN .It HAUSA Ta HA Ta NEGRO-AFRICAN .It HEBREW Ta HE Ta SEMITIC .It HINDI Ta HI Ta INDIAN .It HUNGARIAN Ta HU Ta FINNO-UGRIC .It ICELANDIC Ta IS Ta GERMANIC .It INDONESIAN Ta ID Ta OCEANIC/INDONESIAN .It INTERLINGUA Ta IA Ta INTERNATIONAL AUX. .It INTERLINGUE Ta IE Ta INTERNATIONAL AUX. .It INUKTITUT Ta IU Ta "" .It INUPIAK Ta IK Ta ESKIMO .It IRISH Ta GA Ta CELTIC .It ITALIAN Ta IT Ta ROMANCE .It JAPANESE Ta JA Ta ASIAN .It JAVANESE Ta JV Ta OCEANIC/INDONESIAN .It KANNADA Ta KN Ta DRAVIDIAN .It KASHMIRI Ta KS Ta INDIAN .It KAZAKH Ta KK Ta TURKIC/ALTAIC .It KINYARWANDA Ta RW Ta NEGRO-AFRICAN .It KIRGHIZ Ta KY Ta TURKIC/ALTAIC .It KURUNDI Ta RN Ta NEGRO-AFRICAN .It KOREAN Ta KO Ta ASIAN .It KURDISH Ta KU Ta IRANIAN .It LAOTHIAN Ta LO Ta ASIAN .It LATIN Ta LA Ta LATIN/GREEK .It LATVIAN Ta LV Ta BALTIC .It LINGALA Ta LN Ta NEGRO-AFRICAN .It LITHUANIAN Ta LT Ta BALTIC .It MACEDONIAN Ta MK Ta SLAVIC .It MALAGASY Ta MG Ta OCEANIC/INDONESIAN .It MALAY Ta MS Ta OCEANIC/INDONESIAN .It MALAYALAM Ta ML Ta DRAVIDIAN .It MALTESE Ta MT Ta SEMITIC .It MAORI Ta MI Ta OCEANIC/INDONESIAN .It MARATHI Ta MR Ta INDIAN .It MOLDAVIAN Ta MO Ta ROMANCE .It MONGOLIAN Ta MN Ta "" .It NAURU Ta NA Ta "" .It NEPALI Ta NE Ta INDIAN .It NORWEGIAN Ta NO Ta GERMANIC .It OCCITAN Ta OC Ta ROMANCE .It ORIYA Ta OR Ta INDIAN .It PASHTO Ta PS Ta IRANIAN .It PERSIAN (farsi) Ta FA Ta IRANIAN .It POLISH Ta PL Ta SLAVIC .It PORTUGUESE Ta PT Ta ROMANCE .It PUNJABI Ta PA Ta INDIAN .It QUECHUA Ta QU Ta AMERINDIAN .It RHAETO-ROMANCE Ta RM Ta ROMANCE .It ROMANIAN Ta RO Ta ROMANCE .It RUSSIAN Ta RU Ta SLAVIC .It SAMOAN Ta SM Ta OCEANIC/INDONESIAN .It SANGHO Ta SG Ta NEGRO-AFRICAN .It SANSKRIT Ta SA Ta INDIAN .It SCOTS GAELIC Ta GD Ta CELTIC .It SERBIAN Ta SR Ta SLAVIC .It SERBO-CROATIAN Ta SH Ta SLAVIC .It SESOTHO Ta ST Ta NEGRO-AFRICAN .It SETSWANA Ta TN Ta NEGRO-AFRICAN .It SHONA Ta SN Ta NEGRO-AFRICAN .It SINDHI Ta SD Ta INDIAN .It SINGHALESE Ta SI Ta INDIAN .It SISWATI Ta SS Ta NEGRO-AFRICAN .It SLOVAK Ta SK Ta SLAVIC .It SLOVENIAN Ta SL Ta SLAVIC .It SOMALI Ta SO Ta HAMITIC .It SPANISH Ta ES Ta ROMANCE .It SUNDANESE Ta SU Ta OCEANIC/INDONESIAN .It SWAHILI Ta SW Ta NEGRO-AFRICAN .It SWEDISH Ta SV Ta GERMANIC .It TAGALOG Ta TL Ta OCEANIC/INDONESIAN .It TAJIK Ta TG Ta IRANIAN .It TAMIL Ta TA Ta DRAVIDIAN .It TATAR Ta TT Ta TURKIC/ALTAIC .It TELUGU Ta TE Ta DRAVIDIAN .It THAI Ta TH Ta ASIAN .It TIBETAN Ta BO Ta ASIAN .It TIGRINYA Ta TI Ta SEMITIC .It TONGA Ta TO Ta OCEANIC/INDONESIAN .It TSONGA Ta TS Ta NEGRO-AFRICAN .It TURKISH Ta TR Ta TURKIC/ALTAIC .It TURKMEN Ta TK Ta TURKIC/ALTAIC .It TWI Ta TW Ta NEGRO-AFRICAN .It UIGUR Ta UG Ta "" .It UKRAINIAN Ta UK Ta SLAVIC .It URDU Ta UR Ta INDIAN .It UZBEK Ta UZ Ta TURKIC/ALTAIC .It VIETNAMESE Ta VI Ta ASIAN .It VOLAPUK Ta VO Ta INTERNATIONAL AUX. .It WELSH Ta CY Ta CELTIC .It WOLOF Ta WO Ta NEGRO-AFRICAN .It XHOSA Ta XH Ta NEGRO-AFRICAN .It YIDDISH Ta YI Ta GERMANIC .It YORUBA Ta YO Ta NEGRO-AFRICAN .It ZHUANG Ta ZA Ta "" .It ZULU Ta ZU Ta NEGRO-AFRICAN .El .Pp For example, the locale for the Danish language spoken in Denmark using the ISO 8859-1 character set is da_DK.ISO8859-1. The da stands for the Danish language and the DK stands for Denmark. The short form of da_DK is sufficient to indicate this locale. .Pp The environment variable settings are queried by their priority level in the following manner: .Bl -bullet .It If the .Ev LC_ALL environment variable is set, all six categories use the locale it specifies. .It If the .Ev LC_ALL environment variable is not set, each individual category uses the locale specified by its corresponding environment variable. .It If the .Ev LC_ALL environment variable is not set, and a value for a particular .Ev LC_* environment variable is not set, the value of the .Ev LANG environment variable specifies the default locale for all categories. Only the .Ev LANG environment variable should be set in /etc/profile, since it makes it most easy for the user to override the system default using the individual .Ev LC_* variables. .It If the .Ev LC_ALL environment variable is not set, a value for a particular .Ev LC_* environment variable is not set, and the value of the .Ev LANG environment variable is not set, the locale for that specific category defaults to the C locale. The C or POSIX locale assumes the ASCII character set and defines information for the six categories. .El .Ss Character Sets A character is any symbol used for the organization, control, or representation of data. A group of such symbols used to describe a particular language make up a character set. It is the encoding values in a character set that provide the interface between the system and its input and output devices. .Pp The following character sets are supported in .Dx : .Bl -tag -width ISO_8859_family .It ASCII The American Standard Code for Information Exchange (ASCII) standard specifies 128 Roman characters and control codes, encoded in a 7-bit character encoding scheme. .It ISO 8859 family Industry-standard character sets specified by the ISO/IEC 8859 standard. The standard is divided into 15 numbered parts, with each part specifying broad script similarities. Examples include Western European, Central European, Arabic, Cyrillic, Hebrew, Greek, and Turkish. The character sets use an 8-bit character encoding scheme which is compatible with the ASCII character set. .It Unicode The Unicode character set is the full set of known abstract characters of all real-world scripts. It can be used in environments where multiple scripts must be processed simultaneously. Unicode is compatible with ISO 8859-1 (Western European) and ASCII. Many character encoding schemes are available for Unicode, including UTF-8, UTF-16 and UTF-32. These encoding schemes are multi-byte encodings. The UTF-8 encoding scheme uses 8-bit, variable-width encodings which is compatible with ASCII. The UTF-16 encoding scheme uses 16-bit, variable-width encodings. The UTF-32 encoding scheme using 32-bit, fixed-width encodings. .El .Ss Font Sets A font set contains the glyphs to be displayed on the screen for a corresponding character in a character set. A display must support a suitable font to display a character set. If suitable fonts are available to the X server, then X clients can include support for different character sets. .Xr xterm 1 includes support for Unicode with UTF-8 encoding. .Xr xfd 1 is useful for displaying all the characters in an X font. .Pp The .Dx .Xr syscons 4 console provides support for loading a variety of fonts using the .Xr vidcontrol 1 utility. Available fonts can be found in .Pa /usr/share/syscons/fonts . .Ss Internationalization for Programmers To facilitate translations of messages into various languages and to make the translated messages available to the program based on a user's locale, it is necessary to keep messages separate from the programs and provide them in the form of message catalogs that a program can access at run time. .Pp Access to locale information is provided through the .Xr setlocale 3 and .Xr nl_langinfo 3 interfaces. See their respective man pages for further information. .Pp Message source files containing application messages are created by the programmer and converted to message catalogs. These catalogs are used by the application to retrieve and display messages, as needed. .Pp .Dx supports two message catalog interfaces: the X/Open .Xr catgets 3 interface and the Uniforum .Xr gettext 3 interface. The .Xr catgets 3 interface has the advantage that it belongs to a standard which is well supported. Unfortunately the interface is complicated to use and maintenance of the catalogs is difficult. The implementation also doesn't support different character sets. The .Xr gettext 3 interface has not been standardized yet, however it is being supported by an increasing number of systems. It also provides many additional tools which make programming and catalog maintenance much easier. .Ss Support for Multi-byte Encodings Some character sets with multi-byte encodings may be difficult to decode, or may contain state (i.e., adjacent characters are dependent). ISO C specifies a set of functions using 'wide characters' which can handle multi-byte encodings properly. The behaviour of these functions is affected by the .Ev LC_CTYPE category of the current locale. .Pp A wide character is specified in ISO C as being a fixed number of bits wide and is stateless. There are two types for wide characters: .Em wchar_t and .Em wint_t . .Em wchar_t is a type which can contain one wide character and operates like 'char' type does for one character. .Em wint_t can contain one wide character or WEOF (wide EOF). .Pp There are functions that operate on .Em wchar_t , and substitute for functions operating on 'char'. See .Xr wmemchr 3 and .Xr towlower 3 for details. There are some additional functions that operate on .Em wchar_t . See .Xr wctype 3 and .Xr wctrans 3 for details. .Pp Wide characters should be used for all I/O processing which may rely on locale-specific strings. The two primary issues requiring special use of wide characters are: .Bl -bullet -offset indent .It All I/O is performed using multibyte characters. Input data is converted into wide characters immediately after reading and data for output is converted from wide characters to multi-byte encoding immediately before writing. Conversion is controlled by the .Xr mbstowcs 3 , .Xr mbsrtowcs 3 , .Xr wcstombs 3 , .Xr wcsrtombs 3 , .Xr mblen 3 , .Xr mbrlen 3 , and .Xr mbsinit 3 . .It Wide characters are used directly for I/O, using .Xr getwchar 3 , .Xr fgetwc 3 , .Xr getwc 3 , .Xr ungetwc 3 , .Xr fgetws 3 , .Xr putwchar 3 , .Xr fputwc 3 , .Xr putwc 3 , and .Xr fputws 3 . They are also used for formatted I/O functions for wide characters such as .Xr fwscanf 3 , .Xr wscanf 3 , .Xr swscanf 3 , .Xr fwprintf 3 , .Xr wprintf 3 , .Xr swprintf 3 , .Xr vfwprintf 3 , .Xr vwprintf 3 , and .Xr vswprintf 3 , and wide character identifier of %lc, %C, %ls, %S for conventional formatted I/O functions. .El .Sh SEE ALSO .Xr gencat 1 , .Xr vidcontrol 1 , .Xr xfd 1 , .Xr xterm 1 , .Xr catgets 3 , .Xr gettext 3 Pq Pa devel/gettext , .Xr nl_langinfo 3 , .Xr setlocale 3 .Sh BUGS This man page is incomplete.