Unicode and TeX

Unicode is a character code scheme that has the capacity to express the text of the languages of the world, as well as important symbols (including mathematics). Any coding scheme that is directly applicable to TeX may be expressed in single bytes (expressing up to 256 characters); Unicode characters may require several bytes, and the scheme may express a very large number of characters.

For “old-style” applications (TeX or PDFTeX) to deal with Unicode input, the sequence of bytes to make up Unicode character is processed by a set of macros, and converted to an 8-bit number representing a character in an appropriate font (in practice, only the standard UTF-8 byte sequences are supported). TeX code that reads these bytes is complicated, but works well enough; there is an utf8 option for the LaTeX distribution inputenc package. The separate package ucs provides wider, but less robust, coverage via an inputenc option utf8x. Broadly, the difference is that utf8 deals with “standard LaTeX fonts” (those for which LaTeX has a defined encoding), while utf8x deals with pretty much anything for which it knows a mapping of a Unicode range to a font. As a general rule, you should never use ucs/utf8x until you have convinced yourself that inputenc/utf8 can not do the job for you.

‘Modern’ TeX-alike applications, XeTeX and LuaTeX read their input using UTF-8 representations of Unicode as standard. They also use TrueType or OpenType fonts for output; each such font has tables that tell the application which part(s) of the Unicode space it covers; the tables enable the engines to decide which font to use for which character (assuming there is any choice at all).

inputenc.sty
Part of the macros/latex/base (or browse the directory) distribution
ucs.sty
macros/latex/contrib/unicode (or browse the directory); catalogue entry

This answer last edited: 2011-03-07

This question on the Web: http://www.tex.ac.uk/cgi-bin/texfaq2html?label=unicode