uniconv
program decodes scripts with a certain encoding encodes them
with some other encoding.
The scipt is a 16,8 or 7 bit-byte stream.
The converted text will be sent to the
standard output, even in case of 16-bit encodings,unless the
output file is specified by the
-out
option.
The
-decode
and
-encode
options are optional, the default converter is utf-8.
The program reads the Unicode map helper files (*.my) from the default
directory /usr/share/data.
Simple 1-to-1 encodings can be added on the fly by adding a
a my-file, or setting your yudit.datapath property
in ~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties.
By default /usr/share/yudit/data is searched.
My-files can be created by a program called
The files can be converted between dos/unix/mac line-ending variants
with
-fromdos, -frommac, -todos, -tomac
options. the default (not scpecified one) is Unix.
makeumap.
ENCODING
If you received this program through the Yudit distribution, then as of
today you can convert between the encodings below.
utf-8
Yudit recommends this format for
international information exchange. ASCII
text will get through intact, while other
unicode characters will get their 8th bit set
and the length of the code will depend on
how far away they are in the Unicode space.
This is the only transformation format that can
encode both 16-bit (ucs-2) and 31-bit (ucs-4) unicode.
utf-7
This is the recommended format for
international information exchange, when
7-bit can only be used. It can only handle
16-bit (ucs-2) unicode.
iso8859-1
This is the ISO 8859-1 character encoding
format. It is also known as "Latin-1"
encoding.
iso8859-2
This is the ISO 8859-2 character encoding
format. It is also known as "Central European"
encoding.
iso8859-5
This is the ISO 8859-5 character encoding format. It is also known as
"Cyrillic" encoding.
iso8859-7
This is the ISO 8859-7 character encoding format. It is also known as
"Greek" encoding.
iso8859-9
This is the ISO 8859-9 character encoding format. It is also known as
"Turkish" encoding.
koi8-r
This is the KOI8-R character encoding format. It is mainly used in
Russia.
cp-1251
This is the CP1251 cyrillic character encoding format. It is mainly used in
Microsoft Windows and some web sites.
iso-2022-jp
This is a Japanese character encoding format. It is a 7-bit encoding
format.
euc-jp
This is a Japanese character encoding format. It is an 8-bit encoding
format. Mainly used in UNIX systems.
shift-jis
This is a Japanese character encoding format.
It is an 8-bit encoding format. Mainly used in MSDOS/Windows.
iso-2022-jp
This is a Japanese 7-bit character encoding format.
The iso-2022-jp email messages can be decoded/encoded are in this format.
iso-2022-x11
This is a Japanese character encoding format.
It is also known as "COMPOUND_TEXT" encoding
for the X Window System. This is a 7-bit
encoding format. It can be derived from the
ISO 2022-JP format with some differences.
ksc-5601-x11
This is a Korean character encoding format used by the
X window system(COMPOUND_TEXT encoding) to
encode Korean(KS X 1001) and
US-ASCII. This is a 7bit encoding
format compliant to ISO-2022 specification for encoding of
multiple character sets. Please, note that this is
DIFFERENT from ISO-2022-KR (defined in IETF RFC 1557).
euc-kr
This is an 8bit multibyte encoding for Korean.
It encodes US-ASCII(7bit) in single byte range
and characters in KS X 1001(formerly KS C 5601)
in double byte range with MSB on(8bit). It's used in
Unix and Internet. Korean version
of MS-DOS, MacOS and MS-Windows use compatible
(most cases, identical) variant of this encoding.
johab
This is a Korean encoding specified in KS X
1001(KS C 5601-1992), Annex 3 as a supplementary
encoding. Widely used in Korean MS-DOS until mid-1990's.
It can encode all Hangul syllables(11,172) of modern
Korean as well as all the special symbols and Hanja
(Chinese ideograms used in Korea) defined in KS X 1001.
uhc
A variant of EUC-KR used in Korean MS-Windows
95/98(proprietary encoding of Microsoft,CP949). Its
character repertoire includes all modern syllables of
Hangul,Korean script as well as all the special symbols
and Hanja (Chinese ideograms used in Korea) defined in KS
X 1001.
gb-2312-x11
This is a Chinese character encoding format based upon GB 2312.
It is a 7-bit encoding format.
gb-2312
This is a Chinese character encoding format based upon GB 2312.
It is an 8-bit encoding format.
big-5
This is a Chinese character encoding format based upon BIG5 encoding.
It is an 8-bit encoding format.
hz
This is a Chinese character encoding format based upon "Hanzi" encoding.
It is a 7-bit encoding format.
viscii
This is a Vietnamese character encoding format.
ucs-2-be
This converts 16-bit unicode (ucs-2) streams. The format takes care of
big-endian variant.
Yudit does not recommend this format.
ucs-2-le
This converts 16-bit unicode (ucs-2) streams. The format takes care of
little-endian variant.
Yudit does not recommend this format.
ucs-2
This converts 16-bit unicode (ucs-2) streams.
The input byte order is recognized by the first two characters BEM
(byte-order-mark) U+FEFF. This format is used in Windows NT for
documents like notepad .txt files.
Yudit does not recommend this format.
java
This converts \uxxxx character escapes. When encoding, all characters
above U+0080 will be escaped with a string like '\u0080'. When decoding
the same format is decoded but, in addition, utf-8 format is also
recognized, so it can also be used to recover data accidentally saved
with the wrong enconding.
FILES
~/.yudit/yudit.properties or /usr/share/yudit/config/yudit.properties
can have yudit.datapath property. This is where the map files are kept.
By default /usr/share/yudit/data is searched.
SEE ALSO
makeumap
AUTHOR
This program was written by gsinai@yudit.org (Gaspar Sinai),
Tokyo, 2 January, 2001.