WHATWG

Encoding — 符号化方式

Living Standard — 最終更新 2017 年 2 月 22 日

Participate:
GitHub whatwg/encoding (file an issue, open issues)
IRC: #whatwg on Freenode
Commits:
GitHub whatwg/encoding/commits
Snapshot as of this commit
@encodings
Tests:
web-platform-tests encoding/ (ongoing work)
各国語翻訳(非規範的)
日本語(このページ)

1. 序

~UTF-8~encodingは、統一的な符号化文字集合である~Unicodeの交換に最も適切な~encodingである。 よって,この仕様は、新たな[ ~protocolと形式 ], および[ 新たな文脈~下で流布される既存の形式 ]に対し、~UTF-8~encodingを要求する(また,定義する)。 ◎ The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.

~encodingには,他のもの(旧来の~encoding)もあり,過去にある程度までは定義されているが、~UA間で常に同じように実装されているとは限らない。 また、常に同じ~labelを利用してるわけでもなく、~encodingの中の未定義の区画, あるいは かつての~proprietaryな区画についての扱いも,しばしば異なっている。 この仕様は、新たな実装が~encoding実装をリバースエンジニアせずに済むように,また, 既存の~UAが一つに収束し得るようにするため、これらの隔たりを埋めることに取組む。 ◎ The other (legacy) encodings have been defined to some extent in the past. However, user agents have not always implemented them in the same way, have not always used the same labels, and often differ in dealing with undefined and former proprietary areas of encodings. This specification addresses those gaps so that new user agents do not have to reverse engineer encoding implementations and existing user agents can converge.

特に,この仕様は、それらの~encodingと,そのそれぞれにおける[ ~byte列と`~scalar値$ 列を相互に変換する~algo ], および[ 一連の`~label$を識別する正準的な名前 ]を定義する。 また、~encodingの各種~algoのうち一部を JavaScript に公開する~APIも定義する。 ◎ In particular, this specification defines all those encodings, their algorithms to go from bytes to scalar values and back, and their canonical names and identifying labels. This specification also defines an API to expose part of the encoding algorithms to JavaScript.

~UAは,すでに IANA Character Sets registry に挙げられている~labelからも 有意に逸脱している。 旧来の~encodingを これ以上~拡散させないため、この仕様は,前述の詳細について網羅的であり, registry はもう不要である。 特に,この仕様は、~encodingを拡張するための仕組みは提供しない。 ◎ User agents have also significantly deviated from the labels listed in the IANA Character Sets registry. To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry. In particular, this specification does not provide a mechanism for extending any aspect of encodings.

2. ~securityに関する背景

~encodingには、いくつかの~security上の課題がある — 生産側と消費側の間で,[ 利用中の~encoding, あるいは所与の~encodingの実装-法 ]について合意されてないときに。 例えば 2011 年には、次のような攻撃が報告されている: そこでは、[ 攻撃者が何らかの~fieldを制御し得るような, JSON 資源 ]内の `22^X ~trail~byteを “隠す” ために,`Shift_JIS$n の~lead~byte `82^X が利用された。 生産側からは,これが違法な~byte対であっても問題が見えない一方で、消費側では,この~byte対を単独の `FFFD^U として~decodeする~~結果、全体的な解釈が変わってしまう — `0022^U は重要な区切子なので。 [ `~scalar値$に対し複数~byteを利用する~encoding ]の~decoderには、今や,違法な~byte対の事例では,[ 範囲 `0000^U 〜 `007F^U に入る~scalar値 ]を “隠せない” ようにすることが要求される — 前述の~byte対に対しては、出力が[ `FFFD^U `0022^U ]になるように。 ◎ There is a set of encoding security issues when the producer and consumer do not agree on the encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a Shift_JIS lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022.

これは、より~~一般的には,[ ~lead~byteが伴われないときに,`~ASCII~byte$を`~ASCII~cp$でない何かに対応付ける ]ような~encodingにおける課題である。 これらは, “~ASCII非互換” の~encodingであり、あいにく,すでに流布された内容のために要求されるが、[ `ISO-2022-JP$n, `UTF-16BE$n, `UTF-16LE$n ]以外のものは,~supportされない。 (その種の 他の~encoding`~label$についても、未知の~encodingへ~fallbackせずに,`replacement$n ~encodingに対応付けてよいかどうかの究明が 進行中にある 。) 攻撃の一~例として、注意深く細工された内容を資源の中へ注入して,利用者に~encodingを上書きするよう促す~~結果、例えば~scriptの実行に至らせるものがある。 ◎ This is a larger issue for encodings that map anything that is an ASCII byte to something that is not an ASCII code point, when there is no lead byte present. These are “ASCII-incompatible” encodings and other than ISO-2022-JP, UTF-16BE, and UTF-16LE, which are unfortunately required due to deployed content, they are not supported. (Investigation is ongoing whether more labels of other such encodings can be mapped to the replacement encoding, rather than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a resource and then encouraging the user to override the encoding, resulting in e.g. script execution.

HTML や HTML の~form特色機能に見出される URL に利用される~encoderも、その~encodingにより表現できない~scalar値がある場合には,若干の情報喪失につながり得る。 例えば,資源が `windows-1252$n ~encodingを利用しているとき、~serverは,末端利用者が~formに手入力した “💩” と “💩” とを判別できなくなる。 ◎ Encoders used by URLs found in HTML and HTML’s form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g. when a resource uses the windows-1252 encoding a server will not be able to distinguish between an end user entering “💩” and “💩” into a form.

ここに要旨した問題は、~UTF-8を排他的に利用しているときは,霧消する。 それが、今やすべてに対し~UTF-8~encodingが義務付けられている理由の一つである。 ◎ The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.

注記: ~browser UI 節も見よ。 ◎ See also the Browser UI chapter.

3. 各種用語

この仕様は Infra Standard `INFRA$r に依存する。 ◎ This specification depends on the Infra Standard. [INFRA]

16 進数には "0x" が前置される。 ◎ Hexadecimal numbers are prefixed with "0x".

算術式の中のすべての数値は整数であり、各種~演算は次の記号で表現される:

記号 意味
~PLUS 加算
~MINUS 減算
~INCBY 左辺~値に対する右辺~値による加算
~DECBY 左辺~値に対する右辺~値による減算
~MUL 乗算
~DIV 除算
~MOD 除算の剰余( modulo )
~Lshift 論理~左~shift
~Rshift 論理~右~shift
~bAND ~bit AND
~bOR ~bit OR
floor( %x ) %x を超えない最大の整数

【 記号 ~INCBY, ~DECBY は訳者による追加。 】

◎ In equations, all numbers are integers, addition is represented by "+", subtraction by "−", multiplication by "×", division by "/", calculating the remainder of a division (also known as modulo) by "%", logical left shifts by "<<", logical right shifts by ">>", bitwise AND by "&", and bitwise OR by "|". floor(x) is the largest integer not greater than x.

論理~右~shiftの演算~対象の精度は、少なくとも 21 ~bit以上で~MUST。 ◎ For logical right shifts operands must have at least twenty-one bits precision.

`~token@ は、`~byte$や`~cp$などの, 1 個の~data片である。 ◎ A token is a piece of data, such as a byte or code point.

`~stream@ は、有順序`~token$列を表現する。 `~EoS@ は、`~stream$にそれ以上 読取れる`~token$は無いことを表す,特別な`~token$である。 ◎ A stream represents an ordered sequence of tokens. End-of-stream is a special token that signifies no more tokens are in the stream.

【 ~streamと~token列は( “静的な” )~data構造としては同じでも, “~stream” には,そのアクセスが 先頭(入力として与えられた場合)/末尾(出力~先の場合) に( “時系列的に” )制約されることが含意される。 】

`~stream$から`~token$を 読取る ときは、次を走らせ~MUST:

  1. ~IF[ ~streamは空である ] ⇒ ~RET `~EoS$
  2. ~streamの先頭から 1 個の~tokenを除去する
  3. ~RET 前~段で除去した~token
◎ When a token is read from a stream, the first token in the stream must be returned and subsequently removed, and end-of-stream must be returned otherwise.

1 個~以上の`~token$を`~stream$に `前付加する@ ときは、それらの~tokenを,~streamの先頭に, 所与の順序を保ったまま挿入し~MUST。 ◎ When one or more tokens are prepended to a stream, those tokens must be inserted, in given order, before the first token in the stream.

~token列 "&#128169;" を~stream " hello world" の先頭に挿入した結果は,~stream "&#128169; hello world" になり、次回に読取られる~tokenは & になる。 ◎ Inserting the sequence of tokens &#128169; in a stream " hello world", results in a stream "&#128169; hello world". The next token to be read would be &.

1 個~以上の`~token$を`~stream$に `~pushする@ ときは、それらの~tokenを,~streamの末尾に, 所与の順序を保ったまま付加し~MUST。 ◎ When one or more tokens are pushed to a stream, those tokens must be inserted, in given order, after the last token in the stream.

【この訳に固有の表記規約】

この訳の,~algoの記述に利用されている各種記号(此れ, ~LET, ~ON, ~IF, ~RET, 等々)の意味や定義の詳細は、~SYMBOL_DEF_REFを~~参照されたし。 加えて、次の記法も用いられる:

記法 意味
~byte列 [ %value1, %value2, … ] 数値的に %value1, %value2 … と同じ値をとる一連の`~byte$からなる, 所与の順序による,新たな`~token$列の~instanceを意味する。 角括弧の中が空 — “~byte列 [] ” — と記されたときは,空の`~token$列を意味する。
~cp [ %value ] 数値的に %value と同じ値をとる 1 個の`~cp$からなる,新たな`~token$列の~instanceを意味する。

4. ~encoding

`~encoding@ は、`~scalar値$ 列から~byte列への対応関係 【~encode】, および逆方向への対応関係 【~decode】 を定義する。 それぞれの`~encoding$には、 `名前@ および, 1 個~以上の `~label@ があてがわれている。 ◎ An encoding defines a mapping from a scalar value sequence to a byte sequence (and vice versa). Each encoding has a name, and one or more labels.

4.1. ~encoderと~decoder

各種 `~encoding$には、 `~decoder@ と `~encoder@ が結付けられる。 各 `~decoder$ / 各 `~encoder$には、 `~handler@ が結付けられる。 `~handler$は、 ( `~stream$, 1 個の`~token$ ) を入力にとり,次のいずれかを返す~algoである:

  • `完遂@
  • 空でない`~token$列

    【 ほとんどの~decoderの~handlerは, 1 個の~cpからなる~token列を返すが、 `~Big5~decoder$の~handlerだけ, 2 個の~cpを返す場合がある。 】

  • ~optionで`~cp$も伴い得る,`~error@

    【 ~encoderの~handlerのみが、常に,~cpも伴う~errorを返す。 】

  • `継続@
◎ Each encoding has an associated decoder and most of them have an associated encoder. Each decoder and encoder have a handler algorithm. A handler algorithm takes an input stream and a token, and returns finished, one or more tokens, error optionally with a code point, or continue.

注記: [ `replacement$n, `UTF-16BE$n, `UTF-16LE$n ]`~encoding$には、`~encoder$はない。 ◎ The replacement, UTF-16BE, and UTF-16LE encodings have no encoder.

以下で用いられる `~error~mode@ は:

  • `~decoder$に対しては,[ `replacement^l(既定), `fatal^l ]のいずれかをとる。
  • `~encoder$に対しては,[ `fatal^l(既定) , `html^l ]のいずれかをとる。
◎ An error mode as used below is "replacement" (default) or "fatal" for a decoder and "fatal" (default) or "html" for an encoder.

注記: XML 処理器は、その`~decoder$の`~error~mode$を `fatal^l に設定することになる。 `XML$r ◎ An XML processor would set error mode to "fatal". [XML]

注記: `~error~mode$に `html^l が存在する理由は、 URL や HTML form においては,`~error$に際し旧来の`~encoder$を終了させないようにする取扱いを要するためである。 `html^l `~error~mode$は、合法な入力と判別できない列を~~出力させ,~~検知されることなく~dataを失わせる。 これを防ぐため、開発者には `UTF-8$n `~encoding$の利用が強く奨励される。 `URL$r `HTML$r ◎ html exists as error mode due to URLs and HTML forms requiring a non-terminating legacy encoder. The "html" error mode causes a sequence to be emitted that cannot be distinguished from legitimate input and can therefore lead to silent data loss. Developers are strongly encouraged to use the UTF-8 encoding to prevent this from happening. [URL] [HTML]

`~encoding$の[ `~decoder$ / `~encoder$ ] %~coder を `走らす@ ときは、所与の:

  • 入力`~stream$ : %入力
  • 出力`~stream$ : %出力
  • `~error~mode$ : %~mode (省略可)

に対し,次を走らす:

◎ To run an encoding’s decoder or encoder encoderDecoder with input stream input, output stream output, and optional error mode mode, run these steps:
  1. ~IF[ %~mode は与えられていない ] ⇒ %~mode ~SET %~coder に応じて ⇒ `~decoder$であるならば `replacement^l ~BR `~encoder$であるならば `fatal^l ◎ If mode is not given, set it to "replacement", if encoderDecoder is a decoder, and "fatal" otherwise.
  2. %~coder~instance ~LET 新たな %~coder の~instance ◎ Let encoderDecoderInstance be a new encoderDecoder.
  3. ~WHILE 無条件: ◎ While true:

    1. %結果 ~LET 次を与える下で,`~tokenを処理-$した結果 ⇒ ( %入力 から`読取った結果$, %~coder~instance, %入力, %出力, %~mode ) ◎ Let result be the result of processing the result of reading from input for encoderDecoderInstance, input, output, and mode.
    2. ~IF[ %結果 ~NEQ `継続$ ] ⇒ ~RET %結果 ◎ If result is not continue, return result. ◎ Otherwise, do nothing.

`~tokenを処理-@ するときは、所与の:

  • `~token$ : %~token
  • `~encoding$の[ `~encoder$/`~decoder$ ]の~instance : %~coder~instance
  • 入力`~stream$ : %入力
  • 出力`~stream$ : %出力
  • `~error~mode$ : %~mode(省略可)

に対し,次を走らす:

◎ To process a token token for an encoding’s encoder or decoder instance encoderDecoderInstance, stream input, output stream output, and optional error mode mode, run these steps:
  1. ~IF[ %~mode は与えられていない ] ⇒ %~mode ~SET %~coder~instance に応じて ⇒ `~decoder$の~instanceであるならば `replacement^l / `~encoder$の~instanceであるならば `fatal^l ◎ If mode is not given, set it to "replacement", if encoderDecoderInstance is a decoder instance, and "fatal" otherwise.
  2. %結果 ~LET ( %入力, %~token ) に対し, %~coder~instance による`~handler$を走らせた結果 ◎ Let result be the result of running encoderDecoderInstance’s handler on input and token.
  3. ~IF[ %結果 ~IN { `継続$, `完遂$ } ] ⇒ ~RET %結果 ◎ If result is continue or finished, return result.
  4. ~IF[ %結果 に 1 個~以上の`~token$がある ] ⇒ %結果 を %出力 に`~pushする$ ◎ Otherwise, if result is one or more tokens, push result to output.
  5. ~ELIF[ %結果 ~EQ `~error$ ] ⇒ %~mode に応じて: ◎ Otherwise, if result is error, switch on mode and run the associated steps:

    `replacement^l
    `FFFD^U を %出力 に`~pushする$ ◎ Push U+FFFD to output.
    `html^l
    ~cp列[ `0026^U, `0023^U, [ %結果 の`~cp$を基数 10 により最短で表現する`~ASCII数字$列 ], `003B^U ] 【 "&#数字列;" 】 を %入力 に`前付加する$ ◎ Prepend U+0026, U+0023, followed by the shortest sequence of ASCII digits representing result’s code point in base ten, followed by U+003B to input.
    `fatal^l
    ~RET `~error$ ◎ Return error.
  6. ~RET `継続$ ◎ Return continue.

4.2. 名前と~label

下の一覧に、~UAが~supportし~MUST,すべての`~encoding$とそれらの`~label$を挙げる。 ~UA は、他の`~encoding$や`~label$を~supportしては~MUST_NOT。 ◎ The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels.

作者は、 `UTF-8$n `~encoding$を利用した上で,その利用が識別されるように[ `~ASCII大小無視$ で `utf-8^lb に~~合致する`~label$ ]を利用し~MUST。 ◎ Authors must use the UTF-8 encoding and must use the ASCII case-insensitive "utf-8" label to identify it.

新たな[ ~protocolと形式 ], あるいは[ 新たな文脈~下で流布される既存の形式 ]には、 `UTF-8$n `~encoding$が排他的に利用され~MUST。 これらの[ ~protocolや形式 ]の`~encoding$の[ `名前$や`~label$ ]は、 `utf-8^lb として公開され~MUST。 ◎ New protocols and formats, as well as existing formats deployed in new contexts, must use the UTF-8 encoding exclusively. If these protocols and formats need to expose the encoding’s name or label, they must expose it as "utf-8".

文字列 %~label から `~encodingを取得-@ するときは、次を走らす: ◎ To get an encoding from a string label, run these steps:

  1. %~label から頭部と尾部の`~ASCII空白$を除去する ◎ Remove any leading and trailing ASCII whitespace from label.
  2. ~IF[ %~label が下の一覧のいずれかの`~label$に`~ASCII大小無視$で合致する ] ⇒ ~RET 合致した`~label$に対応する`~encoding$ ◎ If label is an ASCII case-insensitive match for any of the labels listed in the table below, return the corresponding encoding, and failure otherwise.
  3. ~RET `失敗^i ◎ ↑

注記: 配備済みの内容と互換にする必要から、この[ `~label$を`~encoding$に対応付ける~algo ]は, Unicode Technical Standard #22, 1.4 節 によるものよりもずっと単純かつ制約的なものである。 ◎ This is a much simpler and more restrictive algorithm of mapping labels to encodings than section 1.4 of Unicode Technical Standard #22 prescribes, as that is found to be necessary to be compatible with deployed content.

`名前$◎Name `~label$◎Labels
~~標準の~encoding ◎ The Encoding
`UTF-8$n `unicode-1-1-utf-8^lb `utf-8^lb `utf8^lb
旧来の単byte~encoding ◎ Legacy single-byte encodings
`IBM866$n `866^lb `cp866^lb `csibm866^lb `ibm866^lb
`ISO-8859-2$n `csisolatin2^lb `iso-8859-2^lb `iso-ir-101^lb `iso8859-2^lb `iso88592^lb `iso_8859-2^lb `iso_8859-2:1987^lb `l2^lb `latin2^lb
`ISO-8859-3$n `csisolatin3^lb `iso-8859-3^lb `iso-ir-109^lb `iso8859-3^lb `iso88593^lb `iso_8859-3^lb `iso_8859-3:1988^lb `l3^lb `latin3^lb
`ISO-8859-4$n `csisolatin4^lb `iso-8859-4^lb `iso-ir-110^lb `iso8859-4^lb `iso88594^lb `iso_8859-4^lb `iso_8859-4:1988^lb `l4^lb `latin4^lb
`ISO-8859-5$n `csisolatincyrillic^lb `cyrillic^lb `iso-8859-5^lb `iso-ir-144^lb `iso8859-5^lb `iso88595^lb `iso_8859-5^lb `iso_8859-5:1988^lb
`ISO-8859-6$n `arabic^lb `asmo-708^lb `csiso88596e^lb `csiso88596i^lb `csisolatinarabic^lb `ecma-114^lb `iso-8859-6^lb `iso-8859-6-e^lb `iso-8859-6-i^lb `iso-ir-127^lb `iso8859-6^lb `iso88596^lb `iso_8859-6^lb `iso_8859-6:1987^lb
`ISO-8859-7$n `csisolatingreek^lb `ecma-118^lb `elot_928^lb `greek^lb `greek8^lb `iso-8859-7^lb `iso-ir-126^lb `iso8859-7^lb `iso88597^lb `iso_8859-7^lb `iso_8859-7:1987^lb `sun_eu_greek^lb
`ISO-8859-8$n `csiso88598e^lb `csisolatinhebrew^lb `hebrew^lb `iso-8859-8^lb `iso-8859-8-e^lb `iso-ir-138^lb `iso8859-8^lb `iso88598^lb `iso_8859-8^lb `iso_8859-8:1988^lb `visual^lb
`ISO-8859-8-I$n `csiso88598i^lb `iso-8859-8-i^lb `logical^lb
`ISO-8859-10$n `csisolatin6^lb `iso-8859-10^lb `iso-ir-157^lb `iso8859-10^lb `iso885910^lb `l6^lb `latin6^lb
`ISO-8859-13$n `iso-8859-13^lb `iso8859-13^lb `iso885913^lb
`ISO-8859-14$n `iso-8859-14^lb `iso8859-14^lb `iso885914^lb
`ISO-8859-15$n `csisolatin9^lb `iso-8859-15^lb `iso8859-15^lb `iso885915^lb `iso_8859-15^lb `l9^lb
`ISO-8859-16$n `iso-8859-16^lb
`KOI8-R$n `cskoi8r^lb `koi^lb `koi8^lb `koi8-r^lb `koi8_r^lb
`KOI8-U$n `koi8-ru^lb `koi8-u^lb
`macintosh$n `csmacintosh^lb `mac^lb `macintosh^lb `x-mac-roman^lb
`windows-874$n `dos-874^lb `iso-8859-11^lb `iso8859-11^lb `iso885911^lb `tis-620^lb `windows-874^lb
`windows-1250$n `cp1250^lb `windows-1250^lb `x-cp1250^lb
`windows-1251$n `cp1251^lb `windows-1251^lb `x-cp1251^lb
`windows-1252$n `ansi_x3.4-1968^lb `ascii^lb `cp1252^lb `cp819^lb `csisolatin1^lb `ibm819^lb `iso-8859-1^lb `iso-ir-100^lb `iso8859-1^lb `iso88591^lb `iso_8859-1^lb `iso_8859-1:1987^lb `l1^lb `latin1^lb `us-ascii^lb `windows-1252^lb `x-cp1252^lb
`windows-1253$n `cp1253^lb `windows-1253^lb `x-cp1253^lb
`windows-1254$n `cp1254^lb `csisolatin5^lb `iso-8859-9^lb `iso-ir-148^lb `iso8859-9^lb `iso88599^lb `iso_8859-9^lb `iso_8859-9:1989^lb `l5^lb `latin5^lb `windows-1254^lb `x-cp1254^lb
`windows-1255$n `cp1255^lb `windows-1255^lb `x-cp1255^lb
`windows-1256$n `cp1256^lb `windows-1256^lb `x-cp1256^lb
`windows-1257$n `cp1257^lb `windows-1257^lb `x-cp1257^lb
`windows-1258$n `cp1258^lb `windows-1258^lb `x-cp1258^lb
`x-mac-cyrillic$n `x-mac-cyrillic^lb `x-mac-ukrainian^lb
旧来の複byte Chinese (簡体字) ~encoding ◎ Legacy multi-byte Chinese (simplified) encodings
`GBK$n `chinese^lb `csgb2312^lb `csiso58gb231280^lb `gb2312^lb `gb_2312^lb `gb_2312-80^lb `gbk^lb `iso-ir-58^lb `x-gbk^lb
`gb18030$n `gb18030^lb
旧来の複byte Chinese (繁体字)~encoding ◎ Legacy multi-byte Chinese (traditional) encodings
`Big5$n `big5^lb `big5-hkscs^lb `cn-big5^lb `csbig5^lb `x-x-big5^lb
旧来の複byte Japanese ~encoding ◎ Legacy multi-byte Japanese encodings
`EUC-JP$n `cseucpkdfmtjapanese^lb `euc-jp^lb `x-euc-jp^lb
`ISO-2022-JP$n `csiso2022jp^lb `iso-2022-jp^lb
`Shift_JIS$n `csshiftjis^lb `ms932^lb `ms_kanji^lb `shift-jis^lb `shift_jis^lb `sjis^lb `windows-31j^lb `x-sjis^lb
旧来の複byte Korean ~encoding ◎ Legacy multi-byte Korean encodings
`EUC-KR$n `cseuckr^lb `csksc56011987^lb `euc-kr^lb `iso-ir-149^lb `korean^lb `ks_c_5601-1987^lb `ks_c_5601-1989^lb `ksc5601^lb `ksc_5601^lb `windows-949^lb
旧来のその他の~encoding ◎ Legacy miscellaneous encodings
`replacement$n `csiso2022kr^lb `hz-gb-2312^lb `iso-2022-cn^lb `iso-2022-cn-ext^lb `iso-2022-kr^lb
`UTF-16BE$n `utf-16be^lb
`UTF-16LE$n `utf-16^lb `utf-16le^lb
`x-user-defined$n `x-user-defined^lb

注記: すべての`~encoding$とそれらの`~label$は、規範的でない資源 `indexes.json$ からも入手できる。 ◎ All encodings and their labels are also available as non-normative encodings.json resource.

【 名前は正順的な~labelと見なせる — ~replacement を除く上の一覧のどの名前も,~labelとして有効になる(小文字~化した結果は,対応する~labelの集合に含まれている)。 】

4.3 出力~encoding

`~encoding$ %~encoding から `出力~encodingを取得-@ するときは、次を走らす: ◎ To get an output encoding from an encoding encoding, run these steps:

  1. ~IF[ %~encoding ~IN { `replacement$n, `UTF-16BE$n, `UTF-16LE$n } ] ⇒ ~RET `UTF-8$n ◎ If encoding is replacement, UTF-16BE, or UTF-16LE, return UTF-8.
  2. ~RET %~encoding ◎ Return encoding.

注記: `出力~encodingを取得-$する~algoは、それを必要とする[ URL の構文解析 / HTML ~form提出 ]にて有用になる。 ◎ The get an output encoding algorithm is useful for URL parsing and HTML form submission, which both need exactly this.

5. 索引

ほとんどの旧来の`~encoding$では、 【~encodingごとに固有の】 `索引@ が利用される。 `索引$とは、一連の~entryからなる有順序~listであり、各~entryは[ ~pointerと, それに対応する~cp ]からなる組である。 `索引$の中では、~pointerは一意であり,~cpは重複し得る。 ◎ Most legacy encodings make use of an index. An index is an ordered list of entries, each entry consisting of a pointer and a corresponding code point. Within an index pointers are unique and code points can be duplicated.

注記: 効率的な実装は、`~encoding$ごとに,その`~decoder$と`~encoder$のそれぞれに最適化された, 2 つの`索引$を備えることになるであろう。 ◎ An efficient implementation likely has two indexes per encoding. One optimized for its decoder and one for its encoder.

`索引$ 【の~dataを供する下記の資源】 から,~pointerとそれに対応する~cpを見出すためには:

  1. まず、 %行~list をその資源の内容を `000A^U で一連の “行” に分割した結果とする。
  2. %行~list から[ 空行 / `0023^U で開始される行 ]をすべて除去する。
  3. %行~list の各~行に対し,行を `0009^U で分割したときの:

    • 最初の項が~pointer( 10 進表記)を与える。
    • 次の項が対応する~cp( 16 進表記)を与える。
    • 他の項は関係ない。
◎ To find the pointers and their corresponding code points in an index, let lines be the result of splitting the resource’s contents on U+000A. Then remove each item in lines that is the empty string or starts with U+0023. Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). Other subitems are not relevant.

注記: 各`索引$の冒頭には、変更の有無を記すため, IdentifierDate 【識別子と日付】 が記されている。 Identifier の変化は、`索引$に変更が加えられたことを表す。 ◎ To signify changes an index includes an Identifier and a Date. If an Identifier has changed, so has the index.

%索引 の中で %~pointer が指す `索引~cp@ とは、 %索引 内に %~pointer が[ 在るならば,それに対応する~cp / 無ければ ~NULL ]である。 ◎ The index code point for pointer in index is the code point corresponding to pointer in index, or null if pointer is not in index.

%索引 の中で %~cp を指す `索引~pointer@ とは、 %索引 内に %~cp に対応する~pointerが[ 在るならば,それらのうちの 最初の ~pointer / 無ければ ~NULL ]である。 ◎ The index pointer for code point in index is the first pointer corresponding to code point in index, or null if code point is not in index.

注記: 各 索引には,規範的でない視覚化があり、`索引~jis0208$には, `Shift_JIS$n 視覚化も別にある。 加えて,基本多言語面( BMP / Basic Multilingual Plane / `0000^U 〜 `FFFF^U )における被覆域の視覚化もある。 (いずれも,`索引~gb18030範囲集$は除く。) ◎ There is a non-normative visualization for each index other than index gb18030 ranges. index jis0208 also has an alternative Shift_JIS visualization. Additionally, there is visualization of the Basic Multilingual Plane coverage of each index other than index gb18030 ranges.

視覚化における凡例 ◎ The legend for the visualizations is:
表示 説明
対応する~cpなし。 ◎ Unmapped
~UTF-8で 2 ~byte。 ◎ Two bytes in UTF-8
~UTF-8で 2 ~byte, かつ ~cpは、前の~pointerの~cpの直後に続く。 ◎ Two bytes in UTF-8, code point follows immediately the code point of previous pointer
~UTF-8で 3 ~byte(私用領域でない) ◎ Three bytes in UTF-8 (non-PUA)
~UTF-8で 3 ~byte(私用領域でない), かつ ~cpは、前の~pointerの~cpの直後に続く。 ◎ Three bytes in UTF-8 (non-PUA), code point follows immediately the code point of previous pointer
私用領域 ◎ Private Use
私用領域, かつ ~cpは、前の~pointerの~cpの直後に続く。 ◎ Private Use, code point follows immediately the code point of previous pointer
~UTF-8で 4 ~byte ◎ Four bytes in UTF-8
~UTF-8で 4 ~byte, かつ ~cpは、前の~pointerの~cpの直後に続く。 ◎ Four bytes in UTF-8, code point follows immediately the code point of previous pointer
先に現れているものと重複する~cpに対応する。 ◎ Duplicate code point already mapped at an earlier index
~CJK互換漢字( CJK Compatibility Ideograph ) ◎ CJK Compatibility Ideograph
~CJK統合漢字拡張 A ◎ CJK Unified Ideographs Extension A

以下は、この仕様で定義される`索引$のうち,`単byte索引$でないものであり、それぞれに自前の~tableがある: `視覚化/被覆域の~tableは巨大なことに注意^tnote ◎ These are the indexes defined by this specification, excluding index single-byte, which have their own table:

`名前$ `索引$ 視覚化 基本多言語面( BMP )の被覆域
備考
`索引~Big5@ `big5$idx
これは、香港増補字符集( Hong Kong Supplementary Character Set ), および他の共通の拡張と一式で、~Big5標準に合致する。 ◎ This matches the Big5 standard in combination with the Hong Kong Supplementary Character Set and other common extensions.
`索引~EUC-KR@ `euc-kr$idx
これは、 KS X 1001 標準と 統合~Hangul~code( Unified Hangul Code )に合致する。 Windows Codepage 949 としても共通的に知られている。 これ全体で、~Unicodeの~Hangul音節文字( Hangul Syllables )~blockを覆う。 ~Hangul~blockのうち,視覚化における左上隅が~pointer 9026 にあるもの `?^tnote は、~Unicode順に並ぶ。 Taken separately `?^tnote, この索引における残りの~Hangul音節文字も、~Unicode順に並ぶ。 ◎ This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order, too.
`索引~gb18030@ `gb18030$idx
これは、各~cpが 2 ~byteに~encodeされる GB18030-2005 標準に合致する — ただし,配備済みの内容と互換にする必要から、 `A3^X `A0^X は `3000^U に対応付けられる。 この索引~全体で、~Unicodeの~CJK統合漢字( CJK Unified Ideographs )~blockを覆う。 その~block内の~entryのうち,視覚化における(最初の) `3000^U より上または左にあるものは、~Unicode順に並ぶ。 ◎ This matches the GB18030-2005 standard for code points encoded as two bytes, except for 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or to the left of (the first) U+3000 in the visualization are in the Unicode order.
`索引~gb18030範囲集@ index-gb18030-ranges.txt
この`索引$は、すべての~cpを挙げていくと項目数が 100 万を超えてしまう点で,他のすべてと異なる( 207 面の範囲と自明な範囲検査により,きれいに表現し得るが)。 したがって、 4 ~byte に~encodeされる~cpについてのみ,見かけ上 GB18030-2005 標準に合致する。 下の[ `索引~gb18030範囲集~cp$ / `索引~gb18030範囲集~pointer$ ]も見よ。 ◎ This index works different from all others. Listing all code points would result in over a million items whereas they can be represented neatly in 207 ranges combined with trivial limit checks. It therefore only superficially matches the GB18030-2005 standard for code points encoded as four bytes. See also index gb18030 ranges code point and index gb18030 ranges pointer below.
`索引~jis0208@ `jis0208$idx
索引~Shift_JIS視覚化
IBM と NEC によるかつての~proprietary拡張も含まれている, JIS X 0208 標準。 ◎ This is the JIS X 0208 standard including formerly proprietary extensions from IBM and NEC.
`索引~jis0212@ `jis0212$idx
JIS X 0212 標準。 これは、広く~supportされていないので,`~EUC-JP~decoder$でのみ利用される(~encoderからは利用されない)。 ◎ This is the JIS X 0212 standard. It is only used by the EUC-JP decoder due to lack of widespread support elsewhere.

%~pointer が指す `索引~gb18030範囲集~cp@ は、次の手続きが返す~cpである: ◎ The index gb18030 ranges code point for pointer is the return value of these steps:

  1. ~IF[ 39419 ~LT %~pointer ~LT 189000 ]~OR[ 1237575 ~LT %~pointer ] ⇒ ~RET ~NULL ◎ If pointer is greater than 39419 and less than 189000, or pointer is greater than 1237575, return null.
  2. ~IF[ %~pointer ~EQ 7457 ] ⇒ ~RET ~cp `E7C7^U ◎ If pointer is 7457, return code point U+E7C7.
  3. %~offset ~LET `索引~gb18030範囲集$ の中で %~pointer を超えない最後の~pointer ◎ Let offset be the last pointer in index gb18030 ranges that is equal to or less than pointer and let code point offset be its corresponding code point.
  4. %~cp~offset ~LET %~offset が指している~cp ◎ ↑
  5. ~RET 値が[ %~cp~offset ~PLUS %~pointer ~MINUS %~offset ]なる~cp ◎ Return a code point whose value is code point offset + pointer − offset.

%~cp を指す `索引~gb18030範囲集~pointer@ は、次の手続きが返す~pointerである: ◎ The index gb18030 ranges pointer for code point is the return value of these steps:

  1. ~IF[ %~cp ~EQ `E7C7^U ] ⇒ ~RET ~pointer 7457 ◎ If code point is U+E7C7, return pointer 7457.
  2. %~offset ~LET `索引~gb18030範囲集$ の中で %~cp を超えない最後の~cp ◎ Let offset be the last code point in index gb18030 ranges that is equal to or less than code point and let pointer offset be its corresponding pointer.
  3. %~pointer~offset ~LET %~offset に対応する~pointer ◎ ↑
  4. ~RET 値が[ %~pointer~offset ~PLUS %~cp ~MINUS %~offset ]なる~pointer ◎ Return a pointer whose value is pointer offset + code point − offset.

%~cp を指す `索引~Shift_JIS~pointer@ は、次の手続きが返す~pointerである: ◎ The index Shift_JIS pointer for code point is the return value of these steps:

  1. %索引 ~LET `索引~jis0208$ から,[ ~pointerが範囲 { 8272 〜 8835 } に入る~entry ]すべてを除外した索引 ◎ Let index be index jis0208 excluding all entries whose pointer is in the range 8272 to 8835, inclusive.

    `索引~jis0208$は、重複する~cpを包含するので、これらの~entryの除外により,後続の~cpが利用されるようになる。 ◎ The index jis0208 contains duplicate code points so the exclusion of these entries causes later code points to be used.

  2. ~RET %索引 の中で %~cp を指す`索引~pointer$ ◎ Return the index pointer for code point in index.

%~cp を指す `索引~Big5~pointer@ は、次の手続きが返す~pointerである: ◎ The index Big5 pointer for code point is the return value of these steps:

  1. %索引 ~LET `索引~Big5$から[ ~pointerが ( (`A1^X ~MINUS `81^X) ~MUL 157 ) より小さい~entry ]すべてを除外した索引 ◎ Let index be index Big5 excluding all entries whose pointer is less than (0xA1 - 0x81) × 157.

    注記: 香港増補字符集( Hong Kong Supplementary Character Set )拡張を~literalとして返さないようにする。 ◎ Avoid returning Hong Kong Supplementary Character Set extensions literally.

  2. ~IF[ %~cp ~IN { `2550^U, `255E^U, `2561^U, `256A^U, `5341^U, `5345^U } ] ⇒ ~RET %索引 の中で %~cp に対応する 最後の ~pointer ◎ If code point is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345, return the last pointer corresponding to code point in index.

    注記: 他にも重複している~cpはあるが、それらに対しては,最初の ~pointerが利用されることになる。 ◎ There are other duplicate code points, but for those the first pointer is to be used.

  3. ~RET %索引 の中で %~cp を指す`索引~pointer$ ◎ Return the index pointer for code point in index.

注記: すべての`索引$は規範的でない資源 `indexes.json$ からも入手できる( `索引~gb18030範囲集$ の形式は、範囲を表現できるようにするため,若干~異なるものにされている)。 ◎ All indexes are also available as non-normative indexes.json resource. (index gb18030 ranges has a slightly different format here, to be able to represent ranges.)

6. 他の仕様のための~hook

注記: 次に挙げる各種~algoは、他の仕様からの~~利用が意図されている:

  • `~decode$
  • `~UTF-8~decode$
  • `~BOMを取扱わずに~UTF-8~decode$
  • `~BOMも失敗-も取扱わずに~UTF-8~decode$
  • `~encode$
  • `~UTF-8~encode$

新たな形式には、`~UTF-8~decode$が利用されることになる。 最初に`~label$を`~encoding$に~~変換するときは、`~encodingを取得-$する~algoを利用できる。

◎ The algorithms decode, UTF-8 decode, UTF-8 decode without BOM, UTF-8 decode without BOM or fail, encode, and UTF-8 encode are intended for usage by other specifications. UTF-8 decode is to be used by new formats. The get an encoding algorithm can be used first to turn a label into an encoding.

~fallback~encoding %~encoding を利用して,~byte~stream %~stream を `~decode@ するときは、次を走らす: ◎ To decode a byte stream stream using fallback encoding encoding, run these steps:

  1. %buffer ~LET ~byte列 [] ◎ Let buffer be an empty byte sequence.
  2. %~BOMseen~flag ~LET ~OFF ◎ Let BOM seen flag be unset.
  3. 次を 3 回 繰返す ⇒ %~stream から`読取った結果$を %buffer に付加する — ただし,`~EoS$ が返されたときは、付加せずに繰返しを終える ◎ Read bytes from stream into buffer until either buffer contains three bytes or read returns end-of-stream.
  4. ~IF[ 次の表の中で, 1 列目に示された~byte列が %buffer の先頭の~byte列に合致する行がある ] ⇒ %~encoding ~SET その行の 2 列目に与えられる `~encoding$ ~BR %~BOMseen~flag ~LET ~ON ◎ For each of the rows in the table below, starting with the first one and going down, if the first bytes of buffer match all the bytes given in the first column, then set encoding to the encoding given in the cell in the second column of that row and set BOM seen flag.

    ~byte~order-mark( ~BOM )◎Byte order mark ~encoding◎Encoding
    `EF^X `BB^X `BF^X `UTF-8$n
    `FE^X `FF^X `UTF-16BE$n
    `FF^X `FE^X `UTF-16LE$n

    注記: 配備済みの内容と互換性をとるため、~byte~order-mark(~BOM)は他より~~優先される。 HTTP が利用される文脈~下では、これは, `Content-Type` ~headerの意味論に対する違反である。 ◎ For compatibility with deployed content, the byte order mark (also known as BOM) is more authoritative than anything else. In a context where HTTP is used this is in violation of the semantics of the `Content-Type` header.

  5. ~IF[ %~BOMseen~flag ~EQ ~OFF ] ⇒ %buffer を %~stream に`前付加する$ ◎ If BOM seen flag is unset prepend buffer to stream.
  6. ~ELIF [ %~encoding ~NEQ `UTF-8$n ]~AND[ %buffer の長さ ~EQ 3 ] ⇒ %buffer の最後の~byteを %~stream に`前付加する$ ◎ Otherwise, if BOM seen flag is set, encoding is not UTF-8, and buffer contains three bytes, prepend the last byte of buffer to stream.
  7. %出力 ~LET ~cp`~stream$ ◎ Let output be a code point stream.
  8. ( %~stream, %出力 ) を与える下で, %~encoding の`~decoder$を`走らす$ ◎ Run encoding’s decoder with stream and output.
  9. ~RET %出力 ◎ Return output.

~byte~stream %~stream を `~UTF-8~decode@ するときは、次を走らす: ◎ To UTF-8 decode a byte stream stream, run these steps:

  1. %buffer ~LET ~byte列 [] ◎ Let buffer be an empty byte sequence.
  2. 次を 3 回 繰返す ⇒ %~stream から`読取った結果$を %buffer に付加する 【 — ただし,`~EoS$ が返されたときは、付加せずに繰返しを終える 】 ◎ Read three bytes from stream into buffer.
  3. ~IF[ %buffer ~NEQ ~byte列 [ `EF^X, `BB^X, `BF^X ] ] ⇒ %buffer を %~stream に`前付加する$ ◎ If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.
  4. %出力 ~LET ~cp`~stream$ ◎ Let output be a code point stream.
  5. ( %~stream, %出力 ) を与える下で, `UTF-8$n の`~decoder$を`走らす$ ◎ Run UTF-8’s decoder with stream and output.
  6. ~RET %出力 ◎ Return output.

~byte~stream %~stream を `~BOMを取扱わずに~UTF-8~decode@ するときは、次を走らす: ◎ To UTF-8 decode without BOM a byte stream stream, run these steps:

  1. %出力 ~LET ~cp`~stream$ ◎ Let output be a code point stream.
  2. ( %~stream, %出力 ) を与える下で, `UTF-8$n の`~decoder$を`走らす$ ◎ Run UTF-8’s decoder with stream and output.
  3. ~RET %出力 ◎ Return output.

~byte~stream %~stream を `~BOMも失敗-も取扱わずに~UTF-8~decode@ するときは、次を走らす: ◎ To UTF-8 decode without BOM or fail a byte stream stream, run these steps:

  1. %出力 ~LET ~cp`~stream$ ◎ Let output be a code point stream.
  2. ( %~stream, %出力, `fatal^l ) を与える下で, `UTF-8$n の`~decoder$を`走らす$ ◎ Let potentialError be the result of running UTF-8’s decoder with stream, output, and "fatal".
  3. ~IF[ 前~段の結果 ~EQ `~error$ ] ⇒ ~RET `失敗^i ◎ If potentialError is error, return failure.
  4. ~RET %出力 ◎ Return output.

~encoding %~encoding を利用して ~cp~stream %~stream を `~encode@ するときは、次を走らす: ◎ To encode a code point stream stream using encoding encoding, run these steps:

  1. ~Assert: %~encoding ~NIN { `replacement$n, `UTF-16BE$n, `UTF-16LE$n } ◎ Assert: encoding is not replacement, UTF-16BE or UTF-16LE.
  2. %出力 ~LET ~byte`~stream$ ◎ Let output be a byte stream.
  3. ( %~stream, %出力, `html^l ) を与える下で, %~encoding の`~encoder$を`走らす$ ◎ Run encoding’s encoder with stream, output, and "html".
  4. ~RET %出力 ◎ Return output.

注記: これは、主に URL と HTML ~formのための旧来の~hookである。 `~UTF-8~encode$を被せた方が、決して`~error$を誘発させないので安全である。 `URL$r `HTML$r ◎ This is mostly a legacy hook for URLs and HTML forms. Layering UTF-8 encode on top is safe as it never triggers errors. [URL] [HTML]

~cp~stream %~stream を `~UTF-8~encode@ するときは、次を走らす:

  1. ~RET `UTF-8$n を~encodingに利用して %~stream を`~encode$した結果
◎ To UTF-8 encode a code point stream stream, return the result of encoding stream using encoding UTF-8.

7. ~API

この節では Web IDL の各種用語が用いられる。 非~browser~UAに対しては、この~APIの~supportは要求されない。 `WEBIDL$r ◎ This section uses terminology from Web IDL. Non-browser user agents are not required to support this API.

次の例は、 `TextEncoder$I ~objを利用して,文字列の配列を `ArrayBuffer$I に~encodeする。 結果は次を内容とする `Uint8Array$I になる: 先頭が( `Uint32Array$I としての)文字列の個数,その後は: 最初の文字列の( `Uint32Array$I としての)長さ, `UTF-8$n に~encodeされたその文字列~data,
2 番目の文字列の( `Uint32Array$I としての)長さ, その文字列~data,
等々と続く。
◎ The following example uses the TextEncoder object to encode an array of strings into an ArrayBuffer. The result is a Uint8Array containing the number of strings (as a Uint32Array), followed by the length of the first string (as a Uint32Array), the UTF-8 encoded string data, the length of the second string (as a Uint32Array), the string data, and so on.

function encodeArrayOfStrings(%strings) {
  var %encoder, %encoded, %len, %bytes, %view, %offset;

  %encoder = new TextEncoder();
  %encoded = [];

  %len = Uint32Array.BYTES_PER_ELEMENT;
  for (var %i = 0; %i < %strings.length; %i++) {
    %len += Uint32Array.BYTES_PER_ELEMENT;
    %encoded[%i] = %encoder.encode(%strings[%i]);
    %len += %encoded[%i].byteLength;
  }

  %bytes = new Uint8Array(%len);
  %view = new DataView(%bytes.buffer);
  %offset = 0;

  %view.setUint32(%offset, %strings.length);
  %offset += Uint32Array.BYTES_PER_ELEMENT;
  for (var %i = 0; %i < %encoded.length; %i += 1) {
    %len = %encoded[%i].byteLength;
    %view.setUint32(%offset, %len);
    %offset += Uint32Array.BYTES_PER_ELEMENT;
    %bytes.set(%encoded[%i], %offset);
    %offset += %len;
  }
  return %bytes.buffer;
}

次の例は、[[ 前の例, または `UTF-8$n 以外の~encodingに等価な~algo ]により生産される形式に~encodeされた~data ]を含んでいる `ArrayBuffer$I を~decodeして、元の,一連の文字列からなる配列に戻す。 ◎ The following example decodes an ArrayBuffer containing data encoded in the format produced by the previous example, or an equivalent algorithm for encodings other than UTF-8, back into an array of strings.

function decodeArrayOfStrings(%buffer, %encoding) {
  var %decoder, %view, %offset, %num_strings, %strings, %len;

  %decoder = new TextDecoder(%encoding);
  %view = new DataView(%buffer);
  %offset = 0;
  %strings = [];

  %num_strings = %view.getUint32(%offset);
  %offset += Uint32Array.BYTES_PER_ELEMENT;
  for (var %i = 0; %i < %num_strings; %i++) {
    %len = %view.getUint32(%offset);
    %offset += Uint32Array.BYTES_PER_ELEMENT;
    %strings[%i] = %decoder.decode(
      new DataView(%view.buffer, %offset, %len));
    %offset += %len;
  }
  return %strings;
}

7.1. ~interface `TextDecoder^I

dictionary `TextDecoderOptions@I {
  boolean fatal = false;
  boolean ignoreBOM = false;
};

dictionary `TextDecodeOptions@I {
  boolean stream = false;
};

[Constructor(
    optional DOMString %label = "utf-8",
    optional `TextDecoderOptions$I %options
),
 Exposed=(Window,Worker)]
interface `TextDecoder@I {
  readonly attribute DOMString `encoding$m;
  readonly attribute boolean `fatal$m;
  readonly attribute boolean `ignoreBOM$m;
  USVString `decode$m(
      optional BufferSource %input,
      optional `TextDecodeOptions$I %options
  );
};

利用中の~browserでこの特色機能を試す

各 `TextDecoder$I ~objには、次のものが結付けられる(括弧内は初期~値):

  • `~encoding^ec
  • `~decoder^ec

    【 `~encoding^ecに対応する`~decoder$の,~instance。 `~decoder$には,内部状態を保持する変数たちを伴うものもあるので、~objごとに~instanceを要する。 】

  • `~stream^ec
  • `~BOMignore~flag^ec( ~OFF )
  • `~BOMseen~flag^ec( ~OFF )
  • `~error~mode^ec( `replacement^l )
  • `~no_flush~flag^ec( ~OFF )
◎ A TextDecoder object has an associated encoding, decoder, stream, ignore BOM flag (initially unset), BOM seen flag (initially unset), error mode (initially "replacement"), and do not flush flag (initially unset).

各 `TextDecoder$I ~objには、~streamを `直列化-@ する~algoも結付けられる。 それは、所与の`~stream$ %~stream に対し,次を走らす: ◎ A TextDecoder object also has an associated serialize stream algorithm, that given a stream stream, runs these steps:

  1. %出力 ~LET 空~文字列 ◎ Let output be the empty string.
  2. ~WHILE 無条件: ◎ While true:

    1. %~token ~LET %~stream から`読取った結果$ ◎ Let token be the result of reading from stream.
    2. ~IF[ `~encoding^ec ~IN { `UTF-8$n, `UTF-16BE$n, `UTF-16LE$n } ]~AND[ `~BOMignore~flag^ec ~EQ ~OFF ]~AND[ `~BOMseen~flag^ec ~EQ ~OFF ]: ◎ If encoding is UTF-8, UTF-16BE, or UTF-16LE, and ignore BOM flag and BOM seen flag are unset, then run these subsubsteps:

      1. ~IF[ %~token ~EQ `FEFF^U ] ⇒ `~BOMseen~flag^ec ~SET ~ON ◎ If token is U+FEFF, then set BOM seen flag.
      2. ~ELIF[ %~token ~NEQ `~EoS$ ] ⇒ `~BOMseen~flag^ec ~SET ~ON ~BR %~token を %出力 に付加する ◎ Otherwise, if token is not end-of-stream, then set BOM seen flag and append token to output.
      3. ~ELSE ⇒ ~RET %出力 ◎ Otherwise, return output.
    3. ~ELIF[ %~token ~NEQ `~EoS$ ] ⇒ %~token を %出力 に付加する ◎ Otherwise, if token is not end-of-stream, then append token to output.
    4. ~ELSE ⇒ ~RET %出力 ◎ Otherwise, return output.

注記: この~algoは、~APIの利用者に より多くの制御を供するため,~platformの他の場所で利用される`~decode$ ~algoとは、~BOMの取扱いの点で意図的に異なるものにされている。 ◎ This algorithm is intentionally different with respect to BOM handling from the decode algorithm used by the rest of the platform to give API users more control.


%decoder = new `TextDecoder$m([%label = "utf-8" [, %options]])
新たな `TextDecoder$I ~obj を返す。 ◎ Returns a new TextDecoder object.
%label が`~label$でない, または %label が`replacement$nである場合、 `RangeError^E が投出される。 ◎ If label is either not a label or is a label for replacement, throws a RangeError.
%decoder . `encoding$m
`~encoding^ecの`名前$を小文字~化して返す。 ◎ Returns encoding’s name, lowercased.
%decoder . `fatal$m
`~error~mode^ecが `fatal^l ならば ~T を, 他の場合は ~F を返す。 ◎ Returns true if error mode is "fatal", and false otherwise.
%decoder . `ignoreBOM$m
`~BOMignore~flag^ecが ~ON ならば ~T を, 他の場合は ~F を返す。 ◎ Returns true if ignore BOM flag is set, and false otherwise.
%decoder . `decode([input [, options]])$m

%input を `~encoding^ecの`~decoder$にかけた結果を返す。 ~streamを断片化して処理するときは、 %options の `stream^m ~memberを ~T にした下で,この~method 0 回~以上~呼出してから, %options を省略して(または その `stream^m ~memberを ~F にして) 1 回だけ呼出すことで行える。 後者の呼出時に %input もないならば、両~引数とも省略するのが最も簡明になる。 ◎ Returns the result of running encoding’s decoder. The method can be invoked zero or more times with options’s stream set to true, and then once without options’s stream (or set to false), to process a fragmented stream. If the invocation without options’s stream (or set to false) has no input, it’s clearest to omit both arguments.

var %string = "", %decoder = new TextDecoder(%encoding), %buffer;
while(%buffer = next_chunk()) {
  %string += %decoder.decode(%buffer, {stream:true});
}
%string += %decoder.decode(); // ~EoS
`~error~mode^ec ~EQ `fatal^l の下で, `~encoding^ecの`~decoder$が`~error$を返した場合、 `TypeError^E が投出される。 ◎ If the error mode is "fatal" and encoding’s decoder returns error, throws a TypeError.
`TextDecoder(label, options)@m

この構築子の被呼出時には、次を走らせ~MUST: ◎ The TextDecoder(label, options) constructor, when invoked, must run these steps:

  1. %~encoding ~LET %label から`~encodingを取得-$した結果 ◎ Let encoding be the result of getting an encoding from label.
  2. ~IF[ %~encoding ~IN { `失敗^i, `replacement$n } ] ⇒ ~THROW `RangeError^E ◎ If encoding is failure or replacement, then throw a RangeError.
  3. %dec ~LET 新たな `TextDecoder$I ~obj ◎ Let dec be a new TextDecoder object.
  4. %dec の `~encoding^ec ~SET %~encoding ◎ Set dec’s encoding to encoding.
  5. ~IF[ %options の `fatal^m ~member ~EQ ~T ] ⇒ %dec の `~error~mode^ec ~SET `fatal^l ◎ If options’s fatal member is true, then set dec’s error mode to "fatal".
  6. ~IF[ %options の `ignoreBOM^m ~member ~EQ ~T ] ⇒ %dec の `~BOMignore~flag^ec ~SET `fatal^l ◎ If options’s ignoreBOM member is true, then set dec’s ignore BOM flag.
  7. ~RET %dec ◎ Return dec.
`encoding@m
取得子は、此れの`~encoding^ecの`名前$を`~ASCII小文字~化$した結果を返さ~MUST。 ◎ The encoding attribute’s getter must return encoding’s name in ASCII lowercase.
`fatal@m
取得子は、[ 此れの`~error~mode^ec ~EQ `fatal^l ならば ~T / ~ELSE_ ~F ]を返さ~MUST。 ◎ The fatal attribute’s getter must return true if error mode is "fatal", and false otherwise.
`ignoreBOM@m
取得子は、[ 此れの`~BOMignore~flag^ec ~EQ ~ON ならば ~T / ~ELSE_ ~F ]を返さ~MUST。 ◎ The ignoreBOM attribute’s getter must return true if ignore BOM flag is set, and false otherwise.
`decode(input, options)@m

被呼出時には、次を走らせ~MUST: ◎ The decode(input, options) method, when invoked, must run these steps:

  1. ~IF[ 此れの`~no_flush~flag^ec ~EQ ~OFF ] ⇒ 此れの`~decoder^ec ~SET 新たな[ 此れの`~encoding^ecの`~decoder$ ] ~BR 此れの`~stream^ec ~SET 新たな`~stream$ ~BR 此れの`~BOMseen~flag^ec ~SET ~OFF ◎ If the do not flush flag is unset, set decoder to a new encoding’s decoder, set stream to a new stream, and unset the BOM seen flag.
  2. 此れの`~no_flush~flag^ec ~SET[ %options の `stream^m ~EQ ~T ならば ~ON / ~ELSE_ ~OFF ] ◎ If options’s stream is true, set the do not flush flag, and unset the do not flush flag otherwise.
  3. ~IF[ %input は与えられている ] ⇒ %input の複製を 此れの`~stream^ecに`~pushする$ ◎ If input is given, push a copy of input to stream.
  4. %出力 ~LET 新たな`~stream$ ◎ Let output be a new stream.
  5. ~WHILE 無条件: ◎ While true:

    1. %~token ~LET 此れの`~stream^ecから`読取った結果$ ◎ Let token be the result of reading from stream.
    2. ~IF[ %~token ~EQ `~EoS$ ]~AND[ 此れの`~no_flush~flag^ec ~EQ ~ON ] ⇒ ~RET %出力 を`直列化-$した結果 ◎ If token is end-of-stream and the do not flush flag is set, then return output, serialized.

      ~streamingでは、[ `~no_flush~flag^ec ~EQ ~ON ]のときに,ここで`~EoS$を取扱うことなく,その~flagを ~OFF にしない仕方で働く。 この仕方により、後続の呼出時には,`~decoder^ecは この~algoの最初の段で一新されず、その状態は保全される。 ◎ The way streaming works is to not handle end-of-stream here when the do not flush flag is set and to not unset that flag. That way in a subsequent invocation decoder is not set anew in the first step of the algorithm and its state is preserved.

    3. %結果 ~LET 次を与える下で,`~tokenを処理-$した結果 ⇒ ( %~token, 此れの`~decoder^ec, 此れの`~stream^ec, %出力, 此れの`~error~mode^ec ) ◎ Otherwise, run these subsubsteps: ◎ Let result be the result of processing token for decoder, stream, output, and error mode.
    4. ~IF[ %結果 ~EQ `完遂$ ] ⇒ ~RET %出力 を`直列化-$した結果 ◎ If result is finished, then return output, serialized.
    5. ~IF[ %結果 ~EQ `~error$ ] ⇒ ~THROW `TypeError^E ◎ Otherwise, if result is error, throw a TypeError. ◎ Otherwise, do nothing.

7.2. ~interface `TextEncoder^I

[Constructor, Exposed=(Window,Worker)]
interface `TextEncoder@I {
  readonly attribute DOMString `~encoding0$m;
  [NewObject] Uint8Array `encode$m(optional USVString %input = "");
};

各 `TextEncoder$I ~objには、 `~encoder^ec が結付けられる: ◎ A TextEncoder object has an associated encoder.

注記: `TextEncoder$I ~objの構築子には、 %label 引数はなく,~supportするのは `UTF-8$n のみである。 また、~scalar値~bufferを要する`~encoder$は無いので, `stream^m の~optionもない。 ◎ A TextEncoder object offers no label argument as it only supports UTF-8. It also offers no stream option as no encoder requires buffering of scalar values.


%encoder = new `TextEncoder()$m
新たな `TextEncoder$I ~obj を返す。 ◎ Returns a new TextEncoder object.
%encoder . `~encoding0$m
`utf-8^l を返す。 ◎ Returns "utf-8".
%encoder . `encode([input = ""])$m
%input を `UTF-8$n の`~encoder$にかけた結果を返す。 ◎ Returns the result of running UTF-8’s encoder.
`TextEncoder()@m

この構築子の被呼出時には、次を走らせ~MUST: ◎ The TextEncoder() constructor, when invoked, must run these steps:

  1. %enc ~LET 新たな `TextEncoder$I ~obj ◎ Let enc be a new TextEncoder object.
  2. %enc の 此れの`~encoder^ec ~SET `UTF-8$n の`~encoder$ ◎ Set enc’s encoder to UTF-8’s encoder.
  3. ~RET %enc ◎ Return enc.
`~encoding0@m
取得子は、 `utf-8^l を返さ~MUST。 ◎ The encoding attribute’s getter must return "utf-8".
`encode(input)@m

被呼出時には、次を走らせ~MUST: ◎ The encode(input) method, when invoked, must run these steps:

  1. %入力 ~LET %input を`~stream$に変換した結果 ◎ Convert input to a stream.
  2. %出力 ~LET 新たな`~stream$ ◎ Let output be a new stream.
  3. ~WHILE 無条件 : ◎ While true, run these substeps:

    1. %~token ~LET %入力 から`読取った結果$ ◎ Let token be the result of reading from input.
    2. %結果 ~LET 次を与える下で,`~tokenを処理-$した結果 ⇒ ( %~token, 此れの`~encoder^ec, %入力, %出力 ) ◎ Let result be the result of processing token for encoder, input, output.
    3. ~IF[ %結果 ~EQ `完遂$ ] ⇒ ~RET [[ %出力 を~byte列に変換した結果 ]を包含する `ArrayBuffer$I ]を包装するような,新たな `Uint8Array$I ~obj ◎ If result is finished, convert output into a byte sequence, and then return a Uint8Array object wrapping an ArrayBuffer containing output.

    注記: `UTF-8$n が`~error$を返すことはない。 ◎ UTF-8 cannot return error.

8. ~~標準の~encoding

【 この “~~標準の” は “The” の対訳であり、およそ, “規範とされるべき唯一無二の” を意味する。 】

8.1. ~UTF-8

8.1.1. ~UTF-8~decoder

`UTF-8$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~UTF-8~cp@(初期~時 0 )~BR `~UTF-8出現~byte数@(初期~時 0 )~BR `~UTF-8要~byte数@(初期~時 0 )~BR `~UTF-8下限@(初期~時 `80^X )~BR `~UTF-8上限@(初期~時 `BF^X ) ◎ UTF-8’s decoder’s has an associated UTF-8 code point, UTF-8 bytes seen, and UTF-8 bytes needed (all initially 0), a UTF-8 lower boundary (initially 0x80), and a UTF-8 upper boundary (initially 0xBF).

`UTF-8$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ UTF-8’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]~AND[ `~UTF-8要~byte数$ ~NEQ 0 ] ⇒ `~UTF-8要~byte数$ ~SET 0 ~BR ~RET `~error$ ◎ If byte is end-of-stream and UTF-8 bytes needed is not 0, set UTF-8 bytes needed to 0 and return error.
  2. ~IF[ %~byte ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If byte is end-of-stream, return finished.
  3. ~IF[ `~UTF-8要~byte数$ ~EQ 0 ]: ◎ If UTF-8 bytes needed is 0, based on byte:

    1. %~byte に応じて: ◎ ↑

      `00^X 〜 `7F^X
      ~RET ~cp [ %~byte ] ◎ Return a code point whose value is byte.
      `C2^X 〜 `DF^X
      1. `~UTF-8要~byte数$ ~SET 1 ◎ Set UTF-8 bytes needed to 1.
      2. `~UTF-8~cp$ ~SET %~byte ~bAND `1F^X ( %~byte の下位 5 ~bit ) ◎ Set UTF-8 code point to byte & 0x1F. ◎ The five least significant bits of byte.
      `E0^X 〜 `EF^X
      1. ~IF[ %~byte ~EQ `E0^X ] ⇒ `~UTF-8下限$ ~SET `A0^X ◎ If byte is 0xE0, set UTF-8 lower boundary to 0xA0.
      2. ~IF[ %~byte ~EQ `ED^X ] ⇒ `~UTF-8上限$ ~SET `9F^X ◎ If byte is 0xED, set UTF-8 upper boundary to 0x9F.
      3. `~UTF-8要~byte数$ ~SET 2 ◎ Set UTF-8 bytes needed to 2.
      4. `~UTF-8~cp$ ~SET %~byte ~bAND `F^X ( %~byte の下位 4 ~bit ) ◎ Set UTF-8 code point to byte & 0xF. ◎ The four least significant bits of byte.
      `F0^X 〜 `F4^X
      1. ~IF[ %~byte ~EQ `F0^X ] ⇒ `~UTF-8下限$ ~SET `90^X ◎ If byte is 0xF0, set UTF-8 lower boundary to 0x90.
      2. ~IF[ %~byte ~EQ `F4^X ] ⇒ `~UTF-8上限$ ~SET `8F^X ◎ If byte is 0xF4, set UTF-8 upper boundary to 0x8F.
      3. `~UTF-8要~byte数$ ~SET 3 ◎ Set UTF-8 bytes needed to 3.
      4. `~UTF-8~cp$ ~SET %~byte ~bAND `7^X ( %~byte の下位 3 ~bit ) ◎ Set UTF-8 code point to byte & 0x7. ◎ The three least significant bits of byte.
      ~OTHER◎Otherwise
      ~RET `~error$ ◎ Return error.
    2. ~RET `継続$ ◎ Return continue.
  4. ~IF[ %~byte ~NIN { `~UTF-8下限$ 〜 `~UTF-8上限$ } : ◎ If byte is not in the range UTF-8 lower boundary to UTF-8 upper boundary, inclusive, run these substeps:

    1. ( `~UTF-8~cp$, `~UTF-8要~byte数$, `~UTF-8出現~byte数$ ) ~SET ( 0, 0, 0 ) ~BR ( `~UTF-8下限$, `~UTF-8上限$ ) ~SET ( `80^X, `BF^X ) ◎ Set UTF-8 code point, UTF-8 bytes needed, and UTF-8 bytes seen to 0, set UTF-8 lower boundary to 0x80, and set UTF-8 upper boundary to 0xBF.
    2. %~byte を %~stream に`前付加する$ ◎ Prepend byte to stream.
    3. ~RET `~error$ ◎ Return error.
  5. ( `~UTF-8下限$, `~UTF-8上限$ ) ~SET ( `80^X, `BF^X ) ◎ Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.
  6. `~UTF-8~cp$ ~SET (`~UTF-8~cp$ ~Lshift 6) ~bOR (%~byte ~bAND `3F^X) ◎ Set UTF-8 code point to (UTF-8 code point << 6) | (byte & 0x3F)

    `~UTF-8~cp$内の既存の~bitを左へ 6 ~bit ~shiftして,~~空いた下位 6 ~bitに %~byte の下位 6 ~bitをあてがう。 ◎ Shift the existing bits of UTF-8 code point left by six places and set the newly-vacated six least significant bits to the six least significant bits of byte.

  7. `~UTF-8出現~byte数$ ~INCBY 1 ◎ Increase UTF-8 bytes seen by one.
  8. ~IF[ `~UTF-8出現~byte数$ ~NEQ `~UTF-8要~byte数$ ] ⇒ ~RET `継続$ ◎ If UTF-8 bytes seen is not equal to UTF-8 bytes needed, return continue.
  9. %~cp ~LET `~UTF-8~cp$ ◎ Let code point be UTF-8 code point.
  10. ( `~UTF-8~cp$, `~UTF-8要~byte数$, `~UTF-8出現~byte数$ ) ~SET ( 0, 0, 0 ) ◎ Set UTF-8 code point, UTF-8 bytes needed, and UTF-8 bytes seen to 0.
  11. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.

注記: 上の`~UTF-8~decoder$における拘束は、~Unicode標準の “Best Practices for Using U+FFFD” に準ずる。 他のふるまいは Encoding 標準の下では許可されない(同じ結果が得られるなら,他の~algoでも もちろん~~十分であり、むしろ奨励される)。 `UNICODE$r ◎ The constraints in the UTF-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are obviously fine, even encouraged). [UNICODE]

8.1.2. ~UTF-8~encoder

`UTF-8$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ UTF-8’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
  2. ~IF[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
  3. ( %count, %~offset ) ~SET %~cp が属する範囲に応じて,次で与えられる値: ◎ Set count and offset based on the range code point is in:

    `0080^U 〜 `07FF^U
    ( 1, `C0^X )
    `0800^U 〜 `FFFF^U
    ( 2, `E0^X )
    `10000^U 〜 `10FFFF^U
    ( 3, `F0^X )
    U+0080 to U+07FF, inclusive • 1 and 0xC0 U+0800 to U+FFFF, inclusive • 2 and 0xE0 U+10000 to U+10FFFF, inclusive • 3 and 0xF0
  4. %~byte列 ~LET ~byte列 [ ( %~cp ~Rshift ( 6 ~MUL %count ) ) ~PLUS %~offset ] ◎ Let bytes be a byte sequence whose first byte is (code point >> (6 × count)) + offset.
  5. ~WHILE %count ~GT 0 : ◎ Run these substeps while count is greater than 0:

    1. %temp ~SET %~cp ~Rshift ( 6 ~MUL ( %count ~MINUS 1 ) ) ◎ Set temp to code point >> (6 × (count − 1)).
    2. ( `80^X ~bOR ( %temp ~bAND `3F^X ) ) を %~byte列 に付加する ◎ Append to bytes 0x80 | (temp & 0x3F).
    3. %count ~DECBY 1 ◎ Decrease count by one.
  6. ~RET %~byte列 ◎ Return bytes bytes, in order.

この~algoは、~Unicode標準に述べられるものと一致する結果を得るが、完全さのためここに含められている。 `UNICODE$r ◎ This algorithm has identical results to the one described in the Unicode standard. It is included here for completeness. [UNICODE]

9. 旧来の単byte~encoding

各~byteが[ 1個の~cpに対応するか, または対応する~cpは無い ]ような`~encoding$を `単byte~encoding@ と呼ぶ。 すべての`単byte~encoding$が、同じ[ `~decoder$, `~encoder$ ]を共有する。 `単byte~decoder$/`単byte~encoder$から参照される `単byte索引@ は、利用される`単byte~encoding$に依存し,次の一覧で定義される。 [ `ISO-8859-8^n, `ISO-8859-8-I^n ]を除くすべての`単byte~encoding$は、それぞれに一意な`索引$を持つ。 ◎ An encoding where each byte is either a single code point or nothing, is a single-byte encoding. Single-byte encodings share the decoder and encoder. Index single-byte, as referenced by the single-byte decoder and single-byte encoder, is defined by the following table, and depends on the single-byte encoding in use. All but two single-byte encodings have a unique index.

【 被覆域の~tableは巨大なことに注意。 】

`名前$ `索引$ 視覚化 基本多言語面( BMP )の被覆域
`IBM866@n`ibm866$idx
`ISO-8859-2@n`iso-8859-2$idx
`ISO-8859-3@n`iso-8859-3$idx
`ISO-8859-4@n`iso-8859-4$idx
`ISO-8859-5@n`iso-8859-5$idx
`ISO-8859-6@n`iso-8859-6$idx
`ISO-8859-7@n`iso-8859-7$idx
`ISO-8859-8@n`iso-8859-8$idx
`ISO-8859-8-I@n
`ISO-8859-10@n`iso-8859-10$idx
`ISO-8859-13@n`iso-8859-13$idx
`ISO-8859-14@n`iso-8859-14$idx
`ISO-8859-15@n`iso-8859-15$idx
`ISO-8859-16@n`iso-8859-16$idx
`KOI8-R@n`koi8-r$idx
`KOI8-U@n`koi8-u$idx
`macintosh@n`macintosh$idx
`windows-874@n`windows-874$idx
`windows-1250@n`windows-1250$idx
`windows-1251@n`windows-1251$idx
`windows-1252@n`windows-1252$idx
`windows-1253@n`windows-1253$idx
`windows-1254@n`windows-1254$idx
`windows-1255@n`windows-1255$idx
`windows-1256@n`windows-1256$idx
`windows-1257@n`windows-1257$idx
`windows-1258@n`windows-1258$idx
`x-mac-cyrillic@n`x-mac-cyrillic$idx

注記: ~layout方向に波及することから、 `ISO-8859-8$n と `ISO-8859-8-I$n の`~encoding$の`名前$は異なるものにされている。 歴史的に、このことは `ISO-8859-6$n と "iso-8859-6-i" についても該当していたが、それは今や成立しない。 ◎ ISO-8859-8 and ISO-8859-8-I are distinct encoding names, because ISO-8859-8 has influence on the layout direction. And although historically this might have been the case for ISO-8859-6 and "iso-8859-6-i" as well, that is no longer true.

9.1. 単byte~decoder

`単byte~encoding$の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ Single-byte encodings’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If byte is end-of-stream, return finished.
  2. ~IF[ %~byte ~IN `~ASCII~byte$ ] ⇒ ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
  3. %~cp ~LET `単byte索引$ の中で ( %~byte ~MINUS `80^X ) が指す`索引~cp$ ◎ Let code point be the index code point for byte − 0x80 in index single-byte.
  4. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
  5. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.

9.2. 単byte~encoder

`単byte~encoding$ の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ Single-byte encodings’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
  2. ~IF[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
  3. %~pointer ~LET `単byte索引$ の中で %~cp を指す`索引~pointer$ ◎ Let pointer be the index pointer for code point in index single-byte.
  4. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  5. ~RET ~byte列 [ %~pointer ~PLUS `80^X ] ◎ Return a byte whose value is pointer + 0x80.

10. 旧来の複byte Chinese (簡体字) ~encoding

10.1 ~GBK

10.1.1 ~GBK~decoder

`GBK$nの`~decoder$は,`gb18030$nの`~decoder$である。 ◎ GBK’s decoder is gb18030’s decoder.

10.1.2 ~GBK~encoder

GBK の~encoderは,[ `~GBK~flag$ ~SET ~ON ]にされた`gb18030$nの`~encoder$である。 ◎ GBK’s encoder is gb18030’s encoder with its GBK flag set.

注記: `GBK$nを`gb18030$nに対する全くの別名にしないのは、 `GBK$nの`~encoder$により生成された内容を,旧来の~serverや他の消費者をなるべく壊すことなく,保守的に移行するためである。 ◎ Not fully aliasing GBK with gb18030 is a conservative move to decrease the chances of breaking legacy servers and other consumers of content generated with GBK’s encoder.

10.2. ~gb18030

10.2.1. ~gb18030~decoder

`gb18030$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~gb1@(初期~時 0 )~BR `~gb2@(初期~時 0 )~BR `~gb3@(初期~時 0 ) ◎ gb18030’s decoder has an associated gb18030 first, gb18030 second, and gb18030 third (all initially 0x00).

`gb18030$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ gb18030’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ ( `~gb1$, `~gb2$, `~gb3$ ) ~EQ ( `00^X, `00^X, `00^X ) ] ⇒ ~RET `完遂$ ◎ If byte is end-of-stream and gb18030 first, gb18030 second, and gb18030 third are 0x00, return finished.
    2. ( `~gb1$, `~gb2$, `~gb3$ ) ~SET ( `00^X, `00^X, `00^X ) ~BR ~RET `~error$ ◎ If byte is end-of-stream, and gb18030 first, gb18030 second, or gb18030 third is not 0x00, set gb18030 first, gb18030 second, and gb18030 third to 0x00, and return error.
  2. ~IF[ `~gb3$ ~NEQ `00^X ]: ◎ If gb18030 third is not 0x00, run these substeps:

    1. %~cp ~LET ~NULL ◎ Let code point be null.
    2. ~IF[ %~byte ~IN { `30^X 〜 `39^X } ] ⇒ %~cp ~SET [ (( `~gb1$ ~MINUS `81^X ) ~MUL ( 10 ~MUL 126 ~MUL 10 )) ~PLUS (( `~gb2$ ~MINUS `30^X ) ~MUL ( 10 ~MUL 126 )) ~PLUS (( `~gb3$ ~MINUS `81^X ) ~MUL 10 ) ~PLUS ( %~byte ~MINUS `30^X ) ]が指す`索引~gb18030範囲集~cp$ ◎ If byte is in the range 0x30 to 0x39, inclusive, set code point to the index gb18030 ranges code point for ((gb18030 first − 0x81) × (10 × 126 × 10)) + ((gb18030 second − 0x30) × (10 × 126)) + ((gb18030 third − 0x81) × 10) + byte − 0x30.
    3. %buffer ~LET ~byte列 [ `~gb1$, `~gb2$, %~byte ] ◎ Let buffer be a byte sequence consisting of gb18030 second, gb18030 third, and byte, in order.
    4. ( `~gb1$, `~gb2$, `~gb3$ ) ~SET ( `00^X, `00^X, `00^X ) ◎ Set gb18030 first, gb18030 second, and gb18030 third to 0x00.
    5. ~IF[ %~cp ~EQ ~NULL ] ⇒ %buffer を %~stream に`前付加する$ ~BR ~RET `~error$ ◎ If code point is null, prepend buffer to stream and return error.
    6. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  3. ~IF[ `~gb2$ ~NEQ `00^X ]: ◎ If gb18030 second is not 0x00, run these substeps:

    1. ~IF[ %~byte ~IN { `81^X 〜 `FE^X } ] ⇒ `~gb3$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x81 to 0xFE, inclusive, set gb18030 third to byte and return continue.
    2. ~byte列 [ `~gb2$, %~byte ] を %~stream に`前付加する$ ~BR ( `~gb1$, `~gb2$ ) ~SET ( `00^X, `00^X ) ~BR ~RET `~error$ ◎ Prepend gb18030 second followed by byte to stream, set gb18030 first and gb18030 second to 0x00, and return error.
  4. ~IF[ `~gb1$ ~NEQ `00^X ]: ◎ If gb18030 first is not 0x00, run these substeps:

    1. ~IF[ %~byte ~IN { `30^X 〜 `39^X } ] ⇒ `~gb2$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x30 to 0x39, inclusive, set gb18030 second to byte and return continue.
    2. %~lead ~LET `~gb1$ ~BR %~pointer ~LET ~NULL ~BR `~gb1$ ~SET `00^X ◎ Let lead be gb18030 first, let pointer be null, and set gb18030 first to 0x00.
    3. %~offset ~LET [ %~byte ~IN { `00^X 〜 `7E^X } ならば `40^X / ~ELSE_ `41^X ] ◎ Let offset be 0x40 if byte is less than 0x7F and 0x41 otherwise.
    4. ~IF[ %~byte ~IN { `40^X 〜 `7E^X, `80^X 〜 `FE^X } ] ⇒ %~pointer ~SET ( %~lead ~MINUS `81^X ) ~MUL 190 ~PLUS ( %~byte ~MINUS %~offset ) ◎ If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFE, inclusive, set pointer to (lead − 0x81) × 190 + (byte − offset).
    5. %~cp ~LET [ %~pointer ~EQ ~NULL ならば ~NULL / ~ELSE_ `索引~gb18030$ の中で %~pointer が指す`索引~cp$ ] ◎ Let code point be null if pointer is null and the index code point for pointer in index gb18030 otherwise.
    6. ~IF[ %~cp ~EQ ~NULL ]~AND[ %~byte ~IN `~ASCII~byte$ ] ⇒ %~byte を %~stream に`前付加する$ ◎ If code point is null and byte is an ASCII byte, prepend byte to stream.
    7. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
    8. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  5. %~byte に応じて: ◎ ↓

    `~ASCII~byte$
    ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
    `80^X
    ~RET ~cp [ `20AC^U ] ◎ If byte is 0x80, return code point U+20AC.
    `81^X 〜 `FE^X
    `~gb1$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x81 to 0xFE, inclusive, set gb18030 first to byte and return continue.
    その他( `FF^X )
    ~RET `~error$ ◎ Return error.

10.2.2. ~gb18030~encoder

`gb18030$nの `~encoder$の各~instanceには、次のものが結付けられる ⇒ `~GBK~flag@(初期~時 ~OFF ) ◎ gb18030’s encoder has an associated GBK flag (initially unset).

`gb18030$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ gb18030’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
  2. ~IF[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
  3. ~IF[ %~cp ~EQ `E5E5^U ] ⇒ ~RET %~cp を伴う`~error$ ◎ If code point is U+E5E5, return error with code point.

    注記: 配備済みの内容との互換性をとるため、 `索引~gb18030$ は[ `A3^X `A0^X ]を `E5E5^U ではなく `3000^U に対応付けている。 したがって往来できない。 ◎ Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip.

  4. ~IF[ `~GBK~flag$ ~EQ ~ON ]~AND[ %~cp ~EQ `20AC^U ] ⇒ ~RET ~byte列 [ `80^X ] ◎ If the GBK flag is set and code point is U+20AC, return byte 0x80.
  5. %~pointer ~LET `索引~gb18030$ の中で %~cp を指す`索引~pointer$ ◎ Let pointer be the index pointer for code point in index gb18030.
  6. ~IF[ %~pointer ~NEQ ~NULL ]: ◎ If pointer is not null, run these substeps:

    1. %~lead ~LET floor( %~pointer ~DIV 190 ) ~PLUS `81^X ◎ Let lead be floor(pointer / 190) + 0x81.
    2. %~trail ~LET %~pointer ~MOD 190 ◎ Let trail be pointer % 190.
    3. %~offset ~LET [ %~trail ~IN { `00^X 〜 `3E^X } ならば `40^X / ~ELSE_ `41^X ] ◎ Let offset be 0x40 if trail is less than 0x3F and 0x41 otherwise.
    4. ~RET ~byte列 [ %~lead, ( %~trail ~PLUS %~offset ) ] ◎ Return two bytes whose values are lead and trail + offset.
  7. ~IF[ `~GBK~flag$ ~EQ ~ON ] ⇒ ~RET %~cp を伴う`~error$ ◎ If GBK flag is set, return error with code point.
  8. %~pointer ~SET %~cp を指す`索引~gb18030範囲集~pointer$ ◎ Set pointer to the index gb18030 ranges pointer for code point.
  9. %byte1 ~LET floor( %~pointer ~DIV ( 10 ~MUL 126 ~MUL 10 )) ◎ Let byte1 be floor(pointer / (10 × 126 × 10)).
  10. %~pointer ~SET %~pointer ~MOD ( 10 ~MUL 126 ~MUL 10 ) ◎ Set pointer to pointer % (10 × 126 × 10).
  11. %byte2 ~LET floor( %~pointer ~DIV ( 10 ~MUL 126 ) ) ◎ Let byte2 be floor(pointer / (10 × 126)).
  12. %~pointer ~SET %~pointer ~MOD ( 10 ~MUL 126 ) ◎ Set pointer to pointer % (10 × 126).
  13. %byte3 ~LET floor( %~pointer ~DIV 10 ) ◎ Let byte3 be floor(pointer / 10).
  14. %byte4 ~LET %~pointer ~MOD 10 ◎ Let byte4 be pointer % 10.
  15. ~RET ~byte列 [ ( %byte1 ~PLUS `81^X ), ( %byte2 ~PLUS `30^X ), ( %byte3 ~PLUS `81^X ), ( %byte4 ~PLUS `30^X ) ] ◎ Return four bytes whose values are byte1 + 0x81, byte2 + 0x30, byte3 + 0x81, byte4 + 0x30.

11. 旧来の複byte Chinese (繁体字)~encoding

11.1. ~Big5

11.1.1. ~Big5~decoder

`Big5$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~Big5~lead@(初期~時 `00^X ) ◎ Big5’s decoder has an associated Big5 lead (initially 0x00).

`Big5$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ Big5’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]:

    1. ~IF[ `~Big5~lead$ ~NEQ `00^X ] ⇒ `~Big5~lead$ ~SET `00^X ~BR ~RET `~error$
    2. ~RET `完遂$
    ◎ If byte is end-of-stream and Big5 lead is not 0x00, set Big5 lead to 0x00 and return error. ◎ If byte is end-of-stream and Big5 lead is 0x00, return finished.
  2. ~IF[ `~Big5~lead$ ~NEQ `00^X ]: ◎ If Big5 lead is not 0x00, let lead be Big5 lead, let pointer be null, set Big5 lead to 0x00, and then run these substeps:

    1. %~lead ~LET `~Big5~lead$ ~BR %~pointer ~LET ~NULL ~BR `~Big5~lead$ ~SET `00^X ◎ ↑
    2. %~offset ~LET [ %~byte ~IN { `00^X 〜 `7E^X } ならば `40^X / ~ELSE_ `62^X ] ◎ Let offset be 0x40 if byte is less than 0x7F and 0x62 otherwise.
    3. ~IF[ %~byte ~IN { `40^X 〜 `7E^X, `A1^X 〜 `FE^X } ] ⇒ %~pointer ~SET ( %~lead ~MINUS `81^X ) ~MUL 157 ~PLUS ( %~byte ~MINUS %~offset ) ◎ If byte is in the range 0x40 to 0x7E, inclusive, or 0xA1 to 0xFE, inclusive, set pointer to (lead − 0x81) × 157 ~PLUS (byte − offset).
    4. ~IF[ 下の表の中で, 1 列目が %~pointer に等しい行がある ] ⇒ ~RET 同じ行の 2 列目の 2 個の ~cpからなる`~token$列 ◎ If there is a row in the table below whose first column is pointer, return the two code points listed in its second column (the third column is irrelevant):

      ~pointer◎Pointer ~cp◎Code points 説明(この段には関係ない)◎Notes
      1133 `00CA^U `0304^U Ê̄ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON)
      1135 `00CA^U `030C^U Ê̌ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON)
      1164 `00EA^U `0304^U ê̄ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON)
      1166 `00EA^U `030C^U ê̌ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON)

      注記: `索引$ は単独の~cpに制限されるので、これらの~pointerにはこの表が利用される。 ◎ Since indexes are limited to single code points this table is used for these pointers.

    5. %~cp ~LET [ %~pointer ~EQ ~NULL ならば ~NULL / ~ELSE_ `索引~Big5$ の中で %~pointer が指す`索引~cp$ ] ◎ Let code point be null if pointer is null and the index code point for pointer in index Big5 otherwise.
    6. ~IF[ %~cp ~EQ ~NULL ]~AND[ %~byte ~IN `~ASCII~byte$ ] ⇒ %~byte を %~stream に`前付加する$ ◎ If code point is null and byte is an ASCII byte, prepend byte to stream.
    7. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
    8. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  3. ~IF[ %~byte ~IN `~ASCII~byte$ ] ⇒ ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
  4. ~IF[ %~byte ~IN { `81^X 〜 `FE^X } ] ⇒ `~Big5~lead$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x81 to 0xFE, inclusive, set Big5 lead to byte and return continue.
  5. ~RET `~error$ ◎ Return error.

11.1.2. ~Big5~encoder

`Big5$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ Big5’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
  2. ~IF[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
  3. %~pointer ~LET %~cp を指す`索引~Big5~pointer$ ◎ Let pointer be the index Big5 pointer for code point.
  4. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  5. %~lead ~LET floor( %~pointer ~DIV 157 ) ~PLUS `81^X ◎ Let lead be floor(pointer / 157) + 0x81.
  6. %~trail ~LET %~pointer ~MOD 157 ◎ Let trail be pointer % 157.
  7. %~offset ~LET [ %~trail ~IN { `00^X 〜 `3E^X } ならば `40^X / ~ELSE_ `62^X ] ◎ Let offset be 0x40 if trail is less than 0x3F and 0x62 otherwise.
  8. ~RET ~byte列 [ %~lead, ( %~trail ~PLUS %~offset) ] ◎ Return two bytes whose values are lead and trail + offset.

12. 旧来の複byte Japanese ~encoding

12.1. ~EUC-JP

12.1.1. ~EUC-JP~decoder

`EUC-JP$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~EUC-JP~jis0212~flag@(初期~時 ~OFF )~BR `~EUC-JP~lead@(初期~時 `00^X) ◎ EUC-JP’s decoder has an associated EUC-JP jis0212 flag (initially unset) and EUC-JP lead (initially 0x00).

`EUC-JP$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ EUC-JP’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ `~EUC-JP~lead$ ~NEQ `00^X ] ⇒ `~EUC-JP~lead$ ~SET `00^X ~BR ~RET `~error$ ◎ If byte is end-of-stream and EUC-JP lead is not 0x00, set EUC-JP lead to 0x00, and return error.
    2. ~ELSE ⇒ ~RET `完遂$ ◎ If byte is end-of-stream and EUC-JP lead is 0x00, return finished.
  2. ~IF[ `~EUC-JP~lead$ ~EQ `8E^X ]~AND[ %~byte ~IN { `A1^X 〜 `DF^X } ] ⇒ `~EUC-JP~lead$ ~SET `00^X ~BR ~RET ~cp [ `FF61^X ~MINUS `A1^X ~PLUS %~byte ] ◎ If EUC-JP lead is 0x8E and byte is in the range 0xA1 to 0xDF, inclusive, set EUC-JP lead to 0x00 and return a code point whose value is 0xFF61 − 0xA1 + byte.
  3. ~IF[ `~EUC-JP~lead$ ~EQ `8F^X ]~AND[ %~byte ~IN { `A1^X 〜 `FE^X } ] ⇒ `~EUC-JP~jis0212~flag$ ~SET ~ON ~BR `~EUC-JP~lead$ ~SET %~byte ~BR ~RET `継続$ ◎ If EUC-JP lead is 0x8F and byte is in the range 0xA1 to 0xFE, inclusive, set the EUC-JP jis0212 flag, set EUC-JP lead to byte, and return continue.
  4. ~IF[ `~EUC-JP~lead$ ~NEQ `00^X ]: ◎ If EUC-JP lead is not 0x00, let lead be EUC-JP lead, set EUC-JP lead to 0x00, and run these substeps:

    1. %~lead ~LET `~EUC-JP~lead$ ~BR `~EUC-JP~lead$ ~SET `00^X ◎ ↑
    2. %~cp ~LET ~NULL ◎ Let code point be null.
    3. ~IF[ %~lead, %~byte がいずれも ~IN { `A1^X 〜 `FE^X } ] ⇒ %索引 ~LET [ `~EUC-JP~jis0212~flag$ ~EQ ~OFF ならば`索引~jis0208$ / ~ON ならば[ `索引~jis0212$ ]~BR %~cp ~SET %索引 の中で ( ( %~lead ~MINUS `A1^X ) ~MUL 94 ~PLUS %~byte ~MINUS `A1^X ) が指す`索引~cp$ ◎ If lead and byte are both in the range 0xA1 to 0xFE, inclusive, set code point to the index code point for (lead − 0xA1) × 94 + byte − 0xA1 in index jis0208 if the EUC-JP jis0212 flag is unset and in index jis0212 otherwise.
    4. `~EUC-JP~jis0212~flag$ ~SET ~OFF ◎ Unset the EUC-JP jis0212 flag.
    5. ~IF[ %~byte ~NIN { `A1^X 〜 `FE^X } ] ⇒ %~byte を %~stream に`前付加する$ ◎ If byte is not in the range 0xA1 to 0xFE, inclusive, prepend byte to stream.
    6. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
    7. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  5. ~IF[ %~byte ~IN `~ASCII~byte$ ] ⇒ ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
  6. ~IF[ %~byte ~IN { `8E^X, `8F^X, `A1^X 〜 `FE^X } ] ⇒ `~EUC-JP~lead$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is 0x8E, 0x8F, or in the range 0xA1 to 0xFE, inclusive, set EUC-JP lead to byte and return continue.
  7. ~RET `~error$ ◎ Return error.

12.1.2. ~EUC-JP~encoder

`EUC-JP$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ EUC-JP’s encoder’s handler, given a stream and code point, runs these steps:

  1. %~cp に応じて: ◎ ↓

    `~EoS$
    ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
    `~ASCII~cp$
    ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
    `00A5^U
    ~RET ~byte列 [ `5C^X ] ◎ If code point is U+00A5, return byte 0x5C.
    `203E^U
    ~RET ~byte列 [ `7E^X ] ◎ If code point is U+203E, return byte 0x7E.
    `FF61^U 〜 `FF9F^U
    ~RET ~byte列 [ `8E^X, ( %~cp ~MINUS `FF61^X ~PLUS `A1^X ) ] ◎ If code point is in the range U+FF61 to U+FF9F, inclusive, return two bytes whose values are 0x8E and code point − 0xFF61 + 0xA1.
    ~OTHER
    何もしない
  2. ~IF[ %~cp ~EQ `2212^U ] ⇒ %~cp ~SET `FF0D^U ◎ If code point is U+2212, set it to U+FF0D.
  3. %~pointer ~LET `索引~jis0208$ の中で %~cp を指す`索引~pointer$ ◎ Let pointer be the index pointer for code point in index jis0208.

    注記: %~pointer は、 ~NULL でなければ,`索引~jis0208$と~pointer演算の資質に因り 8836 未満になる。 ◎ If pointer is non-null, it is less than 8836 due to the nature of index jis0208 and the index pointer operation.

  4. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  5. %~lead ~LET floor( %~pointer ~DIV 94 ) ~PLUS `A1^X ◎ Let lead be floor(pointer / 94) + 0xA1.
  6. %~trail ~LET ( %~pointer ~MOD 94 ) ~PLUS `A1^X ◎ Let trail be pointer % 94 + 0xA1.
  7. ~RET ~byte列 [ %~lead, %~trail ] ◎ Return two bytes whose values are lead and trail.

12.2. ~ISO-2022-JP

12.2.1. ~ISO-2022-JP~decoder

`ISO-2022-JP$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~ISO-2022-JP~decoder状態@(初期~時 `ASCII$i)~BR `~ISO-2022-JP~decoder出力~状態@(初期~時 `ASCII$i )~BR `~ISO-2022-JP~lead@(初期~時 `00^X )~BR `~ISO-2022-JP出力~flag@(初期~時 ~OFF ) ◎ ISO-2022-JP’s decoder has an associated ISO-2022-JP decoder state (initially ASCII), ISO-2022-JP decoder output state (initially ASCII), ISO-2022-JP lead (initially 0x00), and ISO-2022-JP output flag (initially unset).

`ISO-2022-JP$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,`~ISO-2022-JP~decoder状態$に応じて 次を走らす: ◎ ISO-2022-JP’s decoder’s handler, given a stream and byte, runs these steps, switching on ISO-2022-JP decoder state:

`ASCII@i

%~byte に応じて: ◎ Based on byte:

`1B^X
`~ISO-2022-JP~decoder状態$ ~SET `~escape開始$i ~BR ~RET `継続$ ◎ Set ISO-2022-JP decoder state to escape start and return continue.
`~ASCII~byte$ — ただし, `0E^X, `0F^X, `1B^X は除く
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET ~cp [ %~byte ] ◎ Unset the ISO-2022-JP output flag and return a code point whose value is byte.
`~EoS$
~RET `完遂$ ◎ Return finished.
~OTHER
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag and return error.
`Roman@i

%~byte に応じて: ◎ Based on byte:

`1B^X
`~ISO-2022-JP~decoder状態$ ~SET `~escape開始$i ~BR ~RET `継続$ ◎ Set ISO-2022-JP decoder state to escape start and return continue.
`5C^X
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET ~cp [ `00A5^U ] ◎ Unset the ISO-2022-JP output flag and return code point U+00A5.
`7E^X
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET ~cp [ `203E^U ] ◎ Unset the ISO-2022-JP output flag and return code point U+203E.
`~ASCII~byte$ — ただし, `0E^X, `0F^X, `1B^X, `5C^X, `7E^X は除く
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET a ~cp [ %~byte ] ◎ Unset the ISO-2022-JP output flag and return a code point whose value is byte.
`~EoS$
~RET `完遂$ ◎ Return finished.
~OTHER
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag and return error.
`Katakana@i

%~byte に応じて: ◎ Based on byte:

`1B^X
`~ISO-2022-JP~decoder状態$ ~SET `~escape開始$i ~BR ~RET `継続$ ◎ Set ISO-2022-JP decoder state to escape start and return continue.
`21^X 〜 `5F^X
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET ~cp [ `FF61^X ~MINUS `21^X ~PLUS %~byte ] ◎ Unset the ISO-2022-JP output flag and return a code point whose value is 0xFF61 − 0x21 + byte.
`~EoS$
~RET `完遂$ ◎ Return finished.
~OTHER
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag and return error.
`~lead~byte@i

%~byte に応じて: ◎ Based on byte:

`1B^X
`~ISO-2022-JP~decoder状態$ ~SET `~escape開始$i ~BR ~RET `継続$ ◎ Set ISO-2022-JP decoder state to escape start and return continue.
`21^X 〜 `7E^X
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR `~ISO-2022-JP~lead$ ~SET %~byte ~BR `~ISO-2022-JP~decoder状態$ ~SET `~trail~byte$i ~BR ~RET `継続$ ◎ Unset the ISO-2022-JP output flag, set ISO-2022-JP lead to byte, ISO-2022-JP decoder state to trail byte, and return continue.
`~EoS$
~RET `完遂$ ◎ Return finished.
~OTHER
`~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag and return error.
`~trail~byte@i

%~byte に応じて: ◎ Based on byte:

`1B^X

`~ISO-2022-JP~decoder状態$ ~SET `~escape開始$i ~BR ~RET `~error$ ◎ Set ISO-2022-JP decoder state to escape start and return error.

`21^X 〜 `7E^X
  1. `~ISO-2022-JP~decoder状態$ ~SET `~lead~byte$i ◎ Set the ISO-2022-JP decoder state to lead byte.
  2. %pointer ~LET ( `~ISO-2022-JP~lead$ ~MINUS `21^X ) ~MUL 94 ~PLUS %~byte ~MINUS `21^X ◎ Let pointer be (ISO-2022-JP lead − 0x21) × 94 + byte − 0x21.
  3. %~cp ~LET `索引~jis0208$ の中で %~pointer が指す`索引~cp$ ◎ Let code point be the index code point for pointer in index jis0208.
  4. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
  5. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
`~EoS$
`~ISO-2022-JP~decoder状態$ ~SET `~lead~byte$i ~BR %~byte を %~stream に`前付加する$ ~BR ~RET `~error$ ◎ Set the ISO-2022-JP decoder state to lead byte, prepend byte to stream, and return error.
~OTHER
`~ISO-2022-JP~decoder状態$ ~SET `~lead~byte$i ~BR ~RET `~error$ ◎ Set ISO-2022-JP decoder state to lead byte and return error.
`~escape開始@i
  1. ~IF[ %~byte ~IN { `24^X, `28^X } ] ⇒ `~ISO-2022-JP~lead$ ~SET %~byte ~BR `~ISO-2022-JP~decoder状態$ ~SET `~escape$i ~BR ~RET `継続$ ◎ If byte is either 0x24 or 0x28, set ISO-2022-JP lead to byte, ISO-2022-JP decoder state to escape, and return continue.
  2. %~byte を %~stream に`前付加する$ ◎ Prepend byte to stream.
  3. `~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR `~ISO-2022-JP~decoder状態$ ~SET `~ISO-2022-JP~decoder出力~状態$ ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag, set ISO-2022-JP decoder state to ISO-2022-JP decoder output state, and return error.
`~escape@i
  1. %~lead ~LET `~ISO-2022-JP~lead$ ~BR `~ISO-2022-JP~lead$ ~SET `00^X ◎ Let lead be ISO-2022-JP lead and set ISO-2022-JP lead to 0x00.
  2. %状態 ~LET ~NULL ◎ Let state be null.
  3. ~IF[ %~lead ~EQ `28^X ]~AND[ %~byte ~EQ `42^X ] ⇒ %状態 ~SET `ASCII$i ◎ If lead is 0x28 and byte is 0x42, set state to ASCII.
  4. ~IF[ %~lead ~EQ `28^X ]~AND[ %~byte ~EQ `4A^X ] ⇒ %状態 ~SET `~Roman$i ◎ If lead is 0x28 and byte is 0x4A, set state to Roman.
  5. ~IF[ %~lead ~EQ `28^X ]~AND[ %~byte ~EQ `49^X ] ⇒ %状態 ~SET `Katakana$i ◎ If lead is 0x28 and byte is 0x49, set state to Katakana.
  6. ~IF[ %~lead ~EQ `24^X ]~AND[ %~byte ~IN { `40^X, `42^X } ] ⇒ %状態 ~SET `~lead~byte$i ◎ If lead is 0x24 and byte is either 0x40 or 0x42, set state to lead byte.
  7. ~IF[ %状態 ~NEQ ~NULL ]: ◎ If state is non-null, run these substeps:

    1. `~ISO-2022-JP~decoder状態$ ~SET %状態 ~BR `~ISO-2022-JP~decoder出力~状態$ ~SET %状態 ◎ Set ISO-2022-JP decoder state and ISO-2022-JP decoder output state to state.
    2. %出力~flag ~LET `~ISO-2022-JP出力~flag$ ◎ Let output flag be the ISO-2022-JP output flag.
    3. `~ISO-2022-JP出力~flag$ ~SET ~ON ◎ Set the ISO-2022-JP output flag.
    4. ~RET [ %出力~flag ~EQ ~OFF ならば `継続$ / ~ELSE_ `~error$ ] ◎ Return continue, if output flag is unset, and error otherwise.
  8. ~byte列 [ %~lead, %~byte ] を %~stream に`前付加する$ ◎ Prepend lead and byte to stream.
  9. `~ISO-2022-JP出力~flag$ ~SET ~OFF ~BR `~ISO-2022-JP~decoder状態$ ~SET `~ISO-2022-JP~decoder出力~状態$ ~BR ~RET `~error$ ◎ Unset the ISO-2022-JP output flag, set ISO-2022-JP decoder state to ISO-2022-JP decoder output state and return error.

12.2.2. ~ISO-2022-JP~encoder

`ISO-2022-JP$n の`~encoder$の各~instanceには、次のものが結付けられる ⇒ `~ISO-2022-JP~encoder状態@ — これは,[ `~ASCII@i / `~Roman@i / `~jis0208@i ]のいずれかをとり得る(初期~時 `~ASCII$i )。 ◎ ISO-2022-JP’s encoder has an associated ISO-2022-JP encoder state which is ASCII, Roman, or jis0208 (initially ASCII).

`ISO-2022-JP$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ ISO-2022-JP’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ `~ISO-2022-JP~encoder状態$ ~NEQ `~ASCII$i ] ⇒ %~cp を %~stream に`前付加する$ ~BR `~ISO-2022-JP~encoder状態$ ~SET `~ASCII$i ~BR ~RET ~byte列 [ `1B^X, `28^X, `42^X ] ◎ If code point is end-of-stream and ISO-2022-JP encoder state is not ASCII, prepend code point to stream, set ISO-2022-JP encoder state to ASCII, and return three bytes 0x1B 0x28 0x42.
    2. ~RET `完遂$ ◎ If code point is end-of-stream and ISO-2022-JP encoder state is ASCII, return finished.
  2. ~IF[ `~ISO-2022-JP~encoder状態$ ~IN { `~ASCII$i, `~Roman$i } ]~AND[ %~cp ~IN { `000E^U, `000F^U, `001B^U } ] ⇒ ~RET `FFFD^U を伴う`~error$ ◎ If ISO-2022-JP encoder state is ASCII or Roman, and code point is U+000E, U+000F, or U+001B, return error with U+FFFD.

    攻撃を防ぐため、これは %~cp ではなく, `FFFD^U を返す。 ◎ This returns U+FFFD rather than the code point to prevent attacks.

  3. ~IF[ `~ISO-2022-JP~encoder状態$ ~EQ `~ASCII$i ]~AND[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If ISO-2022-JP encoder state is ASCII and code point is an ASCII code point, return a byte whose value is code point.
  4. ~IF[ `~ISO-2022-JP~encoder状態$ ~EQ `~Roman$i ] ⇒ %~cp に応じて: ◎ If ISO-2022-JP encoder state is Roman and code point is an ASCII code point, excluding U+005C and U+007E, or is U+00A5 or U+203E, run these substeps:

    `~ASCII~cp$ — ただし, `005C^U, `007E^U は除外する
    ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
    `00A5^U
    ~RET ~byte列 [ `5C^X ] ◎ If code point is U+00A5, return byte 0x5C.
    `203E^U
    ~RET ~byte列 [ `7E^X ] ◎ If code point is U+203E, return byte 0x7E.
    ~OTHER
    何もしない
  5. ~IF[ %~cp ~IN `~ASCII~cp$ ]~AND[ `~ISO-2022-JP~encoder状態$ ~NEQ `~ASCII$i ] ⇒ %~cp を %~stream に`前付加する$ ~BR `~ISO-2022-JP~encoder状態$ ~SET `~ASCII$i ~BR ~RET ~byte列 [ `1B^X, `28^X, `42^X ] ◎ If code point is an ASCII code point, and ISO-2022-JP encoder state is not ASCII, prepend code point to stream, set ISO-2022-JP encoder state to ASCII, and return three bytes 0x1B 0x28 0x42.
  6. ~IF[ %~cp ~NIN { `00A5^U, `203E^U } ]~AND[ `~ISO-2022-JP~encoder状態$ ~NEQ `~Roman$i ] ⇒ %~cp を %~stream に`前付加する$ ~BR `~ISO-2022-JP~encoder状態$ ~SET `~Roman$i ~BR ~RET ~byte列 [ `1B^X, `28^X, `4A^X ] ◎ If code point is either U+00A5 or U+203E, and ISO-2022-JP encoder state is not Roman, prepend code point to stream, set ISO-2022-JP encoder state to Roman, and return three bytes 0x1B 0x28 0x4A.
  7. ~IF[ %~cp ~EQ `2212^U ] ⇒ %~cp ~SET `FF0D^U ◎ If code point is U+2212, set it to U+FF0D.
  8. %~pointer ~LET `索引~jis0208$ の中で %~cp を指す`索引~pointer$ ◎ Let pointer be the index pointer for code point in index jis0208.

    注記: %~pointer は、 ~NULL でなければ,`索引~jis0208$と~pointer演算の資質に因り 8836 未満になる。 ◎ If pointer is non-null, it is less than 8836 due to the nature of index jis0208 and the index pointer operation.

  9. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  10. ~IF[ `~ISO-2022-JP~encoder状態$ ~NEQ `~jis0208$i ] ⇒ %~cp を %~stream に`前付加する$ ~BR `~ISO-2022-JP~encoder状態$ ~SET `~jis0208$i ~BR ~RET ~byte列 [ `1B^X, `24^X, `42^X ] ◎ If ISO-2022-JP encoder state is not jis0208, prepend code point to stream, set ISO-2022-JP encoder state to jis0208, and return three bytes 0x1B 0x24 0x42.
  11. %~lead ~LET floor( %~pointer ~DIV 94 ) ~PLUS `21^X ◎ Let lead be floor(pointer / 94) + 0x21.
  12. %~trail ~LET ( %~pointer ~MOD 94 ) ~PLUS `21^X ◎ Let trail be pointer % 94 + 0x21.
  13. ~RET ~byte列 [ %~lead, %~trail ] ◎ Return two bytes whose values are lead and trail.

12.3. ~Shift_JIS

12.3.1. ~Shift_JIS~decoder

`Shift_JIS$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~Shift_JIS~lead@(初期~時 `00^X ) ◎ Shift_JIS’s decoder has an associated Shift_JIS lead (initially 0x00).

`Shift_JIS$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ Shift_JIS’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ `~Shift_JIS~lead$ ~NEQ `00^X ] ⇒ `~Shift_JIS~lead$ ~SET `00^X ~BR ~RET `~error$ ◎ If byte is end-of-stream and Shift_JIS lead is not 0x00, set Shift_JIS lead to 0x00 and return error.
    2. ~ELSE ⇒ ~RET `完遂$ ◎ If byte is end-of-stream and Shift_JIS lead is 0x00, return finished.
  2. ~IF[ `~Shift_JIS~lead$ ~NEQ `00^X ]: ◎ If Shift_JIS lead is not 0x00, let lead be Shift_JIS lead, let pointer be null, set Shift_JIS lead to 0x00, and then run these substeps:

    1. %~lead ~LET `~Shift_JIS~lead$ ~BR %~pointer ~LET ~NULL ~BR `~Shift_JIS~lead$ ~SET `00^X ◎ ↑
    2. %~offset ~LET [ %~byte ~IN { `00^X 〜 `7E^X } ならば `40^X / ~ELSE_ `41^X ] ◎ Let offset be 0x40, if byte is less than 0x7F, and 0x41 otherwise.
    3. %~lead~offset ~LET [ %~lead ~IN { `00^X 〜 `BF^X } ならば `81^X / ~ELSE_ `C1^X ] ◎ Let lead offset be 0x81, if lead is less than 0xA0, and 0xC1 otherwise.
    4. ~IF[ %~byte ~IN { `40^X 〜 `7E^X, `80^X 〜 `FC^X } ] ⇒ %~pointer ~SET ( %~lead ~MINUS %~lead~offset ) ~MUL 188 ~PLUS %~byte ~MINUS %~offset ◎ If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFC, inclusive, set pointer to (lead − lead offset) × 188 + byte − offset.
    5. ~IF[ %~pointer ~IN { 8836 〜 10715 } ] ⇒ ~RET ~cp [ `E000^X ~MINUS 8836 ~PLUS %~pointer ] ◎ If pointer is in the range 8836 to 10715, inclusive, return a code point whose value is 0xE000 − 8836 + pointer.

      注記: これは EUDC として周知の,旧来の Windows によるものと相互運用可能にする。 ◎ This is interoperable legacy from Windows known as EUDC.

      【 EUDC — いわゆる外字のための機能。 】【 8836 = 94 ~MUL 94 は~Shift_JIS( JIS X 0208 )の 区点番号 の総数。 結果の~cpは~Unicode私用領域に入る。 】

    6. %~cp ~LET [ %~pointer ~EQ ~NULL ならば ~NULL / ~ELSE_ `索引~jis0208$ の中で %~pointer が指す`索引~cp$ ] ◎ Let code point be null, if pointer is null, and the index code point for pointer in index jis0208 otherwise.
    7. ~IF[ %~cp ~EQ ~NULL ]~AND[ %~byte ~IN `~ASCII~byte$ ] ⇒ %~byte を %~stream に`前付加する$ ◎ If code point is null and byte is an ASCII byte, prepend byte to stream.
    8. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
    9. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  3. ~IF[ %~byte ~IN { `~ASCII~byte$, `80^X} ] ⇒ ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte or 0x80, return a code point whose value is byte.
  4. ~IF[ %~byte ~IN { `A1^X 〜 `DF^X } ] ⇒ ~RET ~cp [ `FF61^X ~PLUS ( %~byte ~MINUS `A1^X ) ] ◎ If byte is in the range 0xA1 to 0xDF, inclusive, return a code point whose value is 0xFF61 − 0xA1 + byte.
  5. ~IF[ %~byte ~IN { `81^X 〜 `9F^X, `E0^X 〜 `FC^X } ] ⇒ `~Shift_JIS~lead$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x81 to 0x9F, inclusive, or 0xE0 to 0xFC, inclusive, set Shift_JIS lead to byte and return continue.
  6. ~RET `~error$ ◎ Return error.

12.3.2. ~Shift_JIS~encoder

`Shift_JIS$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ Shift_JIS’s encoder’s handler, given a stream and code point, runs these steps:

  1. %~cp に応じて: ◎ ↓

    `~EoS$
    ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
    `~ASCII~cp$
    `0080^U
    ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point or U+0080, return a byte whose value is code point.
    `00A5^U
    ~RET ~byte列 [ `5C^X ] ◎ If code point is U+00A5, return byte 0x5C.
    `203E^U
    ~RET ~byte列 [ `7E^X ] ◎ If code point is U+203E, return byte 0x7E.
    `FF61^U 〜 `FF9F^U
    ~RET ~byte列 [ ( %~cp ~MINUS `FF61^X ) ~PLUS `A1^X ] ◎ If code point is in the range U+FF61 to U+FF9F, inclusive, return a byte whose value is code point − 0xFF61 + 0xA1.
    ~OTHER
    何もしない
  2. ~IF[ %~cp ~EQ `2212^U ] ⇒ %~cp ~SET `FF0D^U ◎ If code point is U+2212, set it to U+FF0D.
  3. %~pointer ~LET %~cp を指す`索引~Shift_JIS~pointer$ ◎ Let pointer be the index Shift_JIS pointer for code point.
  4. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  5. %~lead ~LET floor( %~pointer ~DIV 188 ) ◎ Let lead be floor(pointer / 188).
  6. %~lead~offset ~LET [ %~lead ~IN { `00^X 〜 `1E^X } ならば `81^X / ~ELSE_ `C1^X ] ◎ Let lead offset be 0x81, if lead is less than 0x1F, and 0xC1 otherwise.
  7. %~trail ~LET %~pointer ~MOD 188 ◎ Let trail be pointer % 188.
  8. %~offset ~LET [ %~trail ~IN { `00^X 〜 `3E^X } ならば `40^X / ~ELSE_ `41^X ] ◎ Let offset be 0x40, if trail is less than 0x3F, and 0x41 otherwise.
  9. ~RET ~byte列 [ ( %~lead ~PLUS %~lead~offset ), ( %~trail ~PLUS %~offset ) ] ◎ Return two bytes whose values are lead + lead offset and trail + offset.

13. 旧来の複byte Korean ~encoding

13.1. ~EUC-KR

13.1.1. ~EUC-KR~decoder

`EUC-KR$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~EUC-KR~lead@(初期~時 `00^X ) ◎ EUC-KR’s decoder has an associated EUC-KR lead (initially 0x00).

`EUC-KR$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ EUC-KR’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ `~EUC-KR~lead$ ~NEQ `00^X ] ⇒ `~EUC-KR~lead$ ~SET `00^X ~BR ~RET `~error$ ◎ If byte is end-of-stream and EUC-KR lead is not 0x00, set EUC-KR lead to 0x00 and return error.
    2. ~ELSE ⇒ ~RET `完遂$ ◎ If byte is end-of-stream and EUC-KR lead is 0x00, return finished.
  2. ~IF[ `~EUC-KR~lead$ ~NEQ `00^X ]: ◎ If EUC-KR lead is not 0x00, let lead be EUC-KR lead, let pointer be null, set EUC-KR lead to 0x00, and then run these substeps:

    1. %~lead ~LET `~EUC-KR~lead$ ~BR %~pointer ~LET ~NULL ~BR `~EUC-KR~lead$ ~SET `00^X ◎ ↑
    2. ~IF[ %~byte ~IN { `41^X 〜 `FE^X } ] ⇒ %~pointer ~SET ( %~lead ~MINUS `81^X ) ~MUL 190 ~PLUS ( %~byte ~MINUS `41^X ) ◎ If byte is in the range 0x41 to 0xFE, inclusive, set pointer to (lead − 0x81) × 190 + (byte − 0x41).
    3. %~cp ~LET [ %~pointer ~EQ ~NULL ならば ~NULL / ~ELSE_ `索引~EUC-KR$ の中で %~pointer が指す`索引~cp$ ] ◎ Let code point be null, if pointer is null, and the index code point for pointer in index EUC-KR otherwise.
    4. ~IF[ %~cp ~EQ ~NULL ]~AND[ %~byte ~IN `~ASCII~byte$ ] ⇒ %~byte を %~stream に`前付加する$ ◎ If code point is null and byte is an ASCII byte, prepend byte to stream.
    5. ~IF[ %~cp ~EQ ~NULL ] ⇒ ~RET `~error$ ◎ If code point is null, return error.
    6. ~RET ~cp [ %~cp ] ◎ Return a code point whose value is code point.
  3. ~IF[ %~byte ~IN `~ASCII~byte$ ] ⇒ ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
  4. ~IF[ %~byte ~IN { `81^X 〜 `FE^X } ] ⇒ `~EUC-KR~lead$ ~SET %~byte ~BR ~RET `継続$ ◎ If byte is in the range 0x81 to 0xFE, inclusive, set EUC-KR lead to byte and return continue.
  5. ~RET `~error$ ◎ Return error.

13.1.2. ~EUC-KR~encoder

`EUC-KR$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ EUC-KR’s encoder’s handler, given a stream and code point, runs these steps:

  1. ~IF[ %~cp ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
  2. ~IF[ %~cp ~IN `~ASCII~cp$ ] ⇒ ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
  3. %~pointer ~LET `索引~EUC-KR$ の中で %~cp を指す`索引~pointer$ ◎ Let pointer be the index pointer for code point in index EUC-KR.
  4. ~IF[ %~pointer ~EQ ~NULL ] ⇒ ~RET %~cp を伴う`~error$ ◎ If pointer is null, return error with code point.
  5. %~lead ~LET floor( %~pointer ~DIV 190 ) ~PLUS `81^X ◎ Let lead be floor(pointer / 190) + 0x81.
  6. %~trail ~LET ( %~pointer ~MOD 190 ) ~PLUS `41^X ◎ Let trail be pointer % 190 + 0x41.
  7. ~RET ~byte列 [ %~lead, %~trail ] ◎ Return two bytes whose values are lead and trail.

14. その他の旧来の~encoding

14.1. ~replacement

注記: `replacement$n `~encoding$は、~serverと~clientにおける `~encoding$の~supportの不一致を突く,ある種の攻撃を防ぐためのものである。 ◎ The replacement encoding exists to prevent certain attacks that abuse a mismatch between encodings supported on the server and the client.

14.1.1. ~replacement~decoder

`replacement$n の`~decoder$の各~instanceには、次のものが結付けられる ⇒ `~replacementによる~errorはすでに返した~flag@(初期~時 ~OFF ) ◎ replacement’s decoder has an associated replacement error returned flag (initially unset).

`replacement$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ replacement’s decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ] ⇒ ~RET `完遂$ ◎ If byte is end-of-stream, return finished.
  2. ~IF[ `~replacementによる~errorはすでに返した~flag$ ~EQ ~OFF ] ⇒ `~replacementによる~errorはすでに返した~flag$ ~SET ~ON ~BR ~RET `~error$ ◎ If replacement error returned flag is unset, set the replacement error returned flag and return error.
  3. ~RET `完遂$ ◎ Return finished.

【 `replacement$n には、`~encoder$はない。 】

14.2. ~UTF-16BEと~UTF-16LEに共通の基盤

14.2.1. 共用~UTF-16~decoder

注記: ~byte~order-mark( ~BOM )は`~label$より優先される。 それは,配備済みの内容において、どの`~label$よりも正確aであることが見出されているので。 したがって それは、`共用~UTF-16~decoder$の一部ではなく,`~decode$ ~algoの一部になる。 ◎ A byte order mark has priority over a label as it has been found to be more accurate in deployed content. Therefore it is not part of the shared UTF-16 decoder algorithm but rather the decode algorithm.

`共用~UTF-16~decoder$の各~instanceには、次のものが結付けられる ⇒ `~UTF-16~lead~byte@(初期~時 ~NULL )~BR `~UTF-16~lead~surrogate@(初期~時 ~NULL )~BR `~UTF-16BE~decoder~flag@(初期~時 ~OFF ) ◎ shared UTF-16 decoder has an associated UTF-16 lead byte and UTF-16 lead surrogate (both initially null), and UTF-16BE decoder flag (initially unset).

`共用~UTF-16~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ shared UTF-16 decoder’s handler, given a stream and byte, runs these steps:

  1. ~IF[ %~byte ~EQ `~EoS$ ]: ◎ ↓

    1. ~IF[ `~UTF-16~lead~byte$ ~NEQ ~NULL ]~OR[ `~UTF-16~lead~surrogate$ ~NEQ ~NULL ] ⇒ ( `~UTF-16~lead~byte$, `~UTF-16~lead~surrogate$ ) ~SET ( ~NULL, ~NULL ) ~BR ~RET `~error$ ◎ If byte is end-of-stream and either UTF-16 lead byte or UTF-16 lead surrogate is not null, set UTF-16 lead byte and UTF-16 lead surrogate to null, and return error.
    2. ~ELSE ⇒ ~RET `完遂$ ◎ If byte is end-of-stream and UTF-16 lead byte and UTF-16 lead surrogate are null, return finished.
  2. ~IF[ `~UTF-16~lead~byte$ ~EQ ~NULL ] ⇒ `~UTF-16~lead~byte$ ~SET %~byte ~BR ~RET `継続$ ◎ If UTF-16 lead byte is null, set UTF-16 lead byte to byte and return continue.
  3. %~cu ~LET `~UTF-16BE~decoder~flag$に応じて,次で与えられる値: ◎ Let code unit be the result of:

    ~ON◎UTF-16BE decoder flag is set
    ( `~UTF-16~lead~byte$ ~Lshift 8 ) ~PLUS %~byte ◎ (UTF-16 lead byte << 8) + byte.
    ~OFF◎UTF-16BE decoder flag is unset
    ( %~byte ~Lshift 8 ) ~PLUS `~UTF-16~lead~byte$ ◎ (byte << 8) + UTF-16 lead byte.
  4. `~UTF-16~lead~byte$ ~SET ~NULL ◎ Then set UTF-16 lead byte to null.
  5. ~IF[ `~UTF-16~lead~surrogate$ ~NEQ ~NULL ]: ◎ If UTF-16 lead surrogate is not null, let lead surrogate be UTF-16 lead surrogate, set UTF-16 lead surrogate to null, and then run these substeps:

    1. %~lead~surrogate ~LET `~UTF-16~lead~surrogate$ ~BR `~UTF-16~lead~surrogate$ ~SET ~NULL ◎ ↑
    2. ~IF[ %~cu ~IN { `DC00^U 〜 `DFFF^U } ] ⇒ ~RET ~cp [ `10000^X ~PLUS ( ( %~lead~surrogate ~MINUS `D800^X ) ~Lshift 10 ) ~PLUS ( %~cu ~MINUS `DC00^X ) ] ◎ If code unit is in the range U+DC00 to U+DFFF, inclusive, return a code point whose value is 0x10000 + ((lead surrogate − 0xD800) << 10) + (code unit − 0xDC00).
    3. %byte1 ~LET %~cu ~Rshift 8 ◎ Let bytes be the return value of running these subsubsteps: ◎ Let byte1 be code unit >> 8.
    4. %byte2 ~LET %~cu ~bAND `00FF^X ◎ Let byte2 be code unit & 0x00FF.
    5. [ %~utf-16be~flag に応じて,次で与えられる~byte列 ]を %~stream に`前付加する$: ◎ Then return the bytes in order, switching on UTF-16BE decoder flag:

      ~ON◎Set
      ~byte列 [ %byte1, %byte2 ] ◎ byte1, then byte2.
      ~OFF◎Unset
      ~byte列 [ %byte2, %byte1 ] ◎ byte2, then byte1.
    6. ~RET `~error$ ◎ Prepend the bytes to stream and return error.
  6. ~IF[ %~cu ~IN { `D800^U 〜 `DBFF^U } ] ⇒ `~UTF-16~lead~surrogate$ ~SET %~cu ~BR ~RET `継続$ ◎ If code unit is in the range U+D800 to U+DBFF, inclusive, set UTF-16 lead surrogate to code unit and return continue.
  7. ~IF[ %~cu ~IN { `DC00^U 〜 `DFFF^U } ] ⇒ ~RET `~error$ ◎ If code unit is in the range U+DC00 to U+DFFF, inclusive, return error.
  8. ~RET ~cp [ %~cu ] ◎ Return code point code unit.

14.3. ~UTF-16BE

14.3.1. ~UTF-16BE~decoder

`UTF-16BE$n の`~decoder$は、[ `~UTF-16BE~decoder~flag$ ~SET ~ON ]にされた`共用~UTF-16~decoder$である。 ◎ UTF-16BE’s decoder is shared UTF-16 decoder with its UTF-16BE decoder flag set.

14.4. ~UTF-16LE

注記: 配備済みの内容に対処するため、 `utf-16^lb, `utf-16le^lb のいずれも `UTF-16LE$n のための`~label$にされている。 ◎ Both "utf-16" and "utf-16le" are labels for UTF-16LE to deal with deployed content.

14.4.1. ~UTF-16LE~decoder

`UTF-16BE$n の`~decoder$は、`共用~UTF-16~decoder$である。 ◎ UTF-16LE’s decoder is shared UTF-16 decoder.

14.5. ~x-user-defined

注記: これは技術的には`単byte~encoding$であるが、~algo的に実装し得るので,別々に定義される。 ◎ While technically this is a single-byte encoding, it is defined separately as it can be implemented algorithmically.

14.5.1. ~x-user-defined~decoder

`x-user-defined$n の`~decoder$の`~handler$は、所与の ( %~stream, %~byte ) に対し,次を走らす: ◎ x-user-defined’s decoder’s handler, given a stream and byte, runs these steps:

  1. %~byte に応じて: ◎ ↓

    `~EoS$
    ~RET `完遂$ ◎ If byte is end-of-stream, return finished.
    `~ASCII~byte$
    ~RET ~cp [ %~byte ] ◎ If byte is an ASCII byte, return a code point whose value is byte.
    ~OTHER
    ~RET ~cp [ `F780^X ~PLUS %~byte ~MINUS `80^X ] ◎ Return a code point whose value is 0xF780 + byte − 0x80.

14.5.2. ~x-user-defined~encoder

`x-user-defined$n の`~encoder$の`~handler$は、所与の ( %~stream, %~cp ) に対し,次を走らす: ◎ x-user-defined’s encoder’s handler, given a stream and code point, runs these steps:

  1. %~cp に応じて: ◎ ↓

    `~EoS$
    ~RET `完遂$ ◎ If code point is end-of-stream, return finished.
    `~ASCII~cp$
    ~RET ~byte列 [ %~cp ] ◎ If code point is an ASCII code point, return a byte whose value is code point.
    `F780^U 〜 `F7FF^U
    ~RET ~byte列 [ %~cp ~MINUS `F780^X ~PLUS `80^X ] ◎ If code point is in the range U+F780 to U+F7FF, inclusive, return a byte whose value is code point − 0xF780 + 0x80.
    ~OTHER
    ~RET %~cp を伴う`~error$ ◎ Return error with code point.

15. ~browser UI

~browserには、資源の~encodingに対する上書きを可能化させないことが奨励される。 にもかかわらず,その種の特色機能が在する場合、前述の ~security上の課題 から,~browserは, `UTF-16BE$n/`UTF-16LE$n を~optionとして提供0する~SHOULDでない。 ~browserは、資源が `UTF-16BE$n/`UTF-16LE$n を利用して~decodeされた場合でも,この特色機能を不能化する~SHOULDである。 ◎ Browsers are encouraged to not enable overriding the encoding of a resource. If such a feature is nonetheless present, browsers should not offer either UTF-16BE or UTF-16LE as option due to aforementioned security issues. Browsers also should disable this feature if the resource was decoded using either UTF-16BE or UTF-16LE.

謝辞

年月に渡り、~encodingを相互運用可能なものにするために,たくさんの方々が助力され、この標準の目標へ近付けてきた。 同様に多くの方々の助力が,この標準を現在の姿に仕立て上げてきた。 特に,次の方々に感謝する: ◎ There have been a lot of people that have helped make encodings more interoperable over the years and thereby furthered the goals of this standard. Likewise many people have helped making this standard what it is today.

With that, many thanks to Adam Rice, Alan Chaney, Alexander Shtuchkin, Allen Wirfs-Brock, Aneesh Agrawal, Arkadiusz Michalski, Asmus Freytag, Ben Noordhuis, Boris Zbarsky, Bruno Haible, Cameron McCormack, Charles McCathieNeville, David Carlisle, Domenic Denicola, Dominique Hazaël-Massieux, Doug Ewell, Erik van der Poel, 譚永鋒 (Frank Yung-Fong Tang), Geoffrey Sneddon, Glenn Maynard, Gordon P. Hemsley, Henri Sivonen, Ian Hickson, James Graham, Jeffrey Yasskin, John Tamplin, Joshua Bell, 村井純 (Jun Murai), 신정식 (Jungshik Shin), Jxck, 강 성훈 (Kang Seonghoon), 川幡太一 (Kawabata Taichi), Ken Lunde, Ken Whistler, Kenneth Russell, 田村健人 (Kent Tamura), Leif Halvard Silli, Makoto Kato, Mark Callow, Mark Crispin, Mark Davis, Martin Dürst, Masatoshi Kimura, Ms2ger, Nigel Megitt, Nigel Tao, Norbert Lindenberg, Øistein E. Andersen, Peter Krefting, Philip Jägenstedt, Philip Taylor, Richard Ishida, Robbert Broersma, Robert Mustacchi, Ryan Dahl, Shawn Steele, Simon Montagu, Simon Pieters, Simon Sapin, 寺田健 (Takeshi Terada), Vyacheslav Matva, and 成瀬ゆい (Yui Naruse) for being awesome.

この標準は、 Anne van KesterenMozilla, annevk@annevk.nl )により書かれた。 当初の API 節は、 Joshua Bell ( Google) )により書かれた。 ◎ This standard is written by Anne van Kesteren (Mozilla, annevk@annevk.nl). The API chapter was initially written by Joshua Bell (Google).

Per CC0, to the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work.

参照文献

文献(規範)

[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard.
https://infra.spec.whatwg.org/
[UNICODE]
The Unicode Standard.
http://www.unicode.org/versions/latest/
[WEBIDL]
Cameron McCormack; Boris Zbarsky; Tobie Langel. Web IDL.
https://heycam.github.io/webidl/

文献(参考)

[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard.
https://html.spec.whatwg.org/multipage/
[URL]
Anne van Kesteren. URL Standard. Living Standard.
https://url.spec.whatwg.org/
[XML]
Tim Bray; et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. REC.
https://www.w3.org/TR/xml