UTF-8符号化フォーマットのByte Order Mark問題

4056 ワード

先日、同僚が作成したSQL Serverデータベースのスクリプトファイルを私に渡したとき、syntax errorのエラーが発生しましたが、ファイルの内容をSQL Server Management Studioにコピーして実行したときはすべて正常でした.本当に奇妙で、長い間検査してから、UTF-8符号化のBOM(Byte Order Mark)の問題だったことに気づいた.
以下はwikipediaから抜粋します.
The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF . BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in. [1]
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF . A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
The Unicode Standard does permit the BOM in UTF-8 , [2] but does not require or recommend its use. [3] Byte order has no meaning in UTF-8 [4] so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.
Many Windows programs (including Windows Notepad ) add BOMs to UTF-8 files by default[ citation needed ].
Unicodeは16ビットまたは32ビット符号化が可能であるため、コンピュータは処理時にそのバイト順を知る必要があり、BOMはバイトストリームのバイト順を識別するために用いられるが、バイト順という概念はUTF-8にとって意味がないため、BOMはUTF-8に対しても同様に意味がない.しかしUnicode規格ではBOMはUTF-8符号化フォーマットに存在する.その存在位置はファイルの先頭にあり,3バイト0 xEF,0 xBB,0 xBFで表される.
UTF-8符号化では無意味なBOMは推奨されないが、多くのWindowsプログラムはUTF-8符号化されたファイルを保存する際にBOM付きフォーマット(すなわち、ファイルの先頭に0 xEFBBBFの3バイトを加える)として保存し、Windows手帳を含む.
したがって、UTF-8のファイルを編集する場合は、メモ帳などを使用して編集しないことをお勧めします.保存したファイルはUTF-8のままですが、保存前のUTF-8ではありません.これらのファイルを使用する場合、私の章の冒頭で説明したように、符号化によって問題が発生する可能性があります.
UTF-8エンコードファイルBOMを削除する方法:Notepad++のEncodingメニューのEncoding in UTF-8 without BOMでOK.または、任意の16進エディタでファイルの最初の3バイトを削除します.さらに、またはより簡単な:VIMでUTF-8コードのBOMタグを設定する.

MongoDB日記之:一意のインデックスを作成し、重複データを削除

jsp+javabean実装ページング