バイト値に基づいてファイルの符号化方式を判断する。

1820 ワード

普通のファイルは、最初はファイルの内容で、エンコードがあります。先頭にコードの定義が表示されます。
UTF−8符号化フォーマットのテキストファイルについて、その前の3バイトの値は−17、−69、−65である。
UTF-8符号化フォーマットかどうかを判断することができます。

File file = new File(path);
InputStream ios = new java.io.FileInputStream(file);
byte[] b = new byte[3];
ios.read(b);
ios.close();
if (b[0] == -17 && b[1] == -69 && b[2] == -65)
  System.out.println(file.getName() + “：   UTF-8〃);
else
  System.out.println(file.getName() + “：   GBK，        。”);

プロジェクトが判定するテキストファイルのコードをコントロールできない場合（例えば、ユーザーがアップロードしたHTML、XMLなどのテキスト）、既存のオープンソース項目を採用できます。
最も標準的なルートは、テキストの先頭の数バイトを検出し、先頭のバイトCharse/encodingは、下表の通りです。

EF BB BF　　UTF-8
FE FF　　　 UTF-16/UCS-2, little endian
FF FE　　　 UTF-16/UCS-2, big endian
FF FE 00 00 UTF-32/UCS-4, little endian.
00 00 FE FF UTF-32/UCS-4, big-endian.

int[] head = new int[4];
        InputStream inputStream = new FileInputStream(path);
        for(int i=0; i<4; i++){

                head[0]=inputStream.read();
        }
        inputStream.close();

String code = "ANSI";
        if (head[0]==0xef && head[1]==0xbb && head[2]==0xbf) {
            code = "UTF-8";
            
        } else if(head[0]==0xfe && head[1]==0xff) {
            code = "utf-16/ucs2, little endian";
            
        } else if(head[0]==0xff && head[1]==0xfe) {
            code = "utf-16/ucs2, big endian";
            
        } else if(head[0]==0xff && head[1]==0xfe && head[2]==0x0 && head[3]==0x0) {
            code = "UTF-32/ucs4, little endian";
            
        } else if (head[0]==0x0 && head[1]==0x0 && head[2]==0xfe && head[3]==0xff) {
            code = "UTF-32/ucs4, big endian";
        }

ビッグデータシリーズ修練-Scaraコース16(2)

ソースコード解析のCopyOWriteAraySet