文字セットと文字コード[訂正]

10035 ワード

文字セットと文字コード[訂正]
このテーマはN人以上で議論されていますが、ここでは個人的なまとめにすぎず、チュートリアルではありません.
文字セットと文字コード
潘孙友2010-12-31于遵义

  
 、   
 、    
 、Windows  
  3.1 Codepage   
  3.2     (API)
  3.3     (CRT) [  @loop    ]
 、Linux/unix  
  4.1 iconv
  4.2 ICU

一、文字セット
文字セットは、GB 2312、GBK、GB 18030、UNICODEなどの一般的な文字を記述し定義するセットです.文字セットは単なる仕様であり、約束であり、独自の文字セットを定義することもできます.
例えば、銀行ITシステムは、フィールドの正当性チェックを容易にするために、X文字セット、N文字セットなどの小さな文字セットを内部的に定義することが多い.

x-      86     
a b c d e f g h I j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
. ， - _ （ ） / = ’+ : ? ! ” % & * < > ; @ #
（cr）（lf） （space）

異なる文字セットの間には交差があり、より多くの文字を含む文字セットが共通している可能性があります.中国のBG 2312、GBK、GB 18030は異なる時期に徐々に広がってきたので、GB 18030は前者のスーパーセットです.現在最大の文字セットはユニコードで、世界中のすべての言語の文字がほとんど含まれています.
二、文字コード
文字セットがあってこそ文字符号化があり、符号化は文字セットの具体的な表現であり、1つの文字セットには複数の符号化方式があり、このような方式が文字セットのすべての文字をカバーすることができる限り.
例えば、UNICode文字セットの具体的な符号化方式は、utf-8/utf-16/utf-32のように様々である.一方、gb 2312は、一般的な符号化方式GB 2312のみである(したがって、符号化を文字セットと混同することが多い).utf-8にはgb 2312のすべての文字が含まれていると言う人もいるかもしれませんが、それはutf-8符号でgb 2312の文字セットを表しているのではないでしょうか.事実はそうですが、問題はutf-8符号化方式がgb 2312のためにカスタマイズされているわけではありません.
異なる符号化方式は、その特性に応じて異なる場合に適用される.UNICode文字セットの符号化方式を例にとると、utf-8はできるだけ少ない記憶空間に文字を格納し、自動誤り訂正性能が優れている(符号化の特殊性のため、その1文字の誤りはその後の数バイトに影響するだけで、utf 16/32などの長いバイトは大きな間違いにつながっている).伝送と記憶に適している.utf-16/32はすべての文字を等バイトで表し、プログラム内部で処理しやすく、一般的にシステム内部の文字フォーマットに用いられる.
三、Windowsプラットフォーム
3.1 Codepageコードページ
Windowsプラットフォームには、文字の符号化方法を表すコードページがあります.少なくともどれをサポートしているかは、あなたのシステムにどれがインストールされているかによって決まります.GetACPにより現在のシステムの符号化方式を取得できます.

WINBASEAPI UINT WINAPI GetACP(void);

は、以下のセグメントコードによって、システムがサポートする符号化方式を挙げることができる.

	for(int i=0; i<=65001; i++)
	{
		CPINFOEXA cpinfo;
		if (IsValidCodePage(i)){
			if (GetCPInfoExA(i, 0, &cpinfo))
				printf("%d=[%s]
", cpinfo.CodePage, cpinfo.CodePageName);
		}
	}

詳細のリストを参照してください.http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
3.2符号化変換(API)
Windowsでは、文字符号化の変換に使用する2つの関数が用意されています.

WINBASEAPI
int
WINAPI
MultiByteToWideChar(
    __in UINT     CodePage,
    __in DWORD    dwFlags,
    __in_bcount(cbMultiByte) LPCSTR   lpMultiByteStr,
    __in int      cbMultiByte,
    __out_ecount_opt(cchWideChar) __transfer(lpMultiByteStr) LPWSTR  lpWideCharStr,
    __in int      cchWideChar);

WINBASEAPI
int
WINAPI
WideCharToMultiByte(
    __in UINT     CodePage,
    __in DWORD    dwFlags,
    __in_ecount(cchWideChar) LPCWSTR  lpWideCharStr,
    __in int      cchWideChar,
    __out_bcount_opt(cbMultiByte) __transfer(lpWideCharStr) LPSTR   lpMultiByteStr,
    __in int      cbMultiByte,
    __in_opt LPCSTR   lpDefaultChar,
    __out_opt LPBOOL  lpUsedDefaultChar);

Windows内部はUnicodeコードを完全に使用しているはずで、システムが提供するW接尾辞のAPIはwchar_を受け入れている.t(unsigned short)の文字列であり、AシリーズのAPIは、Unicode符号化に内部変換してからWシステム関数を呼び出す.
Windows上で異なる符号化間の変換を行い、一般的には元の符号化をUnicode符号化に変換し、Unicode符号化の文字列をターゲット符号化に変換する.gbkからutf-8を例に挙げると、

std::string ctk_gbk2utf8(const char*s)
{
	s = s?s:"";
	std::wstring unicodestr;
	std::string utf8str;
	//gbk   utf16
	int n = MultiByteToWideChar(936, 0, s, -1, NULL, 0);

	unicodestr.resize(n);
	MultiByteToWideChar(936, 0, s, -1, (wchar_t *)unicodestr.c_str(), (int)unicodestr.length());
	
	// utf16 utf8
	n = WideCharToMultiByte(CP_UTF8, 0, unicodestr.c_str(), -1, 0, 0, 0, 0 );
	utf8str.resize(n);
	WideCharToMultiByte(CP_UTF8, 0, unicodestr.c_str(), -1, (char*)utf8str.c_str(), (int)utf8str.length(), 0, 0 );
	
	return utf8str;
}

std::string ctk_utf82gbk(const char*s)
{
	s = s?s:"";
	std::wstring unicodestr;
	std::string gbkstr;
	//utf8   utf16
	int n = MultiByteToWideChar(CP_UTF8, 0, s, -1, NULL, 0);

	unicodestr.resize(n);
	MultiByteToWideChar(CP_UTF8, 0, s, -1, (wchar_t *)unicodestr.c_str(), (int)unicodestr.length());

	// utf16 gbk
	n = WideCharToMultiByte(936, 0, unicodestr.c_str(), -1, 0, 0, 0, 0 );
	gbkstr.resize(n);
	WideCharToMultiByte(936, 0, unicodestr.c_str(), -1, (char*)gbkstr.c_str(), (int)gbkstr.length(), 0, 0 );

	return gbkstr;
}

テストコード:

	const char*gbk  = "    . hello world!";
	printf("gbk=%s
", gbk);

	string utf8str = ctk_gbk2utf8(gbk);
	printf("utf8str=%s
", utf8str.c_str());

	string gbkstr = ctk_utf82gbk(utf8str.c_str());
	printf("gbkstr=%s
", gbkstr.c_str());

日本や韓国のような言語コードを扱うなら、936を対応するコードに変えればいい.上のCP_UTF 8は65501に変更することができる.一般的に,システムAPIを用いた変換符号化はそれほど悪くない.
実践されていない状況でネット上の他の文章を軽く信じた発言を聞いたが、実際には間違っていた.また、c/c++標準ライブラリが提供するmbstocws、cwstombs関数もあるが、windowsでは現在のコードページ以外の文字セットを処理できないため、それほど使いにくくなった(setlocaleでも役に立たない).
3.3符号化変換(CRT)
また、c/c++標準ライブラリのmbstocws、cwstombs関数で符号化することができます.Windows APIの実際のグループとmbstocwsのグループ、cwstombsのグループの2つの関数でテストすることができます.効果は同じです.

std::string ctk_gbk2utf8(const char*s)
{
	s = s?s:"";
	std::wstring unicodestr;
	std::string utf8str;
	int n = MultiByteToWideChar(936, 0, s, -1, NULL, 0);
	unicodestr.resize(n);
	MultiByteToWideChar(936, 0, s, -1, (wchar_t *)unicodestr.c_str(), (int)unicodestr.length());
	n = WideCharToMultiByte(CP_UTF8, 0, unicodestr.c_str(), -1, 0, 0, 0, 0 );
	utf8str.resize(n);
	WideCharToMultiByte(CP_UTF8, 0, unicodestr.c_str(), -1, (char*)utf8str.c_str(), (int)utf8str.length(), 0, 0 );
	return utf8str;
}

std::string ctk_gbk2big5(const char*s)
{
	s = s?s:"";
	std::wstring unicodestr;
	std::string dststr;
	int n = MultiByteToWideChar(936, 0, s, -1, NULL, 0);
	unicodestr.resize(n);
	MultiByteToWideChar(936, 0, s, -1, (wchar_t *)unicodestr.c_str(), (int)unicodestr.length());
	n = WideCharToMultiByte(950, 0, unicodestr.c_str(), -1, 0, 0, 0, 0 );
	dststr.resize(n);
	WideCharToMultiByte(950, 0, unicodestr.c_str(), -1, (char*)dststr.c_str(), (int)dststr.length(), 0, 0 );
	return dststr;
}

std::string ctk_gbk2big5_crt(const char*s)
{
	s = s?s:"";
	std::string srcstr = s;
	std::string curLocale = setlocale(LC_ALL, NULL);      
	setlocale(LC_ALL, ".936");
	size_t newSize = srcstr.length() + 1;
	wstring unicodestr;
	unicodestr.resize(newSize);
	wmemset((wchar_t*)unicodestr.c_str(), 0, newSize);
	mbstowcs((wchar_t*)unicodestr.c_str(), srcstr.c_str(), newSize);
	string newstr;
	newSize = newSize*2 + 1;
	setlocale(LC_ALL, ".950");
	newstr.resize(newSize);
	memset((char*)newstr.c_str(), 0, newSize);
	wcstombs((char*)newstr.c_str(), unicodestr.c_str(), newSize);
	setlocale(LC_ALL, curLocale.c_str());
	return newstr;
}

std::string ctk_big52gbk_crt(const char*s)
{
	s = s?s:"";
	std::string srcstr = s;
	std::string curLocale = setlocale(LC_ALL, NULL);      
	setlocale(LC_ALL, ".950");

	size_t newSize = srcstr.length() + 1;
	wstring unicodestr;
	unicodestr.resize(newSize);
	wmemset((wchar_t*)unicodestr.c_str(), 0, newSize);
	mbstowcs((wchar_t*)unicodestr.c_str(), srcstr.c_str(), newSize);
	string newstr;
	newSize = newSize*2 + 1;
	setlocale(LC_ALL, ".936");
	newstr.resize(newSize);
	memset((char*)newstr.c_str(), 0, newSize);
	wcstombs((char*)newstr.c_str(), unicodestr.c_str(), newSize);
	setlocale(LC_ALL, curLocale.c_str());
	return newstr;
}

四、Linux/unixプラットフォーム
4.1 iconv
これらのプラットフォームでは、文字符号化の変換には一般的にiconvが使用され、独立したiconvライブラリでもglibcが持参したバージョンでもある可能性があります.linux/unixで文字コードの問題はwindowsよりも発生しやすいので、ftpから送られてきたファイル名の文字化けして叫んだりすることがよくあります(ちなみに、IE 7を含む前のIEバージョンでutf 8コードのftpに直接アクセスしたときに文字化けしたり、ネットワークキャプチャツールでIEが送信したutf-8文字列が一部間違っていることがわかります).
しばらくはプログラムの中の符号化の問題だけを話して、localeなどはともかく(関連していますが).
自分で「カプセル化」するトランスコード関数.

char *ctk_iconv(const char *fromStr, const int fromLen, char**toStr,  const char *fromCode, const char *toCode)
{
	char *buffer;
	iconv_t cd;
	const char *inbuf = NULL; 
	size_t inbytesleft = 0;
	char *outbuf = NULL;
	size_t outbytesleft = 0;
	int errorCode = 0;
	int bufferSize=0;
	size_t ret = 0;
	int done = 0;

	if (fromStr==NULL || fromStr[0]=='\0' || fromLen <=0 ) return NULL;
	if (fromCode==NULL || fromCode[0]=='\0' ) return NULL;
	if (toCode==NULL || toCode[0]=='\0' ) return NULL;

	memset(&cd, 0x00, sizeof(iconv_t));
	inbuf = fromStr;
	inbytesleft = fromLen;

	errorCode = 0;
	bufferSize = fromLen*4+1;
	buffer = (char*)malloc(sizeof(char)*bufferSize);
	memset(buffer, 0x00, bufferSize);

	outbuf = buffer;
	outbytesleft = bufferSize;

	if ( (iconv_t)-1  == ( cd = iconv_open(toCode, fromCode) ) ) {
		return NULL;
	}	

	while ( inbytesleft >0 && done !=1 ) {
		ret = iconv(cd, (char**)&inbuf, &inbytesleft, &outbuf, &outbytesleft);
		if ( (size_t)-1  == ret ) {
			errorCode = errno;
			switch(errorCode) 
			{
			case EILSEQ:
			{
				if((outbuf<buffer+bufferSize)&&(outbuf>=buffer))
				{
					memcpy(outbuf, inbuf, 1);
					outbuf += 1;
					outbytesleft -= 1;
					inbuf += 1;
					inbytesleft -= 1;
					if ( inbytesleft <= 0 ) break;
				}
			}
			break;
			case EINVAL:
			{
				done = 1;
			}
			break;
			case E2BIG:
			{
				done = 1;
				break;
			}
			break;
			default:
				done = 1;
			}
		}
	}
	if ( NULL != toStr)
		*toStr = buffer;
	iconv_close(cd);
	return buffer;
}

std::string ctk_iconv_gbk2utf8(const char*s)
{
	s = s ? s:"";
	char *utf8str = NULL;
	ctk_iconv(s, strlen(s), &utf8str,  "gbk", "utf-8");
	std::string result("");
	if (utf8str!=NULL)
	{
		result = utf8str;
		free(utf8str);
	}
	return result;
}

std::string ctk_iconv_utf82gbk(const char*s)
{
	s = s ? s:"";
	char *gbkstr = NULL;
	ctk_iconv(s, strlen(s), &gbkstr, "utf-8", "gbk");
	std::string result("");
	if (gbkstr!=NULL)
	{
		result = gbkstr;
		free(gbkstr);
	}
	return result;
}

Iconvはwindowsでもとても使いやすいです.
4.2 ICU
IBM製のICUも符号化変換の上手な人で、あちこちにその姿があり、php 6はそれを内符号にしています.実際に使った経験がないので、あまり話さない.http://www-01.ibm.com/software/globalization/icu/index.html

JAva修習の道(5)---------道は長くて遠いです.

Socketクライアントとサービス側、複数のクライアントとサービス側が通じ、クライアントとクライアントの間で通信!