PostgreSQLでEUCJPとUTF8のTEXT中間一致検索速度を比較

16353 ワード

PostgreSQL 文字コード SQL SQL テキストリンク

動機

ASCIIは容量少なくて羨ましい・・。日本語の文章ならUTF8よりEUCJPとかSJISの方が低容量だ！もしかしてデータベースの検索速度も上がるのかな？

使用データ

Wikipediaを使います。
wp2txtでwikipediaのコーパスを作るまでの道のりを参考に本文を取得します。
wp2txt --input-file jawiki-latest-pages-articles.xml --num-threads=8 --no-list --no-heading --no-title --no-marker
でテキストを出力し、その後はRubyで↓のような前処理とエンコードをします。

txt = File.read("元テキストファイル")
# 色々消す
# tsvとして読み込ませるため \t   念の為 \r
# SQLの挿入で困らないように \\   空白行 \n{2,}
txt.delete!("\t\r\\")
txt.gsub!(/\n{2,}/, "\n")

opts =  {:undef=>:replace, :invalid=>:replace, :replace=>""}
utf8_txt = txt.encode(Encoding::EUC_JP, opts).encode(Encoding::UTF_8, opts)
eucjp_txtx = utf8_txt.encode(Encoding::EUC_JP, opts)

環境

OS: Ubuntu 18.04.1 LTS
PostgreSQL: 10.5 (Ubuntu 10.5-0ubuntu0.18.04)
- 一切チューニングをしない
- インデックスなにそれ？

準備

SJISが入出力時の文字コードとして指定できるけど、内部データには使用できないので、UTF8とEUCJPを使用する。

$ sudo  localedef -f EUC-JP -i ja_JP ja_JP.EUC-JP

postgres=# CREATE DATABASE utf8 WITH ENCODING 'UTF8';
CREATE DATABASE
postgres=# CREATE DATABASE "eucjp" WITH TEMPLATE="template0" ENCODING='EUC_JP' LC_COLLATE='C' LC_CTYPE='C';
CREATE DATABASE
postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 eucjp     | postgres | EUC_JP   | C           | C           | 
 postgres  | postgres | UTF8     | ja_JP.UTF-8 | ja_JP.UTF-8 | 
 template0 | postgres | UTF8     | ja_JP.UTF-8 | ja_JP.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | ja_JP.UTF-8 | ja_JP.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 utf8      | postgres | UTF8     | ja_JP.UTF-8 | ja_JP.UTF-8 | 
(5 rows)

postgres=# \c utf8
You are now connected to database "utf8" as user "postgres".
utf8=# CREATE TABLE utf8text ( text text );
CREATE TABLE
eucjp=# \encoding
UTF8
utf8=# COPY utf8text FROM '/path/to/utf8.txt';
COPY 10191461
utf8=# \c eucjp
You are now connected to database "eucjp" as user "postgres".
eucjp=# CREATE TABLE eucjptext ( text text );
CREATE TABLE
eucjp=# \encoding EUC-JP
eucjp=# \encoding
EUC_JP
eucjp=# COPY eucjptext FROM '/path/to/eucjp.txt';
COPY 10191461

-- なんとなくデータベースの容量を確認
eucjp=# SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;
  datname  | pg_size_pretty 
-----------+----------------
 postgres  | 7629 kB
 template1 | 7497 kB
 template0 | 7497 kB
 utf8      | 2961 MB
 eucjp     | 2142 MB
(5 rows)

計測

postgres=# \timing
Timing is on.
postgres=# \c utf8 
You are now connected to database "utf8" as user "postgres".
--　下記コマンドをそれぞれ5回
utf8=# SELECT count(text) FROM utf8text WHERE text LIKE '%に%';
utf8=# SELECT count(text) FROM utf8text WHERE text LIKE '%ナレッジコミュニティ%';

utf8=# \c eucjp 
You are now connected to database "eucjp" as user "postgres".
--　下記コマンドをそれぞれ5回
eucjp=# SELECT count(text) FROM eucjptext WHERE text LIKE '%に%';
eucjp=# SELECT count(text) FROM eucjptext WHERE text LIKE '%ナレッジコミュニティ%';

結果

単位は [ms]

'%に%'

	UTF8	EUCJP
1回目	1880.681	1532.657
2回目	1876.743	1588.650
3回目	1942.774	1549.886
4回目	1883.708	1576.893
5回目	1904.838	1603.320
平均	1897.7488	1570.2812

'%ナレッジコミュニティ%'

	UTF8	EUCJP
1回目	4053.864	2562.225
2回目	3983.899	2510.445
3回目	3926.061	2475.307
4回目	3952.371	2457.570
5回目	3946.132	2465.966
平均	3972.4654	2494.3026

まあまあ差が出てしまった。
入出力で検索語の文字コードを変換するロスとかを考えたら、やっぱりUTF8安定かなあ。

色々調べていたらKEEPONLYALNUMとかあったけど、これ多分ONのまま計測してるよな～。

Author And Source

この問題について(PostgreSQLでEUCJPとUTF8のTEXT中間一致検索速度を比較), 我々は、より多くの情報をここで見つけました https://qiita.com/horyu/items/3181daa6665441b84066

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

定長メッセージメッセージのグループパケットと解パケット単純パッケージ(Java実装)

Mybatis簡単入門

PostgreSQLでEUCJPとUTF8のTEXT中間一致検索速度を比較

動機

使用データ

環境

準備

計測

結果

コメント

Author And Source