PostgreSQL TPC-Hテスト

10041 ワード

When benchmarking PostgreSQL database, pgbench is probably the first choice. But the default pgbench transactions are rather OLTP-like (it's a TCP-B-like stress test) so it's not very usable when you need to test a DSS/DWH-like workload. I needed to do that recently, and I wasn't very happy with the available tools. Luckily, the TPC Council provides DBGEN, a tool that makes it quite easy to generate data and scripts - and it's not very difficult to make that work with PostgreSQL. So let's see how to make it work  ..
Tweet
I've found several tools that claim to implement TPC-H like benchmark, but I've found them unusable for various reasons. For example DBT-3 is a bit old (last update in 2005) and the dbgen command keeps failing for strange reasons.
The tool from TPC Council works quite well, and although it does not support PostgreSQL out of the box, it's not very difficult to make it work. You can get it right at the TPC-H page (see the  right column, below the TPC-H specification). Download the tgz package (37MB).

compiling


After extracting it, you have to prepare a Makefile - enter the dbgen directory, copy the makefile.suite and modify those four variables (about line 109):
CC=gcc
DATABASE=ORACLE
MACHINE=LINUX
WORKLOAD=TPCH

If you're not running Linux, set the MACHINE accordingly. Then just build the tool using make.

generating data


So now we have a "dbgen"tool that generates data in a CSV format. Just like pgbench it has a scaling factor that influences the amount of data generater - the default value (1) means about 1GB of raw data, i.e. about 2GB of data after populating the database. So let's create 10GB of raw data:
$ ./dbgen -s 10

That gives us eight .tbl files with a CSV format, each containing data for one table. The problem is each row contains an extra "|"separator at the end of the line, so PostgreSQL fails to load that. But it's quite easily fixable with sed - just run this:
for i in `ls *.tbl`; do sed 's/|$//' $i > ${i/tbl/csv}; echo $i; done;

Now we have eight CSV files that may be loaded into the database. But we have to create it first.

populating the database


Although the TPC-H specification describes the  database structure, the create scripts are not part of the package. I've prepared a create script that creates all the  tables and an alter script that creates the foreign keys (after the database is populated). So create a database and create the tables
$ createdb tpch
$ psql tpch < tpch-create.sql

Now it's time to populate the database with generated data. Don't forget the COPY requires absolute paths, so you may use a script like this
dir=`pwd`
opts="-h localhost pgbench"

psql $opts -c "COPY part FROM '$dir/part.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY region FROM '$dir/region.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY nation FROM '$dir/nation.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY supplier FROM '$dir/supplier.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY customer FROM '$dir/customer.csv'
                                WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY partsupp FROM '$dir/partsupp.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY orders FROM '$dir/orders.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

psql $opts -c "COPY lineitem FROM '$dir/lineitem.csv'
                               WITH (FORMAT csv, DELIMITER '|')"

and finally create the foreign keys
$ psql tpch < tpch-alter.sql

So now we have a database populated with data, so let's prepare some queries.

generating queries


TPC-H describes 22 queries (or rather templates), stored in the "queries"directory. Then there is a "qgen"tool that generates queries from the templates (mostly fill them with random data). We have to fix some minor problems before running it.
First, most of the queries uses a ROWCOUNT (or other db-specific variant) instead of LIMIT. This is quite easy to fix. Second, there are about five queries that use correlated aggregate subqueries with huge outer tables - that does not work very well in PostgreSQL, but it's easy to rewrite to a JOIN. So intead of query like this
select
	sum(l_extendedprice) / 7.0 as avg_yearly
from
	lineitem,
	part
where
	p_partkey = l_partkey
	and p_brand = ':1'
	and p_container = ':2'
	and l_quantity < (
		select
			0.2 * avg(l_quantity)
		from
			lineitem
		where
			l_partkey = p_partkey
	)
LIMIT 1;

you get a query like this
select
	sum(l_extendedprice) / 7.0 as avg_yearly
from
	lineitem,
	part,
	(SELECT l_partkey AS agg_partkey,
                0.2 * avg(l_quantity) AS avg_quantity
          FROM lineitem GROUP BY l_partkey) part_agg
where
	p_partkey = l_partkey
	and agg_partkey = l_partkey
	and p_brand = ':1'
	and p_container = ':2'
	and l_quantity < avg_quantity
LIMIT 1;

that produces the same result and is much faster (actually the first query never finished for me). I'm not quite sure how strict the TPC-H is about the query format, but my goal was not to to a fully compliant TPC-H benchmark, it's rather a good starting point to do a DSS benchmark.
The modified queries are available here - just extract it into a new 'queries-pg' directory.
$ mkdir queries-pg
$ tar -xzvf ~/tpch-queries.tgz

and you're ready to generate the workload. This
$ DSS_QUERY=queries-pg ./qgen > workload.sql

gives you a script that executes one query for each template. If you want to generate a workload with only some of the queries, list the query IDs at the command line like this
$ DSS_QUERY=queries-pg ./qgen 2 3 7 > workload.sql

That will give you a workload with queries 2.sql, 3.sql and 7.sql only. If you want to generate more queries, you may do this
for r in `seq 1 10`
do
  rn=$((`cat /dev/urandom|od -N3 -An -i` % 10000))
  DSS_QUERY=queries-pg ./qgen -r $rn >> workload.sql
done

Notice how the qgen is initialized with a different random number each time (by default it uses a timestamp, so executing it multiple times within a second gives exectly the same queries).

running the workload


So we have a populated database and we know hot to generate a workload, but how to run it? I've been thinking about the pgbench "custom scripts"thing but it would be quite difficult to generate all the various random values.
I'm using a very simple approach - decide how many "clients"I want to run (with DSS benchmark this tends to be quite low number, e.g. 4), prepare corresponding number of workload scripts (e.g. workload-1.sql .. workload-4.sql) and then do something like this
#start the processes
for c in `seq 1 4`
do
  /usr/bin/time -f "total=%e" -o result-$c.log \
                psql tpch < workload-$c.sql > /dev/null 2>&1 &
done;

# wait for the processes
for p in `jobs -p`
do
   wait $p;
done;

This produces a bunch of result logs with number of seconds for each workload. Obviously there's a lot of things you can do - collect data from various tools (iostat, vmstat, ...) and so on.
Keep in mind that those queries have to chew through a lot of the data (depending on the the scale factor you've chosen), and when something goes wrong (e.g. a bad plan is chosen), it will take forewer. That's why I usually create a batch of scripts with one query per file, and then kill the query if it does not finish in a given time frame (e.g. 15 minutes). A timeout command may be used to do that
# prepare queries and execute them with 60 seconds timeout
for q in `seq 1 22`
do
  DSS_QUERY=queries-pg ./qgen $q > q-$q.sql
  timeout 60 psql tpch < q-$q.sql
done

The random number is not needed anymore as the queries tend to take more than one second.
from: http://www.fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/