通过TPC-H生成测试数据集并导入HIVE

Hive-Testbench:https://github.com/hortonworks/hive-testbench/
Tpcds-Kit:https://github.com/gregrahn/tpcds-kit
官网:http://www.tpc.org/

针对数据库不同的使用场景TPC组织提供了多种数据集,主要的TPC数据集有如下几种
TPC-C:模拟一个库存-订单系统以及其上的多用户并发事务;
TPC-DI:模拟多种类型的大数据源的ETL过程;
TPC-DS:模拟大型零售业务的系统,该系统主要用于BI和决策支持,数据量和OLAP查询复杂度都很高,是与真实场景非常接近的一个测试集;
TPC-E:模拟证券经纪人的系统,该系统主要用于提供大量查询的OLTP服务;
TPC-H:可以近似视为TPC-DS的简化版本,但它的数据表数据特征单一,已经不能精准反映当今数据库系统的真实性能。

官网下载页面

这里通过官网的下载导入,将下载的文件解压后,修改dbgen中makefile.suite为makefile文件并进行修改

################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC      =  gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS, 
#                                  SGI, SUN, U2200, VMS, LINUX, WIN32 
# Current values for WORKLOAD are:  TPCH
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH

由于官方数据库支持列表并没有原生支持Hive,所以通过SQLSERVER进行修改为HQL风格,修改tpcd.h文件

#ifdef  SQLSERVER
#define GEN_QUERY_PLAN  "explain;"
#define START_TRAN      "start transaction;\n"
#define END_TRAN        "commit;\n"
#define SET_OUTPUT      ""
#define SET_ROWCOUNT    "limit %d;\m"
#define SET_DBASE       "use %s;\n"
#endif

先通过make命令进行编译,再执行./dbgen -s 6生成数据。6为指定数据大小为5个SF,SF为TPC中描述数据量的单位,1个SF约等于1GB

[root@node03 dbgen]# ll -h *.tbl
-rw-r--r-- 1 root root 140M 6月   8 13:30 customer.tbl
-rw-r--r-- 1 root root 4.4G 6月   8 13:30 lineitem.tbl
-rw-r--r-- 1 root root 2.2K 6月   8 13:30 nation.tbl
-rw-r--r-- 1 root root 997M 6月   8 13:30 orders.tbl
-rw-r--r-- 1 root root 688M 6月   8 13:30 partsupp.tbl
-rw-r--r-- 1 root root 139M 6月   8 13:30 part.tbl
-rw-r--r-- 1 root root  389 6月   8 13:30 region.tbl
-rw-r--r-- 1 root root 8.2M 6月   8 13:30 supplier.tbl

建表语句在dss.ddl文件中,但是它的格式与HQL不符,因此需要另外改写建表语句。

create database tpch;
use tpch;

create external table lineitem (
  l_orderkey int,
  l_partkey int,
  l_suppkey int,
  l_linenumber int,
  l_quantity double,
  l_extendedprice double,
  l_discount double,
  l_tax double,
  l_returnflag string,
  l_linestatus string,
  l_shipdate string,
  l_commitdate string,
  l_receiptdate string,
  l_shipinstruct string,
  l_shipmode string,
  l_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table nation (
  n_nationkey int,
  n_name string,
  n_regionkey int,
  n_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table region (
  r_regionkey int,
  r_name string,
  r_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table part (
  p_partkey int,
  p_name string,
  p_mfgr string,
  p_brand string,
  p_type string,
  p_size int,
  p_container string,
  p_retailprice double,
  p_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table supplier (
  s_suppkey int,
  s_name string,
  s_address string,
  s_nationkey int,
  s_phone string,
  s_acctbal double,
  s_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table partsupp (
  ps_partkey int,
  ps_suppkey int,
  ps_availqty int,
  ps_supplycost double,
  ps_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table customer (
  c_custkey int,
  c_name string,
  c_address string,
  c_nationkey int,
  c_phone string,
  c_acctbal double,
  c_mktsegment string,
  c_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

create external table orders (
  o_orderkey int,
  o_custkey int,
  o_orderstatus string,
  o_totalprice double,
  o_orderdate date,
  o_orderpriority string,
  o_clerk string,
  o_shippriority int,
  o_comment string)
row format delimited
fields terminated by '|'
stored as textfile;

use tpch;
load data inpath "/tpch/supplier.tbl" into table supplier;
load data inpath "/tpch/region.tbl" into table region;
load data inpath "/tpch/partsupp.tbl" into table partsupp;
load data inpath "/tpch/part.tbl" into table part;
load data inpath "/tpch/orders.tbl" into table orders;
load data inpath "/tpch/nation.tbl" into table nation;
load data inpath "/tpch/lineitem.tbl" into table lineitem;
load data inpath "/tpch/customer.tbl" into table customer; 
8个表的Schema,其行数也是SF的倍数。

HQL风格查询语句:https://issues.apache.org/jira/browse/HIVE-600 
TPCH 22条SQL语句分析:https://yq.aliyun.com/articles/149715 
参考来源:https://www.jianshu.com/p/59155803d67b

已发布

分类

,

作者:

标签

评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注