Hbase의 noSQL Data를 분석이나 Service에서 그대로 사용하려면 개발 공수가 많이 든다.
그래서 주로 SQL on Hadoop을 쓴다.
나는 Hbase를 Data Ware House로 구성하고 SQL On Hadoop으로 Hive + Presto를 사용했는데
먼저 HBase -> Hive Porting 하는 방법을 기술한다.
1. HBase에 테이블이 존재한다고 가정한다.
hbase(main):001:0> list
TABLE
cost_tbl
tbil_aws_d
tbil_aws_s
tbil_azr_day_s
tbil_comn_c
tbil_oprt_dtl_h
tbil_pfx_h
tbil_svg_set_inst_l
tbil_svg_set_l
tbil_tag_h
tic_tbil_aws_d
toc_tbil_aws_d
tpms_azr_sbsc_l
13 row(s) in 0.2380 seconds
우리는 cost_tbl 을 Hive로 포팅할거다.
2. Hive 접속
[hadoop@ip-10-251-157-14 ~]$ hive shell
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
hive>CREATE EXTERNAL TABLE cost_tbl(id int, InvoiceID string, PayerAccountId string, LinkedAccountId string, RecordType string, RecordId string, ProductName string, RateId string, SubscriptionId string, PricingPlanId string, UsageType string, Operation string, AvailabilityZone string, ReservedInstance string, ItemDescription string, UsageStartDate string, UsageEndDate string, UsageQuantity string, BlendedRate string, BlendedCost string, UnBlendedRate string, UnBlendedCost string, ResourceId string, createdBy string, Name string, SERVICE string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" = ":key, info:InvoiceID, info:PayerAccountId, info:LinkedAccountId, info:RecordType, info:RecordId, info:ProductName, info:RateId, info:SubscriptionId, info:PricingPlanId, info:UsageType, info:Operation, info:AvailabilityZone, info:ReservedInstance, info:ItemDescription, info:UsageStartDate, info:UsageEndDate, info:UsageQuantity, info:BlendedRate, info:BlendedCost, info:UnBlendedRate, info:UnBlendedCost, info:ResourceId, info:createdBy, info:Name, info:SERVICE") TBLPROPERTIES("hbase.table.name"="cost_tbl");
OK
Time taken: 1.177 seconds
hive> desc cost_tbl;
OK
id int
invoiceid string
payeraccountid string
linkedaccountid string
recordtype string
recordid string
productname string
rateid string
subscriptionid string
pricingplanid string
usagetype string
operation string
availabilityzone string
reservedinstance string
itemdescription string
usagestartdate string
usageenddate string
usagequantity string
blendedrate string
blendedcost string
unblendedrate string
unblendedcost string
resourceid string
createdby string
name string
service string
Time taken: 0.108 seconds, Fetched: 26 row(s)
hive> select * from cost_tbl limit 10;
OK
100 150344444 8.18841E+11 8.18841E+11 LineItem 1.08176E+26 Amazon Simple Storage Service 282256825 947325337 31516415 USE1-APN2-AWS-Out-Bytes HeadBucket N $0.02 per GB - US East (Northern Virginia) data transfer to Asia Pacific (Seoul) 2018-09-01 3:00 2018-09-01 4:00 3.492E-07 0.02 0.000000007 0.02 0.000000007 bespin-poc-cloudability Root:818840791441
1000 150344444 8.18841E+11 8.18841E+11 LineItem 1.08758E+26 Amazon Simple Storage Service 282257538 947325337 31516415 APN2-USE1-AWS-Out-Bytes ReadLocation N $0.080 per GB - Asia Pacific (Seoul) data transfer to US East (Northern Virginia) 2018-09-03 0:00 2018-09-03 1:00 1.322E-07 0.08 1.06E-08 0.08 1.06E-08 cf-templates-1chkwotxnn4ka-ap-northeast-2 Root:818840791441
10000 150344444 8.18841E+11 8.18841E+11 LineItem 1.11278E+26 Amazon Relational Database Service 282256794 947325337 31516415 APN2-DataTransfer-In-Bytes Not Applicable N $0.000 per GB - data transfer in per month 2018-09-21 4:00 2018-09-21 5:00 5.662E-07 0 0 0 0 arn:aws:rds:ap-northeast-2:818840791441:db:seoul-demo-bespin-maria01 IAMUser:AIDAJOE7RT57DQMT26DY4:jonglok.kim
100000 150344444 8.18841E+11 86558720570 LineItem 1.12122E+26 Amazon DynamoDB 276961633 979142853 30710885 APN2-ReadCapacityUnit-Hrs CommittedThroughput ap-northeast-2 N $0.00 per hour for 25 units of read capacity for a month (free tier) 2018-09-26 3:00 2018-09-26 4:00 1 0 0 0 0 arn:aws:dynamodb:ap-northeast-2:086558720570:table/my-favorite-movies-table
100001 150344444 8.18841E+11 86558720570 LineItem 1.12121E+26 Amazon DynamoDB 276961501 979142853 30710885 APN2-WriteCapacityUnit-Hrs CommittedThroughput ap-northeast-2 N $0.00 per hour for 25 units of write capacity for a month (free tier) 2018-09-26 3:00 2018-09-26 4:00 1 0 0 0 0 arn:aws:dynamodb:ap-northeast-2:086558720570:table/my-favorite-movies-table
100002 150344444 8.18841E+11 86558720570 LineItem 1.12121E+26 Amazon DynamoDB 276961501 979142853 30710885 APN2-WriteCapacityUnit-Hrs CommittedThroughput ap-northeast-2 N $0.00 per hour for 25 units of write capacity for a month (free tier) 2018-09-26 3:00 2018-09-26 4:00 2 0 0 0 0 arn:aws:dynamodb:ap-northeast-2:086558720570:table/fa003490
100003 150344444 8.18841E+11 86558720570 LineItem 1.12121E+26 Amazon DynamoDB 276961501 979142853 30710885 APN2-WriteCapacityUnit-Hrs CommittedThroughput ap-northeast-2 N $0.00 per hour for 25 units of write capacity for a month (free tier) 2018-09-26 3:00 2018-09-26 4:00 10 0 0 0 0 arn:aws:dynamodb:ap-northeast-2:086558720570:table/Movies
100004 150344444 8.18841E+11 86558720570 LineItem 1.1214E+26 Amazon Elastic Compute Cloud 285152638 993428303 32138069 APN2-BoxUsage:m4.large RunInstances ap-northeast-2a N $0.123 per On Demand Linux m4.large Instance Hour 2018-09-26 3:00 2018-09-26 4:00 1 0.051144633 0.051144633 0.123 0.123 i-0bd6c2adfb0804bdd AssumedRole:AROAI7IA3IKU7V6AZJGM2:AutoScaling k8s-node2
100005 150344444 8.18841E+11 86558720570 LineItem 1.1214E+26 Amazon Elastic Compute Cloud 285152638 993428303 32138069 APN2-BoxUsage:m4.large RunInstances ap-northeast-2a N $0.123 per On Demand Linux m4.large Instance Hour 2018-09-26 3:00 2018-09-26 4:00 1 0.051144633 0.051144633 0.123 0.123 i-08e5dfd2d1885289c AssumedRole:AROAI7IA3IKU7V6AZJGM2:AutoScaling k8s-node1
100006 150344444 8.18841E+11 86558720570 LineItem 1.1214E+26 Amazon Elastic Compute Cloud 285152638 993428303 32138069 APN2-BoxUsage:m4.large RunInstances ap-northeast-2a N $0.123 per On Demand Linux m4.large Instance Hour 2018-09-26 3:00 2018-09-26 4:00 1 0.051144633 0.051144633 0.123 0.123 i-010b4116837ec37a3 Root:086558720570 k8s-master
Time taken: 2.243 seconds, Fetched: 10 row(s)
HBase 가 Dynamo 에 저장된걸 가져오는 것 같다...
hive> describe formatted cost_tbl;
OK
# col_name data_type comment
id int
invoiceid string
payeraccountid string
linkedaccountid string
recordtype string
recordid string
productname string
rateid string
subscriptionid string
pricingplanid string
usagetype string
operation string
availabilityzone string
reservedinstance string
itemdescription string
usagestartdate string
usageenddate string
usagequantity string
blendedrate string
blendedcost string
unblendedrate string
unblendedcost string
resourceid string
createdby string
name string
service string
# Detailed Table Information
Database: default
Owner: root
CreateTime: Mon Feb 18 02:38:48 UTC 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://ip-10-251-157-177.ap-northeast-2.compute.internal:8020/user/hive/warehouse/cost_tbl
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
hbase.table.name cost_tbl
numFiles 0
numRows 0
rawDataSize 0
storage_handler org.apache.hadoop.hive.hbase.HBaseStorageHandler
totalSize 0
transient_lastDdlTime 1550457528
# Storage Information
SerDe Library: org.apache.hadoop.hive.hbase.HBaseSerDe
InputFormat: null
OutputFormat: null
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
hbase.columns.mapping :key, info:InvoiceID, info:PayerAccountId, info:LinkedAccountId, info:RecordType, info:RecordId, info:ProductName, info:RateId, info:SubscriptionId, info:PricingPlanId, info:UsageType, info:Operation, info:AvailabilityZone, info:ReservedInstance, info:ItemDescription, info:UsageStartDate, info:UsageEndDate, info:UsageQuantity, info:BlendedRate, info:BlendedCost, info:UnBlendedRate, info:UnBlendedCost, info:ResourceId, info:createdBy, info:Name, info:SERVICE
serialization.format 1
Time taken: 0.703 seconds, Fetched: 59 row(s)
hive>