Hbase에 importtsv를 통해 Bulk Loading 하는 방법을 소개함.
기본적인 내용은 http://hbase.apache.org/0.94/book/ops_mgt.html#importtsv 에 나와있다.
하지만 나는 이대로 하니까 안됐다..
먼저 csv 데이터를 준비하고 HDFS에 넣는다.
먼저 Hadoop에 폴더를 생성해야하는데 아래와같이 하위 디렉토리까지는 자동으로 안만들어진다.
Filesystem Size Used Available Use%
hdfs://ip-10-251-156-185.ap-northeast-2.compute.internal:8020 69.5 G 15.8 M 69.2 G 0%
상위 디렉토리를 만들고 하위디렉토리를 만들어줘야한다.
데이터를 HDFS에 넣자.
-rw-r--r-- 1 hadoop hadoop 1029 2019-02-08 04:07 /data/cost/aws_billing_2017_08.csv
hbase 에 테이블을 만들자. cf는 하나로 했다.
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
※ AWS EMR을 이용하고 있는데 EMR 생성하면 hadoop 계정을 준다.
그래서 root 로 했다. 이 문제 AWS에 문의해서 해결해야함.
importtsv를 이용해서 bulk loading을 수행한다. 얼마나 걸리나 시간을 찍어봤다.
1k 짜리 파일이고 25 row 집어넣는데 약 30초 소요된다.
hbase에서 scan 하면 25 row가 정상적으로 들어간게 보인다.
hbase(main):001:0> scan 'cost_tbl'
ROW COLUMN+CELL
1 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
1 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
1 column=info:RecordType, timestamp=1549599279681, value=
1 column=info:invoiceID, timestamp=1549599279681, value=109868405
10 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
10 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
10 column=info:RecordType, timestamp=1549599279681, value=
10 column=info:invoiceID, timestamp=1549599279681, value=109868405
11 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
11 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
11 column=info:RecordType, timestamp=1549599279681, value=
11 column=info:invoiceID, timestamp=1549599279681, value=109868405
12 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
12 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
12 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
12 column=info:invoiceID, timestamp=1549599279681, value=109868405
13 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
13 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
13 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
13 column=info:invoiceID, timestamp=1549599279681, value=109868405
14 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
14 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
14 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
14 column=info:invoiceID, timestamp=1549599279681, value=109868405
15 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
15 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
15 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
15 column=info:invoiceID, timestamp=1549599279681, value=109868405
16 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
16 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
16 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
16 column=info:invoiceID, timestamp=1549599279681, value=109868405
17 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
17 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
17 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
17 column=info:invoiceID, timestamp=1549599279681, value=109868405
18 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
18 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
18 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
18 column=info:invoiceID, timestamp=1549599279681, value=109868405
19 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
19 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
19 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
19 column=info:invoiceID, timestamp=1549599279681, value=109868405
2 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
2 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
2 column=info:RecordType, timestamp=1549599279681, value=
2 column=info:invoiceID, timestamp=1549599279681, value=109868405
20 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
20 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
20 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
20 column=info:invoiceID, timestamp=1549599279681, value=109868405
21 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
21 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
21 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
21 column=info:invoiceID, timestamp=1549599279681, value=109868405
22 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
22 column=info:RecordID, timestamp=1549599279681, value=LinkedLineItem
22 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
22 column=info:invoiceID, timestamp=1549599279681, value=109868405
23 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
23 column=info:RecordID, timestamp=1549599279681, value=InvoiceTotal
23 column=info:RecordType, timestamp=1549599279681, value=
23 column=info:invoiceID, timestamp=1549599279681, value=109868405
24 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
24 column=info:RecordID, timestamp=1549599279681, value=AccountTotal
24 column=info:RecordType, timestamp=1549599279681, value=8.19E+11
24 column=info:invoiceID, timestamp=1549599279681, value=
25 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
25 column=info:RecordID, timestamp=1549599279681, value=StatementTotal
25 column=info:RecordType, timestamp=1549599279681, value=
25 column=info:invoiceID, timestamp=1549599279681, value=
3 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
3 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
3 column=info:RecordType, timestamp=1549599279681, value=
3 column=info:invoiceID, timestamp=1549599279681, value=109868405
4 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
4 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
4 column=info:RecordType, timestamp=1549599279681, value=
4 column=info:invoiceID, timestamp=1549599279681, value=109868405
5 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
5 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
5 column=info:RecordType, timestamp=1549599279681, value=
5 column=info:invoiceID, timestamp=1549599279681, value=109868405
6 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
6 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
6 column=info:RecordType, timestamp=1549599279681, value=
6 column=info:invoiceID, timestamp=1549599279681, value=109868405
7 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
7 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
7 column=info:RecordType, timestamp=1549599279681, value=
7 column=info:invoiceID, timestamp=1549599279681, value=109868405
8 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
8 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
8 column=info:RecordType, timestamp=1549599279681, value=
8 column=info:invoiceID, timestamp=1549599279681, value=109868405
9 column=info:PayerAccountId, timestamp=1549599279681, value=8.19E+11
9 column=info:RecordID, timestamp=1549599279681, value=PayerLineItem
9 column=info:RecordType, timestamp=1549599279681, value=
9 column=info:invoiceID, timestamp=1549599279681, value=109868405
25 row(s) in 0.4720 seconds
300G 짜리 csv 파일 로딩하는데 3분 걸림.