此篇分为两部分来探讨,第一部分是hbase的java api接口,第二部分是hbase与mapreduce整合
一、hbase之java api接口
hbase是基于java写的,所以当然可以调用java api一样通过java代码来操纵hbase,可以实现基本的查询hbase表,向hbase表上传某一条记录等操作。那么首先需要依赖一些基本的hbase的maven的jar包,lz是在idea的ide中操作,具体需要添加的jar包如下:
4.0.0
com.sunwangdong.hadoop.test
jkxy
1.0-SNAPSHOT
org.apache.hadoop
hadoop-common
2.7.1
org.apache.hadoop
hadoop-hdfs
2.7.1
org.apache.hadoop
hadoop-mapreduce-client-core
2.7.1
org.apache.hadoop
hadoop-mapreduce-client-common
2.7.1
org.apache.hadoop
hadoop-mapreduce-client-jobclient
2.7.1
org.apache.hbase //hbase的jar包,主要有hbase,hbase-client、hbase-server三个
hbase
1.2.6
org.apache.hbase
hbase-client
1.2.6
org.apache.hbase
hbase-server
1.2.6
添加完依赖包后,依次来写一些基本的操作,首先是建hbase表的操作,如下:
private static void createTable(HBaseAdmin hBaseAdmin) throws IOException
{
if(!hBaseAdmin.tableExists(TABLE_NAME)) //判断是否存在以"hello"为表名的表
{
HTableDescriptor hTableDescriptor = new HTableDescriptor(TABLE_NAME); //表名
HColumnDescriptor hColumnDescriptor = new HColumnDescriptor(FAMILY_NAME); //列族
hTableDescriptor.addFamily(hColumnDescriptor); //表添加列族
hBaseAdmin.createTable(hTableDescriptor); //添加表
}
}
这里主要通过基本类,分别是HTableDescriptor,这个类表示的是表名,可以通过它的构造函数,为它添加具体的表名。然后是HColumnDescriptor类,这个类表示的是列族,通过HTableDescriptor.addFamily(HColumnDescriptor)来为表添加具体的列族。此外还有一个HBaseAdmin接口,这个接口是用来管理hbase数据库的表信息,提供的方法有:创建表、删除表、列出表项等操作,可以通过HBaseAdmin.createTable(HTableDescriptor)来创建一张表。
然后是删除表:
private static void dropTable(HBaseAdmin hBaseAdmin) throws IOException
{
if(hBaseAdmin.tableExists(TABLE_NAME)) //是否存在改表
{
hBaseAdmin.disableTable(TABLE_NAME); //先要disable表
hBaseAdmin.deleteTable(TABLE_NAME); //然后再delete表
}
}
先判断要删除的表是否在hbase数据库中,如果存在,那么调用deleteTable实现删除操作,当然在执行删除操作之前,往往需要先disable那张表,用于使那张表先失效,然后再删除。
接下来是查看数据,具体有两种方法,分别是scan和get方法,分别对应hbase的scan和get方法。
private static void scanTable(HTable hTable) throws IOException
{
System.out.printf("遍历表结果如下:");
Scan scan = new Scan();
ResultScanner results = hTable.getScanner(scan);
for(Result result : results)
{
byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes());
System.out.println(new String(value));
}
}
注意到,因为scan是扫描整张表,所以会得到很多行的数据,所以这里用到了一个ResultScanner的结果集来表示,我们可以通过一个HTable类的getScanner方法来得到。注意:
HTable类是用来与hbase表进行通信,但是此方法是线程不安全的,如果有多个线程尝试与单个HTable实例进行通信,则写缓冲器就会失效。注意,每次获取得到的信息都是bytes类型的,所以都需要转化为string类型即可。
然后是get方法,与scan有点类似,不过get方法只得到某个rowkey下的数据,而不是整张表的数据。
private static void getRecord(HTable hTable) throws IOException
{
Get get = new Get(ROW_KEY.getBytes());
Result result = hTable.get(get);
byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes());
System.out.println("查询结果为:" + new String(value));
}
这里需要指定某一个rowkey的某一列的内容,当然如果有时间戳,那么还需要指明,因为hbase中的一条数据是由四个因素确定的,分别是rowkey,列族,列名和value。结果是一个result,那么同样是bytes数组,然后输出时,需要转化成string类型。
最后是put,也就是向hbase上传、更新某一条记录。
private static void putRecord(HTable hTable) throws IOException
{
Put put = new Put(ROW_KEY.getBytes());
put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),"25".getBytes());
hTable.put(put);
System.out.println("insert a record!");
}
这里用到了put类,它的作用是用来对单个行执行添加操作。当然上传一条记录,同样需要四个信息,行键,列族,列和具体的value值,有时候还需要指定时间戳。
完整的代码如下:
ackage com.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.*;
import java.io.IOException;
/**
* Created by sunwangdong on 2017/7/20.
*/
public class HbaseTest
{
//public HBaseAdmin admin = null;
public static final String TABLE_NAME="hello"; //表名
public static final String FAMILY_NAME="info"; //列族
public static final String COLUMN_NAME="age"; //列名
public static final String ROW_KEY="xiaoming"; //行键rowkey
public static void main(String[] args) throws IOException
{
Configuration conf = HBaseConfiguration.create(); //创建一个configuration
conf.set("hbase.rootdir","hdfs://localhost:9000/hbase");
//conf.set("hbase.zookeeper.quorum","");
HBaseAdmin hBaseAdmin = new HBaseAdmin(conf);
//createTable(hBaseAdmin); //创建表
HTable hTable = new HTable(conf,TABLE_NAME.getBytes());
//putRecord(hTable); //插入数据
putRecord2(hTable,"88");
getRecord(hTable); //取出数据
scanTable(hTable); //遍历表结果
//dropTable(hBaseAdmin); //删除表
}
private static void scanTable(HTable hTable) throws IOException
{
System.out.printf("遍历表结果如下:");
Scan scan = new Scan();
ResultScanner results = hTable.getScanner(scan);
for(Result result : results)
{
byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes());
System.out.println(new String(value));
}
}
private static void getRecord(HTable hTable) throws IOException
{
Get get = new Get(ROW_KEY.getBytes());
Result result = hTable.get(get);
byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes());
System.out.println("查询结果为:" + new String(value));
}
private static void putRecord(HTable hTable) throws IOException
{
Put put = new Put(ROW_KEY.getBytes());
put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),"25".getBytes());
hTable.put(put);
System.out.println("insert a record!");
}
private static void putRecord2(HTable hTable,String value) throws IOException
{
Put put = new Put(ROW_KEY.getBytes());
put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),value.getBytes());
hTable.put(put);
System.out.println("insert or update a record!");
}
private static void createTable(HBaseAdmin hBaseAdmin) throws IOException
{
if(!hBaseAdmin.tableExists(TABLE_NAME)) //判断是否存在以"hello"为表名的表
{
HTableDescriptor hTableDescriptor = new HTableDescriptor(TABLE_NAME); //表名
HColumnDescriptor hColumnDescriptor = new HColumnDescriptor(FAMILY_NAME); //列族
hTableDescriptor.addFamily(hColumnDescriptor); //表添加列族
hBaseAdmin.createTable(hTableDescriptor); //添加表
}
}
private static void dropTable(HBaseAdmin hBaseAdmin) throws IOException
{
if(hBaseAdmin.tableExists(TABLE_NAME)) //是否存在改表
{
hBaseAdmin.disableTable(TABLE_NAME); //先要disable表
hBaseAdmin.deleteTable(TABLE_NAME); //然后再delete表
}
}
}
注意,main函数里需要新建一个configuration类,且此类中需要指定hbase.rootdir的路径,这个需要与hbase-site.xml中配置的路径一致,否则将无法运行。
二、hbase与mapreduce整合
hbase和mapreduce的整合,其实是将hdfs中的数据导入到hbase的数据库中去,目前主要有两种方式来实现,其中一种就是利用mapreduce程序将hdfs中的数据按行划分成列输入到hbase中去,看以下的这个例子,现在在hdfs中有如下的数据表,其中这张数据表存放在hdfs中,具体的路径为:/t1/t1
1 zhangsan 10 male NULL
2 lisi NULL NULL NULL
3 wangwu NULL NULL NULL
4 zhaoliu NULL NULL 1993
其中每一行中的每一列数据由"\t"来分隔,然后我们通过自定义mapreduce来实现:
首先是map函数
public static class HdfsToHBaseMapper extends Mapper
{
private Text outKey = new Text();
private Text outValue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
{
String[] splits = value.toString().split("\t");
outKey.set(splits[0]);
outValue.set(splits[1] + "\t" + splits[2] + "\t" + splits[3] + "\t" + splits[4]);
context.write(outKey,outValue);
}
}
然后是汇总的reduce函数
public static class HdfsToHBaseReducer extends TableReducer
{
public void reduce(Text k2, Iterable v2s,Context context) throws IOException, InterruptedException
{
Put put = new Put(k2.getBytes());
for(Text v2 : v2s)
{
String[] splis = v2.toString().split("\t");
if(splis[0] != null && !"NULL".equals(splis[0]))
{
put.addColumn("f1".getBytes(),"name".getBytes(),splis[0].getBytes());
}
if(splis[1] != null && !"NULL".equals(splis[1]))
{
put.addColumn("f1".getBytes(),"age".getBytes(),splis[1].getBytes());
}
if(splis[2] != null && !"NULL".equals(splis[2]))
{
put.addColumn("f1".getBytes(),"gender".getBytes(),splis[2].getBytes());
}
if(splis[3] != null && !"NULL".equals(splis[3]))
{
put.addColumn("f1".getBytes(),"birthday".getBytes(),splis[3].getBytes());
}
}
context.write(NullWritable.get(),put);
}
}
最后完整的代码如下:
package com.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapred.TableOutputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
/**
* Created by sunwangdong on 2017/7/23.
*/
public class HdfsToHBase
{
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException
{
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.rootdir","hdfs://localhost:9000/hbase"); //约束hbase.root的路径,与hadoop的配置文件一致
conf.set(TableOutputFormat.OUTPUT_TABLE ,args[1]);
Job job = Job.getInstance(conf,HdfsToHBase.class.getSimpleName()); //对job的约束
TableMapReduceUtil.addDependencyJars(job);
job.setJarByClass(HdfsToHBase.class);
job.setMapperClass(HdfsToHBaseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(HdfsToHBaseReducer.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
job.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class); //这里设置的格式特别
Boolean b = job.waitForCompletion(true);
if(!b)
{
System.err.println("failed");
}
else
System.out.println("finished!");
}
public static class HdfsToHBaseMapper extends Mapper
{
private Text outKey = new Text();
private Text outValue = new Text();
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
{
String[] splits = value.toString().split("\t");
outKey.set(splits[0]);
outValue.set(splits[1] + "\t" + splits[2] + "\t" + splits[3] + "\t" + splits[4]);
context.write(outKey,outValue);
}
}
public static class HdfsToHBaseReducer extends TableReducer
{
public void reduce(Text k2, Iterable v2s,Context context) throws IOException, InterruptedException
{
Put put = new Put(k2.getBytes());
for(Text v2 : v2s)
{
String[] splis = v2.toString().split("\t");
if(splis[0] != null && !"NULL".equals(splis[0]))
{
put.addColumn("f1".getBytes(),"name".getBytes(),splis[0].getBytes());
}
if(splis[1] != null && !"NULL".equals(splis[1]))
{
put.addColumn("f1".getBytes(),"age".getBytes(),splis[1].getBytes());
}
if(splis[2] != null && !"NULL".equals(splis[2]))
{
put.addColumn("f1".getBytes(),"gender".getBytes(),splis[2].getBytes());
}
if(splis[3] != null && !"NULL".equals(splis[3]))
{
put.addColumn("f1".getBytes(),"birthday".getBytes(),splis[3].getBytes());
}
}
context.write(NullWritable.get(),put);
}
}
}
然后需要将此代码打成jar包,注意,运行hbase一定要用hadoop jar的形式,因为lz是在idea中来编译的,所以直接通过idea来编译生成jar包即可,注意,生成jar包,需要删除里面的META-INF/LICENSE.
zip -d ****.jar META-INF/LICENSE
当然因为上述代码中并没有直接写在hbase中的create创建表格的代码,所以我们需要事先在hbase中创建表格,即:
hbase(main):006:0> create 'table1','f1'
0 row(s) in 1.4270 seconds
=> Hbase::Table - table1
那么接下来就可以用hadoop的命令来执行上述jar包了,通过
hadoop jar ./***.jar com.hbase.HdfsToHBase /t1/t1 table1
那么可以看到在hadoop中通过mapreduce来执行了:
localhost:jkxy_jar12 sunwangdong$ hadoop jar ./jkxy.jar com.hbase.HdfsToHBase /t1/t1 table1
17/07/23 11:40:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/23 11:40:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/07/23 11:40:07 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/07/23 11:40:08 INFO input.FileInputFormat: Total input paths to process : 1
17/07/23 11:40:08 INFO mapreduce.JobSubmitter: number of splits:1
17/07/23 11:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1500776717808_0003
17/07/23 11:40:08 INFO impl.YarnClientImpl: Submitted application application_1500776717808_0003
17/07/23 11:40:08 INFO mapreduce.Job: The url to track the job: http://sunwangdongMacBook-Pro.local:8088/proxy/application_1500776717808_0003/
17/07/23 11:40:08 INFO mapreduce.Job: Running job: job_1500776717808_0003
17/07/23 11:40:16 INFO mapreduce.Job: Job job_1500776717808_0003 running in uber mode : false
17/07/23 11:40:16 INFO mapreduce.Job: map 0% reduce 0%
17/07/23 11:40:22 INFO mapreduce.Job: map 100% reduce 0%
17/07/23 11:40:29 INFO mapreduce.Job: map 100% reduce 100%
17/07/23 11:40:29 INFO mapreduce.Job: Job job_1500776717808_0003 completed successfully
17/07/23 11:40:29 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=109
FILE: Number of bytes written=250471
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=187
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3069
Total time spent by all reduces in occupied slots (ms)=3801
Total time spent by all map tasks (ms)=3069
Total time spent by all reduce tasks (ms)=3801
Total vcore-seconds taken by all map tasks=3069
Total vcore-seconds taken by all reduce tasks=3801
Total megabyte-seconds taken by all map tasks=3142656
Total megabyte-seconds taken by all reduce tasks=3892224
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=95
Map output materialized bytes=109
Input split bytes=92
Combine input records=0
Combine output records=0
Reduce input groups=4
Reduce shuffle bytes=109
Reduce input records=4
Reduce output records=4
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=130
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=347602944
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=95
File Output Format Counters
Bytes Written=0
finished!
hbase(main):010:0> scan 'table1'
ROW COLUMN+CELL
1 column=f1:age, timestamp=1500781227480, value=10
1 column=f1:gender, timestamp=1500781227480, value=male
1 column=f1:name, timestamp=1500781227480, value=zhangsan
2 column=f1:name, timestamp=1500781227480, value=lisi
3 column=f1:name, timestamp=1500781227480, value=wangwu
4 column=f1:birthday, timestamp=1500781227480, value=1993
4 column=f1:name, timestamp=1500781227480, value=zhaoliu
4 row(s) in 0.0750 seconds