[Hbase] Hbase优化之禁用wal以及Hfile应用

1、WAL:write-ahead log 预写日志

灾难恢复,一旦服务器崩溃,通过重放log,即可恢复之前的数据(内存中还没有刷写到磁盘的数据);如果写入wal操作失败,整个操作就认为是失败。
因此,如果开启wal机制,写操作性能会降低;
如果关闭wal机制,数据在内存中未刷写到磁盘时,server突然宕机,产生数据丢失。

解决办法:不开启wal机制,手工刷新memstore的数据落地

//禁用WAL机制
put.setDurability(Durability.SKIP_WAL) 

在数据写操作之后,调用flushTable操作,代替wal机制:

  def flushTable(table:String,conf:Configuration):Unit={
    var connection: Connection = null
    var admin:Admin=null

    connection=ConnectionFactory.createConnection(conf)

    try{
      admin=connection.getAdmin
      //将数据从MemStore刷写到磁盘中
      admin.flush(TableName.valueOf(table))

    }catch{
      case e:Exception=>e.printStackTrace()
    }finally {
      if(null!=admin){
        admin.close()
      }
    }    
  }

2、利用HFile

https://hbase.apache.org/2.1/book.html#arch.bulk.load
直接使用Spark将DF/RDD的数据生成HFile文件,数据load到HBase里面。

注意:
<1> HFile在所有的加载方案里面是最快的,不过有一个前提——数据是第一次导入,表是空的,如果表中已经有了数据,HFile再导入到hbase的表中会触发split操作;

<2>输出部分key和value的类型必须是: < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>

//1.元数据DataFrame
//这里使用mapPartitions方法
val hbaseInfoRDD = logDF.rdd.mapPartitions(partition => {

      partition.flatMap(x=>{
        val ip = x.getAs[String]("ip")
        val country = x.getAs[String]("country")
        val province = x.getAs[String]("province")
        val city = x.getAs[String]("city")
        val formattime = x.getAs[String]("formattime")
        val method = x.getAs[String]("method")
        val url = x.getAs[String]("url")
        val protocal = x.getAs[String]("protocal")
        val status = x.getAs[String]("status")
        val bytessent = x.getAs[String]("bytessent")
        val referer = x.getAs[String]("referer")
        val browsername = x.getAs[String]("browsername")
        val browserversion = x.getAs[String]("browserversion")
        val osname = x.getAs[String]("osname")
        val osversion = x.getAs[String]("osversion")
        val ua = x.getAs[String]("ua")

        val columns = scala.collection.mutable.HashMap[String,String]()
        columns.put("ip",ip)
        columns.put("country",country)
        columns.put("province",province)
        columns.put("city",city)
        columns.put("formattime",formattime)
        columns.put("method",method)
        columns.put("url",url)
        columns.put("protocal",protocal)
        columns.put("status",status)
        columns.put("bytessent",bytessent)
        columns.put("referer",referer)
        columns.put("browsername",browsername)
        columns.put("browserversion",browserversion)
        columns.put("osname",osname)
        columns.put("osversion",osversion)

        val rowkey = getRowKey(day, referer+url+ip+ua)  // 生成HBase的rowkey
        val rk = Bytes.toBytes(rowkey)
       
        val list = new ListBuffer[((String,String),KeyValue)]()
        // 每一个rowkey对应的cf中的所有column字段
        for((k,v) <- columns) {
          val keyValue = new KeyValue(rk, "o".getBytes, Bytes.toBytes(k),Bytes.toBytes(v))
          list += (rowkey,k) -> keyValue
        }

        list.toList
      })
    }).sortByKey()
      .map(x => (new ImmutableBytesWritable(Bytes.toBytes(x._1._1)), x._2))

    val conf = new Configuration()
    conf.set("hbase.rootdir","hdfs://hadoop000:8020/hbase")
    conf.set("hbase.zookeeper.quorum","hadoop000:2181")

    val tableName = createTable(day, conf)

    // 设置写数据到哪个表中
    conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

    val job = NewAPIHadoopJob.getInstance(conf)
    val table = new HTable(conf, tableName)
    HFileOutputFormat2.configureIncrementalLoad(job,table.getTableDescriptor,table.getRegionLocator)

    val output = "hdfs://hadoop000:8020/etl/access3/hbase"
    val outputPath = new Path(output)
    hbaseInfoRDD.saveAsNewAPIHadoopFile(
      output,
      classOf[ImmutableBytesWritable],
      classOf[KeyValue],
      classOf[HFileOutputFormat2],
      job.getConfiguration
    )

    if(FileSystem.get(conf).exists(outputPath)) {
      val load = new LoadIncrementalHFiles(job.getConfiguration)
      load.doBulkLoad(outputPath, table)

      FileSystem.get(conf).delete(outputPath, true)
    }

    logInfo(s"作业执行成功... $day")
    spark.stop()
  


【分析】:
整个过程可以分为两步:
<1> 通过mapreduce任务准备数据
使用HFileOutputFormat2
HFileOutputFormat2.configureIncrementalLoad(
job,
table.getTableDescriptor,
table.getRegionLocator)

这是configureIncrementalLoad的其中一个函数
<2> 完成数据加载
(1) 先将源数据用saveAsNewAPIHadoopFile保存到指定output路径:

    val output = "hdfs://hadoop000:8020/etl/access3/hbase"
    val outputPath = new Path(output)
    hbaseInfoRDD.saveAsNewAPIHadoopFile(
      output,
      classOf[ImmutableBytesWritable],
      classOf[KeyValue],
      classOf[HFileOutputFormat2],
      job.getConfiguration
    )

(2) 之后将output数据加载到hbase对应的表中

  if(FileSystem.get(conf).exists(outputPath)) {
      val load = new LoadIncrementalHFiles(job.getConfiguration)
      load.doBulkLoad(outputPath, table)

      FileSystem.get(conf).delete(outputPath, true)
    }

3.spark插入df进入hbase(Hortonworks的shc方式)

直接上代码:

  /**
    * 保存到Hbase
    *
    * @param df         dataFrame
    * @param namespace  命名空间
    * @param table      表名
    * @param rowkey     rowkey
    * @param column     列名
    */
  private def saveToHbase(df: DataFrame, namespace:String, table:String, rowkey:String, column:String): Unit = {

    def catalog = s"""{
                     |"table":{"namespace":"$namespace", "name":"$table"},
                     |"rowkey":"$rowkey",
                     |"columns":{
                     |  "$rowkey":{"cf":"rowkey", "col":"$rowkey", "type":"string"},
                     |  "$column":{"cf":"t",      "col":"$column", "type":"string"}
                     |}
                     |}""".stripMargin

    println(s"""start to persist in HBASE: table:$table,column:$column....""")
    df.write
      .mode(SaveMode.Overwrite)
      .options(Map(HBaseTableCatalog.tableCatalog -> catalog))
      .format("org.apache.spark.sql.execution.datasources.hbase")
      .save()
    println(s"""persist in HBASE success...""")
  }

从hbase中查询数据:
withCatalog方法:

def withCatalog(cat: String): DataFrame = {
  sqlContext
  .read
  .options(Map(HBaseTableCatalog.tableCatalog->cat))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .load()
}

查询数据:

val df = withCatalog(catalog)
val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||
  $"col0" === "row005" ||
  $"col0" === "row020" ||
  $"col0" ===  "r20" ||
  $"col0" <= "row005") &&
  ($"col4" === 1 ||
  $"col4" === 42))
  .select("col0", "col1", "col4")
s.show

注意:对于dataFrame中为null的字段,需要将null转成空字符串等,再进行插入,否则会报错。

你可能感兴趣的:([Hbase] Hbase优化之禁用wal以及Hfile应用)