Spark Streaming 结合Spark SQL 案例

本博文主要包含以下内容:

  • String+SQL技术实现解析
  • Streaming+SQL实现实战

一:SparkString+SparkSQL技术实现解析:

使用Spark Streaming + Spark SQL 来在线计算电商中不同类别中最热门的商品排名,例如手机这个类别下面最热门的三种手机、电视
这个类别下最热门的三种电视,该实例在实际生产环境下具有非常重大的意义;
实现技术:SparkStreaming+Spark SQL,之所以SparkStreaming能够使用ML、sql、graphx等功能是因为有foreach何Transformation等
接口,这些接口中其实是基于RDD进行操作的,所以以RDD为基石,就可以直接使用Spark其他所有的功能,就像直接调用API一样简单, 假设说这里的数据的格式:user item category,例如Rocky Samsung Android

二:SparkStreaming+SparkSQL实现实战:
1、代码如下:

object OnlineTheTop3ItemForEachCategory2DB {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("OnlineForeachRDD2DB").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))

    ssc.checkpoint("/root/Documents/SparkApps?checkpoint")

    val userClickLogsDStream = ssc.socketTextStream("Master", 9999)
    val formattedUserClickLogDStream = userClickLogsDStream.map(clickLog =>
      (clickLog.split(" ")(2) +"_" + clickLog.split(" ")(1), 1))

    val categoryUserClickLogsDStream = formattedUserClickLogDStream.reduceByKeyAndWindow(_ + _ ,
      _ - _ ,Seconds(60),Seconds(20))
    categoryUserClickLogsDStream.foreachRDD{rdd =>{

      if(rdd.isEmpty()){
        print("No data inputted!!!")
      }else {
        val categoryItemRow = rdd.map(reducedItem => {
          val category = reducedItem._1.split("_")(0)
          val item = reducedItem._1.split("——")(1)
          val click_count = reducedItem._2
          Row(category, item, click_count)
        })
        val structType = StructType(Array(
          StructField("category", StringType, true),
          StructField("item", StringType, true),
          StructField("click_count", IntegerType, true)
        ))

        val hiveContext = new HiveContext(rdd.context)

        val categoryItemDF = hiveContext.createDataFrame(categoryItemRow, structType)

        categoryItemDF.registerTempTable("categoryItemTable")

        val resultDataFrame = hiveContext.sql("SELECT category,item,click_count FROM (SELECT category,item,click_count,row_number()" +
          "OVER(PARTITION BY category ORDER BY click_count DESC) rank FROM categoryItemTable) subquery " +
          "WHERE rank <= 3")

        resultDataFrame.show()

        val resultRowRDD = resultDataFrame.rdd


        resultRowRDD.foreachPartition { partitionOfRecords => {

          if(partitionOfRecords.isEmpty){
            println("this is RDD is not null but partition is null")
          }else{
            val connection = ConnectionPool.getConnection()
            partitionOfRecords.foreach(record => {
              val sql = "insert into categorytop3(category,item,client_count) values('" + record.getAs("category") + "','" +
                record.getAs("item") + "','" + record.getAs("click_count") + ")"
              val stmt = connection.createStatement
              stmt.executeUpdate(sql)
            })
            ConnectionPool.returnConnection(connection)
          }

        }
        }
      }
      }
      }

    ssc.start()
    ssc.awaitTermination()
  }


}

2、接下来将代码打包放到集群中,并且写好shell脚本

/usr/local/spark/bin/spark-submit --files /usr/local/hive/conf/hive-site.xml --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.35-bin.jar 
/root/Documents/SparkApps/SparkStreamingApps.jar

3、启动集群以及Hive 服务:

hive --service metastore & 

4、进入MySQL中创建表:

create table categorytop3 (category varchar(500),item varchar(2000),client_count int);

5、启动作业,以及打开nc -lk 9999服务

6、观察结果:

博文内容源自DT大数据梦工厂Spark课程的总结笔记。相关课程内容视频可以参考:
百度网盘链接:http://pan.baidu.com/s/1slvODe1(如果链接失效或需要后续的更多资源,请联系QQ460507491或者微信号:DT1219477246 获取上述资料)。

你可能感兴趣的:(Spark梦想)