Spark Streaming初步总结

详情见Spark编程指南 https://aiyanbo.gitbooks.io/spark-programming-guide-zh-cn/content/spark-streaming/index.html
https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-streaming/index.html
https://my.oschina.net/nenusoul/blog/849149
Spark Streaming是一种构建在Spark上的实时计算框架,它扩展了Spark处理大规模流式数据的能力。

首先,Spark Streaming把实时输入数据流以时间片Δt (如1秒)为单位切分成块,Spark Streaming会把每块数据作为一个RDD,并使用RDD操作处理每一小块数据,每个块都会生成一个Spark Job处理,最终结果也返回多块。 在Spark Streaming中,则通过操作DStream(表示数据流的RDD序列)提供的接口,这些接口和RDD提供的接口类似。

正如Spark Streaming最初的目标一样,它通过丰富的API和基于内存的高速计算引擎让用户可以结合流式处理,批处理和交互查询等应用。因此Spark Streaming适合一些需要历史数据和实时数据结合分析的应用场合。当然,对于实时性要求不是特别高的应用也能完全胜任,另外通过RDD的数据重用机制可以得到更高效的容错处理。

Spark Streaming 类似于 Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming 有高吞吐量和容错能力强这两个特点。Spark Streaming 支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ 和简单的 TCP 套接字等等。数据输入后可以用 Spark 的高度抽象原语如:map、reduce、join、window 等进行运算。而结果也能保存在很多地方,如 HDFS,数据库等。另外 Spark Streaming 也能和 MLlib(机器学习)以及 Graphx 完美融合。

在 Spark Streaming 中,处理数据的单位是一批而不是单条,而数据采集却是逐条进行的,因此 Spark Streaming 系统需要设置间隔使得数据汇总到一定的量后再一并操作,这个间隔就是批处理间隔。批处理间隔是 Spark Streaming 的核心概念和关键参数,它决定了 Spark Streaming 提交作业的频率和数据处理的延迟,同时也影响着数据处理的吞吐量和性能。

官方文档:http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming基于Spark之上,安装完Spark就可以使用Spark Streaming,而Storm则不同,需要独立安装!Spark安装见我的博客https://blog.csdn.net/CowBoySoBusy/article/details/83017574
Spark Streaming初步总结_第1张图片
Spark Streaming初步总结_第2张图片

词频统计实例:
spark源码:https://github.com/apache/spark
词频统计:
scala版源码
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: NetworkWordCount  
 *  and  describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
 */
object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount  ")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println

核心代码:
3行

val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

操作:
1.使用spark-submit(多用于生产)
执行./spark-submit获取详细用法,如图
Spark Streaming初步总结_第3张图片

  • To run this on your local machine, you need to first run a Netcat server
  • nc -lk 9999
  • and then run the example
  • bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999
    如果显示没有nc命令,则需要安装,安装命令如下:
  • sudo apt-get -y install netcat-traditional
    

验证是否安装成功:

  • nc -h
    

Spark Streaming初步总结_第4张图片

使用spark-submit
执行以下命令(需要按自己的来修改相应参数)

 ./spark-submit --master local[4] \
 --class org.apache.spark.examples.streaming.NetworkWordCount \
 --name NetworkWordCount \
 /home/zq/spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar localhost 9999

2.使用spark-shell(多用于测试)

  •  ./spark-shell --master local[4]
    
import org.apache.spark.streaming.{Seconds, StreamingContext}
//微机处理,一秒一次
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

3.使用外部文件
test.py

from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

if __name__ == "__main__":

sc = SparkContext("local[4]", "streamwordcount")
# 创建本地的SparkContext对象,包含4个执行线程

ssc = StreamingContext(sc, 2)
# 创建本地的StreamingContext对象,处理的时间片间隔时间,设置为2s

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(" "))
# 使用flatMap和Split对2秒内收到的字符串进行分割

pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

wordCounts.pprint()

ssc.start() 
# 启动Spark Streaming应用

ssc.awaitTermination()
 ./spark-submit /home/zq/Desktop/test.py

你可能感兴趣的:(大数据,Spark,Streaming,大数据平台Spark生态系统,BigData,Spark,Streaming)