前言:
为啥Scala没有JAVA火爆?
1、JAVA94年出现,占领了网络的先机,有各种框架,可以在对应的各个领域处理,占据了IT世界的方方面面;Scala00年后出现;
其实谈JAVA时候的,可以更多谈论JVM,而Scala也是运行在JVM
2、JAVA易学,Scala学习有点难度;大数据首选语言是Scala,天生有出色的函数式编程的支持;
预测:未来十年之内,没有哪门语言可以撼动JAVA的地位
运用模式匹配,精简实现;类型系统可以健壮应对各种变化。Spark源码中到处可见。
――Scala水平的分水岭!!!
模式匹配和JAVA switch case差不多,JAVA是对值判断
Scala模式匹配除了对值还可以类型、集合等匹配
==========按照值匹配============
scala> def bigData(data:String){
| data match{
| case "Spark" => println("Wow!!!")
| case "Hadoop" => println("Ok")
| case _ => println("Something others") //default
| }
| }
bigData: (data: String)Unit
scala> bigData("d")
Something others
scala> bigData("Spark")
Wow!!!
scala> def bigData(data:String){
| data match{
| case "Spark" => println("Wow!!!")
| case "Hadoop" => println("Ok")
| case _ if data == "Flink" => println("Cool")
| case _ => println("Something others")
| }
| }
bigData: (data: String)Unit
scala> bigData("Spark")
Wow!!!
scala> bigData("Flink")
Cool
==========按照值匹配,用变量名接受传入的模式匹配的============
scala> def bigData(data:String){
| data match{
| case "Spark" => println("Wow!!!")
| case "Hadoop" => println("Ok")
| case data_ if data_ == "Flink" => println("Cool")
| case _ => println("Something others")
| }
| }
bigData: (data: String)Unit
scala> bigData("Flink")
Cool
==========按照类型匹配============
scala> import java.io._
import java.io._
scala> def exception(e:Exception){
| e match{
| case e:FileNotFoundException =>println("File not found:"+e)
| case _:Exception=>println("Hahaha")
| }
| }
exception: (e: Exception)Unit
scala> exception(new FileNotFoundException("OOP hahah"))
File not found:java.io.FileNotFoundException: OOP hahah
==========按照Array匹配============
scala> def data(array:Array[String]){
| array match{
| case Array("Scala")=>println("Scala")
| case Array(spark,hadoop,flink) => println(spark+":"+hadoop+":"+flink) //三个元素分别赋值
| case Array("Spark",_*)=>println("Spark...") //Spark开头的数组
| case _=>println("Unkown")
| }
| }
data: (array: Array[String])Unit
scala> data(Array("Spark"))
Spark...
scala> data(Array("Scala"))
Scala
scala> data(Array("Scala","Spark","Kafka"))
Scala:Spark:Kafka
==========case class(样例类)============
相当于JAVABEAN,里面一般都只读成员,一般用于并发编程通信作为消息通信体
val默认对外只有getter
实例化方面,会自动生成case class的伴生对象,而且里面会自定义若干apply方法
scala> case class Person(name:String)
defined class Person
scala> Person("Spark")
res35: Person = Person(Spark) //Scala自动生成了对象Person的伴生对象,且有apply方法
//进化匹配class
scala> class Person
defined class Person
scala> case class Worker(name:String,salary:Double) extends Person
defined class Worker
scala> case class Student(name:String,score:Double) extends Person
defined class Student
scala> def sayHi(person:Person){
| person match{
| case Student(name,score)=>println("name:"+name+";score:"+score)
| case Worker(name,salary)=>println("name:"+name+";salary:"+salary)
| case _=>println("Something others")
| }
| }
sayHi: (person: Person)Unit
scala> sayHi(Worker("Spark",30))
name:Spark;salary:30.0
scala> sayHi(Student("Tom",100))
name:Tom;score:100.0
模式匹配中必须用:
Option:用来判断变量有没有值
Some:
case class每次工作时都会生成实例
case object全局的值,本身就是实例
类型参数,最大难点!!!Scala的核心,太有用了,所有的Spark源码中到处都有类型参数
==========泛型类、泛型方法:其实就是类型参数的概念============
scala> class Person[T](val content:T)
defined class Person
scala> class Person[T](val content:T){
| def getContent(id:T) = id+" _ "+content
| }
defined class Person
scala> val p = new Person[String]("Spark")
p: Person[String] = Person@815e0c
scala> p.getContent(100) //因为Person中定义就是String,所以传入Int就错了
<console>:10: error: type mismatch;
found : Int(100)
required: String
p.getContent(100)
^
scala> val p = new Person[String](2.3) //说了是String,想用Double,当然不行
<console>:8: error: type mismatch;
found : Double(2.3)
required: String
val p = new Person[String](2.3)
^
泛型用到这个的时候,就一定是这个
==========上边界============
比如公司要招聘大数据工程师,例如限制工程师必须至少要掌握Spark,就是边界
泛型是任意类型,如果我们指定类型的上边界或者下边界
那么所有的类型必须是上边界本身或者上边界的子类
//就是类必须是XXXX或者它的子类,那么condc可以调用XXXX的一切方法
def fuc1(path:String,condc:Class[_ <: XXXX])
==========下边界============
用的不是很多 >:
执行类型必须是下边界的父类
==========View Bounds!!!(视图界定)============
实际上用某个类,既不是属于上边界也不属于下边界,用隐式转换到错误的类型运行,看隐式转换之后是不是在判定的边界,就可以把和上下界没任何关系的东西传进去工作,再回到正确的运行,隐式转换下节课再讲
View Bounds 语法 <% 对类型进行隐式转换
SparkContext里面的
private implicit def arrayToArrayWritable[ T <% Writable: ClassTag]( arr: Traversable[T ])
: ArrayWritable = {
def anyToWritable[ U <% Writable]( u: U): Writable = u
new ArrayWritable( classTag[ T]. runtimeClass.asInstanceOf [Class[Writable]],
arr. map( x => anyToWritable(x )).toArray )
}
T必须是Writeble的子类型或者可以转换成Wirteble
T:类型 类型[T] 隐式值 在上下文中注入隐式值,而且注入隐式值的过程是自动的
Context Bounds:
scala> class Compare[T:Ordering](val n1:T,val n2:T)
defined class Compare
scala> class Compare[T:Ordering](val n1:T,val n2:T){
| def bigger(implicit ordered:Ordering[T]) = if(ordered.compare(n1,n2)>0)n1
else n2}
defined class Compare
scala> new Compare[Int](8,3).bigger
res5: Int = 8
scala> new Compare[String]("Spark","Hadoop").bigger
res6: String = Spark
scala> Ordering[String]
res7: scala.math.Ordering[String] = scala.math.Ordering$String$@1e0e8e5
scala> Ordering[Int]
res8: scala.math.Ordering[Int] = scala.math.Ordering$Int$@19cdef7
==========协变/逆变============
父类和子类的继承关系 +协变 -逆变
scala> class Person[+T]
defined class Person
这是协变情况。这种情况下,当类型S是类型A的子类型,则Queue[S]也可以认为是Queue[A}的子类型,即Queue[S]可以泛化为Queue[A]。也就是被参数化类型的泛化方向与参数类型的方向是一致的,所以称为协变。
这是逆变情况。这种情况下,当类型S是类型A的子类型,则Queue[A]反过来可以认为是Queue[S}的子类型。也就是被参数化类型的泛化方向与参数类型的方向是相反的,所以称为逆变。
Depency[_]这种写法相当于Depency[T],语法看上去更加简洁
==========T:ClassTag ============
T:ClassTag 指的是泛型,但是编译时我们不知道是类型,只是在运行的时候runtime时候才确定类型,有些延迟执行等等开始不知道类型
* scala> def mkArray[T : ClassTag](elems: T*) = Array[T](elems: _*)
* mkArray: [T](elems: T*)(implicit evidence$1: scala.reflect.ClassTag[T])Array[T]
*
* scala> mkArray(42, 13)
* res0: Array[Int] = Array(42, 13)
*
* scala> mkArray("Japan","Brazil","Germany")
* res1: Array[String] = Array(Japan, Brazil, Germany)
老师的机器:
作业:
阅读Spark源码的RDD和HadoopRDD、SparkContext、Master、Worker的源码。并分析里面使用的所有模式匹配和类型参数。
***********RDD*************
~~~1、模式匹配~~~
/**
* Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
* second element in each RDD, etc. Assumes that the two RDDs have the *same number of
* partitions* and the *same number of elements in each partition* (e.g. one was made through
* a map on the other).
*/
def zip[U: ClassTag](other: RDD[U]): RDD[( T, U)] = withScope {
zipPartitions( other, preservesPartitioning = false) { ( thisIter, otherIter ) =>
new Iterator[( T, U)] {
def hasNext : Boolean = (thisIter.hasNext, otherIter.hasNext) match {
case (true , true) => true
case (false , false) => false
case _ => throw new SparkException("Can only zip RDDs with " +
"same number of elements in each partition" )
}
def next (): (T, U) = ( thisIter. next(), otherIter.next ())
}
}
}
zip函数,把需要打包的两个数组分别next,如果数量不一样就报错
val mergeResult = (index : Int, taskResult : Option[T]) => {
if (taskResult.isDefined) {
jobResult = jobResult match {
case Some (value ) => Some(f(value, taskResult.get))
case None => taskResult
}
}
}
融合
/**
* Return whether this RDD is marked for local checkpointing.
* Exposed for testing.
*/
private[rdd] def isLocallyCheckpointed : Boolean = {
checkpointData match {
case Some(_: LocalRDDCheckpointData[T]) => true
case _ => false
}
}
~~~2、类型系统~~~
abstract class RDD[ T : ClassTag ](
类的泛化
@ DeveloperApi
def compute( split: Partition, context: TaskContext): Iterator [T ]
方法的泛化
/** An Option holding our checkpoint RDD, if we are checkpointed */
private def checkpointRDD : Option[CheckpointRDD[ T]] = checkpointData .flatMap (_.checkpointRDD )
参数的泛化
/**
* Save this RDD as a compressed text file, using string representations of elements.
*/
def saveAsTextFile( path: String , codec : Class[_ <: CompressionCodec]): Unit = withScope {
// https://issues.apache.org/jira/browse/SPARK-2075
val nullWritableClassTag = implicitly [ClassTag [NullWritable]]
val textClassTag = implicitly [ClassTag [Text]]
val r = this. mapPartitions { iter =>
val text = new Text()
iter. map { x =>
text. set( x. toString)
(NullWritable. get(), text)
}
}
RDD. rddToPairRDDFunctions( r)( nullWritableClassTag , textClassTag , null)
. saveAsHadoopFile[TextOutputFormat[NullWritable, Text]]( path, codec)
}
***********HadoopRDD *************
~~~1、模式匹配~~~
protected def getInputFormat (conf : JobConf): InputFormat[ K, V] = {
val newInputFormat = ReflectionUtils.newInstance (inputFormatClass .asInstanceOf [ Class[_]], conf)
. asInstanceOf[ InputFormat [K , V]]
newInputFormat match {
case c: Configurable => c .setConf (conf )
case _ =>
}
newInputFormat
}
匹配类型,设置配置
// Sets the thread local variable for the file's name
split. inputSplit. value match {
case fs : FileSplit => SqlNewHadoopRDDState .setInputFileName (fs .getPath .toString )
case _ => SqlNewHadoopRDDState .unsetInputFileName ()
}
val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {
case Some(c) =>
try {
val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)
val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]
Some(HadoopRDD.convertSplitLocationInfo(infos))
} catch {
case e: Exception =>
logDebug("Failed to use InputSplitWithLocations.", e)
None
}
case None => None
}
~~~2、类型系统~~~
@ DeveloperApi
class HadoopRDD[ K, V](
sc: SparkContext,
broadcastedConf: Broadcast[SerializableConfiguration],
initLocalJobConfFuncOpt: Option[JobConf => Unit],
inputFormatClass: Class [_ <: InputFormat[K, V]],
keyClass: Class[ K],
valueClass: Class [V ],
minPartitions: Int)
extends RDD[( K, V)]( sc, Nil ) with Logging {
上边界
***********SparkContext *************
~~~1、模式匹配~~~
/**
* The number of driver cores to use for execution in local mode, 0 otherwise.
*/
private[spark] def numDriverCores (master : String): Int = {
def convertToInt( threads: String ): Int = {
if ( threads == "*") Runtime. getRuntime .availableProcessors () else threads .toInt
}
master match {
case "local" => 1
case SparkMasterRegex .LOCAL_N_REGEX (threads ) => convertToInt (threads )
case SparkMasterRegex .LOCAL_N_FAILURES_REGEX (threads , _) => convertToInt (threads )
case _ => 0 // driver is not used for execution
}
}
多少个驱动内核的匹配
val schemeCorrectedPath = uri .getScheme match {
case null | "local" => new File( path). getCanonicalFile. toURI. toString
case _ => path
}
val schemeCorrectedPath = uri. getScheme match {
case null | "local" => new File( path). getCanonicalFile. toURI. toString
case _ => path
}
~~~2、类型系统~~~
val constructors = {
val listenerClass = Utils .classForName (className )
listenerClass .getConstructors .asInstanceOf [Array[Constructor[_ <: SparkListener]]]
}
上边界
private [spark] def clean [ F <: AnyRef]( f: F, checkSerializable : Boolean = true): F = {
ClosureCleaner. clean( f, checkSerializable )
f
}
上边界
private implicit def arrayToArrayWritable[T <% Writable: ClassTag](arr: Traversable[T])
: ArrayWritable = {
def anyToWritable[U <% Writable](u: U): Writable = u
new ArrayWritable(classTag[T].runtimeClass.asInstanceOf[Class[Writable]],
arr.map(x => anyToWritable(x)).toArray)
}
隐式转换,还不是很明白,等第5课之后再回头来看
***********Master*************
~~~1、模式匹配~~~
private def removeDriver (
driverId: String,
finalState: DriverState ,
exception: Option[ Exception ]) {
drivers. find( d => d. id == driverId) match {
case Some (driver ) =>
logInfo( s"Removing driver: $ driverId ")
drivers -= driver
if (completedDrivers .size >= RETAINED_DRIVERS ) {
val toRemove = math.max (RETAINED_DRIVERS / 10 , 1 )
completedDrivers .trimStart (toRemove )
}
completedDrivers += driver
persistenceEngine .removeDriver (driver )
driver. state = finalState
driver. exception = exception
driver. worker. foreach( w => w. removeDriver (driver ))
schedule()
case None =>
logWarning (s "Asked to remove unknown driver: $ driverId ")
}
}
case ExecutorStateChanged (appId , execId , state , message , exitStatus ) => {
val execOption = idToApp. get( appId). flatMap( app => app .executors .get (execId ))
execOption match {
case Some (exec ) => {
val appInfo = idToApp (appId )
val oldState = exec .state
exec. state = state
case DriverStateChanged (driverId , state , exception ) => {
state match {
case DriverState .ERROR | DriverState .FINISHED | DriverState .KILLED | DriverState .FAILED =>
removeDriver (driverId , state , exception )
case _ =>
throw new Exception(s "Received unexpected state update for driver $driverId : $state ")
}
}
case Heartbeat (workerId , worker ) => {
idToWorker. get( workerId) match {
case Some (workerInfo ) =>
workerInfo .lastHeartbeat = System. currentTimeMillis()
case None =>
if (workers .map (_.id ).contains (workerId )) {
logWarning (s "Got heartbeat from unregistered worker $ workerId ." +
" Asking it to re-register." )
worker. send( ReconnectWorker (masterUrl ))
} else {
logWarning (s "Got heartbeat from unregistered worker $ workerId ." +
" This worker was never registered, so ignoring the heartbeat." )
}
}
}
***********Worker*************
~~~1、模式匹配~~~
master match {
case Some (masterRef ) =>
// registered == false && master != None means we lost the connection to master, so
// masterRef cannot be used and we need to recreate it again. Note: we must not set
// master to None due to the above comments.
if (registerMasterFutures != null) {
registerMasterFutures .foreach (_.cancel ( true))
}
val masterAddress = masterRef .address
registerMasterFutures = Array (registerMasterThreadPool .submit ( new Runnable {
override def run (): Unit = {
private def registerWithMaster () {
// onDisconnected may be triggered multiple times, so don't attempt registration
// if there are outstanding registration attempts scheduled.
registrationRetryTimer match {
case None =>
registered = false
registerMasterFutures = tryRegisterAllMasters ()
connectionAttemptCount = 0
registrationRetryTimer = Some (forwordMessageScheduler .scheduleAtFixedRate (
new Runnable {
override def run (): Unit = Utils. tryLogNonFatalError {
Option (self ). foreach(_. send ( ReregisterWithMaster))
}
},
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS ,
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS ,
TimeUnit. SECONDS))
case Some (_) =>
logInfo( "Not spawning another attempt to register with the master, since there is an" +
" attempt scheduled already." )
}
}
本文出自 “一枝花傲寒” 博客,谢绝转载!