大数据学习:Scala模式匹配和类型系统

前言:

为啥Scala没有JAVA火爆?

1、JAVA94年出现,占领了网络的先机,有各种框架,可以在对应的各个领域处理,占据了IT世界的方方面面;Scala00年后出现;

其实谈JAVA时候的,可以更多谈论JVM,而Scala也是运行在JVM

2、JAVA易学,Scala学习有点难度;大数据首选语言是Scala,天生有出色的函数式编程的支持;

预测:未来十年之内,没有哪门语言可以撼动JAVA的地位

运用模式匹配,精简实现;类型系统可以健壮应对各种变化。Spark源码中到处可见。

――Scala水平的分水岭!!!

模式匹配和JAVA switch case差不多,JAVA是对值判断

Scala模式匹配除了对值还可以类型、集合等匹配

==========按照值匹配============

scala> def bigData(data:String){

     | data match{

     | case "Spark" => println("Wow!!!")

     | case "Hadoop" => println("Ok")

     | case _ => println("Something others")                     //default

     | }

     | }

bigData: (data: String)Unit

scala> bigData("d")

Something others

scala> bigData("Spark")

Wow!!!

scala> def bigData(data:String){

     | data match{

     | case "Spark" => println("Wow!!!")

     | case "Hadoop" => println("Ok")

     | case _ if data == "Flink" => println("Cool")

     | case _ => println("Something others")

     | }

     | }

bigData: (data: String)Unit

scala> bigData("Spark")

Wow!!!

scala> bigData("Flink")

Cool

==========按照值匹配,用变量名接受传入的模式匹配的============

scala> def bigData(data:String){

     | data match{

     | case "Spark" => println("Wow!!!")

     | case "Hadoop" => println("Ok")

     | case data_ if data_ == "Flink" => println("Cool")

     | case _ => println("Something others")

     | }

     | }

bigData: (data: String)Unit

scala> bigData("Flink")

Cool

==========按照类型匹配============

scala> import java.io._

import java.io._

scala> def exception(e:Exception){

     | e match{

     | case e:FileNotFoundException =>println("File not found:"+e)

     | case _:Exception=>println("Hahaha")

     | }

     | }

exception: (e: Exception)Unit

scala> exception(new FileNotFoundException("OOP hahah"))

File not found:java.io.FileNotFoundException: OOP hahah

==========按照Array匹配============

scala> def data(array:Array[String]){

     | array match{

     | case Array("Scala")=>println("Scala")

     | case Array(spark,hadoop,flink) => println(spark+":"+hadoop+":"+flink)    //三个元素分别赋值

     | case Array("Spark",_*)=>println("Spark...")      //Spark开头的数组

     | case _=>println("Unkown")

     | }

     | }

data: (array: Array[String])Unit

scala> data(Array("Spark"))

Spark...

scala> data(Array("Scala"))

Scala

scala> data(Array("Scala","Spark","Kafka"))

Scala:Spark:Kafka

==========case class(样例类)============

相当于JAVABEAN,里面一般都只读成员,一般用于并发编程通信作为消息通信体

val默认对外只有getter

实例化方面,会自动生成case class的伴生对象,而且里面会自定义若干apply方法

scala> case class Person(name:String)

defined class Person

scala> Person("Spark")

res35: Person = Person(Spark)       //Scala自动生成了对象Person的伴生对象,且有apply方法

//进化匹配class

scala> class Person

defined class Person

scala> case class Worker(name:String,salary:Double) extends Person

defined class Worker

scala> case class Student(name:String,score:Double) extends Person

defined class Student

scala> def sayHi(person:Person){

     | person match{

     | case Student(name,score)=>println("name:"+name+";score:"+score)

     | case Worker(name,salary)=>println("name:"+name+";salary:"+salary)

     | case _=>println("Something others")

     | }

     | }

sayHi: (person: Person)Unit

scala> sayHi(Worker("Spark",30))

name:Spark;salary:30.0

scala> sayHi(Student("Tom",100))

name:Tom;score:100.0

模式匹配中必须用:

Option:用来判断变量有没有值

Some:

case class每次工作时都会生成实例

case object全局的值,本身就是实例

类型参数,最大难点!!!Scala的核心,太有用了,所有的Spark源码中到处都有类型参数

==========泛型类、泛型方法:其实就是类型参数的概念============

scala> class Person[T](val content:T)

defined class Person

scala> class Person[T](val content:T){

     | def getContent(id:T) = id+" _ "+content

     | }

defined class Person

scala> val p = new Person[String]("Spark")

p: Person[String] = Person@815e0c

scala> p.getContent(100)                  //因为Person中定义就是String,所以传入Int就错了

<console>:10: error: type mismatch;

 found   : Int(100)

 required: String

              p.getContent(100)

                           ^

scala> val p = new Person[String](2.3)      //说了是String,想用Double,当然不行

<console>:8: error: type mismatch;

 found   : Double(2.3)

 required: String

       val p = new Person[String](2.3)

                                  ^

泛型用到这个的时候,就一定是这个

==========上边界============

比如公司要招聘大数据工程师,例如限制工程师必须至少要掌握Spark,就是边界

泛型是任意类型,如果我们指定类型的上边界或者下边界

那么所有的类型必须是上边界本身或者上边界的子类

//就是类必须是XXXX或者它的子类,那么condc可以调用XXXX的一切方法

def fuc1(path:String,condc:Class[_ <: XXXX])  

==========下边界============

用的不是很多  >:  

执行类型必须是下边界的父类

==========View Bounds!!!(视图界定)============

实际上用某个类,既不是属于上边界也不属于下边界,用隐式转换到错误的类型运行,看隐式转换之后是不是在判定的边界,就可以把和上下界没任何关系的东西传进去工作,再回到正确的运行,隐式转换下节课再讲

View Bounds 语法  <%   对类型进行隐式转换

SparkContext里面的

private implicit def arrayToArrayWritable[ T <% Writable: ClassTag]( arr: Traversable[T ])

    : ArrayWritable = {

    def anyToWritable[ U <% Writable]( u: U): Writable = u


    new ArrayWritable( classTag[ T]. runtimeClass.asInstanceOf [Class[Writable]],

        arr. map( x => anyToWritable(x )).toArray )

  }

T必须是Writeble的子类型或者可以转换成Wirteble

T:类型   类型[T]   隐式值   在上下文中注入隐式值,而且注入隐式值的过程是自动的

Context Bounds:

scala> class Compare[T:Ordering](val n1:T,val n2:T)

defined class Compare

scala> class Compare[T:Ordering](val n1:T,val n2:T){

     | def bigger(implicit ordered:Ordering[T]) = if(ordered.compare(n1,n2)>0)n1

 else n2}

defined class Compare

scala> new Compare[Int](8,3).bigger

res5: Int = 8

scala> new Compare[String]("Spark","Hadoop").bigger

res6: String = Spark

scala> Ordering[String]

res7: scala.math.Ordering[String] = scala.math.Ordering$String$@1e0e8e5

scala> Ordering[Int]

res8: scala.math.Ordering[Int] = scala.math.Ordering$Int$@19cdef7

==========协变/逆变============

父类和子类的继承关系  +协变   -逆变

scala> class Person[+T]

defined class Person

这是协变情况。这种情况下,当类型S是类型A的子类型,则Queue[S]也可以认为是Queue[A}的子类型,即Queue[S]可以泛化为Queue[A]。也就是被参数化类型的泛化方向与参数类型的方向是一致的,所以称为协变。 

这是逆变情况。这种情况下,当类型S是类型A的子类型,则Queue[A]反过来可以认为是Queue[S}的子类型。也就是被参数化类型的泛化方向与参数类型的方向是相反的,所以称为逆变。 

Depency[_]这种写法相当于Depency[T],语法看上去更加简洁

==========T:ClassTag ============

T:ClassTag   指的是泛型,但是编译时我们不知道是类型,只是在运行的时候runtime时候才确定类型,有些延迟执行等等开始不知道类型

 *   scala> def mkArray[T : ClassTag](elems: T*) = Array[T](elems: _*)

 *   mkArray: [T](elems: T*)(implicit evidence$1: scala.reflect.ClassTag[T])Array[T]

 *

 *   scala> mkArray(42, 13)

 *   res0: Array[Int] = Array(42, 13)

 *

 *   scala> mkArray("Japan","Brazil","Germany")

 *   res1: Array[String] = Array(Japan, Brazil, Germany)

老师的机器:

作业:

阅读Spark源码的RDD和HadoopRDD、SparkContext、Master、Worker的源码。并分析里面使用的所有模式匹配和类型参数。

***********RDD*************

~~~1、模式匹配~~~

  /**

   * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,

   * second element in each RDD, etc. Assumes that the two RDDs have the *same number of

   * partitions* and the *same number of elements in each partition* (e.g. one was made through

   * a map on the other).

   */

  def zip[U: ClassTag](other: RDD[U]): RDD[( T, U)] = withScope {

    zipPartitions( other, preservesPartitioning = false) { ( thisIter, otherIter ) =>

      new Iterator[( T, U)] {

        def hasNext : Boolean = (thisIter.hasNext, otherIter.hasNext) match {

          case (true , true) => true

          case (false , false) => false

          case _ => throw new SparkException("Can only zip RDDs with " +

            "same number of elements in each partition" )

        }

        def next (): (T, U) = ( thisIter. next(), otherIter.next ())

      }

    }

  }

zip函数,把需要打包的两个数组分别next,如果数量不一样就报错

val mergeResult = (index : Int, taskResult : Option[T]) => {

      if (taskResult.isDefined) {

        jobResult = jobResult match {

          case Some (value ) => Some(f(value, taskResult.get))

          case None => taskResult

        }

      }

    }

融合

/**

   * Return whether this RDD is marked for local checkpointing.

   * Exposed for testing.

   */

  private[rdd] def isLocallyCheckpointed : Boolean = {

    checkpointData match {

      case Some(_: LocalRDDCheckpointData[T]) => true

      case _ => false

    }

  }

~~~2、类型系统~~~

abstract class RDD[ T ClassTag ](

类的泛化

 @ DeveloperApi

  def compute( splitPartitioncontext: TaskContext): Iterator [T ]

方法的泛化

  /** An Option holding our checkpoint RDD, if we are checkpointed */

  private def checkpointRDD : Option[CheckpointRDD[ T]] = checkpointData .flatMap (_.checkpointRDD )

参数的泛化

  /**

   * Save this RDD as a compressed text file, using string representations of elements.

   */

  def saveAsTextFile( pathString codec Class[_ <: CompressionCodec]): Unit = withScope {

    // https://issues.apache.org/jira/browse/SPARK-2075

    val nullWritableClassTag = implicitly [ClassTag [NullWritable]]

    val textClassTag = implicitly [ClassTag [Text]]

    val r = this. mapPartitions iter =>

      val text = new Text()

      iter. map { x =>

        text. set( x. toString)

        (NullWritable. get(), text)

      }

    }

    RDD. rddToPairRDDFunctions( r)( nullWritableClassTag textClassTag null)

      . saveAsHadoopFile[TextOutputFormat[NullWritable, Text]]( pathcodec)

  }

***********HadoopRDD *************

~~~1、模式匹配~~~

protected def getInputFormat (conf : JobConf): InputFormat[ KV] = {

    val newInputFormat = ReflectionUtils.newInstance (inputFormatClass .asInstanceOf [ Class[_]],  conf)

      . asInstanceOf[ InputFormat [K V]]

    newInputFormat match {

      case cConfigurable => c .setConf (conf )

      case _ =>

    }

    newInputFormat

  }

匹配类型,设置配置

// Sets the thread local variable for the file's name

      split. inputSplit. value match {

        case fs : FileSplit => SqlNewHadoopRDDState .setInputFileName (fs .getPath .toString )

        case _ => SqlNewHadoopRDDState .unsetInputFileName ()

      }

val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {

      case Some(c) =>

        try {

          val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)

          val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]

          Some(HadoopRDD.convertSplitLocationInfo(infos))

        } catch {

          case e: Exception =>

            logDebug("Failed to use InputSplitWithLocations.", e)

            None

        }

      case None => None

    }

~~~2、类型系统~~~

@ DeveloperApi

class HadoopRDD[ KV](

    sc: SparkContext,

    broadcastedConf: Broadcast[SerializableConfiguration],

    initLocalJobConfFuncOpt: Option[JobConf => Unit],

    inputFormatClassClass [_ <: InputFormat[K, V]],

    keyClassClass[ K],

    valueClassClass [V ],

    minPartitions: Int)

  extends RDD[( KV)]( scNil with Logging {

上边界

***********SparkContext *************

~~~1、模式匹配~~~

/**

   * The number of driver cores to use for execution in local mode, 0 otherwise.

   */

  private[spark] def numDriverCores (master String): Int = {

    def convertToInt( threadsString ): Int = {

      if ( threads == "*") Runtime. getRuntime .availableProcessors () else threads .toInt

    }

    master match {

      case "local" => 1

      case SparkMasterRegex .LOCAL_N_REGEX (threads ) => convertToInt (threads )

      case SparkMasterRegex .LOCAL_N_FAILURES_REGEX (threads , _) => convertToInt (threads )

      case _ => 0 // driver is not used for execution

    }

  }

多少个驱动内核的匹配

val schemeCorrectedPath uri .getScheme match {

      case null "local" => new File( path). getCanonicalFile. toURI. toString

      case _ => path

    }

  val schemeCorrectedPath = uri. getScheme match {

      case null "local" => new File( path). getCanonicalFile. toURI. toString

      case _ => path

    }

~~~2、类型系统~~~

val constructors = {

          val listenerClass Utils .classForName (className )

          listenerClass .getConstructors .asInstanceOf [Array[Constructor[_ <: SparkListener]]]

        }

上边界

private [spark] def clean [ F <: AnyRef]( fFcheckSerializable : Boolean = true): F = {

    ClosureCleaner. clean( fcheckSerializable )

    f

  }

上边界

private implicit def arrayToArrayWritable[T <% Writable: ClassTag](arr: Traversable[T])

    : ArrayWritable = {

    def anyToWritable[U <% Writable](u: U): Writable = u

    new ArrayWritable(classTag[T].runtimeClass.asInstanceOf[Class[Writable]],

        arr.map(x => anyToWritable(x)).toArray)

  }

隐式转换,还不是很明白,等第5课之后再回头来看

***********Master*************

~~~1、模式匹配~~~

private def removeDriver (

      driverIdString,

      finalStateDriverState ,

      exception: Option[ Exception ]) {

    drivers. find( d => d. id == driverIdmatch {

      case Some (driver ) =>

        logInfo( s"Removing driver: $ driverId ")

        drivers -= driver

        if (completedDrivers .size >= RETAINED_DRIVERS ) {

          val toRemove = math.max (RETAINED_DRIVERS / 10 1 )

          completedDrivers .trimStart (toRemove )

        }

        completedDrivers += driver

        persistenceEngine .removeDriver (driver )

        driver. state = finalState

        driver. exception exception

        driver. worker. foreach( w => w. removeDriver (driver ))

        schedule()

      case None =>

        logWarning (s "Asked to remove unknown driver: $ driverId ")

    }

  }

case ExecutorStateChanged (appId execId state message exitStatus ) => {

      val execOption = idToApp. get( appId). flatMap( app => app .executors .get (execId ))

      execOption match {

        case Some (exec ) => {

          val appInfo idToApp (appId )

          val oldState exec .state

          exec. state = state

  case DriverStateChanged (driverId state exception ) => {

      state match {

        case DriverState .ERROR DriverState .FINISHED DriverState .KILLED DriverState .FAILED  =>

          removeDriver (driverId state exception )

        case _ =>

          throw new Exception(s "Received unexpected state update for driver $driverId : $state ")

      }

    }


    case Heartbeat (workerId worker ) => {

      idToWorker. get( workerIdmatch {

        case Some (workerInfo ) =>

          workerInfo .lastHeartbeat = System. currentTimeMillis()

        case None =>

          if (workers .map (_.id ).contains (workerId )) {

            logWarning (s "Got heartbeat from unregistered worker $ workerId ." +

              " Asking it to re-register." )

            worker. send( ReconnectWorker (masterUrl ))

          } else {

            logWarning (s "Got heartbeat from unregistered worker $ workerId ." +

              " This worker was never registered, so ignoring the heartbeat." )

          }

      }

    }

***********Worker*************

~~~1、模式匹配~~~

master match {

          case Some (masterRef ) =>

            // registered == false && master != None means we lost the connection to master, so

            // masterRef cannot be used and we need to recreate it again. Note: we must not set

            // master to None due to the above comments.

            if (registerMasterFutures != null) {

              registerMasterFutures .foreach (_.cancel ( true))

            }

            val masterAddress masterRef .address

            registerMasterFutures Array (registerMasterThreadPool .submit ( new Runnable {

              override def run (): Unit = {

private def registerWithMaster () {

    // onDisconnected may be triggered multiple times, so don't attempt registration

    // if there are outstanding registration attempts scheduled.

    registrationRetryTimer match {

      case None =>

        registered false

        registerMasterFutures tryRegisterAllMasters ()

        connectionAttemptCount 0

        registrationRetryTimer Some (forwordMessageScheduler .scheduleAtFixedRate (

          new Runnable {

            override def run (): Unit = Utils. tryLogNonFatalError {

              Option (self ). foreach(_. send ( ReregisterWithMaster))

            }

          },

          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS ,

          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS ,

          TimeUnit. SECONDS))

      case Some (_) =>

        logInfo( "Not spawning another attempt to register with the master, since there is an" +

          " attempt scheduled already." )

    }

  }


本文出自 “一枝花傲寒” 博客,谢绝转载!

你可能感兴趣的:(Scala模式匹配和类型系统)