SparkSql中生成DataFrame的四种方式

SparkSql中生成DataFrame的四种方式:
方式一:
定义一个case class类,将其作为RDD中的存储类型,然后导包import spark.implicts._ 最后直接调用RDD的方法即:toDF方法即可生成DataFrame,代码如下:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}

object DataFrameDemo1 {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
    val boyRdd: RDD[Boy] = rdd.map(line => {
      val fields: Array[String] = line.split(" ")
      Boy(fields(0), fields(1).toInt, fields(2).toInt)
    })
    import spark.implicits._
    val dataFrame: DataFrame = boyRdd.toDF()
    dataFrame.createTempView("v_boy")
    dataFrame.show()
    spark.stop()
  }
}
case class Boy(name:String,age:Int,score:Int)

方式二
使用SparkSession中的方法createDataFrame,此方法为重载方法,方式二使用的这个重载方法里面需要传入两个参数,参数一:rowRDD,RDD中存储的类型为Row类型,参数二:schema,类型为StructType,代码如下:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}

/*
获取dataframe的两种方式
1 关联case class
2 RDD+Schema
 */
object DataFrameDemo2 {

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
    val rowRdd: RDD[Row] = rdd.map(line => {
      val fields: Array[String] = line.split(" ")
      Row(fields(0), fields(1).toInt, fields(2).toInt)
    })
    val schema = StructType(
      List(
        StructField("name",StringType),
          StructField("age",IntegerType),
          StructField("score",IntegerType)
      )
    )
    val dataFrame: DataFrame = spark.createDataFrame(rowRdd, schema)
    dataFrame.show()
    spark.stop()
  }
}

方式三:
RDD中存放的类型为元组,导包(import spark.implicits._)后可以直接调用toDF方法,生成DataFrame,代码如下:


import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}

object DataFrameDemo3 {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
    val tupleRdd: RDD[(String, Int, Int)] = rdd.map(line => {
      val fields: Array[String] = line.split(" ")
      (fields(0), fields(1).toInt, fields(2).toInt)
    })
    import spark.implicits._
    val dataFrame: DataFrame = tupleRdd.toDF("name", "age", "score")
    dataFrame.show()
    spark.stop()
  }
}

方式四:
定义一个JavaBean类,将此类型的数据存放到RDD中,调用SparkSession类对象的方法(createDataFrame),将此RDD及这个JavaBean类的类名作为两个参数传入方法,也可以生成DataFrame,代码如下:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
import sparksql.Man

object DataFrameDemo4 {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
    val beanRdd: RDD[Man] = rdd.map(line => {
      val fields: Array[String] = line.split(" ")
      new Man(fields(0), fields(1).toInt, fields(2).toInt)
    })
    val dataFrame: DataFrame = spark.createDataFrame(beanRdd, classOf[Man])
    dataFrame.show()
    spark.stop()
  }
}

你可能感兴趣的:(SparkSql,spark)