SparkSql中生成DataFrame的四种方式:
方式一:
定义一个case class类,将其作为RDD中的存储类型,然后导包import spark.implicts._ 最后直接调用RDD的方法即:toDF方法即可生成DataFrame,代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object DataFrameDemo1 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
val boyRdd: RDD[Boy] = rdd.map(line => {
val fields: Array[String] = line.split(" ")
Boy(fields(0), fields(1).toInt, fields(2).toInt)
})
import spark.implicits._
val dataFrame: DataFrame = boyRdd.toDF()
dataFrame.createTempView("v_boy")
dataFrame.show()
spark.stop()
}
}
case class Boy(name:String,age:Int,score:Int)
方式二
使用SparkSession中的方法createDataFrame,此方法为重载方法,方式二使用的这个重载方法里面需要传入两个参数,参数一:rowRDD,RDD中存储的类型为Row类型,参数二:schema,类型为StructType,代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
/*
获取dataframe的两种方式
1 关联case class
2 RDD+Schema
*/
object DataFrameDemo2 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
val rowRdd: RDD[Row] = rdd.map(line => {
val fields: Array[String] = line.split(" ")
Row(fields(0), fields(1).toInt, fields(2).toInt)
})
val schema = StructType(
List(
StructField("name",StringType),
StructField("age",IntegerType),
StructField("score",IntegerType)
)
)
val dataFrame: DataFrame = spark.createDataFrame(rowRdd, schema)
dataFrame.show()
spark.stop()
}
}
方式三:
RDD中存放的类型为元组,导包(import spark.implicits._)后可以直接调用toDF方法,生成DataFrame,代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object DataFrameDemo3 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
val tupleRdd: RDD[(String, Int, Int)] = rdd.map(line => {
val fields: Array[String] = line.split(" ")
(fields(0), fields(1).toInt, fields(2).toInt)
})
import spark.implicits._
val dataFrame: DataFrame = tupleRdd.toDF("name", "age", "score")
dataFrame.show()
spark.stop()
}
}
方式四:
定义一个JavaBean类,将此类型的数据存放到RDD中,调用SparkSession类对象的方法(createDataFrame),将此RDD及这个JavaBean类的类名作为两个参数传入方法,也可以生成DataFrame,代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
import sparksql.Man
object DataFrameDemo4 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
val rdd: RDD[String] = spark.sparkContext.parallelize(List("xiaohong 19 90", "xiaozhang 20 90", "huahua 21 99"))
val beanRdd: RDD[Man] = rdd.map(line => {
val fields: Array[String] = line.split(" ")
new Man(fields(0), fields(1).toInt, fields(2).toInt)
})
val dataFrame: DataFrame = spark.createDataFrame(beanRdd, classOf[Man])
dataFrame.show()
spark.stop()
}
}