创建DataFrame主要有三种方式:
准备测试RDD数据
scala> val rdd=sc.makeRDD(List("Mina,19","Andy,30","Michael,29"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at makeRDD at :24
需要注意,只有import spark.implicits._之后,RDD才有toDF、toDS功能
scala> import spark.implicits._
import spark.implicits._
(1)RDD转换成DataFrame
1、toDF
scala> rdd.map{x=>val par=x.split(",");(par(0),par(1).toInt)}.toDF("name","age")
res3: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> res3.show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+
2、通过反射获取Scheam
SparkSQL能够自动将包含有case类的RDD转换成DataFrame,case类定义了table的结构,case类属性通过反射变成了表的列名。Case类可以包含诸如Seqs或者Array等复杂的结构。
scala> case class Person(name:String,age:Int)
defined class Person
scala> val df = rdd.map{x => val par = x.split(",");Person(par(0),par(1).toInt)}.toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+
3、通过编程设置Schema
如果case类不能够提前定义,可以通过下面三个步骤定义一个DataFrame
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
//属性名应该是动态通过程序生成的
,再跟进属性名动态定义属性类型,生成StructType
scala> val schemaString = "name age"
schemaString: String = name age
scala> val field = schemaString.split(" ").map(filename=> filename match{ case "name"=> StructField(filename,StringType,nullable = true); case "age"=>StructField(filename, IntegerType,nullable = true)})
field: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,IntegerType,true))
scala> val schema = StructType(field)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))
scala> val rowRDD = rdd.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).toInt))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[2] at map at :35
scala> val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> peopleDF.show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+
(2) DataFrame转换成RDD
df.rdd 返回的是 RDD[Row] 如果需要获取Row中的数据,那么可以通过
getString(0) getAs[String]("name")
更多的可以去查看Row类的方法
scala> df.rdd.map(x=>x.getAs[String]("name")).collect
res8: Array[String] = Array(Mina, Andy, Michael)
(1)RDD转换成DataSet
//前面已经定义过了
case class Person(name:String,age:Int)
//已经导入过了
import spark.implicits._
scala> val ds = rdd.map{x => val par = x.split(",");Person(par(0),par(1).trim().toInt)}.toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
scala> ds.show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+
(2) DataSet转换成RDD
ds.rdd 返回的是 RDD[Person] 直接读取Person的属性即可
scala> ds.rdd
res12: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[20] at rdd at :40
scala> ds.rdd.map(_.name).collect
res13: Array[String] = Array(Mina, Andy, Michael)
(1) DataSet转换成DataFrame
scala> ds.toDF
res14: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> ds.toDF.show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+
(2) DataFrame转换成DataSet
case class Person(name:String,age:Int) 定义case class
df.as[Person] 注意 dataframe中的列名和列的数量应该和case class 一一对应
scala> df.as[Person]
res16: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
scala> df.as[Person].show
+-------+---+
| name|age|
+-------+---+
| Mina| 19|
| Andy| 30|
|Michael| 29|
+-------+---+