在sparkSQL中推出了一个叫做Dataset的数据集,它是对RDD的一个智能的封装。
官方文档中对Dataset的介绍很详细:
接下来我们再来看一下Dataset源码中的说明:
/**
* A Dataset is a strongly typed collection of domain-specific objects that can be transformed
* in parallel using functional or relational operations. Each Dataset also has an untyped view
* called a `DataFrame`, which is a Dataset of [[Row]].
*
* Operations available on Datasets are divided into transformations and actions. Transformations
* are the ones that produce new Datasets, and actions are the ones that trigger computation and
* return results. Example transformations include map, filter, select, and aggregate (`groupBy`).
* Example actions count, show, or writing data out to file systems.
*
* Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
* a Dataset represents a logical plan that describes the computation required to produce the data.
* When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
* physical plan for efficient execution in a parallel and distributed manner. To explore the
* logical plan as well as optimized physical plan, use the `explain` function.
*
* To efficiently support domain-specific objects, an [[Encoder]] is required. The encoder maps
* the domain specific type `T` to Spark's internal type system. For example, given a class `Person`
* with two fields, `name` (string) and `age` (int), an encoder is used to tell Spark to generate
* code at runtime to serialize the `Person` object into a binary structure. This binary structure
* often has much lower memory footprint as well as are optimized for efficiency in data processing
* (e.g. in a columnar format). To understand the internal binary representation for data, use the
* `schema` function.
*
* There are typically two ways to create a Dataset. The most common way is by pointing Spark
* to some files on storage systems, using the `read` function available on a `SparkSession`.
* {{{
* val people = spark.read.parquet("...").as[Person] // Scala
* Dataset people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
* }}}
*
* Datasets can also be created through transformations available on existing Datasets. For example,
* the following creates a new Dataset by applying a filter on the existing one:
* {{{
* val names = people.map(_.name) // in Scala; names is a Dataset[String]
* Dataset names = people.map((Person p) -> p.name, Encoders.STRING));
* }}}
*
* Dataset operations can also be untyped, through various domain-specific-language (DSL)
* functions defined in: Dataset (this class), [[Column]], and [[functions]]. These operations
* are very similar to the operations available in the data frame abstraction in R or Python.
*
* To select a column from the Dataset, use `apply` method in Scala and `col` in Java.
* {{{
* val ageCol = people("age") // in Scala
* Column ageCol = people.col("age"); // in Java
* }}}
*
* Note that the [[Column]] type can also be manipulated through its various functions.
* {{{
* // The following creates a new column that increases everybody's age by 10.
* people("age") + 10 // in Scala
* people.col("age").plus(10); // in Java
* }}}
*
* A more concrete example in Scala:
* {{{
* // To create Dataset[Row] using SparkSession
* val people = spark.read.parquet("...")
* val department = spark.read.parquet("...")
*
* people.filter("age > 30")
* .join(department, people("deptId") === department("id"))
* .groupBy(department("name"), people("gender"))
* .agg(avg(people("salary")), max(people("age")))
* }}}
*
* and in Java:
* {{{
* // To create Dataset using SparkSession
* Dataset people = spark.read().parquet("...");
* Dataset department = spark.read().parquet("...");
*
* people.filter(people.col("age").gt(30))
* .join(department, people.col("deptId").equalTo(department.col("id")))
* .groupBy(department.col("name"), people.col("gender"))
* .agg(avg(people.col("salary")), max(people.col("age")));
* }}}
*
* @groupname basic Basic Dataset functions
* @groupname action Actions
* @groupname untypedrel Untyped transformations
* @groupname typedrel Typed transformations
*
* @since 1.6.0
*/
@InterfaceStability.Stable
class Dataset[T] private[sql](
@transient val sparkSession: SparkSession,
@DeveloperApi @InterfaceStability.Unstable @transient val queryExecution: QueryExecution,
encoder: Encoder[T])
extends Serializable {
queryExecution.assertAnalyzed()
// Note for Spark contributors: if adding or updating any action in `Dataset`, please make sure
// you wrap it with `withNewExecutionId` if this actions doesn't call other action.
def this(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T]) = {
this(sparkSession, sparkSession.sessionState.executePlan(logicalPlan), encoder)
}
def this(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T]) = {
this(sqlContext.sparkSession, logicalPlan, encoder)
}
@transient private[sql] val logicalPlan: LogicalPlan = {
// For various commands (like DDL) and queries with side effects, we force query execution
// to happen right away to let these side effects take place eagerly.
queryExecution.analyzed match {
case c: Command =>
LocalRelation(c.output, queryExecution.executedPlan.executeCollect())
case u @ Union(children) if children.forall(_.isInstanceOf[Command]) =>
LocalRelation(u.output, queryExecution.executedPlan.executeCollect())
case _ =>
queryExecution.analyzed
}
}
大概的翻译一下:
数据集是可以转换的领域特定对象的强类型集合
*并行使用函数或关系操作。每个数据集还有一个非类型化视图
*称为“DataFrame”,它是[[Row]]的数据集。
*数据集上可用的操作分为转换和操作。转换
*是生成新数据集的操作,动作是触发计算和
*返回结果。示例转换包括map、filter、select和aggregate (' groupBy ')。
示例操作计数、显示或将数据写入文件系统。
*数据集是“惰性的”,也就是说,只有在调用操作时才会触发计算。在内部,
数据集表示一个逻辑计划,描述生成数据所需的计算。
*当调用操作时,Spark的查询优化器将优化逻辑计划并生成
*以并行和分布式方式有效执行的物理计划。探讨
*逻辑计划和优化的物理计划,使用“解释”功能。
*为了有效地支持特定领域的对象,需要一个[[编码器]]。编码器的地图
*用于Spark内部类型系统的域特定类型“T”。例如,给定一个类“Person”
*使用' name ' (string)和' age ' (int)两个字段,编码器用于告诉Spark生成
*在运行时编写代码,将“Person”对象序列化为二进制结构。这种二元结构
*通常具有更低的内存占用,并优化了数据处理的效率
*(例如以柱状格式)。若要了解数据的内部二进制表示,请使用
*“模式”功能。
*
创建数据集通常有两种方法。最常见的方法是指向spark
*使用“SparkSession”上的“read”函数来读取存储系统上的一些文件。
{{{*
val people = spark.read.parquet(“…”)。(人)/ / Scala
* Dataset people = spark.read().parquet("…").as(encoder .bean(Person.class));/ / Java
*}}}
*
*数据集也可以通过现有数据集上可用的转换创建。例如,
*以下内容通过对现有数据集应用过滤器创建一个新数据集:
{{{*
* val names = people.map(_.name) //在Scala中;名称是数据集[字符串]
* Dataset names = people。map((Person p) -> .name, encoder . string));
*}}}
*
*数据集操作也可以通过各种领域特定语言(DSL)进行非类型化
*定义在:Dataset(该类)、[[Column]]和[[functions]]中的函数。这些操作
*非常类似于R或Python中数据帧抽象中可用的操作。
*
*要从数据集中选择列,请使用Scala中的“apply”方法和Java中的“col”方法。
{{{*
* val ageCol = people(“age”)//在Scala中
*列ageCol = people.col(“年龄”);/ /在Java中
*}}}
*
*注意,[[Column]]类型也可以通过它的各种函数进行操作。
{{{*
* //下面创建了一个新专栏,每个人的年龄都增加了10岁。
* Scala中的people(“age”)+ 10 //
* people.col(“时代”)的话语(10);/ /在Java中
*}}}
*
* Scala中更具体的例子:
{{{*
* //使用SparkSession创建数据集[行]
val people = spark.read.parquet(“…”)
* val部门= spark.read.parquet(“…”)
*
*人。过滤器(“年龄> 30”)
* .join(department, people("deptId") === department("id"))
* .groupBy(部门(“名字”)、人(“性别”))
* .agg (avg(人(“工资”)),马克斯(人(年龄)))
*}}}
*
*及Java:
{{{*
* //使用SparkSession创建数据集
* Dataset people = spark.read().parquet("…");
*数据集 department = spark.read().parquet("…");
*
* people.filter (people.col(“时代”)gt (30))
* . join(部门、people.col (deptId) .equalTo (department.col (" id ")))
* .groupBy (department.col(“名字”),people.col(“性别”))
* .agg (avg (people.col(“工资”)),马克斯(people.col(“年龄”)));
*}}}
1.Dataset是spark1.6以后推出的新的API,也是一个分布式数据集,与RDD相比,保存了更多的描述信息,概念上等同于关系型数据库中的二维表
2.基于Dataset中保存了更多的描述信息,spark在运行时可以被优化。
3.Dataset里面对应的数据是强类型的,并且可以使用功能更加丰富的lambda表达式,弥补了函数式编程的一些缺点,使用起来更加方便。
4.在Scala中,DataFrame其实就是Dataset[Row]。
5.Dataset的特点:
1.一系列分区
2.每个切片上会有对应的函数
3.依赖关系
4.kv类型shuffle也会有分区器
5.如果读取HDFS中的数据会感知最优位置
6.会优化执行计划
7.支持更加智能的数据源
6.调用Dataset的方法会先生成逻辑计划,然后被spark的优化器进行优化,最终生成物理计划,然后提交到集群中运行。