sparkSQL---Dataset讲解

在sparkSQL中推出了一个叫做Dataset的数据集,它是对RDD的一个智能的封装。

官方文档中对Dataset的介绍很详细:

sparkSQL---Dataset讲解_第1张图片

接下来我们再来看一下Dataset源码中的说明:

/**
 * A Dataset is a strongly typed collection of domain-specific objects that can be transformed
 * in parallel using functional or relational operations. Each Dataset also has an untyped view
 * called a `DataFrame`, which is a Dataset of [[Row]].
 *
 * Operations available on Datasets are divided into transformations and actions. Transformations
 * are the ones that produce new Datasets, and actions are the ones that trigger computation and
 * return results. Example transformations include map, filter, select, and aggregate (`groupBy`).
 * Example actions count, show, or writing data out to file systems.
 *
 * Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
 * a Dataset represents a logical plan that describes the computation required to produce the data.
 * When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
 * physical plan for efficient execution in a parallel and distributed manner. To explore the
 * logical plan as well as optimized physical plan, use the `explain` function.
 *
 * To efficiently support domain-specific objects, an [[Encoder]] is required. The encoder maps
 * the domain specific type `T` to Spark's internal type system. For example, given a class `Person`
 * with two fields, `name` (string) and `age` (int), an encoder is used to tell Spark to generate
 * code at runtime to serialize the `Person` object into a binary structure. This binary structure
 * often has much lower memory footprint as well as are optimized for efficiency in data processing
 * (e.g. in a columnar format). To understand the internal binary representation for data, use the
 * `schema` function.
 *
 * There are typically two ways to create a Dataset. The most common way is by pointing Spark
 * to some files on storage systems, using the `read` function available on a `SparkSession`.
 * {{{
 *   val people = spark.read.parquet("...").as[Person]  // Scala
 *   Dataset people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
 * }}}
 *
 * Datasets can also be created through transformations available on existing Datasets. For example,
 * the following creates a new Dataset by applying a filter on the existing one:
 * {{{
 *   val names = people.map(_.name)  // in Scala; names is a Dataset[String]
 *   Dataset names = people.map((Person p) -> p.name, Encoders.STRING));
 * }}}
 *
 * Dataset operations can also be untyped, through various domain-specific-language (DSL)
 * functions defined in: Dataset (this class), [[Column]], and [[functions]]. These operations
 * are very similar to the operations available in the data frame abstraction in R or Python.
 *
 * To select a column from the Dataset, use `apply` method in Scala and `col` in Java.
 * {{{
 *   val ageCol = people("age")  // in Scala
 *   Column ageCol = people.col("age"); // in Java
 * }}}
 *
 * Note that the [[Column]] type can also be manipulated through its various functions.
 * {{{
 *   // The following creates a new column that increases everybody's age by 10.
 *   people("age") + 10  // in Scala
 *   people.col("age").plus(10);  // in Java
 * }}}
 *
 * A more concrete example in Scala:
 * {{{
 *   // To create Dataset[Row] using SparkSession
 *   val people = spark.read.parquet("...")
 *   val department = spark.read.parquet("...")
 *
 *   people.filter("age > 30")
 *     .join(department, people("deptId") === department("id"))
 *     .groupBy(department("name"), people("gender"))
 *     .agg(avg(people("salary")), max(people("age")))
 * }}}
 *
 * and in Java:
 * {{{
 *   // To create Dataset using SparkSession
 *   Dataset people = spark.read().parquet("...");
 *   Dataset department = spark.read().parquet("...");
 *
 *   people.filter(people.col("age").gt(30))
 *     .join(department, people.col("deptId").equalTo(department.col("id")))
 *     .groupBy(department.col("name"), people.col("gender"))
 *     .agg(avg(people.col("salary")), max(people.col("age")));
 * }}}
 *
 * @groupname basic Basic Dataset functions
 * @groupname action Actions
 * @groupname untypedrel Untyped transformations
 * @groupname typedrel Typed transformations
 *
 * @since 1.6.0
 */
@InterfaceStability.Stable
class Dataset[T] private[sql](
    @transient val sparkSession: SparkSession,
    @DeveloperApi @InterfaceStability.Unstable @transient val queryExecution: QueryExecution,
    encoder: Encoder[T])
  extends Serializable {

  queryExecution.assertAnalyzed()

  // Note for Spark contributors: if adding or updating any action in `Dataset`, please make sure
  // you wrap it with `withNewExecutionId` if this actions doesn't call other action.

  def this(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T]) = {
    this(sparkSession, sparkSession.sessionState.executePlan(logicalPlan), encoder)
  }

  def this(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T]) = {
    this(sqlContext.sparkSession, logicalPlan, encoder)
  }

  @transient private[sql] val logicalPlan: LogicalPlan = {
    // For various commands (like DDL) and queries with side effects, we force query execution
    // to happen right away to let these side effects take place eagerly.
    queryExecution.analyzed match {
      case c: Command =>
        LocalRelation(c.output, queryExecution.executedPlan.executeCollect())
      case u @ Union(children) if children.forall(_.isInstanceOf[Command]) =>
        LocalRelation(u.output, queryExecution.executedPlan.executeCollect())
      case _ =>
        queryExecution.analyzed
    }
  }

大概的翻译一下:

数据集是可以转换的领域特定对象的强类型集合    
*并行使用函数或关系操作。每个数据集还有一个非类型化视图 
*称为“DataFrame”,它是[[Row]]的数据集。
*数据集上可用的操作分为转换和操作。转换
*是生成新数据集的操作,动作是触发计算和 
*返回结果。示例转换包括map、filter、select和aggregate (' groupBy ')。
示例操作计数、显示或将数据写入文件系统。
*数据集是“惰性的”,也就是说,只有在调用操作时才会触发计算。在内部,
数据集表示一个逻辑计划,描述生成数据所需的计算。
*当调用操作时,Spark的查询优化器将优化逻辑计划并生成
*以并行和分布式方式有效执行的物理计划。探讨
*逻辑计划和优化的物理计划,使用“解释”功能。
*为了有效地支持特定领域的对象,需要一个[[编码器]]。编码器的地图
*用于Spark内部类型系统的域特定类型“T”。例如,给定一个类“Person”
*使用' name ' (string)和' age ' (int)两个字段,编码器用于告诉Spark生成
*在运行时编写代码,将“Person”对象序列化为二进制结构。这种二元结构
*通常具有更低的内存占用,并优化了数据处理的效率
*(例如以柱状格式)。若要了解数据的内部二进制表示,请使用
*“模式”功能。

*

创建数据集通常有两种方法。最常见的方法是指向spark 
*使用“SparkSession”上的“read”函数来读取存储系统上的一些文件。
{{{*
val people = spark.read.parquet(“…”)。(人)/ / Scala
* Dataset people = spark.read().parquet("…").as(encoder .bean(Person.class));/ / Java
*}}}

*

*数据集也可以通过现有数据集上可用的转换创建。例如, 
*以下内容通过对现有数据集应用过滤器创建一个新数据集:
{{{*
* val names = people.map(_.name) //在Scala中;名称是数据集[字符串]
* Dataset names = people。map((Person p) -> .name, encoder . string));
*}}}

*

*数据集操作也可以通过各种领域特定语言(DSL)进行非类型化
*定义在:Dataset(该类)、[[Column]]和[[functions]]中的函数。这些操作
*非常类似于R或Python中数据帧抽象中可用的操作。

*

*要从数据集中选择列,请使用Scala中的“apply”方法和Java中的“col”方法。
{{{*
* val ageCol = people(“age”)//在Scala中
*列ageCol = people.col(“年龄”);/ /在Java中
*}}}

*

*注意,[[Column]]类型也可以通过它的各种函数进行操作。
{{{*
* //下面创建了一个新专栏,每个人的年龄都增加了10岁。
* Scala中的people(“age”)+ 10 //
* people.col(“时代”)的话语(10);/ /在Java中
*}}}

*

* Scala中更具体的例子:
{{{*
* //使用SparkSession创建数据集[行]
val people = spark.read.parquet(“…”)
* val部门= spark.read.parquet(“…”)

*

*人。过滤器(“年龄> 30”)
* .join(department, people("deptId") === department("id"))
* .groupBy(部门(“名字”)、人(“性别”)) 
* .agg (avg(人(“工资”)),马克斯(人(年龄)))
*}}}

*

*及Java:
{{{*
* //使用SparkSession创建数据集
* Dataset people = spark.read().parquet("…");
*数据集 department = spark.read().parquet("…");

*

* people.filter (people.col(“时代”)gt (30))
* . join(部门、people.col (deptId) .equalTo (department.col (" id ")))
* .groupBy (department.col(“名字”),people.col(“性别”))
* .agg (avg (people.col(“工资”)),马克斯(people.col(“年龄”)));
*}}}

总结

1.Dataset是spark1.6以后推出的新的API,也是一个分布式数据集,与RDD相比,保存了更多的描述信息,概念上等同于关系型数据库中的二维表
2.基于Dataset中保存了更多的描述信息,spark在运行时可以被优化。
3.Dataset里面对应的数据是强类型的,并且可以使用功能更加丰富的lambda表达式,弥补了函数式编程的一些缺点,使用起来更加方便。
4.在Scala中,DataFrame其实就是Dataset[Row]。
5.Dataset的特点:

	1.一系列分区
	2.每个切片上会有对应的函数
	3.依赖关系
	4.kv类型shuffle也会有分区器
	5.如果读取HDFS中的数据会感知最优位置
	6.会优化执行计划
	7.支持更加智能的数据源

6.调用Dataset的方法会先生成逻辑计划,然后被spark的优化器进行优化,最终生成物理计划,然后提交到集群中运行。

你可能感兴趣的:(spark)