Spark SQL 是用于结构化数据处理的 Spark 模块。
Spark SQL 的一种用途是执行 SQL 查询。
从另一种编程语言中运行 SQL 时,结果将作为Dataset/DataFrame返回。
Dataset 是数据的分布式集合。
Dataset 是 Spark 1.6 中添加的一个新接口,它提供了 RDD 的优势(强类型化、使用强大 lambda 函数的能力)以及 Spark SQL 优化执行引擎的优势。可以从 JVM 对象构造数据集,然后使用功能转换(map、flatMap、filter等)进行操作。
DataFrame 是以命名列组织的数据集(Dataset
它在概念上等同于关系数据库中的表或 R/Python 中的数据框,但在底层进行了更丰富的优化。DataFrames 可以从多种来源构建,例如:结构化数据文件、Hive 中的表、外部数据库或现有 RDD。
我们将 Dataset
Dataset df = spark.read().json("examples/src/main/resources/people.json");
// Print the schema in a tree format
df.printSchema();
// Select only the "name" column
df.select("name").show();
// Select everybody, but increment the age by 1
df.select(col("name"), col("age").plus(1)).show();
// Select people older than 21
df.filter(col("age").gt(21)).show();
// Count people by age
df.groupBy("age").count().show();
有关可以对数据集执行的操作类型的完整列表,请参阅API 文档。
df.createOrReplaceTempView("people");
Dataset sqlDF = spark.sql("SELECT * FROM people");
使用专门的编码器来序列化对象,允许 Spark 执行许多操作(如过滤、排序和散列),而无需将字节反序列化回对象。
// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);
// Encoders are created for Java beans
Encoder personEncoder = Encoders.bean(Person.class);
Dataset javaBeanDS = spark.createDataset(
Collections.singletonList(person),
personEncoder
);
// Encoders for most common types are provided in class Encoders
Encoder longEncoder = Encoders.LONG();
Dataset primitiveDS = spark.createDataset(Arrays.asList(1L, 2L, 3L), longEncoder);
Dataset transformedDS = primitiveDS.map(
(MapFunction) value -> value + 1L,
longEncoder);
transformedDS.collect();
// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = "examples/src/main/resources/people.json";
Dataset peopleDS = spark.read().json(path).as(personEncoder);
Spark SQL 支持两种不同的方法将现有 RDD 转换为数据集:
Spark SQL 支持自动将 JavaBean的 RDD 转换为 DataFrame。使用BeanInfo反射获得的 定义表的架构。
// Create an RDD of Person objects from a text file
JavaRDD peopleRDD = spark.read()
.textFile("examples/src/main/resources/people.txt")
.javaRDD()
.map(line -> {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
// The columns of a row in the result can be accessed by field index
Encoder stringEncoder = Encoders.STRING();
Dataset teenagerNamesByIndexDF = teenagersDF.map(
(MapFunction) row -> "Name: " + row.getString(0),
stringEncoder);
// or by field name
Dataset teenagerNamesByFieldDF = teenagersDF.map(
(MapFunction) row -> "Name: " + row.getAs("name"),
stringEncoder);
Dataset
// Create an RDD
JavaRDD peopleRDD = spark.sparkContext()
.textFile("examples/src/main/resources/people.txt", 1)
.toJavaRDD();
// The schema is encoded in a string
String schemaString = "name age";
// Generate the schema based on the string of schema
List fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
// Convert records of the RDD (people) to Rows
JavaRDD rowRDD = peopleRDD.map((Function) record -> {
String[] attributes = record.split(",");
return RowFactory.create(attributes[0], attributes[1].trim());
});
// Apply the schema to the RDD
Dataset peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");
// SQL can be run over a temporary view created using DataFrames
Dataset results = spark.sql("SELECT name FROM people");
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset namesDS = results.map(
(MapFunction) row -> "Name: " + row.getString(0),
Encoders.STRING());
标量函数是每行返回单个值的函数,与聚合函数相反,聚合函数返回一组行的值。Spark SQL 支持多种内置标量函数。它还支持用户定义的标量函数。
聚合函数是在一组行上返回单个值的函数。内置聚合函数提供了常用的聚合,如count()、count_distinct()、avg()、max()、min()等。用户不限于预定义的聚合函数,可以创建自己的聚合函数。有关用户定义的聚合函数的更多详细信息,请参阅 用户定义的聚合函数的文档。