SparkSQL-JSON数据集

Spark SQL能自动解析JSON数据集的schema,并将其作为 Dataset 加载。这个转换可以在 Dataset 上使用 SparkSession.read.json() 来完成或 JSON 文件。

需要注意的是,这里的JSON文件不是常规的 JSON 文件,JSON文件的每行必须包含一个独立的、自满足有效的JSON 对象。有关更多信息,请参阅 JSON Lines text format, also called newline-delimited JSON

对于常规的多行JSON文件,需要将 multiLine 选项设置为 true。

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
Dataset people = spark.read().json("examples/src/main/resources/people.json");

// The inferred schema can be visualized using the printSchema() method
people.printSchema();
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset namesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
namesDF.show();
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset storing one JSON object per string.
List jsonData = Arrays.asList(
        "{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
Dataset anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING());
Dataset anotherPeople = spark.read().json(anotherPeopleDataset);
anotherPeople.show();
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

 

你可能感兴趣的:(大数据/Spark/Spark,SQL)