SparkSQL schema创建DataFrame

通过case class创建DataFrame

通过case class 把rdd转化为DF是我们常用的方法,然后DF.registerTempTable将DF转化为表格进行SQL操作。而在早期版本中(1.4.1),case class有一个限制,就是字段不能超过22个,元组同样有这个限制。所以当字段较多时,就不方便用case class来创建DataFrame。

SparkSQL schema创建DataFrame

Spark提供了另外一种创建DataFrame的方式,即创建schema。一下是官方文档中的示例:

//***地址:http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#inferring-the-schema-using-reflection

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Import Row.
import org.apache.spark.sql.Row;

// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)

当我将上面的例子搬运到我的代码中时得到一个错误:

failure: Lost task 17.3 in stage 0.0 (TID 52, zdh7en): java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.types.UTF8String
  at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$1$$anonfun$apply$48.apply(Cast.scala:354)
  at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:111)

大体的意思就是类型错误,看代码发现我直接将一下代码拷贝过来并没有改动,而我的字段有很多类型,该段代码则表示所有字段都是String类型:

val schema =StructType(schemaString.split(",").map(fieldName => StructField(fieldName, StringType, true)))

所以应该这么写:

 val schema=StructType(Array(StructField("x1",IntegerType,true),
      StructField("x2",IntegerType,true),
      StructField("x3",StringType,true),
      StructField("x4",DecimalType(18,8),true)))

需要注意的是Deciaml类型不能直接写一个DecimalType,括号中的参数一定要说明。还有一点需要说明,如果你的代码中已经创建了一个hc(HiveContext)对象,就不需要再创建sqlContext。

你可能感兴趣的:(Spark)