企业Spark案例--酒店数据分析实战提交

第1关:数据清洗--过滤字段长度不足的且将出生日期转:

package com.yy


 

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

object edu{

    /**********Begin**********/

    // 此处可填写相关代码

    case class Person(id:String,Name:String,CtfTp:String,CtfId:String,Gender:String,Birthday:String,Address:String,Zip:String,Duty:String,Mobile:String,Tel:String,Fax:String,EMail:String,Nation:String,Taste:String,Education:String,Company:String,Family:String,Version:String,Hotel:String,Grade:String,Duration:String,City:String)

    /**********End**********/

    def main(args: Array[String]): Unit = {

        val spark = SparkSession

        .builder()

        .appName("Spark SQL")

        .master("local")

        .config("spark.some.config.option", "some-value")

        .getOrCreate()

        val rdd = spark.sparkContext.textFile("file:///root/files/part-00000-4ead9570-10e5-44dc-80ad-860cb072a9ff-c000.csv")

        /**********Begin**********/

        // 清洗脏数据(字段长度不足 23 的数据视为脏数据)

        val rdd1: RDD[String] = rdd.filter(x=>{

        val e=x.split(",",-1)

  

你可能感兴趣的:(spark,数据分析,大数据)