25年大数据开发省赛样题第一套,离线数据处理答案

省赛样题一,数据抽取模块

这一模块的作用是从mysql抽取数据到ods层进行指标计算,在题目中要求进行全量抽取,并新增etl-date字段进行分区,日期为比赛前一天

import org.apache.spark.sql.SparkSession

import java.util.Properties

object Task1 {
  def main(args: Array[String]): Unit = {
  val spark=SparkSession.builder()
    .appName("Task")
    .master("local[*]")
    .config("spark.sql.warehouse.dir","hdfs://localhost:9000/user/hive/warehouse")
    .config("hive.metastore.uris","thrift://localhost:9083")
    .config("spark.sql.storeAssignmentPolicy","LEGACY")
    .config("hive.exec.dynamic.partition.mode","nonstrict")
    .enableHiveSupport()
    .getOrCreate()
    val url="jdbc:mysql://localhost:3306/spark_test"
    val properties=new Properties()
      properties.put("user","root")
      properties.put("password","123456")
      properties.put("Driver","com.mysql.jdbc.Driver")
    spark.read.jdbc(url,"supermarket_spark",properties).createTempView("word")
spark.sql(
  """
    |insert Overwrite table ods_1.supermarket_spark partition (etl_date='2025-7-10')
    |select * from word
    |""".stripMargin)

  }
}

模块二,数据清洗

数据清洗模块,赛题要求需要将时间类型为HH:mm:ss为null的字段新增时分秒,还有就是一些增删改查的工作,具体要求看文档

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object Task2 {
  def main(args: Array[String]): Unit = {
    val spark=SparkSession.builder()
      .appName(&

你可能感兴趣的:(大数据,spark,scala)