Iceberg: COW模式下的MERGE INTO的执行流程

MergeInto命令

MERGE INTO target_table t
USING source_table s
ON s.id = t.id                //这里是JOIN的关联条件
WHEN MATCHED AND s.opType = 'delete' THEN DELETE // WHEN条件是对当前行进行打标的匹配条件
WHEN MATCHED AND s.opType = 'update' THEN UPDATE SET id = s.id, name = s.name
WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (key, value) VALUES (key, value)

如上是一条MERGE INTO语句，经过Spark Analyzer解析时，会发现它是MERGE INTO命令，因此将解析target_table对应生成的SparkTable实例封装成RowLevelOperationTable的实例，它会绑定一个SparkCopyOnWriteOperation的实例，并且实现了创建ScanBuilder和WriteBuilder的方法。

ScanBuilder和WriteBuilder是Spark中定义的接口，分别用于构建读数据器（Scan）和写数据器（BatchWrite）。

Iceberg基于Spark 3.x提供的外部Catalog及相关的读写接口，实现了对于Iceberg表（存储格式）的数据读写。

下面以SparkCopyOnWriteOperation跟踪分析如何利用Spark写出数据为Iceberg表格式。

Iceberg行级更新的操作，目前支持UPDATE / DELETE / MERGE INTO三个语法。

预备知识

SparkTable定义

public class SparkTable
    implements org.apache.spark.sql.connector.catalog.Table, // 继承自Spark的接口
        SupportsRead,
        SupportsWrite,
        SupportsDelete, // 支持删除
        SupportsRowLevelOperations, // 支持行级的数据更新
        SupportsMetadataColumns {
  private final Table icebergTable;
  private final Long snapshotId;
  private final boolean refreshEagerly;
  private final Set<TableCapability> capabilities;
  private String branch;
  private StructType lazyTableSchema = null;
  private SparkSession lazySpark = null;

  public SparkTable(Table icebergTable, Long snapshotId, boolean refreshEagerly) {
    this.icebergTable = icebergTable;
    this.snapshotId = snapshotId;
    this.refreshEagerly = refreshEagerly;

    boolean acceptAnySchema =
        PropertyUtil.propertyAsBoolean(
            icebergTable.properties(),
            TableProperties.SPARK_WRITE_ACCEPT_ANY_SCHEMA,
            TableProperties.SPARK_WRITE_ACCEPT_ANY_SCHEMA_DEFAULT);
    this.capabilities = acceptAnySchema ? CAPABILITIES_WITH_ACCEPT_ANY_SCHEMA : CAPABILITIES;
  }
  
  /**
   * 该表支持读取，因此实现了此方法返回一个ScanBuilder实例
   */
  @Override
  public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) {
    if (options.containsKey(SparkReadOptions.FILE_SCAN_TASK_SET_ID)) {
      // skip planning the job and fetch already staged file scan tasks
      // 如果设置了此参数，则会在读取数据后，将此次生成的Iceberg ScanTasks缓存在本地进程中的ScanTaskSetManager实例里，
      // 后面再对同相同的FileSet集合（或scan file的任务集合）构建时，可以避免重复构建任务集，
      // 起到缓存的作用
      return new SparkFilesScanBuilder(sparkSession(), icebergTable, options);
    }

    if (options.containsKey(SparkReadOptions.SCAN_TASK_SET_ID)) {
      // 作用同上
      return new SparkStagedScanBuilder(sparkSession(), icebergTable, options);
    }

    if (refreshEagerly) {
      icebergTable.refresh();
    }
    // 可以支持基于branch或是基于SnapshotId创建SparkTable
    // 如果基于SnapshotID，则需要显示地解析SnapshotId归属的branch
    CaseInsensitiveStringMap scanOptions =
        branch != null ? options : addSnapshotId(options, snapshotId);
    return new SparkScanBuilder(
        sparkSession(), icebergTable, branch, snapshotSchema(), scanOptions);
  }
  
  /**
   * 该表支持写，因此实现了此方法返回一个WriteBuilder实例
   */
  @Override
  public WriteBuilder newWriteBuilder(LogicalWriteInfo info) {
    Preconditions.checkArgument(
        snapshotId == null, "Cannot write to table at a specific snapshot: %s", snapshotId);

    return new SparkWriteBuilder(sparkSession(), icebergTable, branch, info);
  }

Spark Analyzer Resolving Table

假设我们有如下配置，定义了一个新的用户自定义的catalog，其name为iceberg。并通过spark.sql.catalog.iceberg指定了这个catalog的实现类org.apache.iceberg.spark.SparkCatalog，其类型为hive（意味着会在Iceberg的侧使用HiveCatalog解析库、表），meta存储地址为thrift://metastore-host:port。

spark.sql.catalog.iceberg = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.type = hive
spark.sql.catalog.iceberg.uri = thrift://metastore-host:port

当我们执行SELECT * FROM iceberg.test_tbl时，在SQL解析过程中，会通过如下的过程来解析CatalogName和TableName，并创建对应的Catalog和Table实例，即对应Iceberg中的实现类SparkCatalog和SparkTable。

  /**
   * 如果当前Plan是还未解析的表视图或是表，或是INSERT INTO语句，则应用此Rule，
   * 查看绑定的目标表名是否是SQL层级的临时视图或是Session级别的全局视图表名，最终返回一个新的SubqueryAlias的实例
   */
  object ResolveTempViews extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case u @ UnresolvedRelation(ident) =>
        lookupTempView(ident).getOrElse(u)
      case i @ InsertIntoStatement(UnresolvedRelation(ident), _, _, _, _) =>
        lookupTempView(ident)
          .map(view => i.copy(table = view))
          .getOrElse(i)
      case u @ UnresolvedTable(ident) =>
        lookupTempView(ident).foreach { _ =>
          u.failAnalysis(s"${ident.quoted} is a temp view not table.")
        }
        u
      case u @ UnresolvedTableOrView(ident) =>
        lookupTempView(ident).map(_ => ResolvedView(ident.asIdentifier)).getOrElse(u)
    }

    def lookupTempView(identifier: Seq[String]): Option[LogicalPlan] = {
      // Permanent View can't refer to temp views, no need to lookup at all.
      if (isResolvingView) return None

      identifier match {
        case Seq(part1) => v1SessionCatalog.lookupTempView(part1)
        case Seq(part1, part2) => v1SessionCatalog.lookupGlobalTempView(part1, part2)
        case _ => None
      }
    }
  }

  /**
   * Resolve table relations with concrete relations from v2 catalog.
   *
   * [[ResolveRelations]] still resolves v1 tables.
   */
  object ResolveTables extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = ResolveTempViews(plan).resolveOperatorsUp {
      case u: UnresolvedRelation =>
        lookupV2Relation(u.multipartIdentifier)
          .map { rel =>
            val ident = rel.identifier.get
            SubqueryAlias(rel.catalog.get.name +: ident.namespace :+ ident.name, rel)
          }.getOrElse(u)

      case u @ UnresolvedTable(NonSessionCatalogAndIdentifier(catalog, ident)) =>
        // NonSessionCatalogAndIdentifier的unapply方法，会尝试解析catalog，并通过Spark.CatalogManager::catalog(name)解析并构建Catalog实例
        // 这里就是一个Iceberg中定义的SparkCatalog实现类，然后通过工具类的方法加载表，并创建SparkTable实例。
        CatalogV2Util.loadTable(catalog, ident)
          .map(ResolvedTable(catalog.asTableCatalog, ident, _))
          .getOrElse(u)

      case u @ UnresolvedTableOrView(NonSessionCatalogAndIdentifier(catalog, ident)) =>
        CatalogV2Util.loadTable(catalog, ident)
          .map(ResolvedTable(catalog.asTableCatalog, ident, _))
          .getOrElse(u)

      case i @ InsertIntoStatement(u: UnresolvedRelation, _, _, _, _) if i.query.resolved =>
        lookupV2Relation(u.multipartIdentifier)
          .map(v2Relation => i.copy(table = v2Relation))
          .getOrElse(i)

      case alter @ AlterTable(_, _, u: UnresolvedV2Relation, _) =>
        CatalogV2Util.loadRelation(u.catalog, u.tableName)
          .map(rel => alter.copy(table = rel))
          .getOrElse(alter)

      case u: UnresolvedV2Relation =>
        CatalogV2Util.loadRelation(u.catalog, u.tableName).getOrElse(u)
    }
    
    /**
     * Performs the lookup of DataSourceV2 Tables from v2 catalog.
     */
    private def lookupV2Relation(identifier: Seq[String]): Option[DataSourceV2Relation] =
      expandRelationName(identifier) match {
        case NonSessionCatalogAndIdentifier(catalog, ident) =>
          CatalogV2Util.loadTable(catalog, ident) match {
            case Some(table) =>
              Some(DataSourceV2Relation.create(table, Some(catalog), Some(ident)))
            case None => None
          }
        case _ => None
      }
  }

SparkCatalog路由加载表的过程到HiveCatalog

SparkCatalog由于继承自TableCatalog，因此拥有 Table loadTable(Identifier ident) throws NoSuchTableException方法，在Spark内部进行SQL解析时，可以调用此方法，生成用户自定义的Table实例。

SparkCatalog定位于Spark与Iceberg之前的桥梁，最终的实现效果是将某个Catalog的解析并创建表的任务，路由给Iceberg中的Catalog实现，例如HiveCatalog。

/**
 * 实现了Spark中的如下接口：
 * public interface TableCatalog extends CatalogPlugin
 * 可以在SQL解析过程时，应用`ResolveTables`规则时，通过Catalog + Identifier创建
 */
public class SparkCatalog extends BaseCatalog {
  /**
   * 由于继承自CatalogPlugin接口类，因此需要重写initialize(...)方法，以初始化SparkCatalog实例
   */
  @Override
  public final void initialize(String name, CaseInsensitiveStringMap options) {
    this.cacheEnabled =
        PropertyUtil.propertyAsBoolean(
            options, CatalogProperties.CACHE_ENABLED, CatalogProperties.CACHE_ENABLED_DEFAULT);

    long cacheExpirationIntervalMs =
        PropertyUtil.propertyAsLong(
            options,
            CatalogProperties.CACHE_EXPIRATION_INTERVAL_MS,
            CatalogProperties.CACHE_EXPIRATION_INTERVAL_MS_DEFAULT);

    // An expiration interval of 0ms effectively disables caching.
    // Do not wrap with CachingCatalog.
    if (cacheExpirationIntervalMs == 0) {
      this.cacheEnabled = false;
    }
    // 创建Iceberg支持的catalog实例，一共支持如下几个
    //   public static final String ICEBERG_CATALOG_TYPE_HADOOP = "hadoop";
    // public static final String ICEBERG_CATALOG_TYPE_HIVE = "hive";
    // public static final String ICEBERG_CATALOG_TYPE_REST = "rest";

    // public static final String ICEBERG_CATALOG_HADOOP = "org.apache.iceberg.hadoop.HadoopCatalog";
    // public static final String ICEBERG_CATALOG_HIVE = "org.apache.iceberg.hive.HiveCatalog";
    // public static final String ICEBERG_CATALOG_REST = "org.apache.iceberg.rest.RESTCatalog";
    // 默认情况下，我们创建的是HiveCatalog，后续调用loadTable(...)方法创建SparkTable时，则通过HiveCatalog::loadTable(name)方法生成
    Catalog catalog = buildIcebergCatalog(name, options);

    this.catalogName = name;
    SparkSession sparkSession = SparkSession.active();
    this.useTimestampsWithoutZone =
        SparkUtil.useTimestampWithoutZoneInNewTables(sparkSession.conf());
    this.tables =
        new HadoopTables(SparkUtil.hadoopConfCatalogOverrides(SparkSession.active(), name));
    this.icebergCatalog =
        cacheEnabled ? CachingCatalog.wrap(catalog, cacheExpirationIntervalMs) : catalog;
    // 支持通过参数的方式，指定默认的namespace，默认值为default
    if (catalog instanceof SupportsNamespaces) {
      this.asNamespaceCatalog = (SupportsNamespaces) catalog;
      if (options.containsKey("default-namespace")) {
        this.defaultNamespace =
            Splitter.on('.').splitToList(options.get("default-namespace")).toArray(new String[0]);
      }
    }

    EnvironmentContext.put(EnvironmentContext.ENGINE_NAME, "spark");
    EnvironmentContext.put(
        EnvironmentContext.ENGINE_VERSION, sparkSession.sparkContext().version());
    EnvironmentContext.put(CatalogProperties.APP_ID, sparkSession.sparkContext().applicationId());
  }

  @Override
  public Table loadTable(Identifier ident) throws NoSuchTableException {
    // 基于标识符创建SparkTable实例
    try {
      return load(ident);
    } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
      throw new NoSuchTableException(ident);
    }
  }

  @Override
  public Table loadTable(Identifier ident, String version) throws NoSuchTableException {
    Table table = loadTable(ident);
    // ...
  }

  @Override
  public Table loadTable(Identifier ident, long timestamp) throws NoSuchTableException {
    Table table = loadTable(ident);
    // ...
  }
}

SparkCatalog::buildIcebergCatalog

  public static Catalog buildIcebergCatalog(String name, Map<String, String> options, Object conf) {
    String catalogImpl = options.get(CatalogProperties.CATALOG_IMPL);
    if (catalogImpl == null) {
      String catalogType =
          PropertyUtil.propertyAsString(options, ICEBERG_CATALOG_TYPE, ICEBERG_CATALOG_TYPE_HIVE);
      switch (catalogType.toLowerCase(Locale.ENGLISH)) {
        case ICEBERG_CATALOG_TYPE_HIVE:
          catalogImpl = ICEBERG_CATALOG_HIVE;
          break;
        case ICEBERG_CATALOG_TYPE_HADOOP:
          catalogImpl = ICEBERG_CATALOG_HADOOP;
          break;
        case ICEBERG_CATALOG_TYPE_REST:
          catalogImpl = ICEBERG_CATALOG_REST;
          break;
        default:
          throw new UnsupportedOperationException("Unknown catalog type: " + catalogType);
      }
    } else {
      String catalogType = options.get(ICEBERG_CATALOG_TYPE);
      Preconditions.checkArgument(
          catalogType == null,
          "Cannot create catalog %s, both type and catalog-impl are set: type=%s, catalog-impl=%s",
          name,
          catalogType,
          catalogImpl);
    }

    return CatalogUtil.loadCatalog(catalogImpl, name, options, conf);
  }

CatalogUtil::loadCatalog

这里以Hive为例，解析如何加载Custom Catalog

  public static Catalog loadCatalog(
      String impl, String catalogName, Map<String, String> properties, Object hadoopConf) {
    // impl = ICEBERG_CATALOG_HIVE
    // catalogName = hive
    // properties = spark.sql.catalog.[catalogName].x
    // 其中properties指的是catalogName对应的配置选项，是从Spark.SQLConf解析得到的
    Preconditions.checkNotNull(impl, "Cannot initialize custom Catalog, impl class name is null");
    DynConstructors.Ctor<Catalog> ctor;
    try {
      // 通过默认的impl名字，通过Refect机制，调用无参的构造函数，生成对应的类的实例
      ctor = DynConstructors.builder(Catalog.class).impl(impl).buildChecked();
    } catch (NoSuchMethodException e) {
      throw new IllegalArgumentException(
          String.format("Cannot initialize Catalog implementation %s: %s", impl, e.getMessage()),
          e);
    }

    Catalog catalog;
    try {
      catalog = ctor.newInstance();

    } catch (ClassCastException e) {
      throw new IllegalArgumentException(
          String.format("Cannot initialize Catalog, %s does not implement Catalog.", impl), e);
    }

    configureHadoopConf(catalog, hadoopConf);
    // 通过properties，来助力catalog对象的初始化过程
    catalog.initialize(catalogName, properties);
    return catalog;
  }

HiveCatalog::initialize

  @Override
  public void initialize(String inputName, Map<String, String> properties) {
    this.catalogProperties = ImmutableMap.copyOf(properties);
    this.name = inputName;
    if (conf == null) {
      LOG.warn("No Hadoop Configuration was set, using the default environment Configuration");
      this.conf = new Configuration();
    }
    // 解析指定的Hive metastore地址
    if (properties.containsKey(CatalogProperties.URI)) {
      this.conf.set(HiveConf.ConfVars.METASTOREURIS.varname, properties.get(CatalogProperties.URI));
    }
    // 解析指定的metastore的工作目录
    if (properties.containsKey(CatalogProperties.WAREHOUSE_LOCATION)) {
      this.conf.set(
          HiveConf.ConfVars.METASTOREWAREHOUSE.varname,
          LocationUtil.stripTrailingSlash(properties.get(CatalogProperties.WAREHOUSE_LOCATION)));
    }

    this.listAllTables =
        Boolean.parseBoolean(properties.getOrDefault(LIST_ALL_TABLES, LIST_ALL_TABLES_DEFAULT));
    // 解析指定的读写文件的 接口实现类，如果不指定则默认为HadoopFileIO
    // 否则加载用户自定义的实现类
    String fileIOImpl = properties.get(CatalogProperties.FILE_IO_IMPL);
    this.fileIO =
        fileIOImpl == null
            ? new HadoopFileIO(conf)
            : CatalogUtil.loadFileIO(fileIOImpl, properties, conf);
    // 初始化元数据交互的客户端，使用默认使用HiveClientPool
    this.clients = new CachedClientPool(conf, properties);
  }

HiveCatalog加载SparkTable

在生成SparkCatalog时，会根据spark.sql.catalog.iceberg.type这个配置，知道我们要创建的表在Iceberg中的类型是hive，因此需要通过HiveCatalog加载表。

HiveCatalog::loadTable

  @Override
  public Table loadTable(TableIdentifier identifier) {
    Table result;
    if (isValidIdentifier(identifier)) {
      // 对于HiveCatalog来说，所有的identifier都是合法的，因此会通过下面的方法得到对应类型的TableOperations实例
      // 在Iceberg世界中，实际上是不存在表的，只是利用表的概念，将TableMetadata进行了抽象，
      // 因此Iceberg中的Table都必须绑定一个TableOperations实例，来读取TableMetadata数据
      // 例如在这里会对应生成HiveTableOperations
      TableOperations ops = newTableOps(identifier);
      if (ops.current() == null) {
        // the identifier may be valid for both tables and metadata tables
        if (isValidMetadataIdentifier(identifier)) {
          result = loadMetadataTable(identifier);

        } else {
          throw new NoSuchTableException("Table does not exist: %s", identifier);
        }

      } else {
        // 
        result = new BaseTable(ops, fullTableName(name(), identifier), metricsReporter());
      }

    } else if (isValidMetadataIdentifier(identifier)) {
      result = loadMetadataTable(identifier);

    } else {
      throw new NoSuchTableException("Invalid table identifier: %s", identifier);
    }

    LOG.info("Table loaded by catalog: {}", result);
    return result;
  }

Scan的定义、构建及执行流程

MERGE INTO重写时，会为目标表生成一个DataSourceV2Relation的逻辑计划实例，以读取目标表中的相关数据，因此在过滤表达式下推优化时，会同时构建COW模式下的Scan实例。
Scan是Spark中定义的读取数据的接口。

object V2ScanRelationPushDown extends Rule[LogicalPlan] {
  import DataSourceV2Implicits._

  override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
    case ScanOperation(project, filters, relation: DataSourceV2Relation) =>
      // 调用这里的Table是一个RowLevelOperationTable的实例，同时它绑定了一个SparkCopyOnWriteOperation实例
      // 因此底层实际上调用的是SparkCopyOnWriteOperation::newScanBuilder方法
      val scanBuilder = relation.table.asReadable.newScanBuilder(relation.options)

      val normalizedFilters = DataSourceStrategy.normalizeExprs(filters, relation.output)
      val (normalizedFiltersWithSubquery, normalizedFiltersWithoutSubquery) =
        normalizedFilters.partition(SubqueryExpression.hasSubquery)

      // `pushedFilters` will be pushed down and evaluated in the underlying data sources.
      // `postScanFilters` need to be evaluated after the scan.
      // `postScanFilters` and `pushedFilters` can overlap, e.g. the parquet row group filter.
      val (pushedFilters, postScanFiltersWithoutSubquery) = PushDownUtils.pushFilters(
        scanBuilder, normalizedFiltersWithoutSubquery)
      val postScanFilters = postScanFiltersWithoutSubquery ++ normalizedFiltersWithSubquery

      val normalizedProjects = DataSourceStrategy
        .normalizeExprs(project, relation.output)
        .asInstanceOf[Seq[NamedExpression]]
      // 列裁剪，同时调用scanBuilder.build()方法，生成一个读取具体数据源的Scan实例
      val (scan, output) = PushDownUtils.pruneColumns(
        scanBuilder, relation, normalizedProjects, postScanFilters)
      logInfo(
        s"""
           |Pushing operators to ${relation.name}
           |Pushed Filters: ${pushedFilters.mkString(", ")}
           |Post-Scan Filters: ${postScanFilters.mkString(",")}
           |Output: ${output.mkString(", ")}
         """.stripMargin)

      val wrappedScan = scan match {
        case v1: V1Scan =>
          val translated = filters.flatMap(DataSourceStrategy.translateFilter(_, true))
          V1ScanWrapper(v1, translated, pushedFilters)
        case _ => scan
      }
      // 这里生成一个读取数据的逻辑计划
      val scanRelation = DataSourceV2ScanRelation(relation.table, wrappedScan, output)

      val projectionOverSchema = ProjectionOverSchema(output.toStructType)
      val projectionFunc = (expr: Expression) => expr transformDown {
        case projectionOverSchema(newExpr) => newExpr
      }

      val filterCondition = postScanFilters.reduceLeftOption(And)
      val newFilterCondition = filterCondition.map(projectionFunc)
      val withFilter = newFilterCondition.map(Filter(_, scanRelation)).getOrElse(scanRelation)

      val withProjection = if (withFilter.output != project) {
        val newProjects = normalizedProjects
          .map(projectionFunc)
          .asInstanceOf[Seq[NamedExpression]]
        Project(newProjects, withFilter)
      } else {
        withFilter
      }
      // 返回最终的逻辑计划
      withProjection
  }
}

Iceberg中实现的Scan

从前面我们知道，在Spark进行Filter Pushdown优化时，会调用Table::newScanBuilder方法构建一个具体的数据描述器（Scan），实际上是会最终调用Iceberg中如下的方法：

class SparkCopyOnWriteOperation implements RowLevelOperation {

  @Override
  public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) {
    if (lazyScanBuilder == null) {
      lazyScanBuilder =
          new SparkScanBuilder(spark, table, branch, options) {
            @Override
            public Scan build() {
              // 构建COW模式的Scan实例
              Scan scan = super.buildCopyOnWriteScan();
              SparkCopyOnWriteOperation.this.configuredScan = scan;
              return scan;
            }
          };
    }

    return lazyScanBuilder;
  }
}

如下是对SparkCopyOnWriteOperation::buildCopyOnWriteScan方法的完整定义：

  public Scan buildCopyOnWriteScan() {
    // table变量，是一个BaseTable实例，因为从Spark的代码流转到Iceberg侧时，使用的都是Iceberg中定义的类
    // 这里是从当前表找到最新的Snapshot
    Snapshot snapshot = SnapshotUtil.latestSnapshot(table, readConf.branch());

    if (snapshot == null) {
      return new SparkCopyOnWriteScan(
          spark, table, readConf, schemaWithMetadataColumns(), filterExpressions);
    }

    Schema expectedSchema = schemaWithMetadataColumns();
    // Snapshot存在，说明有数据，因此需要生成Scan实例
    //   default BatchScan newBatchScan() {
    //     return new BatchScanAdapter(newScan());
    //   }
    // 由于这里的table类型为BaseTable，因此会调用newScan()方法生成DataTableScan的实例，而BatchScan则是一个代理类
    BatchScan scan =
        table
            .newBatchScan()
            .useSnapshot(snapshot.snapshotId())
            .ignoreResiduals()
            .caseSensitive(caseSensitive)
            .filter(filterExpression())
            .project(expectedSchema);

    scan = configureSplitPlanning(scan);
    // 返回一个实现了Spark中的Scan接口的实例
    return new SparkCopyOnWriteScan(
        spark, table, scan, snapshot, readConf, expectedSchema, filterExpressions);
  }

SparkCopyOnWriteScan负责生成Spark.Batch

我们知道SparkCopyOnWriteScan实现的Spark中的Scan接口，而Scan是一个逻辑上的数据读取器，就像逻辑计划那样，因此还需要通过它的Scan::toBatch方法，创建一个直接可执行的实体类对象

class SparkCopyOnWriteScan extends SparkPartitioningAwareScan<FileScanTask>

  @Override
  public Batch toBatch() {
    // 返回一个Spark可操作的Batch实例，负责对待读取的数据划分Batches
    // 注意这里在创建SparkBatch实例时，taskGroups()的调用，这个方法实际上是调用Iceberg的接口，搜索此次Scan任务需要读取的所有数据。
    return new SparkBatch(
        sparkContext, table, readConf, groupingKeyType(), taskGroups(), expectedSchema, hashCode());
  }
}

SparkCopyOnWriteScan::taskGroups基于SnapshotScan::planFiles方法实现

public abstract class SnapshotScan<ThisT, T extends ScanTask, G extends ScanTaskGroup<T>>

  @Override
  public CloseableIterable<T> planFiles() {
    // 获取要读取的Snapshot
    Snapshot snapshot = snapshot();

    if (snapshot == null) {
      LOG.info("Scanning empty table {}", table());
      return CloseableIterable.empty();
    }

    LOG.info(
        "Scanning table {} snapshot {} created at {} with filter {}",
        table(),
        snapshot.snapshotId(),
        DateTimeUtil.formatTimestampMillis(snapshot.timestampMillis()),
        ExpressionUtil.toSanitizedString(filter()));

    Listeners.notifyAll(new ScanEvent(table().name(), snapshot.snapshotId(), filter(), schema()));
    List<Integer> projectedFieldIds = Lists.newArrayList(TypeUtil.getProjectedIds(schema()));
    List<String> projectedFieldNames =
        projectedFieldIds.stream().map(schema()::findColumnName).collect(Collectors.toList());

    Timer.Timed planningDuration = scanMetrics().totalPlanningDuration().start();

    return CloseableIterable.whenComplete(
        doPlanFiles(), // doPlanFiles()方法会通过Iceberg的接口，搜索所有要读取的data文件和delete文件
        () -> {
          planningDuration.stop();
          Map<String, String> metadata = Maps.newHashMap(context().options());
          metadata.putAll(EnvironmentContext.get());
          ScanReport scanReport =
              ImmutableScanReport.builder()
                  .schemaId(schema().schemaId())
                  .projectedFieldIds(projectedFieldIds)
                  .projectedFieldNames(projectedFieldNames)
                  .tableName(table().name())
                  .snapshotId(snapshot.snapshotId())
                  .filter(ExpressionUtil.sanitize(filter()))
                  .scanMetrics(ScanMetricsResult.fromScanMetrics(scanMetrics()))
                  .metadata(metadata)
                  .build();
          context().metricsReporter().report(scanReport);
        });
  }
}

SparkBatch负责生成Partitions及Partition Reader

SparkBatch继承自Spark中的Batch接口

class SparkBatch implements Batch {
  private final JavaSparkContext sparkContext;
  private final Table table;
  private final String branch;
  private final SparkReadConf readConf;
  private final Types.StructType groupingKeyType;
  // 保存了由SparkCopyOnWriteScan::taskGroups()方法生成的所有要读取的Iceberg管理的data文件和delete文件，
  // 这些文件按对应的分区数据进行分组，并且一个分区的数据文件可能被划分到多个groups
  private final List<? extends ScanTaskGroup<?>> taskGroups;
  private final Schema expectedSchema;
  private final boolean caseSensitive;
  private final boolean localityEnabled;
  private final int scanHashCode;

  @Override
  public InputPartition[] planInputPartitions() {
    // 负责对要读取的数据进行分区
    // broadcast the table metadata as input partitions will be sent to executors
    Broadcast<Table> tableBroadcast =
        sparkContext.broadcast(SerializableTableWithSize.copyOf(table));
    String expectedSchemaString = SchemaParser.toJson(expectedSchema);
    // 一个Group就对应Spark中的一个Partition
    InputPartition[] partitions = new InputPartition[taskGroups.size()];

    Tasks.range(partitions.length)
        .stopOnFailure()
        .executeWith(localityEnabled ? ThreadPools.getWorkerPool() : null)
        .run(
            index ->
                partitions[index] =
                    new SparkInputPartition(
                        groupingKeyType, // 一个taskGroup包含的文件拥有相同的Grouping key
                        taskGroups.get(index),
                        tableBroadcast,
                        branch,
                        expectedSchemaString,
                        caseSensitive,
                        localityEnabled));

    return partitions;
  }

  @Override
  public PartitionReaderFactory createReaderFactory() {
    // 负责创建读取数据的Reader，支持列式读取和行式读取
    if (useParquetBatchReads()) {
      int batchSize = readConf.parquetBatchSize();
      return new SparkColumnarReaderFactory(batchSize);

    } else if (useOrcBatchReads()) {
      int batchSize = readConf.orcBatchSize();
      return new SparkColumnarReaderFactory(batchSize);

    } else {
      return new SparkRowReaderFactory();
    }
  }
}

从SparkBatch构建数据读取的物理执行计划

前文提到的Spark中有关数据的读写接口，都是由DataSourceV2中定义的，因此对于数据读取的逻辑计划（DataSourceV2ScanRelation），会先转换成物理执行计划BatchScanExec。

case class BatchScanExec(
    output: Seq[AttributeReference],
    @transient scan: Scan) extends DataSourceV2ScanExecBase {
  // scan，对应于Iceberg中的SparkCopyOnWriteScan
  // 因此batch变量是一个SparkBatch实例
  @transient lazy val batch = scan.toBatch

  // TODO: unify the equal/hashCode implementation for all data source v2 query plans.
  override def equals(other: Any): Boolean = other match {
    case other: BatchScanExec => this.batch == other.batch
    case _ => false
  }

  override def hashCode(): Int = batch.hashCode()
  // 调用SparkBatch::planInputPartitions生成partitions信息
  @transient override lazy val partitions: Seq[InputPartition] = batch.planInputPartitions()
  // 调用SparkBatch::createReaderFactory生成Reader工厂对象
  override lazy val readerFactory: PartitionReaderFactory = batch.createReaderFactory()

  override lazy val inputRDD: RDD[InternalRow] = {
    // 执行时，净当前的物理执行计划，转换为一个RDD，并传递给所有的RDD以及Reader工厂
    new DataSourceRDD(sparkContext, partitions, readerFactory, supportsColumnar)
  }

  override def doCanonicalize(): BatchScanExec = {
    this.copy(output = output.map(QueryPlan.normalizeExpressions(_, output)))
  }
}

DataSourceRDD计算时实例化Reader并完成读数据

这里需要重点关注的是compute方法，在Spark中，每一个Partition都对应一个Task，这个Task负责最终调用compute方法，触发当前分区上的计算逻辑。

// columnar scan.
class DataSourceRDD(
    sc: SparkContext,
    @transient private val inputPartitions: Seq[InputPartition],
    partitionReaderFactory: PartitionReaderFactory,
    columnarReads: Boolean)
  extends RDD[InternalRow](sc, Nil) {

  override protected def getPartitions: Array[Partition] = {
    inputPartitions.zipWithIndex.map {
      case (inputPartition, index) => new DataSourceRDDPartition(index, inputPartition)
    }.toArray
  }

  private def castPartition(split: Partition): DataSourceRDDPartition = split match {
    case p: DataSourceRDDPartition => p
    case _ => throw new SparkException(s"[BUG] Not a DataSourceRDDPartition: $split")
  }

  override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
    // partition对应于Iceberg中的一个TaskGroup，而一个TaskGroup的数据文件拥有相同的Partition data
    val inputPartition = castPartition(split).inputPartition
    val (iter, reader) = if (columnarReads) {
      // 列读
      // batchReader实际上是一个BatchDataReader的实例
      val batchReader = partitionReaderFactory.createColumnarReader(inputPartition)
      val iter = new MetricsBatchIterator(new PartitionIterator[ColumnarBatch](batchReader))
      (iter, batchReader)
    } else {
      // 行读
      val rowReader = partitionReaderFactory.createReader(inputPartition)
      val iter = new MetricsRowIterator(new PartitionIterator[InternalRow](rowReader))
      (iter, rowReader)
    }
    context.addTaskCompletionListener[Unit](_ => reader.close())
    // TODO: SPARK-25083 remove the type erasure hack in data source scan
    new InterruptibleIterator(context, iter.asInstanceOf[Iterator[InternalRow]])
  }

  override def getPreferredLocations(split: Partition): Seq[String] = {
    castPartition(split).inputPartition.preferredLocations()
  }
}

BatchDataReader读取数据

BatchDataReader继承自Spark中的PartitionReader接口

class BatchDataReader extends BaseBatchReader<FileScanTask>
    implements PartitionReader<ColumnarBatch> {
  
  // 返回一个迭代器，可以在FileScanTask包含的所有data文件和delete文件，
  @Override
  protected CloseableIterator<ColumnarBatch> open(FileScanTask task) {
    String filePath = task.file().path().toString();
    LOG.debug("Opening data file {}", filePath);

    // update the current file for Spark's filename() function
    InputFileBlockHolder.set(filePath, task.start(), task.length());

    Map<Integer, ?> idToConstant = constantsMap(task, expectedSchema());

    InputFile inputFile = getInputFile(filePath);
    Preconditions.checkNotNull(inputFile, "Could not find InputFile associated with FileScanTask");
    // 创建一个SparkDeleteFilter实例，它负责收集等值删除文件 和 位置删除文件，并建立删除数据记录的索引，
    // 如此在每遍历一个data file时，就可以根据索引信息，确定当前的record是不是存活的。
    SparkDeleteFilter deleteFilter =
        task.deletes().isEmpty()
            ? null
            : new SparkDeleteFilter(filePath, task.deletes(), counter());
    // newBatchIterable()方法会根据inputFile的类型，创建相应的文件读取器，例如为Parquet创建VectorizedParquetReader
    return newBatchIterable(
            inputFile,
            task.file().format(),
            task.start(),
            task.length(),
            task.residual(),
            idToConstant,
            deleteFilter)
        .iterator();
  }
}

读取数据转换成Spark中的ColumnarBatch

不论是Parquet/Orc文件，最底层都是通过ColumnarBatchReader负责真正的数据读取与过滤

public class ColumnarBatchReader extends BaseBatchReader<ColumnarBatch> {
  private final boolean hasIsDeletedColumn;
  private DeleteFilter<InternalRow> deletes = null;
  private long rowStartPosInBatch = 0;

  public ColumnarBatchReader(List<VectorizedReader<?>> readers) {
    super(readers);
    this.hasIsDeletedColumn =
        readers.stream().anyMatch(reader -> reader instanceof DeletedVectorReader);
  }

  @Override
  public void setRowGroupInfo(
      PageReadStore pageStore, Map<ColumnPath, ColumnChunkMetaData> metaData, long rowPosition) {
    super.setRowGroupInfo(pageStore, metaData, rowPosition);
    this.rowStartPosInBatch = rowPosition;
  }

  public void setDeleteFilter(DeleteFilter<InternalRow> deleteFilter) {
    this.deletes = deleteFilter;
  }

  @Override
  public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
    if (reuse == null) {
      closeVectors();
    }
    // 通过内部类ColumnBatchLoader代理完成数据的读取与结果转换
    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
    rowStartPosInBatch += numRowsToRead;
    return columnarBatch;
  }

  private class ColumnBatchLoader {
    private final int numRowsToRead;
    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when
    // there is no deletes
    private int[] rowIdMapping;
    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted"
    // metadata column
    private boolean[] isDeleted;

    /**
     * Build a row id mapping inside a batch, which skips deleted rows. Here is an example of how we
     * delete 2 rows in a batch with 8 rows in total. [0,1,2,3,4,5,6,7] -- Original status of the
     * row id mapping array [F,F,F,F,F,F,F,F] -- Original status of the isDeleted array Position
     * delete 2, 6 [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6]
     * [F,F,T,F,F,F,T,F] -- After applying position deletes
     *
     * @param deletedRowPositions a set of deleted row positions
     * @return the mapping array and the new num of rows in a batch, null if no row is deleted
     */
    Pair<int[], Integer> buildPosDelRowIdMapping(PositionDeleteIndex deletedRowPositions) {
      if (deletedRowPositions == null) {
        return null;
      }

      int[] posDelRowIdMapping = new int[numRowsToRead];
      int originalRowId = 0;
      int currentRowId = 0;
      while (originalRowId < numRowsToRead) {
        if (!deletedRowPositions.isDeleted(originalRowId + rowStartPosInBatch)) {
          posDelRowIdMapping[currentRowId] = originalRowId;
          currentRowId++;
        } else {
          if (hasIsDeletedColumn) {
            isDeleted[originalRowId] = true;
          }

          deletes.incrementDeleteCount();
        }
        originalRowId++;
      }

      if (currentRowId == numRowsToRead) {
        // there is no delete in this batch
        return null;
      } else {
        return Pair.of(posDelRowIdMapping, currentRowId);
      }
    }

    int[] initEqDeleteRowIdMapping() {
      int[] eqDeleteRowIdMapping = null;
      if (hasEqDeletes()) {
        eqDeleteRowIdMapping = new int[numRowsToRead];
        for (int i = 0; i < numRowsToRead; i++) {
          eqDeleteRowIdMapping[i] = i;
        }
      }

      return eqDeleteRowIdMapping;
    }

    /**
     * Filter out the equality deleted rows. Here is an example, [0,1,2,3,4,5,6,7] -- Original
     * status of the row id mapping array [F,F,F,F,F,F,F,F] -- Original status of the isDeleted
     * array Position delete 2, 6 [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num
     * records to 6] [F,F,T,F,F,F,T,F] -- After applying position deletes Equality delete 1 <= x <=
     * 3 [0,4,5,7,-,-,-,-] -- After applying equality deletes [Set Num records to 4]
     * [F,T,T,T,F,F,T,F] -- After applying equality deletes
     *
     * @param columnarBatch the {@link ColumnarBatch} to apply the equality delete
     */
    void applyEqDelete(ColumnarBatch columnarBatch) {
      Iterator<InternalRow> it = columnarBatch.rowIterator();
      int rowId = 0;
      int currentRowId = 0;
      while (it.hasNext()) {
        InternalRow row = it.next();
        if (deletes.eqDeletedRowFilter().test(row)) {
          // the row is NOT deleted
          // skip deleted rows by pointing to the next undeleted row Id
          rowIdMapping[currentRowId] = rowIdMapping[rowId];
          currentRowId++;
        } else {
          if (hasIsDeletedColumn) {
            isDeleted[rowIdMapping[rowId]] = true;
          }

          deletes.incrementDeleteCount();
        }

        rowId++;
      }

      columnarBatch.setNumRows(currentRowId);
    }
  }
}

Write的执行过程

从前面的章节可以看到，在构建Scan的过程中，会同时搜集data files和delete files，因此在调用Reader实例读取每一个TaskGroup中的数据文件时，同时会应用DeleteFilter，来过滤掉那些被删除的记录。

这个过程实际上就是一个\Merge On Read的过程。

而MERGE INTO的Write过程，在我之前的文章有解析，大体的思路就是将从target_table Scan得到的、经过删除过滤后的数据集，与source_table中的数据JOIN；从而产生带有变更标记的结果数据集（每个被标记为INSERT/UPDATE/DELETE）；在写出数据到文件时，就可以根据每一行的标记确定写出行为，最终只会产生Data Files，数据文件更加干净。

你可能感兴趣的:(Iceberg,spark,CopyOnWrite)

Azkaban各种类型的Job编写 __元昊__
一、概述原生的Azkaban支持的plugin类型有以下这些：command：Linuxshell命令行任务gobblin：通用数据采集工具hadoopJava：运行hadoopMR任务java：原生java任务hive：支持执行hiveSQLpig：pig脚本任务spark：spark任务hdfsToTeradata：把数据从hdfs导入TeradatateradataToHdfs：把数据从Te
关于HDP的20道高级运维面试题编织幻境的妖运维
1.描述HDP的主要组件及其作用。HDP（HortonworksDataPlatform）的主要组件包括Hadoop框架、HDFS、MapReduce、YARN以及Hadoop生态系统中的其他关键工具，如Spark、Flink、Hive、HBase等。以下是对这些组件及其作用的具体描述：Hadoop框架:Hadoop是一个开源的分布式计算框架，用Java语言编写，用于存储和处理大规模数据集。它广义
【Hadoop】使用Scala与Spark连接ClickHouse进行数据处理音乐学家方大刚 Scala Hadoop hadoop scala spark
风不懂不懂得叶的梦月不听不听闻窗里琴声意难穷水不见不曾见绿消红霜不知不知晓将别人怎道珍重落叶有风才敢做一个会飞的梦孤窗有月才敢登高在夜里从容桃花有水才怕身是客身是客此景不能久TieYann(铁阳)、薄彩生《不知晓》在大数据分析和处理领域，ApacheSpark是一个广泛使用的高性能、通用的计算框架，而ClickHouse作为一个高性能的列式数据库，特别适合在线分析处理（OLAP）。结合Scala语
Spark面试整理-Spark是什么？不务正业的猿面试 Spark spark 大数据分布式
ApacheSpark是一个开源的分布式计算系统，它提供了一个用于大规模数据处理的快速、通用、易于使用的平台。它最初是在加州大学伯克利分校的AMPLab开发的，并于2010年开源。自那时起，Spark已经成为大数据处理中最受欢迎和广泛使用的框架之一。下面是Spark的一些关键特点：速度：Spark使用了先进的DAG（有向无环图）执行引擎，可以支持循环数据流和内存计算。这使得Spark在数据处理方面
Spark Q&A 耐心的农夫2020
Q:在读取文件的时候，如何忽略空gzip文件?A:从Spark2.1开始，你可以通过启用spark.sql.files.ignoreCorruptFiles选项来忽略损毁的文件。可以将下面的选项添加到你的spark-submit或者pyspark命令中。--confspark.sql.files.ignoreCorruptFiles=true另外spark支持的选项可以通过在spark-shell
linux安装单机版spark3.5.0 爱上雪茄大数据 JAVA知识 spark 大数据分布式
一、spark介绍是一种通用的大数据计算框架，正如传统大数据技术Hadoop的MapReduce、Hive引擎，以及Storm流式实时计算引擎等.Spark主要用于大数据的计算二、spark下载spark3.5.0三、spark环境变量配置exportJAVA_HOME=/usr/local/jdk1.8.0_391exportJRE_HOME=/usr/local/jdk1.8.0_391/jr
Spark的数据结构——RDD bluedraam_pp Spark spark 数据结构大数据
RDD的5个特征下面来说一下RDD这东西，它是ResilientDistributedDatasets的简写。咱们来看看RDD在源码的解释。Alistofpartitions:在大数据领域，大数据都是分割成若干个部分，放到多个服务器上，这样就能做到多线程的处理数据，这对处理大数据量是非常重要的。分区意味着，可以使用多个线程了处理。Afunctionforcomputingeachsplit：作用在
大数据开发（Spark面试真题-卷一） Key-Key 大数据 spark 面试
大数据开发（Spark面试真题）1、什么是SparkStreaming？简要描述其工作原理。2、什么是Spark内存管理机制？请解释其中的主要概念，并说明其作用。3、请解释一下Spark中的shuffle是什么，以及为什么shuffle操作开销较大？4、请解释一下Spark中的RDD持久化（Caching）是什么以及为什么要使用持久化？5、请解释一下Spark中ResilientDistribut
基于HBase和Spark构建企业级数据处理平台 weixin_34071713 大数据数据库爬虫
摘要：在中国HBase技术社区第十届Meetup杭州站上，阿里云数据库技术专家李伟为大家分享了如何基于当下流行的HBase和Spark体系构建企业级数据处理平台，并且针对于一些具体落地场景进行了介绍。演讲嘉宾简介：李伟（花名：沐远），阿里云数据库技术专家。专注于大数据分布式计算和数据库领域，具有6年分布式开发经验，先后研发Spark及自主研发内存计算，目前为广大公有云用户提供专业的云HBase数据
lightGBM专题4:pyspark平台下lightgbm模型保存 I_belong_to_jesus 大数据
之前的文章（pysparklightGBM1和pysparklightGBM2）介绍了pyspark下lightGBM算法的实现，本文将重点介绍下如何保存训练好的模型，直接上代码：frompyspark.sqlimportSparkSessionfrompyspark.ml.featureimportStringIndexer#配置spark,创建SparkSession对象spark=Spark
大数据开发（Spark面试真题-卷六） Key-Key 大数据 spark 面试
大数据开发（Spark面试真题）1、SparkHashPartitioner和RangePartitioner的实现？2、SparkDAGScheduler、TaskScheduler、SchedulerBackend实现原理？3、介绍下Sparkclient提交application后，接下来的流程？4、Spark的cache和persist的区别？它们是transformation算子还是ac
大数据开发（Hadoop面试真题-卷二） Key-Key 大数据 hadoop 面试
大数据开发（Hadoop面试真题）1、在大规模数据处理过程中使用编写MapReduce程序存在什么缺点？如何解决这些问题？2、请解释一下HDFS架构中NameNode和DataNode之间是如何通信的？3、请解释一下Hadoop的工作原理及其组成部分？4、HDFS读写流程是什么样子？5、Hadoop中fsimage和edit的区别是什么？6、Spark为什么比MapReduce更快？7、详细描述一
Spark从入门到精通29:Spark SQL：工作原理剖析以及性能优化勇于自信
SparkSQL工作原理剖析1.编写SQL语句只要是在数据库类型的技术里面，例如MySQL、Oracle等，包括现在大数据领域的数据仓库，例如Hive。它的基本的SQL执行的模型，都是类似的，首先都是要生成一条SQL语句执行计划。执行计划即从哪里查询，在哪个文件，从文件中查询哪些数据，此外，复杂的SQL还包括查询时是否对表中的数据进行过滤和筛选等等。2.UnresolvedLogicalPlan未
大数据开发（Hadoop面试真题-卷九） Key-Key 大数据 hadoop 面试
大数据开发（Hadoop面试真题）1、Hivecount(distinct)有几个reduce，海量数据会有什么问题？2、既然HBase底层数据是存储在HDFS上，为什么不直接使用HDFS，而还要用HBase?3、Sparkmapjoin的实现原理？4、Spark的stage如何划分？在源码中是怎么判断属于ShuffleMapStage或ResultStage的？5、SparkreduceByKe
Spark Streaming（二）：DStream数据源雪飘千里
1、输入DStream和Receiver输入（Receiver）DStream代表了来自数据源的输入数据流，在之前的wordcount例子中，lines就是一个输入DStream（JavaReceiverInputDStream），代表了从netcat（nc）服务接收到的数据流。除了文件数据流之外，所有的输入DStream都会绑定一个Receiver对象，该对象是一个关键的组件，用来从数据源接收数
Spark常见问题汇总 midNightParis spark spark
注意：如果Driver写好了代码，eclipse或者程序上传后，没有开始处理数据，或者快速结束任务，也没有在控制台中打印错误，那么请进入spark的web页面，查看一下你的任务，找到每个分区日志的stderr，查看是否有错误，一般情况下一旦驱动提交了，报错的情况只能在任务日志里面查看是否有错误情况了1、OperationcategoryREADisnotsupportedinstatestandb
SparkShop开源可商用，匹配小程序H5和PC端带分销功能！行动之上源码免费下载小程序
SparkShop(星火商城)B2C商城是基于thinkphp6+elementui的开源免费可商用的高性能商城系统；包含小程序商城、H5商城、公众号商城、PC商城、App，支持页面diy、秒杀、优惠券、积分、分销、会员等级。营销功能采用插件化的方式方便扩展、二次开发源码下载地址你别走吖Σ(っ°Д°;)っ(chaobiji.cn)
【Hadoop】在spark读取clickhouse中数据方大刚233 Hadoop Scala hadoop spark clickhouse
读取clickhouse数据库数据importscala.collection.mutable.ArrayBufferimportjava.util.Propertiesimportorg.apache.spark.sql.SaveModeimportorg.apache.spark.sql.SparkSessiondefgetCKJdbcProperties(batchSize:String="
Spark-sql Adaptive Execution动态调整分区数量，调整输出文件数不想起的昵称 hive spark hive 数据仓库
背景：在数仓任务中，经常要解决小文件的问题。有时间为了解决小文件问题，我们把spark.sql.shuffle.partitions这个参数调整的很小，但是随着时间的推移，数据量越来越大，当初设置的参数就不合适了，那有没有一个可以自我伸缩的参数呢？看看这个参数如何运用：我们的spark-sql版本：[hadoop@666~]$spark-sql--versionWelcometo______/__
hive join中出现的数据暴增（数据重复）不想起的昵称 hive 大数据 hadoop hive
什么是join过程中导致的数据暴增？例如：给左表的每个用户打上是否是新用户的标签，左表的用户数为100，但是关联右表之后，得到的用户数为200甚至更多什么原因导致的数据暴增呢？我们来看一下案例：spark-sql>withtest1as>(select'10001'asuid,'xiaomi'asqid>unionall>select'10002'asuid,'huawei'asqid>union
hive四种常见的join 不想起的昵称 hive 大数据 hadoop hdfs hive
1.左连接leftjoinspark-sql>withtest1as(>select1asuser_id,'xiaoming'asname>unionall>select2asuser_id,'xiaolan'asname>unionall>select3asuser_id,'xiaoxin'asname>),>>test2as(>select1asuser_id,19asage>unionall
Spark整合hive（保姆级教程）万家林 spark hive spark hadoop
准备工作：1、需要安装配置好hive，如果不会安装可以跳转到Linux下编写脚本自动安装hive2、需要安装配置好spark，如果不会安装可以跳转到Spark安装与配置（单机版）3、需要安装配置好Hadoop，如果不会安装可以跳转到Linux安装配置Hadoop2.6操作步骤：1、将hive的conf目录下的hive-site.xml拷贝到spark的conf目录下（也可以建立软连接）cp/opt
在 Spark 数据导入中的一些实践细节 NebulaGraph
best-practices-import-data-spark-nebula-graph本文由合合信息大数据团队柳佳浩撰写1.前言图谱业务随着时间的推移愈发的复杂化，逐渐体现出了性能上的瓶颈：单机不足以支持更大的图谱。然而，从性能上来看，Neo4j的原生图存储有着不可替代的性能优势，这一点是之前调研的JanusGraph、Dgraph等都难以逾越的鸿沟。即使JanusGraph在OLAP上面非常
Spark开发_简单DataFrame判空赋值逻辑 Matrix70 Spark开发_工作 spark 大数据分布式
valtable1="实时转存数据"valtable2="历史存hdf数据"valdfin1=inputRDD(table1).asInstanceOf[org.apache.spark.sql.DataFrame]valdfin=if(!dfin1.isEmpty)dfin1elseinputRDD(table2).asInstanceOf[org.apache.spark.sql.DataFr
Spark SQL编程指南 <>= spark
SparkSQL编程指南SparkSQL是用于结构化数据处理的一个模块。同SparkRDD不同地方在于SparkSQL的API可以给Spark计算引擎提供更多地信息，例如:数据结构、计算算子等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这里有几种方式和SparkSQL进行交互，例如DatasetAPI和SQL等，这两种API可以混合使用。SparkSQL的一个用途是执行SQL查询。
Pandas将单列XML格式数据转化为字典再拆分成多列列表拆分成多列 aoyi1337 python
单列XML扩展成多列遇到了个需求是需要把XML格式的数据拆分成多列的一个需求，本来需要使用spark进行处理的，但是没想到什么优雅的解决方案，所以打算先使用pandas找找感觉。样例数据如下所示。df=pd.DataFrame([{"uid":1,"detail":'家电无失败'},{"uid":2,"detail":'无失败'},{"uid":3,"detail":'1337点卡成功'}])然后
航班数据预测与分析林坰大数据 spark 航班数据分析杜艳辉
流程：数据来源：数据集预览（原始数据500w行，使用excel打不开，因此使用notepad++打开）：。。。数据清洗：数据存储到HDFS：使用pyspark对数据进行分析：//数据导入frompysparkimportSparkContextfrompyspark.sqlimportSQLContextsc=SparkContext()sqlContext=SQLContext(sc)airpo
再聊阴影裁剪与高性能视锥剔除 unity
【USparkle专栏】如果你深怀绝技，爱“搞点研究”，乐于分享也博采众长，我们期待你的加入，让智慧的火花碰撞交织，让知识的传递生生不息！一、实际需求因为项目的树与草都采用ComputeShader剔除的GPUInstance绘制，所以需要自己实现阴影投递物的裁剪方法。也就是每一帧具体让哪些物体绘制ShadowMap。该计算的精确性会很影响树（有大量顶点又需要用AlphaTest镂空）的渲染性能。
spark为什么比mapreduce快？后端
spark为什么比mapreduce快？首先澄清几个误区：1：两者都是基于内存计算的，任何计算框架都肯定是基于内存的，所以网上说的spark是基于内存计算所以快，显然是错误的2;DAG计算模型减少的是磁盘I/O次数（相比于mapreduce计算模型而言），而不是shuffle次数，因为shuffle是根据数据重组的次数而定，所以shuffle次数不能减少所以总结spark比mapreduce快的原
[CDH] Spark 属性、内存、CPU相关知识梳理枪枪枪 Spark spark scala big data
version：2.4.0-cdh6.3.0文章目录sparkproperties常用配置sparktasksparktask使用的cpu核数sparkarchitecturesparkmemorysparkonyarn问题1：什么情况下使用spark.executor.memoryOverhead问题2:什么情况下使用spark.executor.memory小总结：归根结底，spark中的cp
Spring中@Value注解，需要注意的地方无量 spring bean @Value xml
Spring 3以后,支持@Value注解的方式获取properties文件中的配置值，简化了读取配置文件的复杂操作 1、在applicationContext.xml文件(或引用文件中)中配置properties文件 <bean id="appProperty" class="org.springframework.beans.fac
mongoDB 分片开窍的石头 mongodb
mongoDB的分片。要mongos查询数据时候先查询configsvr看数据在那台shard上，configsvr上边放的是metar信息，指的是那条数据在那个片上。由此可以看出mongo在做分片的时候咱们至少要有一个configsvr,和两个以上的shard（片）信息。第一步启动两台以上的mongo服务 &nb
OVER(PARTITION BY)函数用法 0624chenhong oracle
这篇写得很好，引自 http://www.cnblogs.com/lanzi/archive/2010/10/26/1861338.html OVER(PARTITION BY)函数用法 2010年10月26日 OVER(PARTITION BY)函数介绍开窗函数 &nb
Android开发中，ADB server didn't ACK 解决方法一炮送你回车库 Android开发
首先通知：凡是安装360、豌豆荚、腾讯管家的全部卸载，然后再尝试。一直没搞明白这个问题咋出现的，但今天看到一个方法，搞定了！原来是豌豆荚占用了 5037 端口导致。参见原文章：一个豌豆荚引发的血案——关于ADB server didn't ACK的问题简单来讲，首先将Windows任务进程中的豌豆荚干掉，如果还是不行，再继续按下列步骤排查。 &nb
canvas中的像素绘制问题换个号韩国红果果 JavaScript canvas
pixl的绘制，1.如果绘制点正处于相邻像素交叉线，绘制x像素的线宽，则从交叉线分别向前向后绘制x/2个像素，如果x/2是整数，则刚好填满x个像素，如果是小数，则先把整数格填满，再去绘制剩下的小数部分，绘制时，是将小数部分的颜色用来除以一个像素的宽度，颜色会变淡。所以要用整数坐标来画的话（即绘制点正处于相邻像素交叉线时），线宽必须是2的整数倍。否则会出现不饱满的像素。 2.如果绘制点为一个像素的
编码乱码问题灵静志远 java jvm jsp 编码
1、JVM中单个字符占用的字节长度跟编码方式有关，而默认编码方式又跟平台是一一对应的或说平台决定了默认字符编码方式；2、对于单个字符：ISO-8859-1单字节编码，GBK双字节编码，UTF-8三字节编码；因此中文平台(中文平台默认字符集编码GBK)下一个中文字符占2个字节，而英文平台(英文平台默认字符集编码Cp1252(类似于ISO-8859-1))。 3、getBytes()、getByte
java 求几个月后的日期 darkranger calendar getinstance
Date plandate = planDate.toDate(); SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd"); Calendar cal = Calendar.getInstance(); cal.setTime(plandate); // 取得三个月后时间 cal.add(Calendar.M
数据库设计的三大范式（通俗易懂） aijuans 数据库复习
关系数据库中的关系必须满足一定的要求。满足不同程度要求的为不同范式。数据库的设计范式是数据库设计所需要满足的规范。只有理解数据库的设计范式，才能设计出高效率、优雅的数据库，否则可能会设计出错误的数据库. 目前，主要有六种范式：第一范式、第二范式、第三范式、BC范式、第四范式和第五范式。满足最低要求的叫第一范式，简称1NF。在第一范式基础上进一步满足一些要求的为第二范式，简称2NF。其余依此类推。
想学工作流怎么入手 atongyeye jbpm
工作流在工作中变得越来越重要，很多朋友想学工作流却不知如何入手。很多朋友习惯性的这看一点，那了解一点，既不系统，也容易半途而废。好比学武功，最好的办法是有一本武功秘籍。研究明白，则犹如打通任督二脉。系统学习工作流，很重要的一本书《JBPM工作流开发指南》。本人苦苦学习两个月，基本上可以解决大部分流程问题。整理一下学习思路，有兴趣的朋友可以参考下。 1 首先要
Context和SQLiteOpenHelper创建数据库百合不是茶 android Context创建数据库
一直以为安卓数据库的创建就是使用SQLiteOpenHelper创建,但是最近在android的一本书上看到了Context也可以创建数据库,下面我们一起分析这两种方式创建数据库的方式和区别,重点在SQLiteOpenHelper 一:SQLiteOpenHelper创建数据库: 1,SQLi
浅谈group by和distinct bijian1013 oracle 数据库 group by distinct
group by和distinct只了去重意义一样，但是group by应用范围更广泛些，如分组汇总或者从聚合函数里筛选数据等。譬如：统计每id数并且只显示数大于3 select id ,count(id) from ta
vi opertion 征客丶 mac opration vi
进入 command mode （命令行模式）按 esc 键再按 shift + 冒号注：以下命令中带 $ 【在命令行模式下进行】，不带 $ 【在非命令行模式下进行】一、文件操作 1.1、强制退出不保存 $ q! 1.2、保存 $ w 1.3、保存并退出 $ wq 1.4、刷新或重新加载已打开的文件 $ e 二、光标移动 2.1、跳到指定行数字
【Spark十四】深入Spark RDD第三部分RDD基本API bit1129 spark
对于K/V类型的RDD,如下操作是什么含义？ val rdd = sc.parallelize(List(("A",3),("C",6),("A",1),("B",5)) rdd.reduceByKey(_+_).collect reduceByKey在这里的操作，是把
java类加载机制 BlueSkator java 虚拟机
java类加载机制 1.java类加载器的树状结构引导类加载器 ^ | 扩展类加载器 ^ | 系统类加载器 java使用代理模式来完成类加载，java的类加载器也有类似于继承的关系，引导类是最顶层的加载器，它是所有类的根加载器，它负责加载java核心库。当一个类加载器接到装载类到虚拟机的请求时，通常会代理给父类加载器，若已经是根加载器了，就自己完成加载。虚拟机区分一个Cla
动态添加文本框 BreakingBad 文本框
<script> var num=1; function AddInput() { var str=""; str+="<input
读《研磨设计模式》-代码笔记-单例模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ public class Singleton { } /* * 懒汉模式。注意，getInstance如果在多线程环境中调用，需要加上synchronized，否则存在线程不安全问题 */ class LazySingleton
iOS应用打包发布常见问题 chenhbc ios iOS发布 iOS上传 iOS打包
这个月公司安排我一个人做iOS客户端开发，由于急着用，我先发布一个版本，由于第一次发布iOS应用，期间出了不少问题，记录于此。 1、使用Application Loader 发布时报错：Communication error.please use diagnostic mode to check connectivity.you need to have outbound acc
工作流复杂拓扑结构处理新思路 comsci 设计模式工作算法企业应用 OO
我们走的设计路线和国外的产品不太一样，不一样在哪里呢？国外的流程的设计思路是通过事先定义一整套规则(类似XPDL)来约束和控制流程图的复杂度(我对国外的产品了解不够多，仅仅是在有限的了解程度上面提出这样的看法)，从而避免在流程引擎中处理这些复杂的图的问题，而我们却没有通过事先定义这样的复杂的规则来约束和降低用户自定义流程图的灵活性，这样一来，在引擎和流程流转控制这一个层面就会遇到很
oracle 11g新特性Flashback data archive daizj oracle
1. 什么是flashback data archive Flashback data archive是oracle 11g中引入的一个新特性。Flashback archive是一个新的数据库对象，用于存储一个或多表的历史数据。Flashback archive是一个逻辑对象，概念上类似于表空间。实际上flashback archive可以看作是存储一个或多个表的所有事务变化的逻辑空间。
多叉树:2-3-4树 dieslrae 树
平衡树多叉树,每个节点最多有4个子节点和3个数据项,2,3,4的含义是指一个节点可能含有的子节点的个数,效率比红黑树稍差.一般不允许出现重复关键字值.2-3-4树有以下特征: 1、有一个数据项的节点总是有2个子节点(称为2-节点) 2、有两个数据项的节点总是有3个子节点(称为3-节
C语言学习七动态分配 malloc的使用 dcj3sjt126com c language malloc
/* 2013年3月15日15:16:24 malloc 就memory(内存) allocate(分配)的缩写本程序没有实际含义，只是理解使用 */ # include <stdio.h> # include <malloc.h> int main(void) { int i = 5; //分配了4个字节静态分配 int * p
Objective-C编码规范[译] dcj3sjt126com 代码规范
原文链接 : The official raywenderlich.com Objective-C style guide 原文作者 : raywenderlich.com Team 译文出自 : raywenderlich.com Objective-C编码规范译者 : Sam Lau
0.性能优化-目录 frank1234 性能优化
从今天开始笔者陆续发表一些性能测试相关的文章，主要是对自己前段时间学习的总结，由于水平有限，性能测试领域很深，本人理解的也比较浅，欢迎各位大咖批评指正。主要内容包括：一、性能测试指标吞吐量、TPS、响应时间、负载、可扩展性、PV、思考时间 http://frank1234.iteye.com/blog/2180305 二、性能测试策略生产环境相同基准测试预热等 htt
Java父类取得子类传递的泛型参数Class类型 happyqing java 泛型父类子类 Class
import java.lang.reflect.ParameterizedType; import java.lang.reflect.Type; import org.junit.Test; abstract class BaseDao<T> { public void getType() { //Class<E> clazz =
跟我学SpringMVC目录汇总贴、PDF下载、源码下载 jinnianshilongnian springMVC
----广告-------------------------------------------------------------- 网站核心商详页开发掌握Java技术，掌握并发/异步工具使用，熟悉spring、ibatis框架；掌握数据库技术，表设计和索引优化，分库分表/读写分离；了解缓存技术，熟练使用如Redis/Memcached等主流技术；了解Ngin
the HTTP rewrite module requires the PCRE library 流浪鱼 rewrite
./configure: error: the HTTP rewrite module requires the PCRE library. 模块依赖性Nginx需要依赖下面3个包 1. gzip 模块需要 zlib 库 ( 下载: http://www.zlib.net/ ) 2. rewrite 模块需要 pcre 库 ( 下载: http://www.pcre.org/ ) 3. s
第12章 Ajax（中） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Optimize query with Query Stripping in Web Intelligence blueoxygen BO
http://wiki.sdn.sap.com/wiki/display/BOBJ/Optimize+query+with+Query+Stripping+in+Web+Intelligence and a very straightfoward video http://www.sdn.sap.com/irj/scn/events?rid=/library/uuid/40ec3a0c-936
Java开发者写SQL时常犯的10个错误 tomcat_oracle java sql
1、不用PreparedStatements 　　有意思的是，在JDBC出现了许多年后的今天，这个错误依然出现在博客、论坛和邮件列表中，即便要记住和理解它是一件很简单的事。开发者不使用PreparedStatements的原因可能有如下几个：　　他们对PreparedStatements不了解　　他们认为使用PreparedStatements太慢了　　他们认为写Prepar
世纪互联与结盟有感阿尔萨斯
10月10日，世纪互联与（Foxcon）签约成立合资公司，有感。全球电子制造业巨头（全球500强企业）与世纪互联共同看好IDC、云计算等业务在中国的增长空间，双方迅速果断出手，在资本层面上达成合作，此举体现了全球电子制造业巨头对世纪互联IDC业务的欣赏与信任，另一方面反映出世纪互联目前良好的运营状况与广阔的发展前景。众所周知，精于电子产品制造（世界第一），对于世纪互联而言，能够与结盟