Trino：分区表上的SQL提交 & 查询流程浅析

SQL提交&注册

通过/v1/statement/queued API向coordinator提交新的Query，会首先将此query放入QueryManager的缓存池中，然后返回给客户端下一次应该访问的地址。
客户端提交SQL成功后，会立即调用queued/{queryId}/{slug}/{token} REST API，轮询SQL的执行状态。

public class QueuedStatementResource {
    @ResourceSecurity(AUTHENTICATED_USER)
    @POST
    @Produces(APPLICATION_JSON)
    public Response postStatement(
            String statement,
            @Context HttpServletRequest servletRequest,
            @Context HttpHeaders httpHeaders,
            @Context UriInfo uriInfo)
    {
        if (isNullOrEmpty(statement)) {
            throw badRequest(BAD_REQUEST, "SQL statement is empty");
        }
        // 注册新的query这里仅仅是创建Query实例，并添加到QueryManager的缓存池中
        Query query = registerQuery(statement, servletRequest, httpHeaders);

        return createQueryResultsResponse(query.getQueryResults(query.getLastToken(), uriInfo));
    }

    private Query registerQuery(String statement, HttpServletRequest servletRequest, HttpHeaders httpHeaders)
    {
        Optional<String> remoteAddress = Optional.ofNullable(servletRequest.getRemoteAddr());
        Optional<Identity> identity = Optional.ofNullable((Identity) servletRequest.getAttribute(AUTHENTICATED_IDENTITY));
        MultivaluedMap<String, String> headers = httpHeaders.getRequestHeaders();

        SessionContext sessionContext = sessionContextFactory.createSessionContext(headers, alternateHeaderName, remoteAddress, identity);
        // 创建一个SQL实例，维护当前SQL生命周期内的各种信息
        Query query = new Query(statement, sessionContext, dispatchManager, queryInfoUrlFactory);
        // 将Query实例注册到QueryManager
        queryManager.registerQuery(query);

        // let authentication filter know that identity lifecycle has been handed off
        servletRequest.setAttribute(AUTHENTICATED_IDENTITY, null);

        return query;
    }
}

Query类

维护SQL运行时状态，可以通过此类获取SQL运行期间的状态信息；同时也负责与Client交互，提供对SQL任务的管理能力。

    private static final class Query
    {
        private final String query;
        private final SessionContext sessionContext;
        private final DispatchManager dispatchManager;
        private final QueryId queryId;
        private final Optional<URI> queryInfoUrl;
        private final Slug slug = Slug.createNew();
        private final AtomicLong lastToken = new AtomicLong();

        private final long initTime = System.nanoTime();
        private final AtomicReference<Boolean> submissionGate = new AtomicReference<>();
        private final SettableFuture<Void> creationFuture = SettableFuture.create();

        public Query(String query, SessionContext sessionContext, DispatchManager dispatchManager, QueryInfoUrlFactory queryInfoUrlFactory)
        {
            this.query = requireNonNull(query, "query is null");
            this.sessionContext = requireNonNull(sessionContext, "sessionContext is null");
            this.dispatchManager = requireNonNull(dispatchManager, "dispatchManager is null");
            this.queryId = dispatchManager.createQueryId();
            requireNonNull(queryInfoUrlFactory, "queryInfoUrlFactory is null");
            this.queryInfoUrl = queryInfoUrlFactory.getQueryInfoUrl(queryId);
        }

        public boolean isCreated()
        {
            return creationFuture.isDone();
        }
        
        private ListenableFuture<Void> waitForDispatched()
        {
            // 只能调用`queued/{queryId}/{slug}/{token}` REST API，获取SQL任务的状态时，才会调用此方法，触发当前SQL任务的提交
            submitIfNeeded();
            if (!creationFuture.isDone()) {
                return nonCancellationPropagating(creationFuture);
            }
            // otherwise, wait for the query to finish
            return dispatchManager.waitForDispatched(queryId);
        }

        private void submitIfNeeded()
        {
            if (submissionGate.compareAndSet(null, true)) {
                // 尝试向dispatcherManager提交一个SQL任务
                creationFuture.setFuture(dispatchManager.createQuery(queryId, slug, sessionContext, query));
            }
        }
        
        public QueryResults getQueryResults(long token, UriInfo uriInfo)
        {
           // 客户端获取结果
        }

        public void cancel()
        {
            creationFuture.addListener(() -> dispatchManager.cancelQuery(queryId), directExecutor());
        }

        public void destroy()
        {
            sessionContext.getIdentity().destroy();
        }
}

QueuedStatementResource.QueryManager类

负责维护所有活着的Query实例，为REST API提供快速获取Query功能；同时也负责检查客户端提交超时逻辑，详见tryAbandonSubmissionWithTimeout(clientTimeout)的检查条件。
对于Server来说，只有触发了Query::waitForDispatched()方法，才将任务的状态设置为submitted。那如果客户端提交一个SQL执行后失联，肯定不会再调用REST API获取SQL的执行状态了，因此就不可能触发这个方法，这个期间段就被算作提交时间。

    @ThreadSafe
    private static class QueryManager
    {
        private final ConcurrentMap<QueryId, Query> queries = new ConcurrentHashMap<>();
        private final ScheduledExecutorService scheduledExecutorService = newSingleThreadScheduledExecutor(daemonThreadsNamed("drain-state-query-manager"));

        private final Duration querySubmissionTimeout;

        public QueryManager(Duration querySubmissionTimeout)
        {
            this.querySubmissionTimeout = requireNonNull(querySubmissionTimeout, "querySubmissionTimeout is null");
        }

        public void initialize(DispatchManager dispatchManager)
        {
            scheduledExecutorService.scheduleWithFixedDelay(() -> syncWith(dispatchManager), 200, 200, MILLISECONDS);
        }

        private void syncWith(DispatchManager dispatchManager)
        {
            queries.forEach((queryId, query) -> {
                if (shouldBePurged(dispatchManager, query)) {
                    removeQuery(queryId);
                }
            });
        }

        private boolean shouldBePurged(DispatchManager dispatchManager, Query query)
        {
            if (query.isSubmissionAbandoned()) {
                // Query submission was explicitly abandoned
                return true;
            }
            if (query.tryAbandonSubmissionWithTimeout(querySubmissionTimeout)) {
                // Query took too long to be submitted by the client
                return true;
            }
            if (query.isCreated() && !dispatchManager.isQueryRegistered(query.getQueryId())) {
                // Query was created in the DispatchManager, and DispatchManager has already purged the query
                return true;
            }
            return false;
        }

        private void removeQuery(QueryId queryId)
        {
            Optional.ofNullable(queries.remove(queryId))
                    .ifPresent(QueryManager::destroyQuietly);
        }

        public void registerQuery(Query query)
        {
            Query existingQuery = queries.putIfAbsent(query.getQueryId(), query);
            checkState(existingQuery == null, "Query already registered");
        }

        @Nullable
        public Query getQuery(QueryId queryId)
        {
            return queries.get(queryId);
        }
    }

Query实例的调度

只有当前客户端尝试获取SQL的执行状态时，才会触发SQL任务的提交，提交到。

public class DispatchManager {
    public ListenableFuture<Void> createQuery(QueryId queryId, Slug slug, SessionContext sessionContext, String query)
    {
        requireNonNull(queryId, "queryId is null");
        requireNonNull(sessionContext, "sessionContext is null");
        requireNonNull(query, "query is null");
        checkArgument(!query.isEmpty(), "query must not be empty string");
        checkArgument(queryTracker.tryGetQuery(queryId).isEmpty(), "query %s already exists", queryId);

        // It is important to return a future implementation which ignores cancellation request.
        // Using NonCancellationPropagatingFuture is not enough; it does not propagate cancel to wrapped future
        // but it would still return true on call to isCancelled() after cancel() is called on it.
        DispatchQueryCreationFuture queryCreationFuture = new DispatchQueryCreationFuture();
        // 异步创建
        dispatchExecutor.execute(() -> {
            try {
                createQueryInternal(queryId, slug, sessionContext, query, resourceGroupManager);
            }
            finally {
                queryCreationFuture.set(null);
            }
        });
        return queryCreationFuture;
    }
    private <C> void createQueryInternal(QueryId queryId, Slug slug, SessionContext sessionContext, String query, ResourceGroupManager<C> resourceGroupManager)
    {
        Session session = null;
        PreparedQuery preparedQuery = null;
        try {
            if (query.length() > maxQueryLength) {
                int queryLength = query.length();
                query = query.substring(0, maxQueryLength);
                throw new TrinoException(QUERY_TEXT_TOO_LARGE, format("Query text length (%s) exceeds the maximum length (%s)", queryLength, maxQueryLength));
            }

            // decode session
            session = sessionSupplier.createSession(queryId, sessionContext);

            // check query execute permissions
            accessControl.checkCanExecuteQuery(sessionContext.getIdentity());

            // prepare query
            // 对用户SQL进行Parsing，产生AST实例
            preparedQuery = queryPreparer.prepareQuery(session, query);

            // select resource group
            Optional<String> queryType = getQueryType(preparedQuery.getStatement()).map(Enum::name);
            // 如果没有配置ResourceGroup的分配策略，则默认会将当前SQL分析到全局队列中，所有的SQL共享集群
            SelectionContext<C> selectionContext = resourceGroupManager.selectGroup(new SelectionCriteria(
                    sessionContext.getIdentity().getPrincipal().isPresent(),
                    sessionContext.getIdentity().getUser(),
                    sessionContext.getIdentity().getGroups(),
                    sessionContext.getSource(),
                    sessionContext.getClientTags(),
                    sessionContext.getResourceEstimates(),
                    queryType));

            // apply system default session properties (does not override user set properties)
            session = sessionPropertyDefaults.newSessionWithDefaultProperties(session, queryType, selectionContext.getResourceGroupId());

            // mark existing transaction as active
            transactionManager.activateTransaction(session, isTransactionControlStatement(preparedQuery.getStatement()), accessControl);
            // 将query和preparedQuery封装成一个DispatchQuery实例，实际上是一个LocalDispatchQuery类的实例，这个过程是异步的。
            // 它提供了如下的方法，帮助上层获取任务的调度状态。
            //     ListenableFuture getDispatchedFuture();
            //     DispatchInfo getDispatchInfo();
            //
            // Trino中SQL执行的每一个阶段基本上都是异步的，为了能够在异步情况下正确管理Query的生命周期，都需要在相应的阶段创建一个
            // 对应的实例，例如这里的DispatchQuery。
            DispatchQuery dispatchQuery = dispatchQueryFactory.createDispatchQuery(
                    session,
                    query,
                    preparedQuery,
                    slug,
                    selectionContext.getResourceGroupId());
            // DispatchQuery一旦创建成功，就会将这个对象添加到QueryTracker对象中的，由它管理SQL的执行生命周期
            boolean queryAdded = queryCreated(dispatchQuery);
            if (queryAdded && !dispatchQuery.isDone()) {
                // 如果SQL成功被添加进了QueryTracker，但是dispatchQuery还没有完成创建，则先将它放进提交到resource group中，等待被调度
                try {
                    resourceGroupManager.submit(dispatchQuery, selectionContext, dispatchExecutor);
                }
                catch (Throwable e) {
                    // dispatch query has already been registered, so just fail it directly
                    dispatchQuery.fail(e);
                }
            }
        }
        catch (Throwable throwable) {
            // creation must never fail, so register a failed query in this case
            if (session == null) {
                session = Session.builder(sessionPropertyManager)
                        .setQueryId(queryId)
                        .setIdentity(sessionContext.getIdentity())
                        .setSource(sessionContext.getSource().orElse(null))
                        .build();
            }
            // 如果发生了任务异常，会创建一个FailedDispatchQuery的实例，记录失败的种信息。
            Optional<String> preparedSql = Optional.ofNullable(preparedQuery).flatMap(PreparedQuery::getPrepareSql);
            DispatchQuery failedDispatchQuery = failedDispatchQueryFactory.createFailedDispatchQuery(session, query, preparedSql, Optional.empty(), throwable);
            queryCreated(failedDispatchQuery);
        }
    }
}

创建LocalDispatchQuery实例

public class LocalDispatchQueryFactory
        implements DispatchQueryFactory
{
    @Override
    public DispatchQuery createDispatchQuery(
            Session session,
            String query,
            PreparedQuery preparedQuery,
            Slug slug,
            ResourceGroupId resourceGroup)
    {
        WarningCollector warningCollector = warningCollectorFactory.create();
        // 为新提交的Query实例，创建一个新的状态机
        QueryStateMachine stateMachine = QueryStateMachine.begin(
                query,
                preparedQuery.getPrepareSql(),
                session,
                locationFactory.createQueryLocation(session.getQueryId()),
                resourceGroup,
                isTransactionControlStatement(preparedQuery.getStatement()),
                transactionManager,
                accessControl,
                executor,
                metadata,
                warningCollector,
                getQueryType(preparedQuery.getStatement()));

        // It is important that `queryCreatedEvent` is called here. Moving it past the `executor.submit` below
        // can result in delivering query-created event after query analysis has already started.
        // That can result in misbehaviour of plugins called during analysis phase (e.g. access control auditing)
        // which depend on the contract that event was already delivered.
        //
        // Note that for immediate and in-order delivery of query events we depend on synchronous nature of
        // QueryMonitor and EventListenerManager.
        queryMonitor.queryCreatedEvent(stateMachine.getBasicQueryInfo(Optional.empty()));
        // 异步的方式，创建QueryExecution实例，实际上是SqlQueryExecution的实例
        ListenableFuture<QueryExecution> queryExecutionFuture = executor.submit(() -> {
            QueryExecutionFactory<?> queryExecutionFactory = executionFactories.get(preparedQuery.getStatement().getClass());
            if (queryExecutionFactory == null) {
                throw new TrinoException(NOT_SUPPORTED, "Unsupported statement type: " + preparedQuery.getStatement().getClass().getSimpleName());
            }

            try {
                // 创建
                return queryExecutionFactory.createQueryExecution(preparedQuery, stateMachine, slug, warningCollector);
            }
            catch (Throwable e) {
                if (e instanceof Error) {
                    if (e instanceof StackOverflowError) {
                        log.error(e, "Unhandled StackOverFlowError; should be handled earlier; to investigate full stacktrace you may need to enable -XX:MaxJavaStackTraceDepth=0 JVM flag");
                    }
                    else {
                        log.error(e, "Unhandled Error");
                    }
                    // wrapping as RuntimeException to guard us from problem that code downstream which investigates queryExecutionFuture may not necessarily handle
                    // Error subclass of Throwable well.
                    RuntimeException wrappedError = new RuntimeException(e);
                    stateMachine.transitionToFailed(wrappedError);
                    throw wrappedError;
                }
                stateMachine.transitionToFailed(e);
                throw e;
            }
        });
        // 返回LocalDispatchQuery的实例，可以看到这个实例，会有接收queryExecutionFuture变量，意味着只有当queryExecutionFuture.isDone()时，
        // 才标识着此实例创建完成。
        return new LocalDispatchQuery(
                stateMachine,
                queryExecutionFuture,
                queryMonitor,
                clusterSizeMonitor,
                executor,
                // queryManager是一个SqlQueryManager的实例对象，它内部维护着QueryTracker的引用，因此可以在更上层管理SQL任务的生命周期
                queryManager::createQuery);
    }
}

创建SqlQueryExecution实例

通过SqlQueryExecutionFactory.createQueryExecution()创建对象。

@ThreadSafe
public class SqlQueryExecution
        implements QueryExecution
{
    private SqlQueryExecution(
            PreparedQuery preparedQuery,
            QueryStateMachine stateMachine,
            Slug slug,
            PlannerContext plannerContext,
            AnalyzerFactory analyzerFactory,
            SplitSourceFactory splitSourceFactory,
            NodePartitioningManager nodePartitioningManager,
            NodeScheduler nodeScheduler,
            List<PlanOptimizer> planOptimizers,
            PlanFragmenter planFragmenter,
            RemoteTaskFactory remoteTaskFactory,
            int scheduleSplitBatchSize,
            ExecutorService queryExecutor,
            ScheduledExecutorService schedulerExecutor,
            FailureDetector failureDetector,
            NodeTaskMap nodeTaskMap,
            ExecutionPolicy executionPolicy,
            SplitSchedulerStats schedulerStats,
            StatsCalculator statsCalculator,
            CostCalculator costCalculator,
            DynamicFilterService dynamicFilterService,
            WarningCollector warningCollector,
            TableExecuteContextManager tableExecuteContextManager,
            TypeAnalyzer typeAnalyzer,
            TaskManager coordinatorTaskManager)
    {
        try (SetThreadName ignored = new SetThreadName("Query-%s", stateMachine.getQueryId())) {
            this.slug = requireNonNull(slug, "slug is null");
            this.plannerContext = requireNonNull(plannerContext, "plannerContext is null");
            this.splitSourceFactory = requireNonNull(splitSourceFactory, "splitSourceFactory is null");
            this.nodePartitioningManager = requireNonNull(nodePartitioningManager, "nodePartitioningManager is null");
            this.nodeScheduler = requireNonNull(nodeScheduler, "nodeScheduler is null");
            this.planOptimizers = requireNonNull(planOptimizers, "planOptimizers is null");
            this.planFragmenter = requireNonNull(planFragmenter, "planFragmenter is null");
            this.queryExecutor = requireNonNull(queryExecutor, "queryExecutor is null");
            this.schedulerExecutor = requireNonNull(schedulerExecutor, "schedulerExecutor is null");
            this.failureDetector = requireNonNull(failureDetector, "failureDetector is null");
            this.nodeTaskMap = requireNonNull(nodeTaskMap, "nodeTaskMap is null");
            this.executionPolicy = requireNonNull(executionPolicy, "executionPolicy is null");
            this.schedulerStats = requireNonNull(schedulerStats, "schedulerStats is null");
            this.statsCalculator = requireNonNull(statsCalculator, "statsCalculator is null");
            this.costCalculator = requireNonNull(costCalculator, "costCalculator is null");
            this.dynamicFilterService = requireNonNull(dynamicFilterService, "dynamicFilterService is null");
            this.tableExecuteContextManager = requireNonNull(tableExecuteContextManager, "tableExecuteContextManager is null");

            checkArgument(scheduleSplitBatchSize > 0, "scheduleSplitBatchSize must be greater than 0");
            this.scheduleSplitBatchSize = scheduleSplitBatchSize;
            // 保存状态机的引用
            this.stateMachine = requireNonNull(stateMachine, "stateMachine is null");

            // analyze query
            // preparedQuery保存了SQL文本Parsing后的Statement（AST），因此这里基于此对象，对AST进行解析
            this.analysis = analyze(preparedQuery, stateMachine, warningCollector, analyzerFactory);
            // 向状态机注册Listener，一旦状态机的状态被设置为完成状态，就注销dynamicFilterService服务，这个服务的任务会在其它文章中详解。
            stateMachine.addStateChangeListener(state -> {
                if (!state.isDone()) {
                    return;
                }
                unregisterDynamicFilteringQuery(
                        dynamicFilterService.getDynamicFilteringStats(stateMachine.getQueryId(), stateMachine.getSession()));

                tableExecuteContextManager.unregisterTableExecuteContextForQuery(stateMachine.getQueryId());
            });

            // when the query finishes cache the final query info, and clear the reference to the output stage
            AtomicReference<SqlQueryScheduler> queryScheduler = this.queryScheduler;
            stateMachine.addStateChangeListener(state -> {
                if (!state.isDone()) {
                    return;
                }

                // query is now done, so abort any work that is still running
                // 失败是完成状态的一种
                SqlQueryScheduler scheduler = queryScheduler.get();
                if (scheduler != null) {
                    scheduler.abort();
                }
            });

            this.remoteTaskFactory = new MemoryTrackingRemoteTaskFactory(requireNonNull(remoteTaskFactory, "remoteTaskFactory is null"), stateMachine);
            this.typeAnalyzer = requireNonNull(typeAnalyzer, "typeAnalyzer is null");
            this.coordinatorTaskManager = requireNonNull(coordinatorTaskManager, "coordinatorTaskManager is null");
        }
    }
}

LocalDispatchQuery的创建及执行

实际上就是QueryExecution实例的执行，进入这个过程，实际上还需要经过ResourceGroup的筛选，筛选细节不是这里的重点，
因此略过，只需要知道ResourceGroup最终会调用LocalDispatchQuery::startWaitingForResources方法。

资源检测

public class LocalDispatchQuery
        implements DispatchQuery
{
    @Override
    public void startWaitingForResources()
    {
        // 将状态机的状态设置为WAITING_RESOURCES
        if (stateMachine.transitionToWaitingForResources()) {
            waitForMinimumWorkers();
        }
    }

    private void waitForMinimumWorkers()
    {
        // 只有当有足够的Workers结点时，才会开始执行queryExecution实例，但由于我们没有修改默认的参数
        // 因此这里的限制条件是，一旦有1个Worker可用，就会触发startExecution(queryExecution)的调用。
        // wait for query execution to finish construction
        addSuccessCallback(queryExecutionFuture, queryExecution -> {
            Session session = stateMachine.getSession();
            int executionMinCount = 1; // always wait for 1 node to be up
            if (queryExecution.shouldWaitForMinWorkers()) {
                executionMinCount = getRequiredWorkers(session);
            }
            ListenableFuture<Void> minimumWorkerFuture = clusterSizeMonitor.waitForMinimumWorkers(executionMinCount, getRequiredWorkersMaxWait(session));
            // when worker requirement is met, start the execution
            addSuccessCallback(minimumWorkerFuture, () -> startExecution(queryExecution));
            addExceptionCallback(minimumWorkerFuture, throwable -> queryExecutor.execute(() -> stateMachine.transitionToFailed(throwable)));

            // cancel minimumWorkerFuture if query fails for some reason or is cancelled by user
            stateMachine.addStateChangeListener(state -> {
                if (state.isDone()) {
                    minimumWorkerFuture.cancel(true);
                }
            });
        });
    }
    
    private void startExecution(QueryExecution queryExecution)
    {
        queryExecutor.execute(() -> {
            // 将状态机的状态设置为DISPATCHING
            if (stateMachine.transitionToDispatching()) {
                try {
                    // 提交给querySubmitter，就是在前面提到的queryManager::createQuery方法，最终会路由到SqlQueryExecution::start方法
                    querySubmitter.accept(queryExecution);
                    if (notificationSentOrGuaranteed.compareAndSet(false, true)) {
                        queryExecution.addFinalQueryInfoListener(queryMonitor::queryCompletedEvent);
                    }
                }
                catch (Throwable t) {
                    // this should never happen but be safe
                    stateMachine.transitionToFailed(t);
                    log.error(t, "query submitter threw exception");
                    throw t;
                }
                finally {
                    submitted.set(null);
                }
            }
        });
    }
}

SqlQueryExecution::start()

@ThreadSafe
public class SqlQueryExecution
        implements QueryExecution
{
    @Override
    public void start()
    {
        try (SetThreadName ignored = new SetThreadName("Query-%s", stateMachine.getQueryId())) {
            try {
                // 将状态机的状态设置为PlANNING
                if (!stateMachine.transitionToPlanning()) {
                    // query already started or finished
                    return;
                }
                // 启动监听线程，一旦在发现状态机的状态处理失败状态，则强制中止PLANNING
                AtomicReference<Thread> planningThread = new AtomicReference<>(currentThread());
                stateMachine.getStateChange(PLANNING).addListener(() -> {
                    if (stateMachine.getQueryState() == FAILED) {
                        synchronized (this) {
                            Thread thread = planningThread.get();
                            if (thread != null) {
                                thread.interrupt();
                            }
                        }
                    }
                }, directExecutor());

                try {
                    // 优化逻辑计划树，并切分为PlanFragments，以便能够调度Plan片段执行
                    PlanRoot plan = planQuery();
                    // DynamicFilterService needs plan for query to be registered.
                    // Query should be registered before dynamic filter suppliers are requested in distribution planning.
                    // 注册动态裁剪服务
                    registerDynamicFilteringQuery(plan);
                    // 调度plan执行，内部会创建SqlQueryScheduler实例，负责调度PlanFragments的分发和状态管理，这个过程是异步的
                    planDistribution(plan);
                }
                finally {
                    synchronized (this) {
                        planningThread.set(null);
                        // Clear the interrupted flag in case there was a race condition where
                        // the planning thread was interrupted right after planning completes above
                        Thread.interrupted();
                    }
                }

                tableExecuteContextManager.registerTableExecuteContextForQuery(getQueryId());
                // 将状态机的状态设置为STARTING
                if (!stateMachine.transitionToStarting()) {
                    // query already started or finished
                    return;
                }

                // if query is not finished, start the scheduler, otherwise cancel it
                SqlQueryScheduler scheduler = queryScheduler.get();

                if (!stateMachine.isDone()) {
                    // 调用SqlQueryScheduler::start()方法，开始调度执行
                    scheduler.start();
                }
            }
            catch (Throwable e) {
                fail(e);
                throwIfInstanceOf(e, Error.class);
            }
        }
    }
}

SqlQueryExecution::planDistribution

@ThreadSafe
public class SqlQueryExecution
        implements QueryExecution
{
    private void planDistribution(PlanRoot plan)
    {
        // if query was canceled, skip creating scheduler
        if (stateMachine.isDone()) {
            return;
        }

        // record output field
        PlanFragment rootFragment = plan.getRoot().getFragment();
        stateMachine.setColumns(
                ((OutputNode) rootFragment.getRoot()).getColumnNames(),
                rootFragment.getTypes());

        // build the stage execution objects (this doesn't schedule execution)
        SqlQueryScheduler scheduler = new SqlQueryScheduler(
                stateMachine,
                plan.getRoot(),
                nodePartitioningManager,
                nodeScheduler,
                remoteTaskFactory,
                plan.isSummarizeTaskInfos(),
                scheduleSplitBatchSize,
                queryExecutor,
                schedulerExecutor,
                failureDetector,
                nodeTaskMap,
                executionPolicy,
                schedulerStats,
                dynamicFilterService,
                tableExecuteContextManager,
                plannerContext.getMetadata(),
                splitSourceFactory,
                coordinatorTaskManager);

        queryScheduler.set(scheduler);

        // if query was canceled during scheduler creation, abort the scheduler
        // directly since the callback may have already fired
        if (stateMachine.isDone()) {
            scheduler.abort();
            queryScheduler.set(null);
        }
    }
}

生成StageManager实例，同时为每一个PlanFragment生成SqlStage实例

SqlStage负责维护跟踪所有归属它的任务的生命周期管理，以及状态维护

public class SqlQueryScheduler
{
    private static class StageManager
    {
        private static StageManager create(
                QueryStateMachine queryStateMachine,
                Session session,
                Metadata metadata,
                RemoteTaskFactory taskFactory,
                NodeTaskMap nodeTaskMap,
                ExecutorService executor,
                SplitSchedulerStats schedulerStats,
                SubPlan planTree,
                boolean summarizeTaskInfo)
        {
            ImmutableMap.Builder<StageId, SqlStage> stages = ImmutableMap.builder();
            ImmutableList.Builder<SqlStage> coordinatorStagesInTopologicalOrder = ImmutableList.builder();
            ImmutableList.Builder<SqlStage> distributedStagesInTopologicalOrder = ImmutableList.builder();
            StageId rootStageId = null;
            ImmutableMap.Builder<StageId, Set<StageId>> children = ImmutableMap.builder();
            ImmutableMap.Builder<StageId, StageId> parents = ImmutableMap.builder();
            // 从Root Plan自顶向下、广度优先遍历，获取所有的SubPlans
            for (SubPlan planNode : Traverser.forTree(SubPlan::getChildren).breadthFirst(planTree)) {
                PlanFragment fragment = planNode.getFragment();
                // 一个SubPlan或是PlanFragment就是一个Stage（同Spark中的概念相近），StageId的取值为{queryId}-{fragmentId}
                SqlStage stage = createSqlStage(
                        getStageId(session.getQueryId(), fragment.getId()),
                        fragment,
                        extractTableInfo(session, metadata, fragment),
                        taskFactory,
                        session,
                        summarizeTaskInfo,
                        nodeTaskMap,
                        executor,
                        schedulerStats);
                StageId stageId = stage.getStageId();
                stages.put(stageId, stage);
                // 以拓扑序，维护所有的Stages
                if (fragment.getPartitioning().isCoordinatorOnly()) {
                    coordinatorStagesInTopologicalOrder.add(stage);
                }
                else {
                    distributedStagesInTopologicalOrder.add(stage);
                }
                // 由于外层遍历是自顶向下的，因此每一个Stage就是最上游的Stage，即root stage
                if (rootStageId == null) {
                    rootStageId = stageId;
                }
                // 维护Stages之间的依赖关系
                Set<StageId> childStageIds = planNode.getChildren().stream()
                        .map(childStage -> getStageId(session.getQueryId(), childStage.getFragment().getId()))
                        .collect(toImmutableSet());
                children.put(stageId, childStageIds);
                childStageIds.forEach(child -> parents.put(child, stageId));
            }
          
            StageManager stageManager = new StageManager(
                    queryStateMachine,
                    stages.build(),
                    coordinatorStagesInTopologicalOrder.build(),
                    distributedStagesInTopologicalOrder.build(),
                    rootStageId,
                    children.build(),
                    parents.build());
            stageManager.initialize();
            return stageManager;
        }
    }
}

SqlQueryScheduler::start()

public class SqlQueryScheduler
{
    public synchronized void start()
    {
        if (started) {
            return;
        }
        started = true;

        if (queryStateMachine.isDone()) {
            return;
        }

        // when query is done or any time a stage completes, attempt to transition query to "final query info ready"
        queryStateMachine.addStateChangeListener(state -> {
            if (!state.isDone()) {
                return;
            }

            DistributedStagesScheduler distributedStagesScheduler;
            // synchronize to wait on distributed scheduler creation if it is currently in process
            synchronized (this) {
                distributedStagesScheduler = this.distributedStagesScheduler.get();
            }

            if (state == QueryState.FINISHED) {
                // 如果状态机的状态被设置为FINISHED，就取消所有正在调度的Stages
                coordinatorStagesScheduler.cancel();
                if (distributedStagesScheduler != null) {
                    distributedStagesScheduler.cancel();
                }
                // 通过StageManager完成
                stageManager.finish();
            }
            else if (state == QueryState.FAILED) {
                coordinatorStagesScheduler.abort();
                if (distributedStagesScheduler != null) {
                    distributedStagesScheduler.abort();
                }
                stageManager.abort();
            }

            queryStateMachine.updateQueryInfo(Optional.ofNullable(getStageInfo()));
        });
        // 调度Stages执行
        coordinatorStagesScheduler.schedule();

        Optional<DistributedStagesScheduler> distributedStagesScheduler = createDistributedStagesScheduler(currentAttempt.get());
        distributedStagesScheduler.ifPresent(scheduler -> distributedStagesSchedulingTask = executor.submit(scheduler::schedule, null));
    }
}

DistributedStagesScheduler::schedule()

DistributedStagesScheduler的创建

负责调度所有的SqlStages执行。

        public static PipelinedDistributedStagesScheduler create(
                QueryStateMachine queryStateMachine, // Query级别的状态机
                SplitSchedulerStats schedulerStats, // 记录Splits的调度信息
                NodeScheduler nodeScheduler,        // 负责为Split分配合适的Worker Node
                NodePartitioningManager nodePartitioningManager, // 提供获取对数据页Page进行Partitioning相关信息
                StageManager stageManager,          // 管理所有的Stages
                CoordinatorStagesScheduler coordinatorStagesScheduler, // 负责调度所有的Stages到Coordinator结点
                ExecutionPolicy executionPolicy,    // 执行策略器，AllAtOnceExecutionPolicy和PhasedExecutionPolicy
                FailureDetector failureDetector,
                ScheduledExecutorService executor,  // Stages调度时的线程池
                SplitSourceFactory splitSourceFactory, // 创建Source Splits的工厂类
                int splitBatchSize,                    // 一次调度的最大Splits数量
                DynamicFilterService dynamicFilterService,
                TableExecuteContextManager tableExecuteContextManager,
                RetryPolicy retryPolicy,
                int attempt)
        {
            // 由于DistributedStagesScheduler是负责Stages的调度器，这有别与QueryStateMachine的状态，因此这里要创建一个独立的状态机
            // 负责维护PipelinedDistributedStagesScheduler的状态
            DistributedStagesSchedulerStateMachine stateMachine = new DistributedStagesSchedulerStateMachine(queryStateMachine.getQueryId(), executor);
            // 使用Map以PartitioningHandle缓存所有的NodePartitionMap实例，由于PlanFragment对应的PartitioningHandle实例相同
            // 因此可以避免干次生成NodePartitionMap实例
            Map<PartitioningHandle, NodePartitionMap> partitioningCacheMap = new HashMap<>();
            // 根据具体的Connector提供的PartitioningHandle，生成NodePartitonMap:
            //     NodePartitonMap记录了WorkerNode -> PartitionId的映射关系，它的生成可以由Connector提供，
            //     例如IcebergPartitioningHandle，也可以使用系统默认的实现SystemPartitioningHandle。
            // 如何生成NodePartitionMap实例，见后面的子章节。
            Function<PartitioningHandle, NodePartitionMap> partitioningCache = partitioningHandle ->
                    partitioningCacheMap.computeIfAbsent(partitioningHandle, handle -> nodePartitioningManager.getNodePartitioningMap(queryStateMachine.getSession(), handle));
            // 为每一个PlanFragment实例，创建Bucket -> PartitionId的映射
            // Butcket即桶，类似Hive中的Bucket概念，是对一个数据Partition中的数据的进一步细化，因此一个Partition会包含多个buckets
            Map<PlanFragmentId, Optional<int[]>> bucketToPartitionMap = createBucketToPartitionMap(
                    coordinatorStagesScheduler.getBucketToPartitionForStagesConsumedByCoordinator(),
                    stageManager,
                    partitioningCache);
            // 为每一个PlanFragment创建OutputBufferManager实例，用于创建和维护这个Fragment的输出缓存区
            // OutputBufferManager分根据PartitioningHandle的不同类型，创建不一样的OutputBuffers，一共有如下三种：
            // BufferType type = 
            //   if partitioningHandle.equals(FIXED_BROADCAST_DISTRIBUTION) then BROADCAST;
            //   else if (partitioningHandle.equals(FIXED_ARBITRARY_DISTRIBUTION) then ARBITRARY;
            //   else PARTITIONED;
            Map<PlanFragmentId, OutputBufferManager> outputBufferManagers = createOutputBufferManagers(
                    coordinatorStagesScheduler.getOutputBuffersForStagesConsumedByCoordinator(),
                    stageManager,
                    bucketToPartitionMap);

            TaskLifecycleListener coordinatorTaskLifecycleListener = coordinatorStagesScheduler.getTaskLifecycleListener();
            if (retryPolicy != RetryPolicy.NONE) {
                // when retries are enabled only close exchange clients on coordinator when the query is finished
                TaskLifecycleListenerBridge taskLifecycleListenerBridge = new TaskLifecycleListenerBridge(coordinatorTaskLifecycleListener);
                coordinatorTaskLifecycleListener = taskLifecycleListenerBridge;
                stateMachine.addStateChangeListener(state -> {
                    if (state == DistributedStagesSchedulerState.FINISHED) {
                        taskLifecycleListenerBridge.notifyNoMoreSourceTasks();
                    }
                });
            }
			// 为所有的要调度的Stages创建对应的PipelinedStageExecution实例，每一个PipelinedStageExecution实例则负责各自的Stage的生命周期管理
            Map<StageId, PipelinedStageExecution> stageExecutions = new HashMap<>();
            for (SqlStage stage : stageManager.getDistributedStagesInTopologicalOrder()) {
                Optional<SqlStage> parentStage = stageManager.getParent(stage.getStageId());
                // TaskLifecycleListener提供了为Stage创建任务的接口
                TaskLifecycleListener taskLifecycleListener;
                if (parentStage.isEmpty() || parentStage.get().getFragment().getPartitioning().isCoordinatorOnly()) {
                    // output will be consumed by coordinator
                    // parentStage是Root或是PlanFragment的分区策略是仅位于Coordiantor时，设置这个Stage的生命周期为Coordiator
                    taskLifecycleListener = coordinatorTaskLifecycleListener;
                }
                else {
                    // 非Root Stage时，则获取已经绑定了的实例
                    StageId parentStageId = parentStage.get().getStageId();
                    PipelinedStageExecution parentStageExecution = requireNonNull(stageExecutions.get(parentStageId), () -> "execution is null for stage: " + parentStageId);
                    taskLifecycleListener = parentStageExecution.getTaskLifecycleListener();
                }

                PlanFragment fragment = stage.getFragment();
                // 创建PipelinedStageExecution，负责执行调度&执行当前Stage，会为每一个Partition创建RemoteTask实例，并调度到相应的Worker Node执行
                PipelinedStageExecution stageExecution = createPipelinedStageExecution(
                        stageManager.get(fragment.getId()),
                        outputBufferManagers,
                        taskLifecycleListener,
                        failureDetector,
                        executor,
                        bucketToPartitionMap.get(fragment.getId()),
                        attempt);
                stageExecutions.put(stage.getStageId(), stageExecution);
            }

            ImmutableMap.Builder<StageId, StageScheduler> stageSchedulers = ImmutableMap.builder();
            for (PipelinedStageExecution stageExecution : stageExecutions.values()) {
                List<PipelinedStageExecution> children = stageManager.getChildren(stageExecution.getStageId()).stream()
                        .map(stage -> requireNonNull(stageExecutions.get(stage.getStageId()), () -> "stage execution not found for stage: " + stage))
                        .collect(toImmutableList());
                // 每一个StageExecution实例，创建对应的StageScheduler实例，负责当前Stage的调度执行，Trino实现了几个不同实现类：
                //  FixedSourcePartitionedScheduler
                //  FixedCountScheduler
                //  () -> SourcePartitionedScheduler
                StageScheduler scheduler = createStageScheduler(
                        queryStateMachine,
                        stageExecution,
                        splitSourceFactory,
                        children,
                        partitioningCache,
                        nodeScheduler,
                        nodePartitioningManager,
                        splitBatchSize,
                        dynamicFilterService,
                        executor,
                        tableExecuteContextManager);
                stageSchedulers.put(stageExecution.getStageId(), scheduler);
            }
            // 创建PipelinedDistributedStagesScheduler实例，负责所有的Stages的调度执行
            PipelinedDistributedStagesScheduler distributedStagesScheduler = new PipelinedDistributedStagesScheduler(
                    stateMachine,
                    queryStateMachine,
                    schedulerStats,
                    stageManager,
                    executionPolicy.createExecutionSchedule(stageExecutions.values()),
                    stageSchedulers.build(),
                    ImmutableMap.copyOf(stageExecutions),
                    dynamicFilterService);
            distributedStagesScheduler.initialize();
            return distributedStagesScheduler;
        }
        
        /**
         * 为每一个PlanFragment计算bucketToPartition的映射关系。
         */
        private static Map<PlanFragmentId, Optional<int[]>> createBucketToPartitionMap(
                Map<PlanFragmentId, Optional<int[]>> bucketToPartitionForStagesConsumedByCoordinator,
                StageManager stageManager,
                Function<PartitioningHandle, NodePartitionMap> partitioningCache)
        {
            ImmutableMap.Builder<PlanFragmentId, Optional<int[]>> result = ImmutableMap.builder();
            // 忽略，只有在Coordinator上调度时，才会有值
            result.putAll(bucketToPartitionForStagesConsumedByCoordinator);
            for (SqlStage stage : stageManager.getDistributedStagesInTopologicalOrder()) {
                PlanFragment fragment = stage.getFragment();
                // 
                Optional<int[]> bucketToPartition = getBucketToPartition(fragment.getPartitioning(), partitioningCache, fragment.getRoot(), fragment.getRemoteSourceNodes());
                for (SqlStage childStage : stageManager.getChildren(stage.getStageId())) {
                    result.put(childStage.getFragment().getId(), bucketToPartition);
                }
            }
            return result.build();
        }
        
        private static Optional<int[]> getBucketToPartition(
                PartitioningHandle partitioningHandle,
                Function<PartitioningHandle, NodePartitionMap> partitioningCache,
                PlanNode fragmentRoot,
                List<RemoteSourceNode> remoteSourceNodes)
        {
            if (partitioningHandle.equals(SOURCE_DISTRIBUTION) || partitioningHandle.equals(SCALED_WRITER_DISTRIBUTION)) {
                // SOURCE_DISTRIBUTION表示一个TableScan算子，而SCALED_WRITER_DISTRIBUTION表示Table Write算子
                // 因此这种类型的PlanFragment只会有一个分桶
                return Optional.of(new int[1]);
            }
            else if (searchFrom(fragmentRoot).where(node -> node instanceof TableScanNode).findFirst().isPresent()) {
                if (remoteSourceNodes.stream().allMatch(node -> node.getExchangeType() == REPLICATE)) {
                    return Optional.empty();
                }
                else {
                    // remote source requires nodePartitionMap
                    // remote source类型的算子，需要从上游的PlanFragment读取分区的数据，因此bucket到partition的映射关系，需要
                    // 根据绑定的partitioningHandle得到，partitioningCache在之前已经被初始化过了
                    NodePartitionMap nodePartitionMap = partitioningCache.apply(partitioningHandle);
                    return Optional.of(nodePartitionMap.getBucketToPartition());
                }
            }
            else {
                // 其它类型，例如ARBITRARY_DISTRIBUTION、FIXED_HASH_DISTRIBUTION等，计算过程同remote source相似
                NodePartitionMap nodePartitionMap = partitioningCache.apply(partitioningHandle);
                List<InternalNode> partitionToNode = nodePartitionMap.getPartitionToNode();
                // todo this should asynchronously wait a standard timeout period before failing
                checkCondition(!partitionToNode.isEmpty(), NO_NODES_AVAILABLE, "No worker nodes available");
                return Optional.of(nodePartitionMap.getBucketToPartition());
            }
        }
}

创建NodePartitioningMap实例，维护bucket -> Partition -> Node的映射关系

此实例保存了两个重要的数据结构：
partitionToNode：Data Partition -> Worker Node的映射集合
bucketToPartition：Data Bucket -> Data Partition的映射集合
根据上面两个Map变量，可以做到根据分区键，计算每一个数据行的Bucket ID，就可以知道这一行数据归于哪个Partition，
进而知道应该分布到哪个Worker Node上

    public NodePartitionMap getNodePartitioningMap(Session session, PartitioningHandle partitioningHandle)
    {
        requireNonNull(session, "session is null");
        requireNonNull(partitioningHandle, "partitioningHandle is null");
        if (partitioningHandle.getConnectorHandle() instanceof SystemPartitioningHandle) {
            // 返回系统默认的对象
            return ((SystemPartitioningHandle) partitioningHandle.getConnectorHandle()).getNodePartitionMap(session, nodeScheduler);
        }
        // 获取Connector自己实现的Bucket -> Node的映射集合，Connector可以实现接口，定义buckets的数量，以及构建bucket到worker node映射
        // 由于我们讨论的Iceberg Connector，因此会createArbitraryBucketToNode(...)方法得到实例
        ConnectorBucketNodeMap connectorBucketNodeMap = getConnectorBucketNodeMap(session, partitioningHandle);
        // safety check for crazy partitioning
        checkArgument(connectorBucketNodeMap.getBucketCount() < 1_000_000, "Too many buckets in partitioning: %s", connectorBucketNodeMap.getBucketCount());

        List<InternalNode> bucketToNode;
        if (connectorBucketNodeMap.hasFixedMapping()) {
            bucketToNode = getFixedMapping(connectorBucketNodeMap);
        }
        else {
            CatalogName catalogName = partitioningHandle.getConnectorId()
                    .orElseThrow(() -> new IllegalArgumentException("No connector ID for partitioning handle: " + partitioningHandle));
            // Create a bucket to node mapping. Consecutive buckets are assigned
            // to shuffled nodes (e.g "1 -> node2, 2 -> node1, 3 -> node2, 4 -> node1, ...").
            // 这里必然有这样的不等式：buckets的数量 >= 可用的Workers的数量
            // Iceberg Connector仅仅定义了buckets数量，没有定义bucket到node映射关系，并且buckets的数量=活跃worker数量
            bucketToNode = createArbitraryBucketToNode(
                    nodeScheduler.createNodeSelector(session, Optional.of(catalogName)).allNodes(),
                    connectorBucketNodeMap.getBucketCount());
        }
        // 前面创建了bucket到worker的映射关系，下面就要构建Bucket与Partition的关系
        // 创建一个数组，大小为Buckets的数量，同时bucketToPartition[i]存放的是对应的PartitionId
        int[] bucketToPartition = new int[connectorBucketNodeMap.getBucketCount()];
        // BiMap，保证keys和values都各自不重复，也就意味着一个Worker Node唯一对应一个Partition
        BiMap<InternalNode, Integer> nodeToPartition = HashBiMap.create();
        int nextPartitionId = 0; // 初始值
        for (int bucket = 0; bucket < bucketToNode.size(); bucket++) {
            InternalNode node = bucketToNode.get(bucket);
            // bucketToNode中可能会存在重复的Value，即多个Bucket映射到多个Worker Node
            Integer partitionId = nodeToPartition.get(node);
            if (partitionId == null) {
                // 如果partitionId不存在，即找到了一个新的Worker，那么就递增partitionId
                // 不难看出在Trino内部，一个WorkerNode就是一个Partition
                partitionId = nextPartitionId++;
                nodeToPartition.put(node, partitionId);
            }
            // 记录bucketId到PartitionId的映射
            bucketToPartition[bucket] = partitionId;
        }
        // 收集所有的WorkerNode
        List<InternalNode> partitionToNode = IntStream.range(0, nodeToPartition.size())
                .mapToObj(partitionId -> nodeToPartition.inverse().get(partitionId))
                .collect(toImmutableList());
        // 返回实例
        return new NodePartitionMap(partitionToNode, bucketToPartition, getSplitToBucket(session, partitioningHandle));
    }

创建StageScheduler实例，负责一个Stage的调度执行

    private static class PipelinedDistributedStagesScheduler
            implements DistributedStagesScheduler
    {
        private static StageScheduler createStageScheduler(
                QueryStateMachine queryStateMachine,
                PipelinedStageExecution stageExecution,
                SplitSourceFactory splitSourceFactory,
                List<PipelinedStageExecution> childStageExecutions,
                Function<PartitioningHandle, NodePartitionMap> partitioningCache,
                NodeScheduler nodeScheduler,
                NodePartitioningManager nodePartitioningManager,
                int splitBatchSize,
                DynamicFilterService dynamicFilterService,
                ScheduledExecutorService executor,
                TableExecuteContextManager tableExecuteContextManager)
        {
            Session session = queryStateMachine.getSession();
            PlanFragment fragment = stageExecution.getFragment();
            PartitioningHandle partitioningHandle = fragment.getPartitioning();
            // 尝试为当前的Fragment，为每一个TableScanNode创建一个SplitSource实例，用于对数据源的数据进行切分，生成一系列的数据片段。
            // 对于Iceberg Connector来说，就是对DataFile进行切分，返回一批IcebergSplit。
            // SplitSource的提供的接口的最终调用会代理到IcebergSplitSource。
            // 在创建的过程中还会涉及到SplitManager的对象，不过不在这里解析了。
            Map<PlanNodeId, SplitSource> splitSources = splitSourceFactory.createSplitSources(session, fragment);
            if (!splitSources.isEmpty()) {
                queryStateMachine.addStateChangeListener(new StateChangeListener<>()
                {
                    private final AtomicReference<Collection<SplitSource>> splitSourcesReference = new AtomicReference<>(splitSources.values());

                    @Override
                    public void stateChanged(QueryState newState)
                    {
                        if (newState.isDone()) {
                            // ensure split sources are closed and release memory
                            Collection<SplitSource> sources = splitSourcesReference.getAndSet(null);
                            if (sources != null) {
                                closeSplitSources(sources);
                            }
                        }
                    }
                });
            }
            if (partitioningHandle.equals(SOURCE_DISTRIBUTION)) {
                // 如果当前PlanFragment的分区类型是SOURCE_DISTRIBUTION，说明这个Fragment是上游的SubPlan，负责从数据源加载数据
                // nodes are selected dynamically based on the constraints of the splits and the system load
                Entry<PlanNodeId, SplitSource> entry = getOnlyElement(splitSources.entrySet());
                PlanNodeId planNodeId = entry.getKey();
                SplitSource splitSource = entry.getValue();
                Optional<CatalogName> catalogName = Optional.of(splitSource.getCatalogName())
                        .filter(catalog -> !isInternalSystemConnector(catalog));
                NodeSelector nodeSelector = nodeScheduler.createNodeSelector(session, catalogName);
                // placementPolicy负责根据nodelSelector的实现，为Split分配合适的WorkerNode，
                SplitPlacementPolicy placementPolicy = new DynamicSplitPlacementPolicy(nodeSelector, stageExecution::getAllTasks);

                checkArgument(!fragment.getStageExecutionDescriptor().isStageGroupedExecution());
                // 返回一个封装了SourcePartitionedScheduler实例的对象
                return newSourcePartitionedSchedulerAsStageScheduler(
                        stageExecution,
                        planNodeId,
                        splitSource,
                        placementPolicy,
                        splitBatchSize,
                        dynamicFilterService,
                        tableExecuteContextManager,
                        () -> childStageExecutions.stream().anyMatch(PipelinedStageExecution::isAnyTaskBlocked));
            }
            else if (partitioningHandle.equals(SCALED_WRITER_DISTRIBUTION)) {
                // ...
                return scheduler;
            }
            else {
                // 如果不是包含TableScan的PlanFragment，比如是一个JOIN类型的Fragment，它存在如下三种情况
                //    left is Source, right is RemoteSource
                //    left is RemoteSource, right is RemoteSource
                //    left is Source, right is Source
                if (!splitSources.isEmpty()) {
                    // contains local source
                    List<PlanNodeId> schedulingOrder = fragment.getPartitionedSources();
                    Optional<CatalogName> catalogName = partitioningHandle.getConnectorId();
                    checkArgument(catalogName.isPresent(), "No connector ID for partitioning handle: %s", partitioningHandle);
                    List<ConnectorPartitionHandle> connectorPartitionHandles;
                    boolean groupedExecutionForStage = fragment.getStageExecutionDescriptor().isStageGroupedExecution();
                    // 如果一个Stage被标记为Grouped，这个Stage必须是被Partitioning了，因此可以等价地认为
                    // 一个Group就是一个Bucket，因此这个Group中的Splits都对应同一个Partition，又对应同一个Worker Node
                    if (groupedExecutionForStage) {
                        connectorPartitionHandles = nodePartitioningManager.listPartitionHandles(session, partitioningHandle);
                        checkState(!ImmutableList.of(NOT_PARTITIONED).equals(connectorPartitionHandles));
                    }
                    else {
                        // 如果不是分组
                        connectorPartitionHandles = ImmutableList.of(NOT_PARTITIONED);
                    }

                    BucketNodeMap bucketNodeMap;
                    List<InternalNode> stageNodeList;
                    if (fragment.getRemoteSourceNodes().stream().allMatch(node -> node.getExchangeType() == REPLICATE)) {
                        // no remote source
                        boolean dynamicLifespanSchedule = fragment.getStageExecutionDescriptor().isDynamicLifespanSchedule();
                        bucketNodeMap = nodePartitioningManager.getBucketNodeMap(session, partitioningHandle, dynamicLifespanSchedule);

                        // verify execution is consistent with planner's decision on dynamic lifespan schedule
                        verify(bucketNodeMap.isDynamic() == dynamicLifespanSchedule);
                        // 如果Fragment仅包含本地的TableScanNode，那么所有可用的Worker结点，都是当前Stage可以被调度执行的结点
                        // 因此
                        stageNodeList = new ArrayList<>(nodeScheduler.createNodeSelector(session, catalogName).allNodes());
                        Collections.shuffle(stageNodeList);
                    }
                    else {
                        // cannot use dynamic lifespan schedule
                        verify(!fragment.getStageExecutionDescriptor().isDynamicLifespanSchedule());

                        // remote source requires nodePartitionMap
                        NodePartitionMap nodePartitionMap = partitioningCache.apply(partitioningHandle);
                        if (groupedExecutionForStage) {
                            // 如果是Grouped Stage，则需要M个不同的ConnectorPartitionHandle实例，用来计算BucketID，
                            // 同时M == Buckets的数量，才能保存每一个BucketId都对应不同的分区。
                            checkState(connectorPartitionHandles.size() == nodePartitionMap.getBucketToPartition().length);
                        }
                        stageNodeList = nodePartitionMap.getPartitionToNode();
                        bucketNodeMap = nodePartitionMap.asBucketNodeMap();
                    }
                    // 在这种情况下，Buckets的数量是固定的，因此数据源的分区数量也是固定的，因此创建FixedSourcePartitionedScheduler实例
                    return new FixedSourcePartitionedScheduler(
                            stageExecution,
                            splitSources,
                            fragment.getStageExecutionDescriptor(),
                            schedulingOrder,
                            stageNodeList,
                            bucketNodeMap,
                            splitBatchSize,
                            getConcurrentLifespansPerNode(session),
                            nodeScheduler.createNodeSelector(session, catalogName),
                            connectorPartitionHandles,
                            dynamicFilterService,
                            tableExecuteContextManager);
                }
                else {
                    // all sources are remote
                    // 如果都是RemoteSources Plan Node，即要读取的数据来自上游的OutputBufers，
                    // 因此这里Partitions的数量，取决于上游，为当前的Stage创建分区任务的数量也是确定的
                    // 例如当Buckets数量 == Partitions数量 == Node数量时，同一个Stage的不同分区上的任务，会发送到不同的WorkerNode；
                    // 但如果Buckets数量多于node数量，一个Stage的多个分区可能会同时运行在一个Node上面
                    NodePartitionMap nodePartitionMap = partitioningCache.apply(partitioningHandle);
                    List<InternalNode> partitionToNode = nodePartitionMap.getPartitionToNode();
                    // todo this should asynchronously wait a standard timeout period before failing
                    checkCondition(!partitionToNode.isEmpty(), NO_NODES_AVAILABLE, "No worker nodes available");
                    return new FixedCountScheduler(stageExecution, partitionToNode);
                }
            }
        }
    }

DistributedStagesScheduler的调度

    private static class PipelinedDistributedStagesScheduler
            implements DistributedStagesScheduler
    {
        @Override
        public void schedule()
        {
            // 调度开始
            checkState(started.compareAndSet(false, true), "already started");

            try (SetThreadName ignored = new SetThreadName("Query-%s", queryStateMachine.getQueryId())) {
                while (!executionSchedule.isFinished()) {
                    List<ListenableFuture<Void>> blockedStages = new ArrayList<>();
                    // 获取要调度的Stages，默认配置下，会调度所有的Stages运行，而不考虑Stages之间的依赖
                    for (PipelinedStageExecution stageExecution : executionSchedule.getStagesToSchedule()) {
                        // 由StageExecution实例代理调度绑定的Stage执行
                        stageExecution.beginScheduling();

                        // perform some scheduling work，异步
                        ScheduleResult result = stageSchedulers.get(stageExecution.getStageId())
                                .schedule();

                        // modify parent and children based on the results of the scheduling
                        if (result.isFinished()) {
                            // 如果Stage完成了，那么就设置完成状态
                            stageExecution.schedulingComplete();
                        }
                        else if (!result.getBlocked().isDone()) {
                            // 如果Stage的状态为BLOCKED，可能是由于前置Stage还没有数据输出
                            blockedStages.add(result.getBlocked());
                        }
                        schedulerStats.getSplitsScheduledPerIteration().add(result.getSplitsScheduled());
                        if (result.getBlockedReason().isPresent()) {
                            switch (result.getBlockedReason().get()) {
                                case WRITER_SCALING:
                                    // no-op
                                    break;
                                case WAITING_FOR_SOURCE:
                                    schedulerStats.getWaitingForSource().update(1);
                                    break;
                                case SPLIT_QUEUES_FULL:
                                    schedulerStats.getSplitQueuesFull().update(1);
                                    break;
                                case MIXED_SPLIT_QUEUES_FULL_AND_WAITING_FOR_SOURCE:
                                case NO_ACTIVE_DRIVER_GROUP:
                                    break;
                                default:
                                    throw new UnsupportedOperationException("Unknown blocked reason: " + result.getBlockedReason().get());
                            }
                        }
                    }

                    // wait for a state change and then schedule again，如果还有被BLOCKED的Stage，则需要进行超时检测
                    if (!blockedStages.isEmpty()) {
                        try (TimeStat.BlockTimer timer = schedulerStats.getSleepTime().time()) {
                            tryGetFutureValue(whenAnyComplete(blockedStages), 1, SECONDS);
                        }
                        for (ListenableFuture<Void> blockedStage : blockedStages) {
                            blockedStage.cancel(true);
                        }
                    }
                }

                for (PipelinedStageExecution stageExecution : stageExecutions.values()) {
                    PipelinedStageExecution.State state = stageExecution.getState();
                    if (state != SCHEDULED && state != RUNNING && state != FLUSHING && !state.isDone()) {
                        throw new TrinoException(GENERIC_INTERNAL_ERROR, format("Scheduling is complete, but stage %s is in state %s", stageExecution.getStageId(), state));
                    }
                }
            }
            catch (Throwable t) {
                fail(t, Optional.empty());
            }
            finally {
                RuntimeException closeError = new RuntimeException();
                for (StageScheduler scheduler : stageSchedulers.values()) {
                    try {
                        scheduler.close();
                    }
                    catch (Throwable t) {
                        fail(t, Optional.empty());
                        // Self-suppression not permitted
                        if (closeError != t) {
                            closeError.addSuppressed(t);
                        }
                    }
                }
            }
        }
    }

FixedSourcePartitionedScheduler::schedule()

GROUP_WIDE，Split的Life Cycle为Task Group级别，只会由中继Stage/PlanFragment的产生，其对应的Split用于读取上游已经被Partitioned的数据（因此可以简单地认为一个Group，就是一个Data Partition）。当有多个SourceScheduler调度Splits时，同一个Source上
的调度策略是按GroupId顺序高度，且同一个时刻只能调度一个Group的Splits执行；而其它SourceScheduler可以并行地调度不同的Group。

TASK_WIDE，Split的Life Cycle为Task级别，可以先简单地认为就是Table Scan Stage执行时的Split的LifeCycle。每一个SourceScheduler之间互相不影响，只看当前SqlTask的剩余资源来决定是否要调度新的Splits。

public class FixedSourcePartitionedScheduler
        implements StageScheduler
{
    @Override
    public ScheduleResult schedule()
    {
        // schedule a task on every node in the distribution
        List<RemoteTask> newTasks = ImmutableList.of();
        if (scheduledTasks.isEmpty()) { // 如果这个Stage还没有调度过任务，就为所有的分区创建RemoteTask任务
            ImmutableList.Builder<RemoteTask> newTasksBuilder = ImmutableList.builder();
            for (InternalNode node : nodes) { // 遍历当前Stage所有可用的Worker Nodes
                // 一个Node，就对应唯一一个分区
                Optional<RemoteTask> task = stageExecution.scheduleTask(node, partitionIdAllocator.getNextId(), ImmutableMultimap.of(), ImmutableMultimap.of());
                if (task.isPresent()) {
                    scheduledTasks.put(node, task.get());
                    newTasksBuilder.add(task.get());
                }
            }
            newTasks = newTasksBuilder.build();
        }

        boolean allBlocked = true;
        List<ListenableFuture<Void>> blocked = new ArrayList<>();
        BlockedReason blockedReason = BlockedReason.NO_ACTIVE_DRIVER_GROUP;

        if (groupedLifespanScheduler.isPresent()) {
            // Start new driver groups on the first scheduler if necessary,
            // i.e. when previous ones have finished execution (not finished scheduling).
            //
            // Invoke schedule method to get a new SettableFuture every time.
            // Reusing previously returned SettableFuture could lead to the ListenableFuture retaining too many listeners.
            blocked.add(groupedLifespanScheduler.get().schedule(sourceSchedulers.get(0)));
        }

        int splitsScheduled = 0;
        // SourceSchedulers保存了每一个Source的调度器实例，即SourcePartitionedScheduler实例，它们负责调度各自的Splits
        Iterator<SourceScheduler> schedulerIterator = sourceSchedulers.iterator();
        List<Lifespan> driverGroupsToStart = ImmutableList.of();
        boolean shouldInvokeNoMoreDriverGroups = false;
        while (schedulerIterator.hasNext()) {
            SourceScheduler sourceScheduler = schedulerIterator.next();
            // 如果是分组调度，意味着底层调度Splits的策略是按分组来的，只有当一个SourceScheduler某个分组的Splits调度完成了，
            // 下一个SourceScheduler才能调度相应分组的Splits，而其它的分组被BLOCKED，真正第一个SourceScheduler又完成有其它分组上的调度
            for (Lifespan lifespan : driverGroupsToStart) {
                sourceScheduler.startLifespan(lifespan, partitionHandleFor(lifespan));
            }
            if (shouldInvokeNoMoreDriverGroups) {
                sourceScheduler.noMoreLifespans();
            }
            // 调用FixedSourcePartitionedScheduler::schedule()方法
            ScheduleResult schedule = sourceScheduler.schedule();
            // 累加当前Stage总共调度的Splits数量
            splitsScheduled += schedule.getSplitsScheduled();
            if (schedule.getBlockedReason().isPresent()) {
                blocked.add(schedule.getBlocked());
                blockedReason = blockedReason.combineWith(schedule.getBlockedReason().get());
            }
            else {
                verify(schedule.getBlocked().isDone(), "blockedReason not provided when scheduler is blocked");
                allBlocked = false;
            }
            // 如果SourceScheduler instanceOf AsGroupedSourceScheduler，那么drainCompletedLifespans()方法总是会返回对应的LifeSpan对象
            driverGroupsToStart = sourceScheduler.drainCompletedLifespans();

            if (schedule.isFinished()) {
                stageExecution.schedulingComplete(sourceScheduler.getPlanNodeId());
                schedulerIterator.remove();
                sourceScheduler.close();
                shouldInvokeNoMoreDriverGroups = true;
            }
            else {
                shouldInvokeNoMoreDriverGroups = false;
            }
        }

        if (allBlocked) {
            // 如果所有的SourcePartitionedScheduler被BLOCKED了，那么就返回Blocked信息
            return new ScheduleResult(sourceSchedulers.isEmpty(), newTasks, whenAnyComplete(blocked), blockedReason, splitsScheduled);
        }
        else {
            // 有正在运行的SourcePartitionedScheduler，就返回已经调度的Splits信息
            return new ScheduleResult(sourceSchedulers.isEmpty(), newTasks, splitsScheduled);
        }
    }
}

SourcePartitionedScheduler::schedule()

最底层的Splits调度器，负责调度、执行SOURCE类型的Stage。

public class SourcePartitionedScheduler
        implements SourceScheduler
{
    @Override
    public synchronized ScheduleResult schedule()
    {
        dropListenersFromWhenFinishedOrNewLifespansAdded();

        int overallSplitAssignmentCount = 0;
        ImmutableSet.Builder<RemoteTask> overallNewTasks = ImmutableSet.builder();
        List<ListenableFuture<?>> overallBlockedFutures = new ArrayList<>();
        boolean anyBlockedOnPlacements = false;
        boolean anyBlockedOnNextSplitBatch = false;
        boolean anyNotBlocked = false;
        // 遍历每一个ScheduleGroup实例，一个ScheduleGroup对应了
        // 
        for (Entry<Lifespan, ScheduleGroup> entry : scheduleGroups.entrySet()) {
            Lifespan lifespan = entry.getKey();
            ScheduleGroup scheduleGroup = entry.getValue();
            Set<Split> pendingSplits = scheduleGroup.pendingSplits;

            if (scheduleGroup.state == ScheduleGroupState.NO_MORE_SPLITS || scheduleGroup.state == ScheduleGroupState.DONE) {
                verify(scheduleGroup.nextSplitBatchFuture == null);
            }
            else if (pendingSplits.isEmpty()) {
                // try to get the next batch
                if (scheduleGroup.nextSplitBatchFuture == null) {
                    // 实际上是通过IcebergConnectorSplitSource获取下一批要调度处理的Splits，
                    // 注意通过splitBatchSize - pendingSplits.size()限制了最大被调度的Splits数量
                    scheduleGroup.nextSplitBatchFuture = splitSource.getNextBatch(scheduleGroup.partitionHandle, lifespan, splitBatchSize - pendingSplits.size());

                    long start = System.nanoTime();
                    addSuccessCallback(scheduleGroup.nextSplitBatchFuture, () -> stageExecution.recordGetSplitTime(start));
                }

                if (scheduleGroup.nextSplitBatchFuture.isDone()) {
                    // 如果nextSplitBatchFuture完成，意味着拿到了Splits实例，因此就可以立即调度了
                    SplitBatch nextSplits = getFutureValue(scheduleGroup.nextSplitBatchFuture);
                    scheduleGroup.nextSplitBatchFuture = null;
                    // 将所有的Splits添加到等待队列中
                    pendingSplits.addAll(nextSplits.getSplits());
                    if (nextSplits.isLastBatch()) {
                        // 如果是最后一批要调度的Splits，则追加一个EmptySplit的实例，以便通知Worker Node上的SqlTask任务停止运行
                        if (scheduleGroup.state == ScheduleGroupState.INITIALIZED && pendingSplits.isEmpty()) {
                            // Add an empty split in case no splits have been produced for the source.
                            // For source operators, they never take input, but they may produce output.
                            // This is well handled by the execution engine.
                            // However, there are certain non-source operators that may produce output without any input,
                            // for example, 1) an AggregationOperator, 2) a HashAggregationOperator where one of the grouping sets is ().
                            // Scheduling an empty split kicks off necessary driver instantiation to make this work.
                            pendingSplits.add(new Split(
                                    splitSource.getCatalogName(),
                                    new EmptySplit(splitSource.getCatalogName()),
                                    lifespan));
                        }
                        // 通知当前的SourceScheduler，不需要再调度了
                        scheduleGroup.state = ScheduleGroupState.NO_MORE_SPLITS;
                    }
                }
                else {
                    overallBlockedFutures.add(scheduleGroup.nextSplitBatchFuture);
                    anyBlockedOnNextSplitBatch = true;
                    continue;
                }
            }

            Multimap<InternalNode, Split> splitAssignment = ImmutableMultimap.of();
            if (!pendingSplits.isEmpty()) {
                if (!scheduleGroup.placementFuture.isDone()) {
                    anyBlockedOnPlacements = true;
                    continue;
                }

                if (scheduleGroup.state == ScheduleGroupState.INITIALIZED) {
                    scheduleGroup.state = ScheduleGroupState.SPLITS_ADDED;
                }
                if (state == State.INITIALIZED) {
                    state = State.SPLITS_ADDED;
                }

                // calculate placements for splits，为每一个Split计算应该被分发到哪个Worker Node
                SplitPlacementResult splitPlacementResult = splitPlacementPolicy.computeAssignments(pendingSplits);
                splitAssignment = splitPlacementResult.getAssignments();

                // remove splits with successful placements
                splitAssignment.values().forEach(pendingSplits::remove); // AbstractSet.removeAll performs terribly here.
                overallSplitAssignmentCount += splitAssignment.size();

                // if not completed placed, mark scheduleGroup as blocked on placement
                if (!pendingSplits.isEmpty()) {
                    scheduleGroup.placementFuture = splitPlacementResult.getBlocked();
                    overallBlockedFutures.add(scheduleGroup.placementFuture);
                    anyBlockedOnPlacements = true;
                }
            }

            // if no new splits will be assigned, update state and attach completion event
            Multimap<InternalNode, Lifespan> noMoreSplitsNotification = ImmutableMultimap.of();
            if (pendingSplits.isEmpty() && scheduleGroup.state == ScheduleGroupState.NO_MORE_SPLITS) {
                scheduleGroup.state = ScheduleGroupState.DONE;
                if (!lifespan.isTaskWide()) {
                    InternalNode node = ((BucketedSplitPlacementPolicy) splitPlacementPolicy).getNodeForBucket(lifespan.getId());
                    noMoreSplitsNotification = ImmutableMultimap.of(node, lifespan);
                }
            }

            // assign the splits with successful placements
            overallNewTasks.addAll(assignSplits(splitAssignment, noMoreSplitsNotification));

            // Assert that "placement future is not done" implies "pendingSplits is not empty".
            // The other way around is not true. One obvious reason is (un)lucky timing, where the placement is unblocked between `computeAssignments` and this line.
            // However, there are other reasons that could lead to this.
            // Note that `computeAssignments` is quite broken:
            // 1. It always returns a completed future when there are no tasks, regardless of whether all nodes are blocked.
            // 2. The returned future will only be completed when a node with an assigned task becomes unblocked. Other nodes don't trigger future completion.
            // As a result, to avoid busy loops caused by 1, we check pendingSplits.isEmpty() instead of placementFuture.isDone() here.
            if (scheduleGroup.nextSplitBatchFuture == null && scheduleGroup.pendingSplits.isEmpty() && scheduleGroup.state != ScheduleGroupState.DONE) {
                anyNotBlocked = true;
            }
        }

        // * `splitSource.isFinished` invocation may fail after `splitSource.close` has been invoked.
        //   If state is NO_MORE_SPLITS/FINISHED, splitSource.isFinished has previously returned true, and splitSource is closed now.
        // * Even if `splitSource.isFinished()` return true, it is not necessarily safe to tear down the split source.
        //   * If anyBlockedOnNextSplitBatch is true, it means we have not checked out the recently completed nextSplitBatch futures,
        //     which may contain recently published splits. We must not ignore those.
        //   * If any scheduleGroup is still in DISCOVERING_SPLITS state, it means it hasn't realized that there will be no more splits.
        //     Next time it invokes getNextBatch, it will realize that. However, the invocation will fail we tear down splitSource now.
        if ((state == State.NO_MORE_SPLITS || state == State.FINISHED) || (noMoreScheduleGroups && scheduleGroups.isEmpty() && splitSource.isFinished())) {
            switch (state) {
                case INITIALIZED:
                    // We have not scheduled a single split so far.
                    // But this shouldn't be possible. See usage of EmptySplit in this method.
                    throw new IllegalStateException("At least 1 split should have been scheduled for this plan node");
                case SPLITS_ADDED:
                    state = State.NO_MORE_SPLITS;

                    Optional<List<Object>> tableExecuteSplitsInfo = splitSource.getTableExecuteSplitsInfo();

                    // Here we assume that we can get non-empty tableExecuteSplitsInfo only for queries which facilitate single split source.
                    // TODO support grouped execution
                    tableExecuteSplitsInfo.ifPresent(info -> {
                        TableExecuteContext tableExecuteContext = tableExecuteContextManager.getTableExecuteContextForQuery(stageExecution.getStageId().getQueryId());
                        tableExecuteContext.setSplitsInfo(info);
                    });

                    splitSource.close();
                    // fall through
                case NO_MORE_SPLITS:
                    state = State.FINISHED;
                    whenFinishedOrNewLifespanAdded.set(null);
                    // fall through
                case FINISHED:
                    splitSource.getMetrics().ifPresent(stageExecution::updateConnectorMetrics);
                    return new ScheduleResult(
                            true,
                            overallNewTasks.build(),
                            overallSplitAssignmentCount);
            }
            throw new IllegalStateException("Unknown state");
        }

        if (anyNotBlocked) {
            return new ScheduleResult(false, overallNewTasks.build(), overallSplitAssignmentCount);
        }

        if (anyBlockedOnNextSplitBatch
                && scheduledTasks.isEmpty()
                && dynamicFilterService.isCollectingTaskNeeded(stageExecution.getStageId().getQueryId(), stageExecution.getFragment())) {
            // schedule a task for collecting dynamic filters in case probe split generator is waiting for them
            createTaskOnRandomNode().ifPresent(overallNewTasks::add);
        }

        boolean anySourceTaskBlocked = this.anySourceTaskBlocked.getAsBoolean();
        if (anySourceTaskBlocked) {
            // Dynamic filters might not be collected due to build side source tasks being blocked on full buffer.
            // In such case probe split generation that is waiting for dynamic filters should be unblocked to prevent deadlock.
            dynamicFilterService.unblockStageDynamicFilters(stageExecution.getStageId().getQueryId(), stageExecution.getAttemptId(), stageExecution.getFragment());
        }

        if (groupedExecution) {
            overallNewTasks.addAll(finalizeTaskCreationIfNecessary());
        }
        else if (anyBlockedOnPlacements && anySourceTaskBlocked) {
            // In a broadcast join, output buffers of the tasks in build source stage have to
            // hold onto all data produced before probe side task scheduling finishes,
            // even if the data is acknowledged by all known consumers. This is because
            // new consumers may be added until the probe side task scheduling finishes.
            //
            // As a result, the following line is necessary to prevent deadlock
            // due to neither build nor probe can make any progress.
            // The build side blocks due to a full output buffer.
            // In the meantime the probe side split cannot be consumed since
            // builder side hash table construction has not finished.
            overallNewTasks.addAll(finalizeTaskCreationIfNecessary());
        }

        ScheduleResult.BlockedReason blockedReason;
        if (anyBlockedOnNextSplitBatch) {
            blockedReason = anyBlockedOnPlacements ? MIXED_SPLIT_QUEUES_FULL_AND_WAITING_FOR_SOURCE : WAITING_FOR_SOURCE;
        }
        else {
            blockedReason = anyBlockedOnPlacements ? SPLIT_QUEUES_FULL : NO_ACTIVE_DRIVER_GROUP;
        }

        overallBlockedFutures.add(whenFinishedOrNewLifespanAdded);
        return new ScheduleResult(
                false,
                overallNewTasks.build(),
                nonCancellationPropagating(asVoid(whenAnyComplete(overallBlockedFutures))),
                blockedReason,
                overallSplitAssignmentCount);
    }
}

SqlTask的创建

SqlTask，运行在Worker Node上，每一个SqlTask对应一个Stage中的一个分区，它负责处理这个分区上的所有Splits。
客户端通过/v1/task/{taskId}，请求对应的Worker Node创建相应的任务实例。

    @ResourceSecurity(INTERNAL_ONLY)
    @POST
    @Path("{taskId}")
    @Consumes(MediaType.APPLICATION_JSON)
    @Produces(MediaType.APPLICATION_JSON)
    public void createOrUpdateTask(
            @PathParam("taskId") TaskId taskId,
            TaskUpdateRequest taskUpdateRequest,
            @Context UriInfo uriInfo,
            @Suspended AsyncResponse asyncResponse)
    {
        requireNonNull(taskUpdateRequest, "taskUpdateRequest is null");

        Session session = taskUpdateRequest.getSession().toSession(sessionPropertyManager, taskUpdateRequest.getExtraCredentials());

        if (injectFailure(session.getTraceToken(), taskId, RequestType.CREATE_OR_UPDATE_TASK, asyncResponse)) {
            return;
        }
        // 创建任务
        TaskInfo taskInfo = taskManager.updateTask(session,
                taskId,
                taskUpdateRequest.getFragment(),
                taskUpdateRequest.getSources(),
                taskUpdateRequest.getOutputIds(),
                taskUpdateRequest.getDynamicFilterDomains());

        if (shouldSummarize(uriInfo)) {
            taskInfo = taskInfo.summarize();
        }

        asyncResponse.resume(Response.ok().entity(taskInfo).build());
    }

SqlTaskManager::updateTask

public class SqlTaskManager
        implements TaskManager, Closeable
{
    private final LoadingCache<TaskId, SqlTask> tasks = CacheBuilder.newBuilder().build(CacheLoader.from(
                taskId -> createSqlTask(
                        taskId,
                        locationFactory.createLocalTaskLocation(taskId),
                        nodeInfo.getNodeId(),
                        queryContexts.getUnchecked(taskId.getQueryId()),
                        sqlTaskExecutionFactory,
                        taskNotificationExecutor,
                        sqlTask -> finishedTaskStats.merge(sqlTask.getIoStats()),
                        maxBufferSize,
                        maxBroadcastBufferSize,
                        failedTasks)));
                        
    @Override
    public TaskInfo updateTask(
            Session session,
            TaskId taskId,
            Optional<PlanFragment> fragment,
            List<TaskSource> sources,
            OutputBuffers outputBuffers,
            Map<DynamicFilterId, Domain> dynamicFilterDomains)
    {
        try {
            return versionEmbedder.embedVersion(() -> doUpdateTask(session, taskId, fragment, sources, outputBuffers, dynamicFilterDomains)).call();
        }
        catch (Exception e) {
            throwIfUnchecked(e);
            // impossible, doUpdateTask does not throw checked exceptions
            throw new RuntimeException(e);
        }
    }

    private TaskInfo doUpdateTask(
            Session session,
            TaskId taskId,
            Optional<PlanFragment> fragment,
            List<TaskSource> sources,
            OutputBuffers outputBuffers,
            Map<DynamicFilterId, Domain> dynamicFilterDomains)
    {
        requireNonNull(session, "session is null");
        requireNonNull(taskId, "taskId is null");
        requireNonNull(fragment, "fragment is null");
        requireNonNull(sources, "sources is null");
        requireNonNull(outputBuffers, "outputBuffers is null");

        SqlTask sqlTask = tasks.getUnchecked(taskId); // 创建一个新的SqlTask实例
        QueryContext queryContext = sqlTask.getQueryContext();
        if (!queryContext.isMemoryLimitsInitialized()) {
            // 如果限制了当前Query运行时的内存，则需要更新相关的属性
            long sessionQueryMaxMemoryPerNode = getQueryMaxMemoryPerNode(session).toBytes();
            long sessionQueryTotalMaxMemoryPerNode = getQueryMaxTotalMemoryPerNode(session).toBytes();
            // Session properties are only allowed to decrease memory limits, not increase them
            queryContext.initializeMemoryLimits(
                    resourceOvercommit(session),
                    min(sessionQueryMaxMemoryPerNode, queryMaxMemoryPerNode),
                    min(sessionQueryTotalMaxMemoryPerNode, queryMaxTotalMemoryPerNode));
        }
        // 更新SqlTask的心跳信息，实际上就是系统当前的时间
        // 每一个SqlTask的心跳信息，都会在查找或更新时，被更新，以保证能够根据上一次的心跳时间，判断它是不是失联了
        sqlTask.recordHeartbeat();
        // 更新SqlTask实例运行时的参数
        return sqlTask.updateTask(session, fragment, sources, outputBuffers, dynamicFilterDomains);
    }

SqlTask的更新

创建SqlTask实例时，或是Coordinator调了新的Splits时，会执行更新过程。

public class SqlTask
{
    /**
     * 此方法的所有参数，都来自客户端发送的TaskUpdateRequest的对象，因此在生成Worker端的执行任务时，此Fragment上的输入(Split)、
     * 输出（outputBuffers）都已经确定了。
     * session: 保存了Sql执行时的客户端侧会话信息
     * fragment: 当前SqlTask要执行的逻辑计划片段
     * sources: 当前SqlTask要读取的数据源的Split描述信息，这些Split要么读取remote source，要么读取table。
     * outputBuffers: SqlTask的输出缓存区队列，在Coordinator侧创建PipelinedStageExecution实例时就已经被确定了，一共有如下三种类型：
     *     BroadcastOutputBufferManager：广播输出数据，只有一个Partition，因此只有一个buffer
     *     ScaledOutputBufferManager：动态扩展输出Buffer的数量，因此buffers数量是不固定的，被用于写出数据的任务
     *     PartitionedOutputBufferManager：按分区数量创建相同数量的outputBuffer，因此每一个Buffer都对应一个分区ID，供下游Stage
     *                                     消费。在当前的执行流程分析场景下，用到的是此类的实例。
     */
    public TaskInfo updateTask(
            Session session,
            Optional<PlanFragment> fragment,
            List<TaskSource> sources,
            OutputBuffers outputBuffers,
            Map<DynamicFilterId, Domain> dynamicFilterDomains)
    {
        try {
            // trace token must be set first to make sure failure injection for getTaskResults requests works as expected
            session.getTraceToken().ifPresent(traceToken::set);

            // The LazyOutput buffer does not support write methods, so the actual
            // output buffer must be established before drivers are created (e.g.
            // a VALUES query).
            outputBuffer.setOutputBuffers(outputBuffers);

            // assure the task execution is only created once
            SqlTaskExecution taskExecution;
            synchronized (this) {
                // is task already complete?
                TaskHolder taskHolder = taskHolderReference.get();
                if (taskHolder.isFinished()) {
                    return taskHolder.getFinalTaskInfo();
                }
                taskExecution = taskHolder.getTaskExecution();
                if (taskExecution == null) {
                    checkState(fragment.isPresent(), "fragment must be present");
                    // 创建SqlTaskExecution实例，负责在当前的Worker结点分析和执行fragment    
                    taskExecution = sqlTaskExecutionFactory.create(
                            session,
                            queryContext,
                            taskStateMachine,
                            outputBuffer,
                            fragment.get(),
                            this::notifyStatusChanged);
                    taskHolderReference.compareAndSet(taskHolder, new TaskHolder(taskExecution));
                    needsPlan.set(false);
                }
            }

            if (taskExecution != null) {
                // 一旦发现taskExecution实例，就将要处理的数据源Splits添加到等待队列中
                taskExecution.addSources(sources);
                // 同时更新dynamicFilter产生的（可以在处理Split时用于过滤数据的值集合）。
                taskExecution.getTaskContext().addDynamicFilter(dynamicFilterDomains);
            }
        }
        catch (Error e) {
            failed(e);
            throw e;
        }
        catch (RuntimeException e) {
            failed(e);
        }

        return getTaskInfo();
    }
}

SqlTaskExecution的创建

创建此实例时，会在创建过程中，生成真正可执行的物理执行计划实例LocalExecutionPlan。

public class SqlTaskExecutionFactory
{
    public SqlTaskExecution create(
            Session session,
            QueryContext queryContext,
            TaskStateMachine taskStateMachine,
            OutputBuffer outputBuffer,
            PlanFragment fragment,
            Runnable notifyStatusChanged)
    {
        // 创建TaskContext实例，维护了当前SqlTask运行时的各种信息，例如各种metrics
        TaskContext taskContext = queryContext.addTaskContext(
                taskStateMachine,
                session,
                notifyStatusChanged,
                perOperatorCpuTimerEnabled,
                cpuTimerEnabled);

        LocalExecutionPlan localExecutionPlan;
        try (SetThreadName ignored = new SetThreadName("Task-%s", taskStateMachine.getTaskId())) {
            try {
                // planner是一个LocalExecutionPlanner类型的实例，用于将逻辑计划PlanFragment转换成本地可执行的
                // 物理执行计划LocalExecutionPlan
                localExecutionPlan = planner.plan(
                        taskContext,
                        fragment.getRoot(),
                        TypeProvider.copyOf(fragment.getSymbols()),
                        fragment.getPartitioningScheme(),
                        fragment.getStageExecutionDescriptor(),
                        fragment.getPartitionedSources(),
                        outputBuffer);
            }
            catch (Throwable e) {
                // planning failed
                taskStateMachine.failed(e);
                throwIfUnchecked(e);
                throw new RuntimeException(e);
            }
        }
        return createSqlTaskExecution(
                taskStateMachine,
                taskContext,
                outputBuffer,
                localExecutionPlan,
                taskExecutor,
                taskNotificationExecutor,
                splitMonitor);
    }
}

LocalExecutionPlanner::plan

将PlanFragment转换成本地可执行的物理计划LocalExecutionPlan。

public class LocalExecutionPlanner
{
    public LocalExecutionPlan plan(
            TaskContext taskContext,
            PlanNode plan,
            TypeProvider types,
            PartitioningScheme partitioningScheme,
            StageExecutionDescriptor stageExecutionDescriptor,
            List<PlanNodeId> partitionedSourceOrder,
            OutputBuffer outputBuffer)
    {
        // 得到当前Fragment的输出布局（layout）
        List<Symbol> outputLayout = partitioningScheme.getOutputLayout();
        if (partitioningScheme.getPartitioning().getHandle().equals(FIXED_BROADCAST_DISTRIBUTION) ||
                partitioningScheme.getPartitioning().getHandle().equals(FIXED_ARBITRARY_DISTRIBUTION) ||
                partitioningScheme.getPartitioning().getHandle().equals(SCALED_WRITER_DISTRIBUTION) ||
                partitioningScheme.getPartitioning().getHandle().equals(SINGLE_DISTRIBUTION) ||
                partitioningScheme.getPartitioning().getHandle().equals(COORDINATOR_DISTRIBUTION)) {
            // 由于数据是基于Partition的，因此跳过
            return plan(taskContext, stageExecutionDescriptor, plan, outputLayout, types, partitionedSourceOrder, new TaskOutputFactory(outputBuffer));
        }

        // We can convert the symbols directly into channels, because the root must be a sink and therefore the layout is fixed
        List<Integer> partitionChannels;
        List<Optional<NullableValue>> partitionConstants;
        List<Type> partitionChannelTypes;
        if (partitioningScheme.getHashColumn().isPresent()) {
            partitionChannels = ImmutableList.of(outputLayout.indexOf(partitioningScheme.getHashColumn().get()));
            partitionConstants = ImmutableList.of(Optional.empty());
            partitionChannelTypes = ImmutableList.of(BIGINT);
        }
        else {
            // 收集分区列的下标。对于常量分区值，则赋值-1
            partitionChannels = partitioningScheme.getPartitioning().getArguments().stream()
                    .map(argument -> {
                        if (argument.isConstant()) {
                            return -1;
                        }
                        return outputLayout.indexOf(argument.getColumn());
                    })
                    .collect(toImmutableList());
            // 收集分区常量值
            partitionConstants = partitioningScheme.getPartitioning().getArguments().stream()
                    .map(argument -> {
                        if (argument.isConstant()) {
                            return Optional.of(argument.getConstant());
                        }
                        return Optional.<NullableValue>empty();
                    })
                    .collect(toImmutableList());
            // 收集分区字段的类型
            partitionChannelTypes = partitioningScheme.getPartitioning().getArguments().stream()
                    .map(argument -> {
                        if (argument.isConstant()) {
                            return argument.getConstant().getType();
                        }
                        return types.get(argument.getColumn());
                    })
                    .collect(toImmutableList());
        }
        // 得到计算分区ID的函数，一般地，执行Read SQL时，它是一个BucketPartitionFunction的实例
        // PartitionFunction提供了计算分区ID的方法，getPartition(Page page, int position)，即给定一个数据页中的某一行数据，
        // 而计算PartitionId的算法，一共内置如下几类：
        //    SINGLE: Single partition can only have one bucket
        //    HASH: HashBucketFunction，根据分区字段，计算得到HASH值，后面再将HASH值对partitions数量取余得到某行数据对应的分区ID
        //    ROUND_ROBIN: 净某一行数据以顺序遍历地方式，分区分区ID
        PartitionFunction partitionFunction = nodePartitioningManager.getPartitionFunction(taskContext.getSession(), partitioningScheme, partitionChannelTypes);
        OptionalInt nullChannel = OptionalInt.empty();
        Set<Symbol> partitioningColumns = partitioningScheme.getPartitioning().getColumns();

        // partitioningColumns expected to have one column in the normal case, and zero columns when partitioning on a constant
        // 对于常量分区，则不需要额外的列；对于指定了分区字段的情况，则需要一个额外的分区列，来保存每一行的分区值（例如对于HASH分区算法，存储
        // 的是HASH值）。
        checkArgument(!partitioningScheme.isReplicateNullsAndAny() || partitioningColumns.size() <= 1);
        if (partitioningScheme.isReplicateNullsAndAny() && partitioningColumns.size() == 1) {
            nullChannel = OptionalInt.of(outputLayout.indexOf(getOnlyElement(partitioningColumns)));
        }
        
        return plan(
                taskContext,
                stageExecutionDescriptor,
                plan,
                outputLayout,
                types,
                partitionedSourceOrder,
                // 创建一个PartitionedOutputFactory类型的实例，它提供了创建PartitionedOutputOperator实例的方法。
                // 而PartitionedOutputOperator是此PlanFragment上一系列执行算子的最后一个，负责对Data Page按PartitioningColumns
                // 计算PartitionID，并扔进OutputBuffer中。
                operatorFactories.partitionedOutput(
                        taskContext,
                        partitionFunction,
                        partitionChannels,
                        partitionConstants,
                        partitioningScheme.isReplicateNullsAndAny(),
                        nullChannel,
                        outputBuffer,
                        maxPagePartitioningBufferSize));
    }

    public LocalExecutionPlan plan(
            TaskContext taskContext,
            StageExecutionDescriptor stageExecutionDescriptor,
            PlanNode plan, // PlanFragment的根结点，对应于此逻辑子计划的最上层Node
            List<Symbol> outputLayout, // 此PlanFragment的结果的布局信息，实际上就是要输出的列符号
            TypeProvider types,        // 用于描述 每一个Symbol的类型
            List<PlanNodeId> partitionedSourceOrder, // 保存了所有要输出的Source源的PlanNodeId，例如JOIN，有左、右两个Source
            OutputFactory outputOperatorFactory)
    {
        Session session = taskContext.getSession();
        // 保存了本地物理执行计划运行时的上下文信息
        LocalExecutionPlanContext context = new LocalExecutionPlanContext(taskContext, types);
        // 从PlanFragment的根逻辑计划结点plan开始访问，构建物理执行计划树，实际上就是一组作用于Split之上的Operators（Driver）
        PhysicalOperation physicalOperation = plan.accept(new Visitor(session, stageExecutionDescriptor), context);
        // 对齐逻辑计划的outputLayout和物理执行计划的输出layoutput
        Function<Page, Page> pagePreprocessor = enforceLoadedLayoutProcessor(outputLayout, physicalOperation.getLayout());
        // 收集逻辑逻辑计划的输出字段的类型
        List<Type> outputTypes = outputLayout.stream()
                .map(types::get)
                .collect(toImmutableList());
        // 创建一个新的Driver。需要将OutputOperator与physicalOperaton串联到一个物理执行流水线中，分配一个新的PipelineId。
        // 其中physicalOperaton作为流水线的Source Operator，而OutputOperator作为流水线中的Output Operator。
        context.addDriverFactory(
                context.isInputDriver(),
                true, // 标识新的Driver的类型为，Output
                new PhysicalOperation(
                        outputOperatorFactory.createOutputOperator(
                                context.getNextOperatorId(),
                                plan.getId(),
                                outputTypes,
                                pagePreprocessor,
                                new PagesSerdeFactory(plannerContext.getBlockEncodingSerde(), isExchangeCompressionEnabled(session))),
                        physicalOperation),
                context.getDriverInstanceCount());

        // notify operator factories that planning has completed
        context.getDriverFactories().stream()
                .map(DriverFactory::getOperatorFactories)
                .flatMap(List::stream)
                .filter(LocalPlannerAware.class::isInstance)
                .map(LocalPlannerAware.class::cast)
                .forEach(LocalPlannerAware::localPlannerComplete);

        return new LocalExecutionPlan(context.getDriverFactories(), partitionedSourceOrder, stageExecutionDescriptor);
    }
}

SqlTaskExecution的构建

    private SqlTaskExecution(
            TaskStateMachine taskStateMachine,
            TaskContext taskContext,
            OutputBuffer outputBuffer,
            LocalExecutionPlan localExecutionPlan,
            TaskExecutor taskExecutor,
            SplitMonitor splitMonitor,
            Executor notificationExecutor)
    {
        this.taskStateMachine = requireNonNull(taskStateMachine, "taskStateMachine is null");
        this.taskId = taskStateMachine.getTaskId();
        this.taskContext = requireNonNull(taskContext, "taskContext is null");
        this.outputBuffer = requireNonNull(outputBuffer, "outputBuffer is null");
        this.taskExecutor = requireNonNull(taskExecutor, "taskExecutor is null");
        this.notificationExecutor = requireNonNull(notificationExecutor, "notificationExecutor is null");
        this.splitMonitor = requireNonNull(splitMonitor, "splitMonitor is null");

        try (SetThreadName ignored = new SetThreadName("Task-%s", taskId)) {
            // index driver factories
            // 从执行计划，得到Source结点ID
            Set<PlanNodeId> partitionedSources = ImmutableSet.copyOf(localExecutionPlan.getPartitionedSourceOrder());
            // 保存所有生成周期为Split级别的DriverSplitRunnerFactory实例
            ImmutableMap.Builder<PlanNodeId, DriverSplitRunnerFactory> driverRunnerFactoriesWithSplitLifeCycle = ImmutableMap.builder();
            // 保存所有生命周期为Task级别的DriverSplitRunnerFactory的实例
            ImmutableList.Builder<DriverSplitRunnerFactory> driverRunnerFactoriesWithTaskLifeCycle = ImmutableList.builder();
            // 保存所有生命周期为Group级别的DriverSplitRunnerFactory实例
            ImmutableList.Builder<DriverSplitRunnerFactory> driverRunnerFactoriesWithDriverGroupLifeCycle = ImmutableList.builder();
            for (DriverFactory driverFactory : localExecutionPlan.getDriverFactories()) {
                // 获取当前Driver的最上游的PlanNodeId
                Optional<PlanNodeId> sourceId = driverFactory.getSourceId();
                if (sourceId.isPresent() && partitionedSources.contains(sourceId.get())) {
                    // 如果这个Driver有输入，同时是一个分区类型的Source Node，那么这个Driver的生命周期就是与Split绑定的，即
                    // 绑定的Split被处理完，那么这个Driver就没用了。
                    driverRunnerFactoriesWithSplitLifeCycle.put(sourceId.get(), new DriverSplitRunnerFactory(driverFactory, true));
                }
                else {
                    // 如果这个Driver是一个下游的Driver实例，
                    switch (driverFactory.getPipelineExecutionStrategy()) {
                        case GROUPED_EXECUTION:
                            // 如果是GROUP LifeSpan，那么就需要每一个Drvier创建一个Runner Factory，添加到相应的等待队列中
                            driverRunnerFactoriesWithDriverGroupLifeCycle.add(new DriverSplitRunnerFactory(driverFactory, false));
                            break;
                        case UNGROUPED_EXECUTION:
                            // 如果是GROUP LifeSpan，那么就需要每一个Drvier创建一个Runner
                            driverRunnerFactoriesWithTaskLifeCycle.add(new DriverSplitRunnerFactory(driverFactory, false));
                            break;
                        default:
                            throw new UnsupportedOperationException();
                    }
                }
            }
            this.driverRunnerFactoriesWithSplitLifeCycle = driverRunnerFactoriesWithSplitLifeCycle.build();
            this.driverRunnerFactoriesWithDriverGroupLifeCycle = driverRunnerFactoriesWithDriverGroupLifeCycle.build();
            this.driverRunnerFactoriesWithTaskLifeCycle = driverRunnerFactoriesWithTaskLifeCycle.build();

            this.pendingSplitsByPlanNode = this.driverRunnerFactoriesWithSplitLifeCycle.keySet().stream()
                    .collect(toImmutableMap(identity(), ignore -> new PendingSplitsForPlanNode()));
            this.status = new Status(
                    taskContext,
                    localExecutionPlan.getDriverFactories().stream()
                            .collect(toImmutableMap(DriverFactory::getPipelineId, DriverFactory::getPipelineExecutionStrategy)));
            this.schedulingLifespanManager = new SchedulingLifespanManager(localExecutionPlan.getPartitionedSourceOrder(), localExecutionPlan.getStageExecutionDescriptor(), this.status);

            checkArgument(this.driverRunnerFactoriesWithSplitLifeCycle.keySet().equals(partitionedSources),
                    "Fragment is partitioned, but not all partitioned drivers were found");

            // Pre-register Lifespans for ungrouped partitioned drivers in case they end up get no splits.
            for (Entry<PlanNodeId, DriverSplitRunnerFactory> entry : this.driverRunnerFactoriesWithSplitLifeCycle.entrySet()) {
                PlanNodeId planNodeId = entry.getKey();
                DriverSplitRunnerFactory driverSplitRunnerFactory = entry.getValue();
                if (driverSplitRunnerFactory.getPipelineExecutionStrategy() == UNGROUPED_EXECUTION) {
                    this.schedulingLifespanManager.addLifespanIfAbsent(Lifespan.taskWide());
                    this.pendingSplitsByPlanNode.get(planNodeId).getLifespan(Lifespan.taskWide());
                }
            }

            // don't register the task if it is already completed (most likely failed during planning above)
            if (!taskStateMachine.getState().isDone()) {
                taskHandle = createTaskHandle(taskStateMachine, taskContext, outputBuffer, localExecutionPlan, taskExecutor);
            }
            else {
                taskHandle = null;
            }
            // 追加一个Listener，当前Outputbuffer处于FINISHED状态时，检查当前的SqkTaskExecution是否完成了。
            outputBuffer.addStateChangeListener(new CheckTaskCompletionOnBufferFinish(SqlTaskExecution.this));
        }
    }

SqlTaskExecution::addSources

addSources()方法，用于将客户端（Coordinator）发送的新Splits，按类型添加到相应调度队列中，并尝试调度之。

public class SqlTaskExecution
{
    public void addSources(List<TaskSource> sources)
    {
        requireNonNull(sources, "sources is null");
        checkState(!Thread.holdsLock(this), "Cannot add sources while holding a lock on the %s", getClass().getSimpleName());

        try (SetThreadName ignored = new SetThreadName("Task-%s", taskId)) {
            // update our record of sources and schedule drivers for new partitioned splits
            // 返回的updatedUnpartitionedSources集合，包含了所有非分区类型的、未处理完成的Splits
            // 而分区类型的Splits则通过SqlTaskExecution::schedulePartitionedSource(..)方法被调度
            Map<PlanNodeId, TaskSource> updatedUnpartitionedSources = updateSources(sources);

            // 调度所有的非分区类型的Splits
            // tell existing drivers about the new splits; it is safe to update drivers
            // multiple times and out of order because sources contain full record of
            // the unpartitioned splits
            for (WeakReference<Driver> driverReference : drivers) {
                Driver driver = driverReference.get();
                // the driver can be GCed due to a failure or a limit
                if (driver == null) {
                    // remove the weak reference from the list to avoid a memory leak
                    // NOTE: this is a concurrent safe operation on a CopyOnWriteArrayList
                    drivers.remove(driverReference);
                    continue;
                }
                Optional<PlanNodeId> sourceId = driver.getSourceId();
                if (sourceId.isEmpty()) {
                    continue;
                }
                TaskSource sourceUpdate = updatedUnpartitionedSources.get(sourceId.get());
                if (sourceUpdate == null) {
                    continue;
                }
                driver.updateSource(sourceUpdate);
            }

            // we may have transitioned to no more splits, so check for completion
            checkTaskCompletion();
        }
    }
}

SqlTask的调度&执行

SqlTaskExecution::schedulePartitionedSource

SqlTask每收到新的Splits，就调用schedulePartitionedSource(TaskSource)方法调度Splits。

    private synchronized void schedulePartitionedSource(TaskSource sourceUpdate)
    {
        mergeIntoPendingSplits(sourceUpdate.getPlanNodeId(), sourceUpdate.getSplits(), sourceUpdate.getNoMoreSplitsForLifespan(), sourceUpdate.isNoMoreSplits());

        while (true) {
            // SchedulingLifespanManager tracks how far each Lifespan has been scheduled. Here is an example.
            // Let's say there are 4 source pipelines/nodes: A, B, C, and D, in scheduling order.
            // And we're processing 3 concurrent lifespans at a time. In this case, we could have
            //
            // * Lifespan 10:  A   B  [C]  D; i.e. Pipeline A and B has finished scheduling (but not necessarily finished running).
            // * Lifespan 20: [A]  B   C   D
            // * Lifespan 30:  A  [B]  C   D
            //
            // To recap, SchedulingLifespanManager records the next scheduling source node for each lifespan.
            // schedulingLifespanManager维护了两种类型的LifeSpan：
            //   Task Wide：Split的运行时辐射范围对应于split/task lifecycle，与Group的Source Pipeline是互斥的，
            //              同一时刻只能有一个task wide的pipeline和一个group wide的pipeline并行调度执行。
            //   Task Group Wide：Split的运行辐射范围对应于Driver Group lifecycle，如果有多个Source Pipeline，那么对于
            //                    相同的Group（就是一个Partition)，同一时刻只能有一个在调度&执行的Source Pipeline；对于
            //                    不同的Group，可以并行调度。
            //
            // 获取还需要调度执行的LifeSpan，调度属于这个范围内的Splits执行。
            Iterator<SchedulingLifespan> activeLifespans = schedulingLifespanManager.getActiveLifespans();

            boolean madeProgress = false;

            while (activeLifespans.hasNext()) {
                SchedulingLifespan schedulingLifespan = activeLifespans.next();
                Lifespan lifespan = schedulingLifespan.getLifespan();

                // Continue using the example from above. Let's say the sourceUpdate adds some new splits for source node B.
                //
                // For lifespan 30, it could start new drivers and assign a pending split to each.
                // Pending splits could include both pre-existing pending splits, and the new ones from sourceUpdate.
                // If there is enough driver slots to deplete pending splits, one of the below would happen.
                // * If it is marked that all splits for node B in lifespan 30 has been received, SchedulingLifespanManager
                //   will be updated so that lifespan 30 now processes source node C. It will immediately start processing them.
                // * Otherwise, processing of lifespan 30 will be shelved for now.
                //
                // It is possible that the following loop would be a no-op for a particular lifespan.
                // It is also possible that a single lifespan can proceed through multiple source nodes in one run.
                //
                // When different drivers in the task has different pipelineExecutionStrategy, it adds additional complexity.
                // For example, when driver B is ungrouped and driver A, C, D is grouped, you could have something like this:
                //     TaskWide   :     [B]
                //     Lifespan 10:  A  [ ]  C   D
                //     Lifespan 20: [A]      C   D
                //     Lifespan 30:  A  [ ]  C   D
                // In this example, Lifespan 30 cannot start executing drivers in pipeline C because pipeline B
                // hasn't finished scheduling yet (albeit in a different lifespan).
                // Similarly, it wouldn't make sense for TaskWide to start executing drivers in pipeline B until at least
                // one lifespan has finished scheduling pipeline A.
                // This is why getSchedulingPlanNode returns an Optional.
                while (true) {
                    Optional<PlanNodeId> optionalSchedulingPlanNode = schedulingLifespan.getSchedulingPlanNode();
                    if (optionalSchedulingPlanNode.isEmpty()) {
                        break;
                    }
                    PlanNodeId schedulingPlanNode = optionalSchedulingPlanNode.get();
                    // driverRunnerFactoriesWithSplitLifeCycle存储的PlanNode实际上就是Source Node类型，因此这些Split
                    // 对应了数据源的数据，因此需要被Repartitioning，以便能够被Trino对数据进行分桶（重分区），满足在之前章节讲到的
                    // PartitionId -> WorkNode的数据分发策略。
                    DriverSplitRunnerFactory partitionedDriverRunnerFactory = driverRunnerFactoriesWithSplitLifeCycle.get(schedulingPlanNode);

                    PendingSplits pendingSplits = pendingSplitsByPlanNode.get(schedulingPlanNode).getLifespan(lifespan);

                    // Enqueue driver runners with driver group lifecycle for this driver life cycle, if not already enqueued.
                    if (!lifespan.isTaskWide() && !schedulingLifespan.getAndSetDriversForDriverGroupLifeCycleScheduled()) {
                        // 如果当前要调度的LifeSpan的类型为Grouped，同时还没有被调度，就走这里，为当前SqlTask的所有的Pipeline
                        // 创建DriverRunner。
                        // 此时，这个SqlTask是某个一个中继Stage/PlanFragment的某个分区的任务实例，因此属于这个LifSpan的Split的
                        // 信息已经确定了，而且已经被分区过了，因此内部会调用enqueueDriverSplitRunner(true, runners);方法，
                        // 直接唤起每一个DriverRunner
                        scheduleDriversForDriverGroupLifeCycle(lifespan);
                    }

                    // Enqueue driver runners with split lifecycle for this plan node and driver life cycle combination.
                    // 如果lifespan是TASK WIDE，那么这些Split是叶子Split，即TableScan Splits，因此不能直接唤起它们的Runner，
                    // 需要根据当前Worker Node的负载进行调度
                    ImmutableList.Builder<DriverSplitRunner> runners = ImmutableList.builder();
                    for (ScheduledSplit scheduledSplit : pendingSplits.removeAllSplits()) {
                        // create a new driver for the split
                        runners.add(partitionedDriverRunnerFactory.createDriverRunner(scheduledSplit, lifespan));
                    }
                    enqueueDriverSplitRunner(false, runners.build());

                    // If all driver runners have been enqueued for this plan node and driver life cycle combination,
                    // move on to the next plan node.
                    if (pendingSplits.getState() != NO_MORE_SPLITS) {
                        break;
                    }
                    // 到这里，不会再有新的Split了，那么就进当前的SqlTaskExecution实例进行清理
                    partitionedDriverRunnerFactory.noMoreDriverRunner(ImmutableList.of(lifespan));
                    pendingSplits.markAsCleanedUp();

                    schedulingLifespan.nextPlanNode();
                    madeProgress = true;
                    if (schedulingLifespan.isDone()) {
                        break;
                    }
                }
            }

            if (!madeProgress) {
                break;
            }
        }

        if (sourceUpdate.isNoMoreSplits()) {
            // 通知SchedulingLifespanManager，当前TaskSource对应的PlanNode的工作已经完成。
            schedulingLifespanManager.noMoreSplits(sourceUpdate.getPlanNodeId());
        }
    }

SqlTaskExecution::scheduleDriversForTaskLifeCycle

SqlTask接收到创建请求时，会尝试创建SqlTaskExecution，作为此SqlTask的执行实体，并在完成实例创建后，调用
scheduleDriversForTaskLifeCycle()方法开始调度。

    // scheduleDriversForTaskLifeCycle and scheduleDriversForDriverGroupLifeCycle are similar.
    // They are invoked under different circumstances, and schedules a disjoint set of drivers, as suggested by their names.
    // They also have a few differences, making it more convenient to keep the two methods separate.
    private void scheduleDriversForTaskLifeCycle()
    {
        // This method is called at the beginning of the task.
        // It schedules drivers for all the pipelines that have task life cycle.
        List<DriverSplitRunner> runners = new ArrayList<>();
        for (DriverSplitRunnerFactory driverRunnerFactory : driverRunnerFactoriesWithTaskLifeCycle) {
            for (int i = 0; i < driverRunnerFactory.getDriverInstances().orElse(1); i++) {
                runners.add(driverRunnerFactory.createDriverRunner(null, Lifespan.taskWide()));
            }
        }
        // driverRunnerFactoriesWithTaskLifeCycle存储的是中继续Stage/PlanFragment对应的某个分区的DriverSplitRunners，因此可以立即执行
        enqueueDriverSplitRunner(true, runners);
        for (DriverSplitRunnerFactory driverRunnerFactory : driverRunnerFactoriesWithTaskLifeCycle) {
            driverRunnerFactory.noMoreDriverRunner(ImmutableList.of(Lifespan.taskWide()));
            verify(driverRunnerFactory.isNoMoreDriverRunner());
        }
    }

SqlTaskExecution::scheduleDriversForDriverGroupLifeCycle

    private void scheduleDriversForDriverGroupLifeCycle(Lifespan lifespan)
    {
        // This method is called when a split that belongs to a previously unseen driver group is scheduled.
        // It schedules drivers for all the pipelines that have driver group life cycle.
        if (lifespan.isTaskWide()) {
            checkArgument(driverRunnerFactoriesWithDriverGroupLifeCycle.isEmpty(), "Instantiating pipeline of driver group lifecycle at task level is not allowed");
            return;
        }

        List<DriverSplitRunner> runners = new ArrayList<>();
        for (DriverSplitRunnerFactory driverSplitRunnerFactory : driverRunnerFactoriesWithDriverGroupLifeCycle) {
            for (int i = 0; i < driverSplitRunnerFactory.getDriverInstances().orElse(1); i++) {
                runners.add(driverSplitRunnerFactory.createDriverRunner(null, lifespan));
            }
        }
        // 与scheduleDriversForTaskLifeCycle方法类似，这里的DriverSplitRunners是属于某个中继Stage/PlanFragment，可以立即执行
        enqueueDriverSplitRunner(true, runners);
        for (DriverSplitRunnerFactory driverRunnerFactory : driverRunnerFactoriesWithDriverGroupLifeCycle) {
            driverRunnerFactory.noMoreDriverRunner(ImmutableList.of(lifespan));
        }
    }

DriverSplitRunner的执行

DriverSplitRunner，负责应用一组物理Operators到一个Split上，表示一段完整的数据处理过程，且数据处理的最小单元是Page。
…

Pipeline，就是一组可独立运行的Operators的描述

Driver是对Pipeline的执行实例，因此一个Pipeline可以由多个Drivers并行执行

时间片

你可能感兴趣的:(Trino,OLAP,java,大数据,TRINO)

华为OD机试 - 单向链表中间节点（Java & JS & Python & C & C++）华为OD题库华为od 链表 java
须知哈喽，本题库完全免费，收费是为了防止被爬，大家订阅专栏后可以私信联系退款。感谢支持文章目录须知题目描述输出描述解析代码题目描述给定一个单链表L，请编写程序输出L中间结点保存的数据。如果有两个中间结点，则输出第二个中间结点保存的数据。例如：给定L为1→7→5，则输出应该为7；给定L为1→2→3→4，则输出应该为3；输入描述每个输入包含1个测试用例。每个测试用例：第一行给出链表首结点的地址、结点总
学习JavaEE的日子 Day32 线程池 A 北枝学习JavaEE 学习 java-ee java 线程池
Day32线程池1.引入一个线程完成一项任务所需时间为：创建线程时间-Time1线程中执行任务的时间-Time2销毁线程时间-Time32.为什么需要线程池(重要)线程池技术正是关注如何缩短或调整Time1和Time3的时间，从而提高程序的性能。项目中可以把Time1，T3分别安排在项目的启动和结束的时间段或者一些空闲的时间段线程池不仅调整Time1，Time3产生的时间段，而且它还显著减少了创建
数据分析：低代码平台助力大数据时代的飞跃发展快乐非自愿数据分析低代码大数据
随着信息技术的突飞猛进，我们身处于一个数据量空前增长的时代——大数据时代。在这个时代背景下，数据分析已经成为企业决策、政策制定、科学研究等众多领域不可或缺的重要工具。然而，面对海量的数据和日益复杂多变的分析需求，传统的数据分析方法往往捉襟见肘，难以应对。幸运的是，低代码平台的兴起为大数据分析注入了新的活力，成为推动大数据时代发展的重要力量。低代码平台，顾名思义，是一种通过少量甚至无需编写代码，就能
请简单介绍一下Shiro框架是什么？Shiro在Java安全领域的主要作用是什么？Shiro主要提供了哪些安全功能？ AaronWang94 shiro java java 安全开发语言
请简单介绍一下Shiro框架是什么？Shiro框架是一个强大且灵活的开源安全框架，为Java应用程序提供了全面的安全解决方案。它主要用于身份验证、授权、加密和会话管理等功能，可以轻松地集成到任何JavaWeb应用程序中，并提供了易于理解和使用的API，使开发人员能够快速实现安全特性。Shiro的核心组件包括Subject、SecurityManager和Realms。Subject代表了当前与应用
通俗易懂：什么是Java虚拟机（JVM）？它的主要作用是什么？大龄下岗程序员 mysql java mysql spring
Java虚拟机（JavaVirtualMachine,JVM）是一种软件实现的抽象计算机，它负责执行Java字节码（Bytecode）。Java程序并不是直接在物理计算机上运行，而是先由Java编译器将源代码编译成与平台无关的字节码，然后由JVM负责读取字节码并在实际硬件架构上运行。JVM的主要作用包括以下几个方面：1.跨平台性-JVM是Java语言“一次编写，到处运行”（WriteOnce,Ru
3、JavaWeb-Ajax/Axios-前端工程化-Element 所谓远行Misnearch #JavaWeb 前端 ajax elementui java 前端框架
P34Ajax介绍Ajax:AsynchroousJavaScriptAndXML，异步的JS和XMLJS网页动作，XML一种标记语言，存储数据，作用：数据交换：通过Ajax给服务器发送请求，并获取服务器响应的数据异步交互：在不重新加载整个页面的情况下，与服务器交换数据并实现更新部分网页的技术，例如：搜索联想、用户名是否可用的校验等等。同步与异步：同步：服务器在处理中客户端要处于等待状态，输入域名
枚举使用笔记万变不离其宗_8 项目笔记笔记
1.java枚举怎么放在方法上面的注释里面/***保存*@paramuserId用户id*@paramtype见枚举{@linkcom.common.enums.TypeEnum}*@return*/voidsave(LonguserId,Stringtype);
Python dict字符串转json对象，小数精度丢失问题朝如青丝暮成雪 json python
一前言JSON(JavaScriptObjectNotation)是一种轻量级的数据交换格式，dict是Python的一种数据格式。本篇介绍一个float数据转换时精度丢失的案例。二问题描述importjsontest_str1='{"π":3.1415926535897932384626433832795028841971}'test_str2='{"value":10.00000}'print
java实体中返回前端的double类型四舍五入（格式化）婲落ヽ紅顏誶 java
根据业务，需要通过后端给前端返回部分double类型的数值，一般需要保留两位小数，使用jackson转换对象packagecom.ruoyi.common.core.config;importcom.fasterxml.jackson.core.JsonGenerator;importcom.fasterxml.jackson.databind.JsonSerializer;importcom.f
Apache Kafka的伸缩性探究：实现高性能、弹性扩展的关键 i289292951 kafka kafka
引言ApacheKafka作为当今最流行的消息中间件之一，以其强大的伸缩性著称。在大数据处理、流处理和实时数据集成等领域，Kafka的伸缩性为其在面临急剧增长的数据流量和多样化业务需求时提供了无与伦比的扩展能力。本文将深入探讨Kafka如何通过其独特的架构设计实现高水平的伸缩性，以及在实际部署中如何优化和利用这一特性。一、Kafka伸缩性的核心设计分区（Partitioning）与水平扩展Kafk
Java中HashMap底层数据结构及主要参数? 山间漫步人生路 java 数据结构开发语言
在Java中，HashMap的底层数据结构主要基于数组和链表，同时在Java8及以后的版本中，当链表长度超过一定阈值时，链表会转换为红黑树来优化性能。这种结构结合了数组和链表的优点，既提供了快速的随机访问，又允许动态地扩展存储桶的大小。HashMap的主要参数包括：初始容量（InitialCapacity）：这是HashMap在创建时设定的桶数组的大小。默认值为16。这个值可以根据预计存储的键值对
Java学习笔记01 .wsy. 日常 java 学习笔记
1.1Java简介Java的前身是Oak，詹姆斯·高斯林是java之父。1.2Java体系Java是一种与平台无关的语言，其源代码可以被编译成一种结构中立的中间文件（.class，字节码文件）于Java虚拟机上运行。1.2.3专有名词JDK提供编译、运行Java程序所需要的种种工具及资源。JRE是运行Java所依赖的环境的集合。JVM是一个虚构出来的计算机，通过在实际的计算机上仿真模拟各种计算机功
Java回溯知识点（含面试大厂题和源码）一成码农 java 面试开发语言
回溯算法是一种通过遍历所有可能的候选解来寻找所有解的算法，如果候选解被确认不是一个解（或至少不是最后一个解），回溯算法会通过在上一步进行一些变化来丢弃这个解，即“回溯”并尝试另一个候选解。回溯法通常用递归方法来实现，在解决排列、组合、选择问题时非常有效。回溯算法的核心要点：路径：也就是已经做出的选择。选择列表：也就是你当前可以做的选择。结束条件：也就是到达决策树底层，无法再做出选择的条件。回溯算法
Azkaban各种类型的Job编写 __元昊__
一、概述原生的Azkaban支持的plugin类型有以下这些：command：Linuxshell命令行任务gobblin：通用数据采集工具hadoopJava：运行hadoopMR任务java：原生java任务hive：支持执行hiveSQLpig：pig脚本任务spark：spark任务hdfsToTeradata：把数据从hdfs导入TeradatateradataToHdfs：把数据从Te
java基础相关面试题详细总结。。。。。96 java 开发语言
1.Java中的数据类型有哪些？答：Java中的数据类型包括基本数据类型（如整数、浮点数、字符等）和引用数据类型（如类、接口、数组等）。2.什么是面向对象编程（OOP）？答：面向对象编程是一种编程范式，它将数据和对数据的操作封装在一起，形成对象。通过对象之间的交互来实现程序的功能。3.解释类和对象的关系。答：类是对象的抽象描述，而对象是类的具体实例。一个类可以创建多个对象，每个对象都具有类中定义的
javascript 日期转换为时间戳，时间戳转换为日期的函数 cdcdhj javascript学习日记 javascript 开发语言 ecmascript
日期转化为时间戳，主要用valueOf()来进行转化为毫秒时间戳，getTime()IOS系统无法解析转换，所以都有valueOf()letgetTimestampOrDate=function(timestamp){lettimeStamp='';constregex=/^\d{4}(-|\/)\d{2}(-|\/)\d{2}$/;constregex2=/^\d{4}(-|\/)\d{2}(-
Java面试题：解释JVM的内存结构，并描述堆、栈、方法区在内存结构中的角色和作用，Java中的多线程是如何实现的，Java垃圾回收机制的基本原理，并讨论常见的垃圾回收算法杰哥在此 Java系列 java jvm 算法面试
Java内存模型与多线程的深入探讨在Java的世界里，内存模型和多线程是开发者必须掌握的核心知识点。它们不仅关系到程序的性能和稳定性，还直接影响到系统的可扩展性和可靠性。下面，我将通过三个面试题，带领大家深入理解Java内存模型、多线程以及并发编程的相关原理和实践。面试题一：请解释JVM的内存结构，并描述堆、栈、方法区在内存结构中的角色和作用。关注点：JVM内存结构的基本组成堆、栈、方法区的功能和
COMP315 JavaScript Cloud Computing for E Commerce zhuyu0206girl javascript 开发语言 ecmascript
Assignment1:Javascript1IntroductionAcommontaskincloudcomputingisdatacleaning,whichistheprocessoftakinganinitialdatasetthatmaycontainerroneousorincompletedata,andremovingorfixingthoseelementsbeforeform
JSON与AJAX：网页交互的利器入冉心 json ajax 前端
在现代Web开发中，JSON（JavaScriptObjectNotation）和AJAX（AsynchronousJavaScriptandXML）是两项不可或缺的技术。它们共同为网页提供了动态、实时的数据交互能力，为用户带来了更加流畅和丰富的体验。本文将详细介绍JSON和AJAX的概念、原理，并通过代码示例展示它们在实际开发中的应用。一、JSON：轻量级的数据交换格式JSON是一种轻量级的数据
程序员开发技术整理 laizhixue 学习前端框架
前端技术：vue-前端框架element-前端框架bootstrap-前端框架echarts-图标组件C#后端技术：webservice：soap架构：简单的通信协议，用于服务通信ORM框架：对象关系映射，如EF：对象实体模型，是ado.net中的应用技术soap服务通讯：xml通讯ado.net：OAuth2:登录授权认证：Token认证：JWT：jsonwebtokenJava后端技术：便捷工
山东省大数据局副局长禹金涛一行莅临聚合数据走访调研聚合数据 API 大数据人工智能 API
3月19日，山东省大数据局党组成员、副局长禹金涛莅临聚合数据展开考察调研。山东省大数据局数据应用管理与安全处处长杨峰，副处长都海明参加调研，苏州市大数据局副局长汤晶陪同。聚合数据董事长左磊等人接待来访。调研组一行参观了聚合数据展厅，了解了聚合数据的发展历程、数据产品、应用案例、奖项荣誉等情况。并就企业在数据处理和应用方面取得的成绩进行了深入交流。作为最早一批进入大数据行业的企业，聚合数据深耕行业十
javascript的数据类型及转换田小田txt
一、JavaScript数据类型：共有string，number，boolean，object，function五种数据类型；其中Object，Date，Array为对象型；2个不包含任何值的数据类型：null，undefined。二、Typeof查看数据类型：typeof"John"//返回stringtypeof3.14//返回numbertypeofNaN//返回numbertypeoffa
java线程之Lock的使用 dimdark
目标:大致介绍一下java.util.concurrent.locks包下的类,接口及其常用方法1.Lock接口Lock接口使用Lock接口的最佳模式:publicvoidmethod()throwInterruptedException{try{lock.lock();//lock.lockUninterruptibly();}finally{lock.unlock();}}用户必须手动释放Lo
第六届蓝桥杯大赛软件赛省赛Java 大学C组题解爱跑步的程序员~ 刷题蓝桥杯省赛
文章目录A隔行变色思路解题方法复杂度CodeB立方尾不变思路解题方法复杂度CodeC无穷分数思路解题方法复杂度CodeD奇妙的数字思路解题方法复杂度CodeE移动距离思路解题方法复杂度CodeF垒骰子思路解题方法复杂度CodeA隔行变色思路这是一个简单的计数问题。我们需要找出21到50之间的奇数数量。奇数行将被染成蓝色，偶数行将被染成白色。解题方法我们可以使用一个for循环从21遍历到50，然后使
Java学习笔记04：Java_数组 JasonYangQ Java java
文章目录1.数组1.1数组介绍1.2数组的定义格式1.2.1第一种格式1.2.2第二种格式1.3数组的动态初始化1.3.1什么是动态初始化1.3.2动态初始化格式1.3.3动态初始化格式详解1.4数组元素访问1.4.1什么是索引1.4.2访问数组元素格式1.4.3示例代码1.5内存分配1.5.1内存概述1.5.2java中的内存分配1.9数组的静态初始化1.9.1什么是静态初始化1.9.2静态初始
【设计模式】Java 设计模式之桥接模式（Bridge）新手村长 Java 设计模式设计模式 java 桥接模式
桥接模式（BridgePattern）是结构型设计模式的一种，它主要解决的是抽象部分与实现部分的解耦问题，使得两者可以独立变化。这种类型的设计模式属于结构型模式，因为该模式涉及如何组合接口和它们的实现。将抽象部分与实现部分分离，使它们都可以独立地变化。一、桥接模式概述桥接模式的主要思想是将抽象与实现进行解耦，使得二者可以独立进行变化。在桥接模式中，抽象部分和实现部分被分离出来，抽象部分定义了一个抽
基于SSM+Vue企业销售培训系统企业人才培训系统企业课程培训管理系统企业文化培训班系统Java 计算机程序老哥
作者主页：计算机毕业设计老哥有问题可以主页问我一、开发介绍1.1开发环境开发语言：Java数据库：MySQL系统架构：B/S后端：SSM(Spring+SpringMVC+Mybatis)前端：Vue工具：IDEA或者Eclipse，JDK1.8，Maven二、系统介绍2.1图片展示注册登录页面：登陆.png前端页面功能：首页、培训班、在线学习、企业文化、交流论坛、试卷列表、系统公告、留言反馈、个
java selenium 元素点击不了马达马达达 selenium 测试工具
最近做了一个页面爬取，很有意思被机缘巧合下解决了。这个元素很奇怪，用xpath可以定位元素，但是就是click()不了。试过了网上搜的一些办法：//尝试一WebElementa_tag=driver.findElement(By.xpath("xxx"));a_tag.click();//点击不了，卡住//尝试二WebDriverWaitwait=newWebDriverWait(driver,1
【Java初阶（三）】方法的使用 PU-YUHAN Java从入门到精通 java 开发语言递归方法
❣博主主页:33的博客❣▶文章专栏分类:Java从入门到精通◀我的代码仓库:33的代码仓库目录1.前言2.方法的概念2.1方法定义2.2实参和形参的关系3.方法的重载3.1方法重载的概念4.递归4.1递归的概念4.2递归过程分析4.3递归练习5.总结1.前言在前面的学习中，我们已经学习了Java的部分知识，包括数据类型与变量，运算符，分支与循环以及输入和输出这些基础知识，我们继续对Java的学习进
解析XML文件的几种方式？人生在勤，不索何获 xml
在Java中解析XML文件可以通过多种方式完成，其中最常用的有DOM（DocumentObjectModel）、SAX（SimpleAPIforXML）和StAX（StreamingAPIforXML）。每种方式有其特点和适用场景。1.DOM解析DOM解析是一种将整个XML文档加载到内存中，构造成一个树形结构，然后你可以很方便地访问任何数据节点的方法。这种方法适用于需要频繁读写操作的场景。impo
github中多个平台共存 jackyrong github
在个人电脑上，如何分别链接比如oschina,github等库呢，一般教程之列的，默认 ssh链接一个托管的而已，下面讲解如何放两个文件 1）设置用户名和邮件地址 $ git config --global user.name "xx" $ git config --global user.email "[email protected]"
ip地址与整数的相互转换(javascript) alxw4616 JavaScript
//IP转成整型 function ip2int(ip){ var num = 0; ip = ip.split("."); num = Number(ip[0]) * 256 * 256 * 256 + Number(ip[1]) * 256 * 256 + Number(ip[2]) * 256 + Number(ip[3]); n
读书笔记-jquey+数据库+css chengxuyuancsdn html jquery oracle
1、grouping ,group by rollup, GROUP BY GROUPING SETS区别 2、$("#totalTable tbody>tr td:nth-child(" + i + ")").css({"width":tdWidth, "margin":"0px", &q
javaSE javaEE javaME == API下载 Array_06 java
oracle下载各种API文档： http://www.oracle.com/technetwork/java/embedded/javame/embed-me/documentation/javame-embedded-apis-2181154.html JavaSE文档： http://docs.oracle.com/javase/8/docs/api/ JavaEE文档： ht
shiro入门学习 cugfy java Web 框架
声明本文只适合初学者，本人也是刚接触而已，经过一段时间的研究小有收获，特来分享下希望和大家互相交流学习。首先配置我们的web.xml代码如下，固定格式，记死就成 <filter> <filter-name>shiroFilter</filter-name> &nbs
Array添加删除方法 357029540 js
刚才做项目前台删除数组的固定下标值时，删除得不是很完整，所以在网上查了下，发现一个不错的方法，也提供给需要的同学。 //给数组添加删除 Array.prototype.del = function(n){
navigation bar 更改颜色张亚雄 IO
今天郁闷了一下午，就因为objective-c默认语言是英文，我写的中文全是一些乱七八糟的样子，到不是乱码，但是，前两个自字是粗体，后两个字正常体，这可郁闷死我了，问了问大牛，人家告诉我说更改一下字体就好啦，比如改成黑体，哇塞，茅塞顿开。翻书看，发现，书上有介绍怎么更改表格中文字字体的，代码如下
unicode转换成中文 adminjun unicode 编码转换
在Java程序中总会出现\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5这个的字符，这是unicode编码，使用时有时候不会自动转换成中文就需要自己转换了使用下面的方法转换一下即可。 /** * unicode 转换成中文
一站式 Java Web 框架 firefly aijuans Java Web
Firefly是一个高性能一站式Web框架。涵盖了web开发的主要技术栈。包含Template engine、IOC、MVC framework、HTTP Server、Common tools、Log、Json parser等模块。 firefly-2.0_07修复了模版压缩对javascript单行注释的影响，并新增了自定义错误页面功能。更新日志：增加自定义系统错误页面功能
设计模式——单例模式 ayaoxinchao 设计模式
定义 Java中单例模式定义：“一个类有且仅有一个实例，并且自行实例化向整个系统提供。” 分析从定义中可以看出单例的要点有三个：一是某个类只能有一个实例；二是必须自行创建这个实例；三是必须自行向系统提供这个实例。 &nb
Javascript 多浏览器兼容性问题及解决方案 BigBird2012 JavaScript
不论是网站应用还是学习js,大家很注重ie与firefox等浏览器的兼容性问题，毕竟这两中浏览器是占了绝大多数。一、document.formName.item(”itemName”) 问题问题说明：IE下，可以使用 document.formName.item(”itemName”) 或 document.formName.elements ["elementName&quo
JUnit-4.11使用报java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing错误 bijian1013 junit4.11 单元测试
下载了最新的JUnit版本，是4.11，结果尝试使用发现总是报java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing这样的错误，上网查了一下，一般的解决方案是，换一个低一点的版本就好了。还有人说，是缺少hamcrest的包。去官网看了一下，如下发现：
[Zookeeper学习笔记之二]Zookeeper部署脚本 bit1129 zookeeper
Zookeeper伪分布式安装脚本(此脚本在一台机器上创建Zookeeper三个进程，即创建具有三个节点的Zookeeper集群。这个脚本和zookeeper的tar包放在同一个目录下，脚本中指定的名字是zookeeper的3.4.6版本，需要根据实际情况修改)： #!/bin/bash #!!!Change the name!!! #The zookeepe
【Spark八十】Spark RDD API二 bit1129 spark
coGroup package spark.examples.rddapi import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object CoGroupTest_05 { def main(args: Array[String]) { v
Linux中编译apache服务器modules文件夹缺少模块(.so)的问题 ronin47 modules
在modules目录中只有httpd.exp，那些so文件呢？我尝试在fedora core 3中安装apache 2. 当我解压了apache 2.0.54后使用configure工具并且加入了 --enable-so 或者 --enable-modules=so (两个我都试过了) 去make并且make install了。我希望在/apache2/modules/目录里有各种模块，
Java基础-克隆 BrokenDreams java基础
Java中怎么拷贝一个对象呢？可以通过调用这个对象类型的构造器构造一个新对象，然后将要拷贝对象的属性设置到新对象里面。Java中也有另一种不通过构造器来拷贝对象的方式，这种方式称为克隆。 Java提供了java.lang.
读《研磨设计模式》-代码笔记-适配器模式-Adapter bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 适配器模式解决的主要问题是，现有的方法接口与客户要求的方法接口不一致 * 可以这样想，我们要写这样一个类（Adapter）: * 1.这个类要符合客户的要求 ---> 那显然要
HDR图像PS教程集锦&心得 cherishLC PS
HDR是指高动态范围的图像，主要原理为提高图像的局部对比度。软件有photomatix和nik hdr efex。一、教程叶明在知乎上的回答： http://www.zhihu.com/question/27418267/answer/37317792 大意是修完后直方图最好是等值直方图，方法是HDR软件调一遍，再结合不透明度和蒙版细调。二、心得 1、去除阴影部分的
maven-3.3.3 mvn archetype 列表 crabdave ArcheType
maven-3.3.3 mvn archetype 列表可以参考最新的：http://repo1.maven.org/maven2/archetype-catalog.xml [INFO] Scanning for projects... [INFO]
linux shell 中文件编码查看及转换方法 daizj shell 中文乱码 vim 文件编码
一、查看文件编码。在打开文件的时候输入:set fileencoding 即可显示文件编码格式。二、文件编码转换 1、在Vim中直接进行转换文件编码,比如将一个文件转换成utf-8格式 &
MySQL--binlog日志恢复数据 dcj3sjt126com binlog
恢复数据的重要命令如下 mysql> flush logs; 默认的日志是mysql-bin.000001，现在刷新了重新开启一个就多了一个mysql-bin.000002
数据库中数据表数据迁移方法 dcj3sjt126com sql
刚开始想想好像挺麻烦的，后来找到一种方法了，就SQL中的 INSERT 语句，不过内容是现从另外的表中查出来的，其实就是 MySQL中INSERT INTO SELECT的使用下面看看如何使用语法：MySQL中INSERT INTO SELECT的使用 1. 语法介绍有三张表a、b、c，现在需要从表b
Java反转字符串 dyy_gusi java 反转字符串
前几天看见一篇文章，说使用Java能用几种方式反转一个字符串。首先要明白什么叫反转字符串，就是将一个字符串到过来啦，比如"倒过来念的是小狗"反转过来就是”狗小是的念来过倒“。接下来就把自己能想到的所有方式记录下来了。 1、第一个念头就是直接使用String类的反转方法，对不起，这样是不行的，因为Stri
UI设计中我们为什么需要设计动效 gcq511120594 UI linux
随着国际大品牌苹果和谷歌的引领，最近越来越多的国内公司开始关注动效设计了，越来越多的团队已经意识到动效在产品用户体验中的重要性了，更多的UI设计师们也开始投身动效设计领域。但是说到底，我们到底为什么需要动效设计？或者说我们到底需要什么样的动效？做动效设计也有段时间了，于是尝试用一些案例，从产品本身出发来说说我所思考的动效设计。一、加强体验舒适度嗯，就是让用户更加爽更加爽的用
JBOSS服务部署端口冲突问题 HogwartsRow java 应用服务器 jboss server EJB3
服务端口冲突问题的解决方法，一般修改如下三个文件中的部分端口就可以了。 1、jboss5/server/default/conf/bindingservice.beans/META-INF/bindings-jboss-beans.xml 2、./server/default/deploy/jbossweb.sar/server.xml 3、.
第三章 Redis/SSDB+Twemproxy安装与使用 jinnianshilongnian ssdb reids twemproxy
目前对于互联网公司不使用Redis的很少，Redis不仅仅可以作为key-value缓存，而且提供了丰富的数据结果如set、list、map等，可以实现很多复杂的功能；但是Redis本身主要用作内存缓存，不适合做持久化存储，因此目前有如SSDB、ARDB等，还有如京东的JIMDB，它们都支持Redis协议，可以支持Redis客户端直接访问；而这些持久化存储大多数使用了如LevelDB、RocksD
ZooKeeper原理及使用 liyonghui160com
ZooKeeper是Hadoop Ecosystem中非常重要的组件，它的主要功能是为分布式系统提供一致性协调(Coordination)服务，与之对应的Google的类似服务叫Chubby。今天这篇文章分为三个部分来介绍ZooKeeper，第一部分介绍ZooKeeper的基本原理，第二部分介绍ZooKeeper
程序员解决问题的60个策略 pda158 框架工作单元测试
根本的指导方针 1. 首先写代码的时候最好不要有缺陷。最好的修复方法就是让 bug 胎死腹中。良好的单元测试强制数据库约束使用输入验证框架避免未实现的“else”条件在应用到主程序之前知道如何在孤立的情况下使用日志 2. print 语句。往往额外输出个一两行将有助于隔离问题。 3. 切换至详细的日志记录。详细的日
Create the Google Play Account sillycat Google
Create the Google Play Account Having a Google account, pay 25$, then you get your google developer account. References: http://developer.android.com/distribute/googleplay/start.html https://p
JSP三大指令 vikingwei jsp
JSP三大指令一个jsp页面中，可以有0~N个指令的定义！ 1. page --> 最复杂：<%@page language="java" info="xxx"...%> * pageEncoding和contentType： > pageEncoding：它