七、浅析kafka状态机

  • 此文章基于kafka2.5
    kafka状态机有两种,一种是ReplicaStateMachine副本状态机,另一种是PartitionStateMachine分区状态机,下面我们就来分别介绍一下

状态机启动的入口kafka.controller.KafkaController#onControllerFailover

在服务端controller选举之后会启动状态机

private def onControllerFailover(): Unit = {
    //……省略
    replicaStateMachine.startup()
    partitionStateMachine.startup()
}

副本状态机流转图

图一

ReplicaStateMachine源码分析

startup()代码如下

  def startup(): Unit = {
    info("Initializing replica state")
    //3.1 初始化副本状态
    initializeReplicaState()
    info("Triggering online replica state changes")
    val (onlineReplicas, offlineReplicas) = controllerContext.onlineAndOfflineReplicas
    //3.2 上线副本
    handleStateChanges(onlineReplicas.toSeq, OnlineReplica)
    info("Triggering offline replica state changes")
    //3.3 剔除下线副本
    handleStateChanges(offlineReplicas.toSeq, OfflineReplica)
    debug(s"Started replica state machine with initial state -> ${controllerContext.replicaStates}")
  }

kafka.controller.ReplicaStateMachine#initializeReplicaState

这里逻辑很简单,就是循环所有的topic的partition,区分出目前在线的副本及已下线的副本

  private def initializeReplicaState(): Unit = {
      controllerContext.allPartitions.foreach { partition =>
        val replicas = controllerContext.partitionReplicaAssignment(partition)
        replicas.foreach { replicaId =>
          val partitionAndReplica = PartitionAndReplica(partition, replicaId)
          //判断broker是否存活,以及脱机Map是否包含此TopicAndPartition
          if (controllerContext.isReplicaOnline(replicaId, partition)) {
            controllerContext.putReplicaState(partitionAndReplica, OnlineReplica)
          } else {
            // mark replicas on dead brokers as failed for topic deletion, if they belong to a topic to be deleted.
            // This is required during controller failover since during controller failover a broker can go down,
            // so the replicas on that broker should be moved to ReplicaDeletionIneligible to be on the safer side.
            controllerContext.putReplicaState(partitionAndReplica, ReplicaDeletionIneligible)
          }
        }
      }
    }

kafka.controller.ZkReplicaStateMachine#handleStateChanges

override def handleStateChanges(replicas: Seq[PartitionAndReplica], targetState: ReplicaState): Unit = {
    if (replicas.nonEmpty) {
      try {
        //校验
        controllerBrokerRequestBatch.newBatch()
        //3.2.1 处理上线请求
        replicas.groupBy(_.replica).foreach { case (replicaId, replicas) =>
          doHandleStateChanges(replicaId, replicas, targetState)
        }
        //3.2.2 批量发送请求
        controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
      } catch {
        case e: ControllerMovedException =>
          error(s"Controller moved to another broker when moving some replicas to $targetState state", e)
          throw e
        case e: Throwable => error(s"Error while moving some replicas to $targetState state", e)
      }
    }
  }

着重分析doHandleStateChanges,这也是副本状态机的核心方法

副本状态机总共有7种状态,具体见图二
图二

  private def doHandleStateChanges(replicaId: Int, replicas: Seq[PartitionAndReplica], targetState: ReplicaState): Unit = {
    //首先判断将要改变的状态是不是合法的,剔除不合法的副本
    replicas.foreach(replica => controllerContext.putReplicaStateIfNotExists(replica, NonExistentReplica))
    val (validReplicas, invalidReplicas) = controllerContext.checkValidReplicaStateChange(replicas, targetState)
    invalidReplicas.foreach(replica => logInvalidTransition(replica, targetState))

    targetState match {
      case NewReplica =>
        validReplicas.foreach { replica =>
          val partition = replica.topicPartition
          //通过controller上下文中获取该副本当前的状态
          val currentState = controllerContext.replicaState(replica)
          //partitionLeadershipInfo为Map.empty[TopicPartition, LeaderIsrAndControllerEpoch]结构的数据
          controllerContext.partitionLeadershipInfo.get(partition) match {
            case Some(leaderIsrAndControllerEpoch) =>
              //判断如果当前需要处理的副本为leader,则抛StateChangeFailedException异常
              if (leaderIsrAndControllerEpoch.leaderAndIsr.leader == replicaId) {
                val exception = new StateChangeFailedException(s"Replica $replicaId for partition $partition cannot be moved to NewReplica state as it is being requested to become leader")
                logFailedStateChange(replica, currentState, OfflineReplica, exception)
              } else {
                //发送LeaderAndISRRequest将当前leader及isr等信息告诉新副本所在的broker,这里的replicaId就是副本所在的brokerId
                //给所有存活的broker发送UpdateMetadataRequest
                controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
                  replica.topicPartition,
                  leaderIsrAndControllerEpoch,
                  controllerContext.partitionFullReplicaAssignment(replica.topicPartition),
                  isNew = true)
                logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
                //更新controllerContext中的副本状态
                controllerContext.putReplicaState(replica, NewReplica)
              }
            case None =>
              logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
              controllerContext.putReplicaState(replica, NewReplica)
          }
        }
      case OnlineReplica =>
        validReplicas.foreach { replica =>
          val partition = replica.topicPartition
          val currentState = controllerContext.replicaState(replica)
          //这里再次回顾一下,能流转为OnlineReplica状态的状态为NewReplica、OnlineReplica、OfflineReplica、ReplicaDeletionIneligible
          //这里对NewReplica做了特殊处理
          currentState match {
            case NewReplica =>
              //获取副本的分配信息,如果不包含当前副本id则将当前副本id写入
              val assignment = controllerContext.partitionFullReplicaAssignment(partition)
              if (!assignment.replicas.contains(replicaId)) {
                error(s"Adding replica ($replicaId) that is not part of the assignment $assignment")
                val newAssignment = assignment.copy(replicas = assignment.replicas :+ replicaId)
                controllerContext.updatePartitionFullReplicaAssignment(partition, newAssignment)
              }
            case _ =>
              //如果是其他状态则发送LeaderAndIsrRequest及UpdateMetadataRequest
              controllerContext.partitionLeadershipInfo.get(partition) match {
                case Some(leaderIsrAndControllerEpoch) =>
                  controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
                    replica.topicPartition,
                    leaderIsrAndControllerEpoch,
                    controllerContext.partitionFullReplicaAssignment(partition), isNew = false)
                case None =>
              }
          }
          logSuccessfulTransition(replicaId, partition, currentState, OnlineReplica)
          controllerContext.putReplicaState(replica, OnlineReplica)
        }
      case OfflineReplica =>
        //对有效的副本发送StopReplicaRequest
        validReplicas.foreach { replica =>
          controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = false)
        }
        //区分出存在leader的副本及不存在leader的副本
        val (replicasWithLeadershipInfo, replicasWithoutLeadershipInfo) = validReplicas.partition { replica =>
          controllerContext.partitionLeadershipInfo.contains(replica.topicPartition)
        }
        //将副本从isr中移除,里面主要就是更新zk及controllerContext
        val updatedLeaderIsrAndControllerEpochs = removeReplicasFromIsr(replicaId, replicasWithLeadershipInfo.map(_.topicPartition))
        updatedLeaderIsrAndControllerEpochs.foreach { case (partition, leaderIsrAndControllerEpoch) =>
          if (!controllerContext.isTopicQueuedUpForDeletion(partition.topic)) {
            val recipients = controllerContext.partitionReplicaAssignment(partition).filterNot(_ == replicaId)
            //给其他还存活的副本发送LeaderAndIsr及更新元数据请求,这里我猜想有可能是在下线leader的情况下要进行选举,所以需要发送LeaderAndIsrRequest
            controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipients,
              partition,
              leaderIsrAndControllerEpoch,
              controllerContext.partitionFullReplicaAssignment(partition), isNew = false)
          }
          val replica = PartitionAndReplica(partition, replicaId)
          val currentState = controllerContext.replicaState(replica)
          logSuccessfulTransition(replicaId, partition, currentState, OfflineReplica)
          controllerContext.putReplicaState(replica, OfflineReplica)
        }
        //给没有leader的副本直接发送更新元数据的请求
        replicasWithoutLeadershipInfo.foreach { replica =>
          val currentState = controllerContext.replicaState(replica)
          logSuccessfulTransition(replicaId, replica.topicPartition, currentState, OfflineReplica)
          controllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(replica.topicPartition))
          controllerContext.putReplicaState(replica, OfflineReplica)
        }
      case ReplicaDeletionStarted =>
        validReplicas.foreach { replica =>
          val currentState = controllerContext.replicaState(replica)
          logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionStarted)
          controllerContext.putReplicaState(replica, ReplicaDeletionStarted)
          //多了一个StopReplicaRequest的处理
          controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = true)
        }
      case ReplicaDeletionIneligible =>
        validReplicas.foreach { replica =>
          val currentState = controllerContext.replicaState(replica)
          logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionIneligible)
          controllerContext.putReplicaState(replica, ReplicaDeletionIneligible)
        }
      case ReplicaDeletionSuccessful =>
        validReplicas.foreach { replica =>
          val currentState = controllerContext.replicaState(replica)
          logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionSuccessful)
          controllerContext.putReplicaState(replica, ReplicaDeletionSuccessful)
        }
      case NonExistentReplica =>
        validReplicas.foreach { replica =>
          val currentState = controllerContext.replicaState(replica)
          val newAssignedReplicas = controllerContext
            .partitionFullReplicaAssignment(replica.topicPartition)
            .removeReplica(replica.replica)
          //也是处理了controllerContext中的数据及状态
          controllerContext.updatePartitionFullReplicaAssignment(replica.topicPartition, newAssignedReplicas)
          logSuccessfulTransition(replicaId, replica.topicPartition, currentState, NonExistentReplica)
          controllerContext.removeReplicaState(replica)
        }
    }
  }

分区状态机流转图

图三

PartitionStateMachine源码分析

startup()代码如下

  def startup(): Unit = {
    info("Initializing partition state")
    //5.1 初始化分区状态
    initializePartitionState()
    info("Triggering online partition state changes")
    //5.2 触发分区上线状态
    triggerOnlinePartitionStateChange()
    debug(s"Started partition state machine with initial state -> ${controllerContext.partitionStates}")
  }

kafka.controller.PartitionStateMachine#initializePartitionState

这里就是对所有的分区做一次简单的状态过滤
  private def initializePartitionState(): Unit = {
    for (topicPartition <- controllerContext.allPartitions) {
      // 这里逻辑也挺简单的,就是根据controllerContext中的cache来判断分区状态,如果不存在缓存中则是NewPartition状态,如果存在且Leader还存活,则置为Online状态,否则置为Offline状态
      controllerContext.partitionLeadershipInfo.get(topicPartition) match {
        case Some(currentLeaderIsrAndEpoch) =>
          // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
          if (controllerContext.isReplicaOnline(currentLeaderIsrAndEpoch.leaderAndIsr.leader, topicPartition))
          // leader is alive
            controllerContext.putPartitionState(topicPartition, OnlinePartition)
          else
            controllerContext.putPartitionState(topicPartition, OfflinePartition)
        case None =>
          controllerContext.putPartitionState(topicPartition, NewPartition)
      }
    }
  }

kafka.controller.PartitionStateMachine#triggerOnlinePartitionStateChange

    //……省略若干代码
    override def handleStateChanges(
        partitions: Seq[TopicPartition],
        targetState: PartitionState,
        partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
        ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
        if (partitions.nonEmpty) {
          try {
            controllerBrokerRequestBatch.newBatch(
            //5.2.1 处理NewPartition及OfflinePartition状态的数据,将其置为上线状态
            val result = doHandleStateChanges(
              partitions,
              targetState,
              partitionLeaderElectionStrategyOpt
            )
            //5.2.2 批量发送请求
            controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
            result
          } catch {
            case e: ControllerMovedException =>
              error(s"Controller moved to another broker when moving some partitions to $targetState state", e)
              throw e
            case e: Throwable =>
              error(s"Error while moving some partitions to $targetState state", e)
              partitions.iterator.map(_ -> Left(e)).toMap
          }
        } else {
          Map.empty
        }
    }

着重分析doHandleStateChanges方法

分区状态机总共有四种状态,具体见图四。
图四

private def doHandleStateChanges(
    partitions: Seq[TopicPartition],
    targetState: PartitionState,
    partitionLeaderElectionStrategyOpt: Option[PartitionLeaderElectionStrategy]
  ): Map[TopicPartition, Either[Throwable, LeaderAndIsr]] = {
    val stateChangeLog = stateChangeLogger.withControllerEpoch(controllerContext.epoch)
    //校验原状态是否能流转到目标状态
    partitions.foreach(partition => controllerContext.putPartitionStateIfNotExists(partition, NonExistentPartition))
    val (validPartitions, invalidPartitions) = controllerContext.checkValidPartitionStateChange(partitions, targetState)
    invalidPartitions.foreach(partition => logInvalidTransition(partition, targetState))

    targetState match {
      case NewPartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState with " +
            s"assigned replicas ${controllerContext.partitionReplicaAssignment(partition).mkString(",")}")
          //在内存中标记状态
          controllerContext.putPartitionState(partition, NewPartition)
        }
        Map.empty
      case OnlinePartition =>
        val uninitializedPartitions = validPartitions.filter(partition => partitionState(partition) == NewPartition)
        val partitionsToElectLeader = validPartitions.filter(partition => partitionState(partition) == OfflinePartition || partitionState(partition) == OnlinePartition)
        if (uninitializedPartitions.nonEmpty) {
          //指定了第一个存活的副本为leader,将数据写入zk,路径为brokers/topics/对应的topic/partitions/对应的partition/state,并更新cache
          // 中的partitionLeadershipInfo数据
          val successfulInitializations = initializeLeaderAndIsrForPartitions(uninitializedPartitions)
          successfulInitializations.foreach { partition =>
            stateChangeLog.trace(s"Changed partition $partition from ${partitionState(partition)} to $targetState with state " +
              s"${controllerContext.partitionLeadershipInfo(partition).leaderAndIsr}")
            //更新分区状态
            controllerContext.putPartitionState(partition, OnlinePartition)
          }
        }
        //处理由OfflinePartition或OnlinePartition转变为OnlinePartition状态的数据
        if (partitionsToElectLeader.nonEmpty) {
          val electionResults = electLeaderForPartitions(
            partitionsToElectLeader,
            partitionLeaderElectionStrategyOpt.getOrElse(
              throw new IllegalArgumentException("Election strategy is a required field when the target state is OnlinePartition")
            )
          )
          //leader选取完成之后更新cache中的partitionStates信息
          electionResults.foreach {
            case (partition, Right(leaderAndIsr)) =>
              stateChangeLog.trace(
                s"Changed partition $partition from ${partitionState(partition)} to $targetState with state $leaderAndIsr"
              )
              controllerContext.putPartitionState(partition, OnlinePartition)
            case (_, Left(_)) => // Ignore; no need to update partition state on election error
          }

          electionResults
        } else {
          Map.empty
        }
      case OfflinePartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
          //只改变了内存中的分区状态
          controllerContext.putPartitionState(partition, OfflinePartition)
        }
        Map.empty
      case NonExistentPartition =>
        validPartitions.foreach { partition =>
          stateChangeLog.trace(s"Changed partition $partition state from ${partitionState(partition)} to $targetState")
          //只改变了内存中的分区状态
          controllerContext.putPartitionState(partition, NonExistentPartition)
        }
        Map.empty
    }
  }

重点看一下OnlinePartition中的electLeaderForPartitions方法

分区选举leader总共有四种策略,在以下代码中我都做了说明,第一个离线分区选举策略就是初始化使用的策略,默认allowUnclean为false,但如果在topic中的配置是允许的,那也可选择除了isr以外的副本作为leader
    //……省略若干代码
    val (partitionsWithoutLeaders, partitionsWithLeaders) = partitionLeaderElectionStrategy match {
        //离线分区leader选举策略:优先选取isr中还存活的第一个副本作为leader,如果传入的allowUnclean为true或者topic中的配置允许选举,则可选其他副本作为leader
      case OfflinePartitionLeaderElectionStrategy(allowUnclean) =>
        val partitionsWithUncleanLeaderElectionState = collectUncleanLeaderElectionState(
          validLeaderAndIsrs,
          allowUnclean
        )
        leaderForOffline(controllerContext, partitionsWithUncleanLeaderElectionState).partition(_.leaderAndIsr.isEmpty)
      case ReassignPartitionLeaderElectionStrategy =>
        //分区重分配leader选举策略:选取isr中还存活的第一个副本作为leader
        leaderForReassign(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
      case PreferredReplicaPartitionLeaderElectionStrategy =>
        //优先分区副本选举策略:是一定要ar中的第一个副本且在isr中且存活才能作为leader
        leaderForPreferredReplica(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
      case ControlledShutdownPartitionLeaderElectionStrategy =>
        //controller宕机分区leader选举策略:不在宕机broker之列的isr中还存活的第一个副本作为leader
        leaderForControlledShutdown(controllerContext, validLeaderAndIsrs).partition(_.leaderAndIsr.isEmpty)
    }

总结

副本状态机总结

  • 从状态图及源码分析来看我们大概可以猜到创建副本的流程及删除副本的过程,可能对ReplicaDeletionIneligible这个状态还能流转到上线或者下线状态有些疑问,咱们在做这里标记个todo,后面遇到了再来回顾

分区状态机总结

  • 分区状态较为简单,里面稍微复杂点的就是选举leader的部分,里面针对选举leader的策略大部分都是优先存活且在isr中的副本为leader,只有离线分区选举策略才支持根据topic的配置判断是否允许非存活isr成为leader
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>