zookeeper和k8s_Kubernetes(k8s)运行ZooKeeper,一个分布式系统协调器

运行ZooKeeper,一个分布式系统协调器

目标

在本教程之后,您将了解以下内容。

如何使用StatefulSet部署ZooKeeper集合。

如何使用ConfigMaps一致地配置集合。

如何在集合中扩展ZooKeeper服务器的部署。

如何使用PodDisruptionBudgets确保计划维护期间的服务可用性。

创建ZooKeeper综合

下面的清单包含Headless Service,Service,PodDisruptionBudget和StatefulSet。

apiVersion: v1

kind: Service

metadata:

name: zk-hs

labels:

app: zk

spec:

ports:

-port: 2888 name: server

-port: 3888 name: leader-election

clusterIP: None

selector:

app: zk

---apiVersion: v1

kind: Service

metadata:

name: zk-cs

labels:

app: zk

spec:

ports:

-port: 2181 name: client

selector:

app: zk

---apiVersion: policy/v1beta1

kind: PodDisruptionBudget

metadata:

name: zk-pdb

spec:

selector:

matchLabels:

app: zk

maxUnavailable: 1

---apiVersion: apps/v1

kind: StatefulSet

metadata:

name: zk

spec:

selector:

matchLabels:

app: zk

serviceName: zk-hs

replicas: 3

updateStrategy:

type: RollingUpdate

podManagementPolicy: Parallel

template:

metadata:

labels:

app: zk

spec:

affinity:

podAntiAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

-labelSelector: matchExpressions:

-key: "app" operator: In

values:

-zk topologyKey: "kubernetes.io/hostname"

containers:

-name: kubernetes-zookeeper imagePullPolicy: Always

image: "k8s.gcr.io/kubernetes-zookeeper:1.0-3.4.10"

resources:

requests:

memory: "1Gi"

cpu: "0.5"

ports:

-containerPort: 2181 name: client

-containerPort: 2888 name: server

-containerPort: 3888 name: leader-election

command:

-sh --c -"start-zookeeper \ --servers=3 \ --data_dir=/var/lib/zookeeper/data \ --data_log_dir=/var/lib/zookeeper/data/log \ --conf_dir=/opt/zookeeper/conf \ --client_port=2181 \ --election_port=3888 \ --server_port=2888 \ --tick_time=2000 \ --init_limit=10 \ --sync_limit=5 \ --heap=512M \ --max_client_cnxns=60 \ --snap_retain_count=3 \ --purge_interval=12 \ --max_session_timeout=40000 \ --min_session_timeout=4000 \ --log_level=INFO" readinessProbe:

exec:

command:

-sh --c -"zookeeper-ready 2181" initialDelaySeconds: 10

timeoutSeconds: 5

livenessProbe:

exec:

command:

-sh --c -"zookeeper-ready 2181" initialDelaySeconds: 10

timeoutSeconds: 5

volumeMounts:

-name: datadir mountPath: /var/lib/zookeeper

securityContext:

runAsUser: 1000

fsGroup: 1000

volumeClaimTemplates:

-metadata: name: datadir

spec:

accessModes: [ "ReadWriteOnce" ]

resources:

requests:

storage: 10Gi

打开终端,然后使用kubectl apply命令创建清单。

kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml

这将创建Headless Service为zk-hs,Service为zk-cs,PodDisruptionBudget是zk-pdb,StatefulSet为zk。

service/zk-hs created

service/zk-cs created

poddisruptionbudget.policy/zk-pdb created

statefulset.apps/zk created

使用kubectl get来监视StatefulSet控制器创建StatefulSet的Pod。

kubectl get pods -w -l app=zk

一旦zk-2 运行并准备就绪,使用CTRL-C终止kubectl。

NAME READY STATUS RESTARTS AGE

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 19s

zk-0 1/1 Running 0 40s

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 0s

zk-1 0/1 ContainerCreating 0 0s

zk-1 0/1 Running 0 18s

zk-1 1/1 Running 0 40s

zk-2 0/1 Pending 0 0s

zk-2 0/1 Pending 0 0s

zk-2 0/1 ContainerCreating 0 0s

zk-2 0/1 Running 0 19s

zk-2 1/1 Running 0 40s

StatefulSet控制器创建三个Pod,每个Pod都有一个带ZooKeeper服务器的容器。

促进leader选举

由于在匿名网络中选择leader没有终止算法,因此Zab需要显式成员资格配置来执行leader选举。 集合中的每个服务器都需要具有唯一标识符,所有服务器都需要知道全局标识符集,并且每个标识符需要与网络地址相关联。

使用kubectl exec获取zk StatefulSet中Pod的主机名。

for i in 0 1 2; do kubectl exec zk-$i -- hostname; done

StatefulSet控制器根据其序数索引为每个Pod提供唯一的主机名。 主机名采用 - 的形式。 由于zk StatefulSet的副本字段设置为3,因此Set的控制器创建三个Pod,其主机名设置为zk-0,zk-1和zk-2。

zk-0

zk-1

zk-2

ZooKeeper集合中的服务器使用自然数作为唯一标识符,并将每个服务器的标识符存储在服务器数据目录中名为myid的文件中。

要检查每个服务器的myid文件的内容,请使用以下命令。

for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done

因为标识符是自然数,而序数索引是非负整数,所以可以通过向序数加1来生成标识符。

myid zk-0

1

myid zk-1

2

myid zk-2

3

要获取zk StatefulSet中每个Pod的完全限定域名(FQDN),请使用以下命令。

for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done

zk-hs服务为所有Pod创建一个域,

zk-hs.default.svc.cluster.local.

zk-0.zk-hs.default.svc.cluster.local

zk-1.zk-hs.default.svc.cluster.local

zk-2.zk-hs.default.svc.cluster.local

Kubernetes DNS中的A记录将FQDN解析为Pod的IP地址。如果Kubernetes重新调度Pod,它将使用Pod的新IP地址更新A记录,但A记录名称不会更改。

ZooKeeper将其应用程序配置存储在名为zoo.cfg的文件中。 使用kubectl exec查看zk-0Pod中zoo.cfg文件的内容。

kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg

在文件底部的server.1,server.2和server.3属性中,1,2和3对应于ZooKeeper服务器的myid文件中的标识符。 它们被设置为zkStatefulSet中Pod的FQDN。

clientPort=2181

dataDir=/var/lib/zookeeper/data

dataLogDir=/var/lib/zookeeper/log

tickTime=2000

initLimit=10

syncLimit=2000

maxClientCnxns=60

minSessionTimeout= 4000

maxSessionTimeout= 40000

autopurge.snapRetainCount=3

autopurge.purgeInterval=0

server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888

server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888

server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

达成共识

If two Pods are launched with the same ordinal, two ZooKeeper servers would both identify themselves as the same server.

共识协议要求每个参与者的标识符是唯一的。Zab协议中没有两个参与者应该声明相同的唯一标识符。这对于允许系统中的进程就哪些进程提交了哪些数据达成一致是必要的。如果使用相同的序号启动两个Pod,则两个ZooKeeper服务器都将自己标识为同一服务器。

kubectl get pods -w -l app=zk

NAME READY STATUS RESTARTS AGE

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 19s

zk-0 1/1 Running 0 40s

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 0s

zk-1 0/1 ContainerCreating 0 0s

zk-1 0/1 Running 0 18s

zk-1 1/1 Running 0 40s

zk-2 0/1 Pending 0 0s

zk-2 0/1 Pending 0 0s

zk-2 0/1 ContainerCreating 0 0s

zk-2 0/1 Running 0 19s

zk-2 1/1 Running 0 40s

当Pod变为就绪时,输入每个Pod的A记录。因此,ZooKeeper服务器的FQDN将解析为单个端点,该端点将是声称在其myid文件中配置的身份的唯一ZooKeeper服务器。

zk-0.zk-hs.default.svc.cluster.local

zk-1.zk-hs.default.svc.cluster.local

zk-2.zk-hs.default.svc.cluster.local

这可确保ZooKeepers的zoo.cfg文件中的服务器属性表示正确配置的集合。

server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888

server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888

server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

当服务器使用Zab协议尝试提交value时,他们将达成共识并提交value(如果leader选举成功并且至少有两个Pod正在运行和就绪),或者他们将无法做到(如果不符合任何一个条件)。如果一个服务器代表另一个服务器确认写入,则不会出现任何状态。

综合测试

最基本的健全性测试是将数据写入一个ZooKeeper服务器并从另一个服务器读取数据。

The command below executes the zkCli.sh script to write world to the path /hello on the zk-0 Pod in the ensemble.

下面的命令执行zkCli.sh脚本,将world写入集合中zk-0 Pod的路径/hello。

kubectl exec zk-0 zkCli.sh create /hello world

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

Created /hello

从zk-1获取数据。

kubectl exec zk-1 zkCli.sh get /hello

你在zk-0上创建的数据在所有服务器上都可用。

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

world

cZxid = 0x100000002

ctime = Thu Dec 08 15:13:30 UTC 2016

mZxid = 0x100000002

mtime = Thu Dec 08 15:13:30 UTC 2016

pZxid = 0x100000002

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 5

numChildren = 0

提供耐用的存储

如ZooKeeper Basics部分所述,ZooKeeper将所有条目提交给持久的WAL,并定期将内存状态的快照写入存储介质。使用WAL来提供持久性是使用共识协议来实现复制状态机的应用程序的常用技术。

使用kubectl delete删除zk StatefulSet。

kubectl delete statefulset zk

statefulset.apps "zk" deleted

观察StatefulSet中Pod的终止。

kubectl get pods -w -l app=zk

当zk-0完全终止时,使用CTRL-C终止kubectl。

zk-2 1/1 Terminating 0 9m

zk-0 1/1 Terminating 0 11m

zk-1 1/1 Terminating 0 10m

zk-2 0/1 Terminating 0 9m

zk-2 0/1 Terminating 0 9m

zk-2 0/1 Terminating 0 9m

zk-1 0/1 Terminating 0 10m

zk-1 0/1 Terminating 0 10m

zk-1 0/1 Terminating 0 10m

zk-0 0/1 Terminating 0 11m

zk-0 0/1 Terminating 0 11m

zk-0 0/1 Terminating 0 11m

重新applyzookeeper.yaml。

kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml

这将创建zk StatefulSet对象,但清单中的其他API对象不会被修改,因为它们已经存在。

观察StatefulSet控制器重新创建StatefulSet的Pod。

kubectl get pods -w -l app=zk

Once the zk-2 Pod is Running and Ready, use CTRL-C to terminate kubectl.

NAME READY STATUS RESTARTS AGE

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 19s

zk-0 1/1 Running 0 40s

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 0s

zk-1 0/1 ContainerCreating 0 0s

zk-1 0/1 Running 0 18s

zk-1 1/1 Running 0 40s

zk-2 0/1 Pending 0 0s

zk-2 0/1 Pending 0 0s

zk-2 0/1 ContainerCreating 0 0s

zk-2 0/1 Running 0 19s

zk-2 1/1 Running 0 40s

使用以下命令从zk-2 Pod获取在完整性测试期间输入的值。

kubectl exec zk-2 zkCli.sh get /hello

即使你终止并重新创建了zk StatefulSet中的所有Pod,该集合仍然提供原始值。

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

world

cZxid = 0x100000002

ctime = Thu Dec 08 15:13:30 UTC 2016

mZxid = 0x100000002

mtime = Thu Dec 08 15:13:30 UTC 2016

pZxid = 0x100000002

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 5

numChildren = 0

zk StatefulSet规范的volumeClaimTemplates字段指定为每个Pod配置的PersistentVolume。

volumeClaimTemplates:

- metadata:

name: datadir

annotations:

volume.alpha.kubernetes.io/storage-class: anything

spec:

accessModes: [ "ReadWriteOnce" ]

resources:

requests:

storage: 20Gi

StatefulSet控制器为StatefulSet中的每个Pod生成PersistentVolumeClaim。

使用以下命令获取StatefulSet的PersistentVolumeClaims。

kubectl get pvc -l app=zk

When the StatefulSet recreated its Pods, it remounts the Pods’ PersistentVolumes.

NAME STATUS VOLUME CAPACITY ACCESSMODES AGE

datadir-zk-0 Bound pvc-bed742cd-bcb1-11e6-994f-42010a800002 20Gi RWO 1h

datadir-zk-1 Bound pvc-bedd27d2-bcb1-11e6-994f-42010a800002 20Gi RWO 1h

datadir-zk-2 Bound pvc-bee0817e-bcb1-11e6-994f-42010a800002 20Gi RWO 1h

StatefulSet容器模板的volumeMounts部分在ZooKeeper服务器的数据目录中安装PersistentVolumes。

volumeMounts:

-name: datadir mountPath: /var/lib/zookeeper

当(重新)调度zk StatefulSet中的Pod时,它将始终具有安装到ZooKeeper服务器的数据目录的相同PersistentVolume。 即使重新调度Pod,对ZooKeeper服务器的WAL及其所有快照的所有写入都保持持久。

确保一致的配置

如“促进leader选举和实现共识”部分所述,ZooKeeper集合中的服务器需要一致的配置来选举领导者并形成法定人数。 它们还需要一致配置Zab协议,以使协议在网络上正常工作。在我们的示例中,我们通过将配置直接嵌入清单来实现一致的配置。

获取zk StatefulSet。

kubectl get sts zk -o yaml

command:

-sh --c -"start-zookeeper \ --servers=3 \ --data_dir=/var/lib/zookeeper/data \ --data_log_dir=/var/lib/zookeeper/data/log \ --conf_dir=/opt/zookeeper/conf \ --client_port=2181 \ --election_port=3888 \ --server_port=2888 \ --tick_time=2000 \ --init_limit=10 \ --sync_limit=5 \ --heap=512M \ --max_client_cnxns=60 \ --snap_retain_count=3 \ --purge_interval=12 \ --max_session_timeout=40000 \ --min_session_timeout=4000 \ --log_level=INFO"…

用于启动ZooKeeper服务器的命令将配置作为命令行参数传递。 您还可以使用环境变量将配置传递。

配置日志记录

zkGenConfig.sh脚本生成的其中一个文件控制着ZooKeeper的日志记录。 ZooKeeper使用Log4j,默认情况下,它使用基于时间和大小的滚动文件追加器进行日志记录配置。

使用以下命令从zk StatefulSet中的一个Pod获取日志记录配置。

kubectl exec zk-0 cat /usr/etc/zookeeper/log4j.properties

下面的日志记录配置将导致ZooKeeper进程将其所有日志写入标准输出文件流。

zookeeper.root.logger=CONSOLE

zookeeper.console.threshold=INFO

log4j.rootLogger=${zookeeper.root.logger}

log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender

log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}

log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout

log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n

这是安全日志容器的最简单方法。由于应用程序将日志写入标准输出,因此Kubernetes将为您处理日志循环。Kubernetes还实施了一种理智的保留策略,确保写入标准输出和标准错误的应用程序日志不会耗尽本地存储介质。

使用kubectl日志从其中一个Pod中检索最后20个日志行。

kubectl logs zk-0 --tail 20

您可以使用kubectl logs和Kubernetes Dashboard查看写入标准输出或标准错误的应用程序日志。

2016-12-06 19:34:16,236 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740

2016-12-06 19:34:16,237 [myid:1] - INFO [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)

2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749

2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749

2016-12-06 19:34:26,156 [myid:1] - INFO [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)

2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750

2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750

2016-12-06 19:34:26,226 [myid:1] - INFO [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)

2016-12-06 19:34:36,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760

2016-12-06 19:34:36,152 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760

2016-12-06 19:34:36,152 [myid:1] - INFO [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)

2016-12-06 19:34:36,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761

2016-12-06 19:34:36,231 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761

2016-12-06 19:34:36,231 [myid:1] - INFO [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)

2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767

2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767

2016-12-06 19:34:46,149 [myid:1] - INFO [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)

2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768

2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768

2016-12-06 19:34:46,230 [myid:1] - INFO [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)

Kubernetes supports more powerful, but more complex, logging integrations with Stackdriver and Elasticsearch and Kibana. For cluster level log shipping and aggregation, consider deploying a sidecar container to rotate and ship your logs.

Kubernetes支持与Stackdriver,Elasticsearch和Kibana进行更强大更复杂的日志记录集成。对于集群级日志传送和聚合,请考虑部署sidecar容器以轮询和发送日志。

配置非特权用户

允许应用程序作为特权用户在容器内运行的最佳实践是一个有争议的问题。如果你们要求应用程序作为非特权用户运行,则可以使用SecurityContext来控制入口点运行的用户。

zk StatefulSet的Pod模板包含SecurityContext。

securityContext:

runAsUser: 1000

fsGroup: 1000

在Pods的容器中,UID 1000对应于zookeeper用户,GID 1000对应于zookeeper组。

从zk-0 Pod获取ZooKeeper进程信息。

kubectl exec zk-0 -- ps -elf

由于securityContext对象的runAsUser字段设置为1000,而不是以root身份运行,因此ZooKeeper进程作为zookeeper用户运行。

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

4 S zookeep+ 1 0 0 80 0 - 1127 - 20:46 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground

0 S zookeep+ 27 1 0 80 0 - 1155556 - 20:46 ? 00:00:19 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg

默认情况下,当Pod的PersistentVolumes挂载到ZooKeeper服务器的数据目录时,只有root用户才能访问它。 此配置可防止ZooKeeper进程写入其WAL并存储其快照。

使用以下命令获取zk-0 Pod上ZooKeeper数据目录的文件权限。

kubectl exec -ti zk-0 -- ls -ld /var/lib/zookeeper/data

由于securityContext对象的fsGroup字段设置为1000,因此Pods的PersistentVolumes的所有权设置为zookeeper组,ZooKeeper进程可以读取和写入其数据。

drwxr-sr-x 3 zookeeper zookeeper 4096 Dec 5 20:45 /var/lib/zookeeper/data

管理ZooKeeper进程

ZooKeeper文档提到“你将需要一个管理每个ZooKeeper服务器进程(JVM)的监督进程。” 利用监视程序(监督进程)重新启动分布式系统中的失败进程是一种常见的模式。在Kubernetes中部署应用程序时,不应使用外部实用程序作为监督过程,而应使用Kubernetes作为应用程序的监视程序。

更新整体

zk StatefulSet配置为使用RollingUpdate更新策略。

您可以使用kubectl patch来更新分配给服务器的cpu数量。

kubectl patch sts zk --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"0.3"}]'

statefulset.apps/zk patched

使用kubectl rollout status来监控更新的状态。

kubectl rollout status sts/zk

waiting for statefulset rolling update to complete 0 pods at revision zk-5db4499664...

Waiting for 1 pods to be ready...

Waiting for 1 pods to be ready...

waiting for statefulset rolling update to complete 1 pods at revision zk-5db4499664...

Waiting for 1 pods to be ready...

Waiting for 1 pods to be ready...

waiting for statefulset rolling update to complete 2 pods at revision zk-5db4499664...

Waiting for 1 pods to be ready...

Waiting for 1 pods to be ready...

statefulset rolling update complete 3 pods at revision zk-5db4499664...

This terminates the Pods, one at a time, in reverse ordinal order, and recreates them with the new configuration. This ensures that quorum is maintained during a rolling update.

这将按顺序以反向顺序终止Pod,并使用新配置重新创建它们。这可确保在滚动更新期间维护quorum。

使用kubectl rollout history命令查看历史记录或以前的配置。

kubectl rollout history sts/zk

statefulsets "zk"

REVISION

1

2

使用kubectl rollout undo命令回滚修改。

kubectl rollout undo sts/zk

statefulset.apps/zk rolled back

处理过程失败

Restart Policies control how Kubernetes handles process failures for the entry point of the container in a Pod. For Pods in a StatefulSet, the only appropriate RestartPolicy is Always, and this is the default value. For stateful applications you should never override the default policy.

Use the following command to examine the process tree for the ZooKeeper server running in the zk-0 Pod.

kubectl exec zk-0 -- ps -ef

The command used as the container’s entry point has PID 1, and the ZooKeeper process, a child of the entry point, has PID 27.

UID PID PPID C STIME TTY TIME CMD

zookeep+ 1 0 0 15:03 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground

zookeep+ 27 1 0 15:03 ? 00:00:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg

In another terminal watch the Pods in the zk StatefulSet with the following command.

kubectl get pod -w -l app=zk

In another terminal, terminate the ZooKeeper process in Pod zk-0 with the following command.

kubectl exec zk-0 -- pkill java

The termination of the ZooKeeper process caused its parent process to terminate. Because the RestartPolicy of the container is Always, it restarted the parent process.

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Running 0 21m

zk-1 1/1 Running 0 20m

zk-2 1/1 Running 0 19m

NAME READY STATUS RESTARTS AGE

zk-0 0/1 Error 0 29m

zk-0 0/1 Running 1 29m

zk-0 1/1 Running 1 29m

If your application uses a script (such as zkServer.sh) to launch the process that implements the application’s business logic, the script must terminate with the child process. This ensures that Kubernetes will restart the application’s container when the process implementing the application’s business logic fails.

Testing for Liveness

Configuring your application to restart failed processes is not enough to keep a distributed system healthy. There are scenarios where a system’s processes can be both alive and unresponsive, or otherwise unhealthy. You should use liveness probes to notify Kubernetes that your application’s processes are unhealthy and it should restart them.

The Pod template for the zk StatefulSet specifies a liveness probe. ``

livenessProbe:

exec:

command:

-sh --c -"zookeeper-ready 2181" initialDelaySeconds: 15

timeoutSeconds: 5

The probe calls a bash script that uses the ZooKeeper ruok four letter word to test the server’s health.

OK=$(echo ruok | nc 127.0.0.1 $1)

if [ "$OK" == "imok" ]; then

exit 0

else

exit 1

fi

In one terminal window, use the following command to watch the Pods in the zk StatefulSet.

kubectl get pod -w -l app=zk

In another window, using the following command to delete the zkOk.sh script from the file system of Pod zk-0.

kubectl exec zk-0 -- rm /usr/bin/zookeeper-ready

When the liveness probe for the ZooKeeper process fails, Kubernetes will automatically restart the process for you, ensuring that unhealthy processes in the ensemble are restarted.

kubectl get pod -w -l app=zk

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Running 0 1h

zk-1 1/1 Running 0 1h

zk-2 1/1 Running 0 1h

NAME READY STATUS RESTARTS AGE

zk-0 0/1 Running 0 1h

zk-0 0/1 Running 1 1h

zk-0 1/1 Running 1 1h

Testing for Readiness

Readiness is not the same as liveness. If a process is alive, it is scheduled and healthy. If a process is ready, it is able to process input. Liveness is a necessary, but not sufficient, condition for readiness. There are cases, particularly during initialization and termination, when a process can be alive but not ready.

If you specify a readiness probe, Kubernetes will ensure that your application’s processes will not receive network traffic until their readiness checks pass.

For a ZooKeeper server, liveness implies readiness. Therefore, the readiness probe from the zookeeper.yaml manifest is identical to the liveness probe.

readinessProbe:

exec:

command:

-sh --c -"zookeeper-ready 2181"initialDelaySeconds: 15

timeoutSeconds: 5

Even though the liveness and readiness probes are identical, it is important to specify both. This ensures that only healthy servers in the ZooKeeper ensemble receive network traffic.

Tolerating Node Failure

ZooKeeper needs a quorum of servers to successfully commit mutations to data. For a three server ensemble, two servers must be healthy for writes to succeed. In quorum based systems, members are deployed across failure domains to ensure availability. To avoid an outage, due to the loss of an individual machine, best practices preclude co-locating multiple instances of the application on the same machine.

By default, Kubernetes may co-locate Pods in a StatefulSet on the same node. For the three server ensemble you created, if two servers are on the same node, and that node fails, the clients of your ZooKeeper service will experience an outage until at least one of the Pods can be rescheduled.

You should always provision additional capacity to allow the processes of critical systems to be rescheduled in the event of node failures. If you do so, then the outage will only last until the Kubernetes scheduler reschedules one of the ZooKeeper servers. However, if you want your service to tolerate node failures with no downtime, you should set podAntiAffinity.

Use the command below to get the nodes for Pods in the zk StatefulSet.

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

All of the Pods in the zk StatefulSet are deployed on different nodes.

kubernetes-minion-group-cxpk

kubernetes-minion-group-a5aq

kubernetes-minion-group-2g2d

This is because the Pods in the zk StatefulSet have a PodAntiAffinity specified.

affinity:

podAntiAffinity:

requiredDuringSchedulingIgnoredDuringExecution:

-labelSelector: matchExpressions:

-key: "app" operator: In

values:

-zk topologyKey: "kubernetes.io/hostname"

The requiredDuringSchedulingIgnoredDuringExecution field tells the Kubernetes Scheduler that it should never co-locate two Pods which have app label as zk in the domain defined by the topologyKey. The topologyKey kubernetes.io/hostname indicates that the domain is an individual node. Using different rules, labels, and selectors, you can extend this technique to spread your ensemble across physical, network, and power failure domains.

Surviving Maintenance

In this section you will cordon and drain nodes. If you are using this tutorial on a shared cluster, be sure that this will not adversely affect other tenants.

The previous section showed you how to spread your Pods across nodes to survive unplanned node failures, but you also need to plan for temporary node failures that occur due to planned maintenance.

Use this command to get the nodes in your cluster.

kubectl get nodes

Use kubectl cordon to cordon all but four of the nodes in your cluster.

kubectl cordon

Use this command to get the zk-pdb PodDisruptionBudget.

kubectl get pdb zk-pdb

The max-unavailable field indicates to Kubernetes that at most one Pod from zk StatefulSet can be unavailable at any time.

NAME MIN-AVAILABLE MAX-UNAVAILABLE ALLOWED-DISRUPTIONS AGE

zk-pdb N/A 1 1

In one terminal, use this command to watch the Pods in the zk StatefulSet.

kubectl get pods -w -l app=zk

In another terminal, use this command to get the nodes that the Pods are currently scheduled on.

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

kubernetes-minion-group-pb41

kubernetes-minion-group-ixsl

kubernetes-minion-group-i4c4

Use kubectl drain to cordon and drain the node on which the zk-0 Pod is scheduled.

kubectl drain $(kubectl get pod zk-0 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data

node "kubernetes-minion-group-pb41" cordoned

WARNING:Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-pb41, kube-proxy-kubernetes-minion-group-pb41; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-o5elz

pod "zk-0" deleted

node "kubernetes-minion-group-pb41" drained

As there are four nodes in your cluster, kubectl drain, succeeds and the zk-0 is rescheduled to another node.

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Running 2 1h

zk-1 1/1 Running 0 1h

zk-2 1/1 Running 0 1h

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 51s

zk-0 1/1 Running 0 1m

Keep watching the StatefulSet’s Pods in the first terminal and drain the node on which zk-1 is scheduled.

kubectl drain $(kubectl get pod zk-1 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data "kubernetes-minion-group-ixsl" cordoned

WARNING:Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-ixsl, kube-proxy-kubernetes-minion-group-ixsl; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-voc74

pod "zk-1" deleted

node "kubernetes-minion-group-ixsl" drained

The zk-1 Pod cannot be scheduled because the zk StatefulSet contains a PodAntiAffinity rule preventing co-location of the Pods, and as only two nodes are schedulable, the Pod will remain in a Pending state.

kubectl get pods -w -l app=zk

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Running 2 1h

zk-1 1/1 Running 0 1h

zk-2 1/1 Running 0 1h

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 51s

zk-0 1/1 Running 0 1m

zk-1 1/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 0s

Continue to watch the Pods of the stateful set, and drain the node on which zk-2 is scheduled.

kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data

node "kubernetes-minion-group-i4c4" cordoned

WARNING:Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog

WARNING:Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4

There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.

pod/zk-2

Use CTRL-C to terminate to kubectl.

You cannot drain the third node because evicting zk-2 would violate zk-budget. However, the node will remain cordoned.

Use zkCli.sh to retrieve the value you entered during the sanity test from zk-0.

kubectl exec zk-0 zkCli.sh get /hello

The service is still available because its PodDisruptionBudget is respected.

WatchedEvent state:SyncConnected type:None path:null

world

cZxid = 0x200000002

ctime = Wed Dec 07 00:08:59 UTC 2016

mZxid = 0x200000002

mtime = Wed Dec 07 00:08:59 UTC 2016

pZxid = 0x200000002

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 5

numChildren = 0

使用 kubectl uncordon 来取消对第一个节点的隔离。

kubectl uncordon kubernetes-minion-group-pb41

node "kubernetes-minion-group-pb41" uncordoned

zk-1被重新调度到了这个节点。等待zk-1变为 Running 和 Ready 状态。

kubectl get pods -w -l app=zk

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Running 2 1h

zk-1 1/1 Running 0 1h

zk-2 1/1 Running 0 1h

NAME READY STATUS RESTARTS AGE

zk-0 1/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Terminating 2 2h

zk-0 0/1 Pending 0 0s

zk-0 0/1 Pending 0 0s

zk-0 0/1 ContainerCreating 0 0s

zk-0 0/1 Running 0 51s

zk-0 1/1 Running 0 1m

zk-1 1/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Terminating 0 2h

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 0s

zk-1 0/1 Pending 0 12m

zk-1 0/1 ContainerCreating 0 12m

zk-1 0/1 Running 0 13m

zk-1 1/1 Running 0 13m

尝试 drain zk-2 调度的节点。

kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data

输出:

node "kubernetes-minion-group-i4c4" already cordoned

WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog

pod "heapster-v1.2.0-2604621511-wht1r" deleted

pod "zk-2" deleted

node "kubernetes-minion-group-i4c4" drained

这次 kubectl drain 执行成功。

Uncordon 第二个节点以允许 zk-2 被重新调度。

kubectl uncordon kubernetes-minion-group-ixsl

node "kubernetes-minion-group-ixsl" uncordoned

你可以同时使用 kubectl drain 和 PodDisruptionBudgets 来保证你的服务在维护过程中仍然可用。如果使用 drain 来隔离节点并在此之前删除 pods 使节点进入离线维护状态,如果服务表达了 disruption budget,这个 budget 将被遵守。你应该总是为关键服务分配额外容量,这样它们的 Pods 就能够迅速的重新调度。

清理现场

使用 kubectl uncordon解除你集群中所有节点的隔离。

你需要删除在本教程中使用的 PersistentVolumes 的持久存储媒介。基于你的环境、存储配置和准备方法,保证回收所有的存储。

你可能感兴趣的:(zookeeper和k8s)