一次典型的网络心跳异常导致的ORACLE RAC节点GRID软件重启分析

如下是一次典型的由于网络心跳异常导致的RAC节点GRID软件重启分析(由于11gR2 新特性Rebootless Restart特性默认是重启GRID不重启主机),当时写给用户的故障分析报告,供参考。

一、服务概述

某银行近期遇到**系统数据库集群节点2的GRID异常重启,相应的数据库实例也发生了重启。**工程师接到故障申报后,及时进行响应,通过对相关日志等信息的深入分析,整理汇总此文档。

 

二、集群及数据库日志分析

1.节点1 alert日志信息

Sun Jul 28 10:01:09 2019

Thread 1 advanced to log sequence 10933 (LGWR switch)

  Current log# 1 seq# 10933 mem# 0: +DATA/AAA/onlinelog/group_1.261.839773731

Sun Jul 28 10:01:10 2019

Archived Log entry 19664 added for thread 1 sequence 10932 ID 0x1c6bb01c dest 1:

Sun Jul 28 12:13:01 2019 =è>>在2019/7/28 12:13:01分时发生了数据库实例的Reconfiguration,节点2离开集群

Reconfiguration started (old inc 3, new inc 5)

List of instances:

 1 (myinst: 1)

 Global Resource Directory frozen

 * dead instance detected - domain 0 invalid = TRUE

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sun Jul 28 12:13:01 2019

 LMS 0: 2 GCS shadows cancelled, 0 closed, 0 Xw survived

Sun Jul 28 12:13:01 2019

 LMS 1: 1 GCS shadows cancelled, 1 closed, 0 Xw survived

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

 Post SMON to start 1st pass IR

Sun Jul 28 12:13:01 2019

Instance recovery: looking for dead threads

Beginning instance recovery of 1 threads

 Submitted all GCS remote-cache requests

 Post SMON to start 1st pass IR

 Fix write in gcs resources

Reconfiguration complete

Sun Jul 28 12:13:03 2019

Setting Resource Manager plan SCHEDULER[0x318F]:DEFAULT_MAINTENANCE_PLAN via scheduler window

Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter

 parallel recovery started with 23 processes

Started redo scan

Completed redo scan

 read 1824 KB redo, 125 data blocks need recovery

Started redo application at

 Thread 2: logseq 8735, block 135314

Recovery of Online Redo Log: Thread 2 Group 6 Seq 8735 Reading mem 0

  Mem# 0: +DATA/AAA/onlinelog/group_6.275.839774147

Sun Jul 28 12:13:05 2019

minact-scn: master found reconf/inst-rec before recscn scan old-inc#:5 new-inc#:5

Completed redo application of 0.85MB

Completed instance recovery at

 Thread 2: logseq 8735, block 138963, scn 14690080331

 114 data blocks read, 126 data blocks written, 1824 redo k-bytes read

Thread 2 advanced to log sequence 8736 (thread recovery)

Redo thread 2 internally disabled at seq 8736 (SMON)

Sun Jul 28 12:13:06 2019

Archived Log entry 19665 added for thread 2 sequence 8735 ID 0x1c6bb01c dest 1:

Sun Jul 28 12:13:06 2019

ARC0: Archiving disabled thread 2 sequence 8736

Archived Log entry 19666 added for thread 2 sequence 8736 ID 0x1c6bb01c dest 1:

minact-scn: master continuing after IR

minact-scn: Master considers inst:2 dead

Sun Jul 28 12:14:07 2019   =è>>在2019/7/28 12/14:07分时发生了数据库实例节点2加入集群

Reconfiguration started (old inc 5, new inc 7)

List of instances:

 1 2 (myinst: 1)

 Global Resource Directory frozen

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sun Jul 28 12:14:07 2019

Sun Jul 28 12:14:07 2019

 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

Sun Jul 28 12:14:08 2019

minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:7 errcnt:0

 Submitted all GCS remote-cache requests

 Fix write in gcs resources

Reconfiguration complete

Sun Jul 28 12:14:34 2019

Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter

Sun Jul 28 14:52:26 2019

Thread 1 advanced to log sequence 10934 (LGWR switch)

  Current log# 2 seq# 10934 mem# 0: +DATA/AAA/onlinelog/group_2.262.839773731

Sun Jul 28 14:52:27 2019

Archived Log entry 19667 added for thread 1 sequence 10933 ID 0x1c6bb01c dest 1:

2.节点2 alert日志信息

Sun Jul 28 09:29:47 2019

Archived Log entry 19663 added for thread 2 sequence 8734 ID 0x1c6bb01c dest 1:

Sun Jul 28 12:13:00 2019

NOTE: ASMB terminating

Errors in file /u01/app/oracle/diag/rdbms/AAA/AAA2/trace/AAA2_asmb_1489.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 386 Serial number: 3

Errors in file /u01/app/oracle/diag/rdbms/AAA/AAA2/trace/AAA2_asmb_1489.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 386 Serial number: 3

ASMB (ospid: 1489): terminating the instance due to error 15064

Instance terminated by ASMB, pid = 1489

3.节点1 集群日志信息

2019-07-28 12:12:52.104

[cssd(17568)]CRS-1611:Network communication with node AAABC02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.750 seconds

2019-07-28 12:12:56.118

[cssd(17568)]CRS-1610:Network communication with node AAABC02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.730 seconds

2019-07-28 12:12:58.856

[cssd(17568)]CRS-1607:Node AAABC02 is being evicted in cluster incarnation 287856623; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/AAABC01/cssd/ocssd.log.  ====>>>>集群机制,节点2被驱逐

2019-07-28 12:13:00.865

[cssd(17568)]CRS-1625:Node AAABC02, number 2, was manually shut down

2019-07-28 12:13:00.873

[cssd(17568)]CRS-1601:CSSD Reconfiguration complete. Active nodes are AAABC01 .

2019-07-28 12:13:00.887

[ctssd(17797)]CRS-2410:The Cluster Time Synchronization Service on host AAABC01 is in active mode.

2019-07-28 12:13:02.735

[crsd(18144)]CRS-5504:Node down event reported for node 'AAABC02'.

2019-07-28 12:13:18.997

[crsd(18144)]CRS-2773:Server 'AAABC02' has been removed from pool 'Generic'.

2019-07-28 12:13:19.002

[crsd(18144)]CRS-2773:Server 'AAABC02' has been removed from pool 'ora.AAA'.

2019-07-28 12:13:25.174

[cssd(17568)]CRS-1601:CSSD Reconfiguration complete. Active nodes are AAABC01 AAABC02 .

2019-07-28 12:13:29.757

[ctssd(17797)]CRS-2403:The Cluster Time Synchronization Service on host AAABC01 is in observer mode.

2019-07-28 12:13:53.878

[crsd(18144)]CRS-2772:Server 'AAABC02' has been assigned to pool 'Generic'.

2019-07-28 12:13:53.879

[crsd(18144)]CRS-2772:Server 'AAABC02' has been assigned to pool 'ora.AAA'.

4.节点2 集群日志信息

2019-07-28 12:09:00.212

[ctssd(506)]CRS-2409:The clock on host AAABC02 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.

2019-07-28 12:12:51.760

[cssd(458)]CRS-1611:Network communication with node AAABC01 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.800 seconds

2019-07-28 12:12:55.767

[cssd(458)]CRS-1610:Network communication with node AAABC01 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.790 seconds

====================
=====================>>>>>网络心跳异常时的集群日志
====================
2019-07-28 12:12:58.565

[cssd(458)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/AAABC02/cssd/ocssd.log.

2019-07-28 12:12:58.565

[cssd(458)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/AAABC02/cssd/ocssd.log

2019-07-28 12:12:58.619

[cssd(458)]CRS-1652:Starting clean up of CRSD resources.

………………

2019-07-28 12:13:00.979

[cssd(458)]CRS-1660:The CSS daemon shutdown has completed

2019-07-28 12:13:01.203

[ohasd(5853)]CRS-2765:Resource 'ora.evmd' has failed on server 'AAABC02'.

2019-07-28 12:13:01.210

[ohasd(5853)]CRS-2765:Resource 'ora.ctssd' has failed on server 'AAABC02'.

2019-07-28 12:13:02.934

[crsd(20865)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0/grid/log/AAABC02/crsd/crsd.log.

…………

2019-07-28 12:13:12.310

[cssd(20952)]CRS-1713:CSSD daemon is started in clustered mode

2019-07-28 12:13:13.137

…………

2019-07-28 12:13:25.184

[cssd(20952)]CRS-1601:CSSD Reconfiguration complete. Active nodes are AAABC01 AAABC02 .
…………

2019-07-28 12:13:49.654

[crsd(21206)]CRS-1201:CRSD started on node AAABC02.

三、总结与后续处理建议

3.1 问题总结    

从集群节点1、2的日志中可以发现,在2019-07-28 12:12:51.760分日志显示网络心跳丢失Network communication with node AAABC01 (1) missing for 75%,集群网络心跳阈值是30秒,数据库集群的网络心跳是每秒一次,可以算出从2019-07-28 12:12:28秒时开始出现网络心跳异常。2019-07-28 12:12:56.118提示网络心跳丢失90%。

2019-07-28 12:12:58.856 [cssd(17568)]CRS-1607:Node AAABC02 is being evicted in cluster ,此时根据11gR2 GRID集群的机制,在网络心跳出现异常并达到阈值时,节点2集群需要发生重启。

而由于11gR2 新特性Rebootless Restart,因此集群在节点2上首先尝试正常关闭、重启GRID软件,如GRID软件关闭异常时才会重启节点2的操作系统。因此从节点2的集群日志也可以发现是正常关闭了GRID并重启,未发生操作系统的重启(OS日志或uptime命令可以验证);总体花费时间较少,整个节点2集群关闭、重启过程约50秒。

相关集群日志:
2019-07-28 12:12:58.619

[cssd(458)]CRS-1652:Starting clean up of CRSD resources.
…………

2019-07-28 12:13:49.654

[crsd(21206)]CRS-1201:CRSD started on node AAABC02.

 

 

3.2 后续处理建议

1.建议检查主机网络,包括物理网卡以及交换机等相关网络设备,排查具体是何种原因导致的心跳网络异常;

2.从数据库集群日志中可以发现在2018-01-27 15:57:32也出现过由于网络心跳出现问题导致的集群重启;可以结合当时查出的网络问题与本次的结合并进行修复。

 

 

你可能感兴趣的:(ORACLE,RAC/ASM)