放飞的心灵

Linux运维学习笔记之三十一：监控利器Nagios实战

第四十二章监控利器Nagios实战

一、Nagios介绍

1、哪些内容需要监控呢？

（1）本地资源

a、负载：uptime；

b、CPU：top,sar，cpu温度；

c、磁盘：df；

d、内存：free；

e、IO：iostat；

f、RAID

g、passwd文件的变化(本地所有文件指纹识别)。

（2）网络服务

端口、URL、ping丢包、进程数、IDC网络流量

（3）其他设备

路由器、交换机端口流量、打印机、windows等

（4）业务数据

用户登录失败次数，用户登录网站次数，输入验证码失败次数、某个API接口流量并发，电商网站订单，支付交易的数量等。这个获取的过程可能是开发或者架构师完成的，但添加的过程就是运维；

2、Nagios（难够死）监控工具介绍与优势

Nagios是一款开源的网络及服务监控工具，功能强大，灵活性强。能有效监控Windows、Linux、Unix等系统主机的各种状态信息、交换机、路由器等网络设备、主机端口和URL服务等。根据不同业务故障级另发出告警信息（邮件、微信、短信、语音报警、飞信、MSN）给管理员，当故障恢复时也会发出恢复消息给管理员。

Nagios服务端可以在Linux系统和类Unix系统上运行，目前无法在Windows上运行，但Windows可以作为被监控的主机。

官方网站地址：https://www.nagios.org/

官方快速安装说明：https://support.nagios.com/kb/article.php?id=96#CentOS

3、Nagios的特点

（1）监控网络服务（SMTP、POP3、HTTP、NNTP、PING等）

（2）监控主机资源（处理器负荷、磁盘利用率等）

（3）简单地插件设计使得用户可以方便地扩展自己服务的检测方法

（4）并行服务检查机制

（5）具备定义网络分层结构的能力，用"parent"主机定义来表达网络主机间的关系，这种关系可被用来发现和明晰主机宕机或不可达状态

（6）当服务或主机问题产生与解决时将告警发送给联系人（通过EMail、短信、用户定义方式）

（7）具备定义事件句柄功能，它可以在主机或服务的事件发生时获取更多问题定位

（8）自动的日志回滚

（9）可以支持并实现对主机的冗余监控

（10）可选的WEB界面用于查看当前的网络状态、通知和故障历史、日志文件等

4、Nagios的构成

Nagios不好的地方在于它只做核心，很多其他功能都是通过插件来实现的。Nagios一般由一个主程序(Nagios)，一个插件程序(Nagios-plugins)和一些可选的附加程序(NRPE,NSClient++,NSCA,NDOUtils)等。Nagios本身就是一个监控的平台而已，其具体的监控工作都是通过插件(Nagios-plugins，也可自己编写)来实现的。因此，Nagios主程序和Nagios-plugins插件都是Nagios服务端必须安装的程序组件，并且Nagios-plugins一般也要安装于被监控端。几个附加程序的描述如下：

（1）NRPE：半被动模式（用于Linux服务器，主要用于监控本地资源）

a、存在位置

工作在被监控端，操作系统为Linux/Unix。

b、作用

用于在被监控的远程Linux/Unix主机上执行脚本插件获取数据回传给服务器端，以实现对这些主机资源的监控。主要用于监控本地资源。

c、存在形式

守护进程(agent)模式，开启的端口为5666.

d、原理：

相当于领导分配工作，下属做完回报工作。

（2）NSClient++：半被动模式（用于windows服务器）

a、存在位置

监控Windows主机。

b、作用

相当于Linux下的NRPE。

c、原理

（3）NDOUtils：不推荐使用

a、存在位置

Nagios服务器端。

b、作用

用于将Nagios的配置信息和各event产生的数据存入数据库以实现对这些数据的检索和处理。但是存入数据库还不如存放在磁盘上，因此不推荐使用。

c、原理

（4）NSCA：纯被动模式的监控

a、存在位置

同时安装在Nagios的服务器端和客户端。

b、作用

用于让被监控的远程Linux/Unix主机主动将监控到的信息发送给Nagios服务器。在分布式监控集群模式中要用到，300台服务器以内可以不考虑。

c、原理

5、Nagios的监控原理图

二、Nagios服务端安装

1、演示环境

Host	OS	role	remask
192.168.1.198	RedHat6.4_32	Nagios监控服务器	服务端
192.168.1.218	CentOS6.5_32	LNMP_Web服务器	被监控的客户端服务器
192.168.1.219	CentOS6.5_32	LNMP_Web服务器	被监控的客户端服务器

2、安装前准备工作

（1）配置yum源

echo "------- Step 1 : Config yum -------"

cd /etc/yum.repos.d/

cp /etc/yum.repos.d/CentOS-Base.repo/etc/yum.repos.d/CentOS-Base.repo.bak

wget -O /etc/yum.repos.d/CentOS-Base.repohttp://mirrors.aliyum.com/repo/CentOS-6.repo

（2）配置字符集

echo "------- Step 2 : Config CharSet -------"

echo 'export LC_ALL=C' >> /etc/profile

source /etc/profile

（3）关闭防火墙和SELinux

echo "------- Step 3 : Stop iptables and SELinux -------"

a、关闭防火墙

/etc/init.d/iptables stop

/etc/init.d/ip6tables stop

chkconfig iptables off

chkconfig ip6tables off

b、关闭SELinux

setenforce 0

vi /etc/selinux/config

SELINUX=disabled

c、脚本方式关闭SELinu

if [ if /etc/selinux/config ]; then

sed -i's#SELINUX=enforcing#SELINUX=disable#g' /etc/selinux/config

setenforce 0

（4）配置时间同步任务（监控要求时间准确）

echo "------- Step 4 : Config CharSet -------"

/usr/sbin/ntpdate pool.ntp.org

echo "#time sync by my at `date +%F` " >> /var/spool/cron/root

echo '*/10 * * * * /usr/sbin/ntpdate pool.ntp.org >/dev/null2>&1' >> /var/spool/cron/root

crontab -l

（5）安装gcc和lamp环境（Nagios提供web界面查看，Nagios与httpd配合是官方推荐）

echo "------- Step 5 : Install gcc and lamp env etc-------"

yum install gcc glibc glibc-common -y #编译环境

yum install gd gd-devel -y #画图用

yum install httpd php php-gd -y #php环境

yum install mysql* -y #非必须，但如不安装，nagios在安装时，就不会产生监控数据库的插件

yum install perl-devel -y #安装nagios插件时需要

3、安装

（1）增加nagios用户和组

echo "------- Step 6 : add nagios user and group -------"

/usr/sbin/useradd -m nagios

#/usr/sbin/useradd apache #安装httpd时已安装

/usr/sbin/groupadd nagcmd

/usr/sbin/usermod -a -G nagcmd nagios

/usr/sbin/usermod -a -G nagcmd apache

（2）解压安装nagios软件包

echo "------- Step 7 : download and install nagios-------"

cd /tools

unzip oldboy_training_nagios_soft.zip

tar xzf nagios-3.5.1.tar.gz

cd nagios

./configure --with-command-group=nagcmd

make all

make install

make install-init #This installs theinit script in /etc/rc.d/init.d

make install-config # This installssample config files in /usr/local/nagios/etc

make install-commandmode #installs andconfigures permissions the external command file

make install-webconf #生成nagios在apache中的配置文件：/etc/httpd/conf.d/nagios.conf

cat /etc/httpd/conf.d/nagios.conf #查看nagios.conf文件的内容

# SAMPLE CONFIG SNIPPETS FOR APACHE WEB SERVER

# Last Modified: 11-26-2005

# This file contains examples of entries that need

# to be incorporated into your Apache web server

# configuration file. Customize the paths, etc. as

# needed to fit your system.

ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"

# SSLRequireSSL

Options ExecCGI

AllowOverride None

Order allow,deny

Allow from all

# Order deny,allow

# Deny from all

# Allow from 127.0.0.1

AuthName "NagiosAccess"

AuthType Basic

AuthUserFile /usr/local/nagios/etc/htpasswd.users

Require valid-user

Alias /nagios "/usr/local/nagios/share"

# SSLRequireSSL

Options None

AllowOverride None

Order allow,deny

Allow from all

# Order deny,allow

# Deny from all

# Allow from 127.0.0.1

AuthName "NagiosAccess"

AuthType Basic

AuthUserFile /usr/local/nagios/etc/htpasswd.users

Require valid-user

（3）配置apache的web认证（也就是登陆web的用户和密码：test/123456）

echo "------- Step 8 : config web auth -------"

#一定要用/etc/httpd/conf.d/nagios.conf文件中指定的AuthUserFile的值一样，否则不能登陆

htpasswd -cb /usr/local/nagios/etc/htpasswd.users test 123456

cd ..

（4）安装nagios插件

echo "------- Step 9 : install nagios-plusins -------"

#yum install perl-devel -y #安装nagios插件时需要，需确认一下是否安装

tar zxf nagios-plugins-1.4.16.tar.gz

cd nagios-plugins-1.4.16

./configure --with-nagios-user=nagios --with-nagios-group=nagios--enable-perl-modules

make && make install

ls /usr/local/nagios/libexec/ | wc -l #查看安装的插件的数量

（5）安装nrpe（因为服务端需要chek_nrpe插件）

echo "------- Step 10 : install nrpe -------"

tar zxf nrpe-2.12.tar.gz

cd nrpe-2.12

./configure

make all

make install-plugin

make install-daemon

make install-daemon-config

cd ..

（6）启动服务并检查

echo "------- Step 11 : startup service and check -------"

/etc/init.d/nagios start

/etc/init.d/httpd start

lsof -i tcp:80

ps -ef | grep nagios

（7）浏览器登陆验证

http://192.168.1.198

出现登陆框，用户输入test，密码输入123456后提示用户名密码错误，经检查原因如下：

配置文件/etc/httpd/conf.d/nagios.conf文件中指定的AuthUserFile的值为：

AuthUserFile /usr/local/nagios/etc/htpasswd.users

但创建密码文件时的文件为：

htpasswd -cb /usr/local/nagios/etc/htpasswd.user test 123456

重新创建密码文件后，成功登陆

htpasswd -cb /usr/local/nagios/etc/htpasswd.users test 123456

三、Nagios客户端安装

1、客户端需安装的软件

（1）不需要安装的软件

a、无需lamp环境。

不需要安装gd、gd-devel、mysql*、httpd、php、php-gd

b、无需nagios服务端软件包

不需要安装nagios-3.5.1.tar.gz

c、无需gcc环境

不需要安装gcc glibc glibc-common

（2）需要安装的软件

a、客户端软件：

nrpe-2.12

b、插件：

Class-Accessor-0.31.tar.gz

Config-Tiny-2.12.tar.gz

Math-Calc-Units-1.07.tar.gz

Nagios-Plugin-0.34.tar.gz

Params-Validate-0.91.tar.gz

Regexp-Common-2010010201.tar.gz

check_iostat

check_memory.pl

2、安装前准备工作

（1）配置yum源

echo "------- Step 1 : Config yum -------"

cd /etc/yum.repos.d/

cp /etc/yum.repos.d/CentOS-Base.repo/etc/yum.repos.d/CentOS-Base.repo.bak

wget -O /etc/yum.repos.d/CentOS-Base.repohttp://mirrors.aliyum.com/repo/CentOS-6.repo

（2）配置字符集

echo "------- Step 2 : Config CharSet -------"

echo 'export LC_ALL=C' >> /etc/profile

source /etc/profile

（3）关闭防火墙和SELinux

echo "------- Step 3 : Stop iptables and SELinux -------"

a、关闭防火墙

/etc/init.d/iptables stop

/etc/init.d/ip6tables stop

chkconfig iptables off

chkconfig ip6tables off

b、关闭SELinux

setenforce 0

vi /etc/selinux/config

SELINUX=disabled

c、脚本方式关闭SELinu

if [ if /etc/selinux/config ]; then

sed -i's#SELINUX=enforcing#SELINUX=disable#g' /etc/selinux/config

setenforce 0

（4）配置时间同步任务（监控要求时间准确）

echo "------- Step 4 : Config CharSet -------"

/usr/sbin/ntpdate pool.ntp.org

echo "#time sync by my at `date +%F` " >>/var/spool/cron/root

echo '*/10 * * * * /usr/sbin/ntpdate pool.ntp.org >/dev/null2>&1' >> /var/spool/cron/root

crontab -l

3、安装

（1）增加nagios用户和组

echo "------- Step 5 : add nagios user and group -------"

/usr/sbin/useradd -m nagios -s /sbin/nologin

（2）安装nagios插件

echo "------- Step 6 : install nagios-plusins -------"

#yum install perl-devel -y #安装nagios插件时需要，需确认一下是否安装

scp /wddg/tools/[email protected]:/wddg/tools/

cd /wddg/tools/

unzip oldboy_training_nagios_soft.zip

tar zxf nagios-plugins-1.4.16.tar.gz

cd nagios-plugins-1.4.16

./configure --prefix=/application/nagios--enable-perl-modules --enable-redhat-pthread-workaround

make && make install

ls /application/nagios/libexec/ | wc -l #查看安装的插件的数量

（3）安装nrpe

echo "------- Step 7 : install nrpe -------"

tar zxf nrpe-2.12.tar.gz

cd nrpe-2.12

./configure --prefix=/application/nagios #目录必须与nagios-plugins目录一致

make all

make install-plugin

make install-daemon

make install-daemon-config

cd ..

（4）安装iostat（监控磁盘IO的插件）

echo "------- Step 8 : install iostat -------"

cd /wddg/tools/

echo "------- Step 8.1 : install Params-Validate -------"

tar zxvf Params-Validate-0.91.tar.gz

cd Params-Validate-0.91

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.2 : install Class-Accessor -------"

tar zxvf Class-Accessor-0.31.tar.gz

cd Class-Accessor-0.31

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.3 : install Config-Tiny -------"

tar zxvf Config-Tiny-2.12.tar.gz

cd Config-Tiny-2.12

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.4 : install Math-Calc-Units -------"

tar zxvf Math-Calc-Units-1.07.tar.gz

cd Math-Calc-Units-1.07

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.5 : install Regexp-Common -------"

tar zxvf Regexp-Common-2010010201.tar.gz

cd Regexp-Common-2010010201

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.6 : install Nagios-Plugin -------"

tar zxvf Nagios-Plugin-0.34.tar.gz

cd Nagios-Plugin-0.34

perl Makefile.PL

make

make install

cd -

echo "------- Step 8.7 : install sysstat -------"

#for monitor iostat

yum install sysstat -y

echo "------- Step 8.8 : copy script to nagios -------"

/bin/cp /wddg/tools/check_memory.pl /application/nagios/libexec/

/bin/cp /wddg/tools/check_iostat /application/nagios/libexec/

echo "------- Step 8.9 : chmod 755 script-------"

chmod 755 /application/nagios/libexec/check_memory.pl

chmod 755 /application/nagios/libexec/check_iostat

echo "------- Step 8.10 : dos2unix script -------"

dos2unix /application/nagios/libexec/check_memory.pl

dos2unix /application/nagios/libexec/check_iostat

（5）修改nrpe配置文件nrpe.cfg

cp /application/nagios/etc/nrpe.cfg/application/nagios/etc/nrpe.cfg.bak

vi /application/nagios/etc/nrpe.cfg

a、指定nagios服务端IP

#第79行修改前：

allowed_hosts=127.0.0.1

#第79行修改后：

allowed_hosts=127.0.0.1,192.168.1.198

b、删除第199行到203行

sed -i '199,203d' /application/nagios/etc/nrpe.cfg

下面是删除的内容：

command[check_users]=/application/nagios/libexec/check_users -w 5 -c10

command[check_load]=/application/nagios/libexec/check_load -w15,10,5 -c 30,25,20

command[check_hda1]=/application/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1

command[check_zombie_procs]=/application/nagios/libexec/check_procs-w 5 -c 10 -s Z

command[check_total_procs]=/application/nagios/libexec/check_procs-w 150 -c 200

c、在文件未尾加上如下内容

command[check_load]=/application/nagios/libexec/check_load -w15,10,6 -c 30,25,20

command[check_mem]=/application/nagios/libexec/check_memory.pl -w 6%-c 3%

command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c8% -p /

command[check_swap]=/application/nagios/libexec/check_swap -w 20% -c10%"

command[check_iostat]=/application/nagios/libexec/check_iostat -w 6-c 10

（6）启动nrpe

/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d

netstat -lntup | grep 5666

echo "/application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d">>/etc/rc.local

四、配置Nagios服务端监控服务

1、Nagios服务端目录结构

（1）目录结构

ll /usr/local/nagios/

bin：Nagios执行程序所在目录，包括nagios、npc、nrpe等；

etc：存放nagios配置文件

include：存放nagios的包含文件

libexec：存放nagios的插件

perl：

sbin：NagiosCgi文件所在目录，也就是执行外部命令所需文件所在的目录

share：存放nagios的web程序。主要是nagios界面展示的php程序

var：存放nagios的日志和数据

（2）etc目录

cd /usr/local/nagios/etc

-rw-rw-r--. cgi.cfg

-rw-r--r--. htpasswd.users

-rw-rw-r--. nagios.cfg

-rw-r--r--. nrpe.cfg

drwxrwxr-x. objects

-rw-rw----. resource.cfg

其中，nagios.cfg是nagios的主配置文件，包含（include）了cgi.cfg、resource.cfg文件和objects目录下的所有文件。nrpe.cfg是客户端配置文件，如果要把nagios服务端也当成一个客户端时，就需要配置，否则就不需要配置。htpasswd.users是nagios的web密码验证文件。

（3）etc/objects目录

ll /usr/local/nagios/etc/objects

commands.cfg：定义命令执行的文件，比如check_tcp、check_local_disk等，由后面定义服务的配置文件来引用；

contacts.cfg：定义联系人的文件，比如服务down了通知的对象；

localhost.cfg：定义本机的监控条目，默认生成；

printer.cfg：定义打印机的文件，默认未启用，在生产环境中意义不大；

switch.cfg：定义监控路由器和交换机的配置文件，默认未启用；

templates.cfg：定义服务类型的模版配置文件，类似shell中的函数功能；

timeperiods.cfg：定义要监控时间段（报警周期）文件，比如24x7，workhours等；

windows.cfg：定义监控Windows的文件，默认未启用。

services.cfg：自定义存放具体被临控的服务相关配置内容（上百台可以指定services目录，默认不存在）

hosts.cfg：自定义存放具体被临控的主机相关配置内容（上百台可以指定hosts目录，默认不存在）

（4）nagios包含文件和目录的方式

nagios包含其它文件的方式不是include，而是cfg_file=文件全路径。例如

cfg_file=/usr/local/nagios/etc/objects/commands.cfg

nagios包含其它目录的方式是cfg_dir=目录全路径，该目录下所有.cfg文件将会全部被包含。例如

cfg_dir=/usr/local/nagios/etc/servers

2、配置主配置文件nagios.cfg

（1）备份nagios的etc目录

/usr/local/nagios

tar cvf etc.tar.gz etc/

（2）编辑nagios.cfg文件

cd /usr/local/nagios/etc

vi nagios.cfg

a、在第34行后加上下面2行

vi nagios.cfg +34

cfg_file=/usr/local/nagios/etc/objects/services.cfg

cfg_file=/usr/local/nagios/etc/objects/hosts.cfg

b、注释第38行

修改前：

# Definitions for monitoring the local (Linux) host

cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

修改后：

#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

将nagios服务端也当成客户端，通过nrpe来监控，不需要将其当成特殊服务器来监控。

c、在第58行新增1行，指定要监控的服务目录

cfg_dir=/usr/local/nagios/etc/services

（3）创建在配置文件中指定的文件和目录

a、创建etc/services目录

mkdir services

chown -R nagios.nagios services/ #指定nagios用户和组进行管理

b、创建hosts.cfg文件（通过localhost.cfg的前51行进行创建）

cd objects

head -51 localhost.cfg > hosts.cfg

chown nagios.nagios hosts.cfg

pwd

/usr/local/nagios/etc/objects

c、创建services.cfg文件

touch services.cfg

chown nagios.nagios services.cfg

3、Nagios监控模式定义和选择

（1）监控模式

a、主动监控

nagios按照检测周期像URL监控一样，由服务端主动发出请求获取远程主机的数据的监控方式。不需要在客户端安装任何插件。

b、半被动监控（nrpe）

把对负载、内存、硬盘、虚拟内存、IO、温度、风扇转速等本地资源的监控，通过nrpe插件定时连接客户端的nrpe服务，将获取的信息发回nagios服务端的监控方式。

c、全被动监控（nsca）

主动上报。

（2）模式选择

a、主动监控

对于web服务、数据库服务这种能对外提供服务的监控，一般用主动模式。如监控http、ssh、mysql、rsync等服务。

与nrpe无关，就是利用服务端本地插件直接获取信息。

b、半被动监控

对于本地资源性能等的监控，一般用被动模式。如对负载、内存、硬盘、虚拟内存、IO、温度、风扇转速等本地资源的监控。（有时也可通过snmp实现监控的部分系统资源）

主程序通过check_nrpe插件，与客户端nrpe进程沟通，调用客户端本地插件获取数据。

c、说明

主动模式和被动模式是相对的，并且是可以互相转换的，即主动模式的服务可以改成被动模式，被动模式的服务有时也可以改成主动模式。

五、Nagios主机和服务监控服务实战

1、在服务端配置需监控的主机

（1）编辑配置文件hosts.cfg

vi /usr/local/nagios/etc/objects/hosts.cfg

a、修改主机

第25行（文件hosts.cfg是通过head -51 localhost.cfg > hosts.cfg得到的）

修改前

define host{

use linux-server

host_name localhost

alias localhost

address 127.0.0.1

}

修改后

define host{

use linux-server #这是模板，在templates.cfg文件中定义

host_name 01-client218 #可随意写

alias 01-client218 #可有可无，一般与主机名一样

address 192.168.1.218

}

define host{

use linux-server

host_name 02-client219

alias 02-client219

address 192.168.1.219

}

define host{

use linux-server

host_name nagiosServer198

alias nagiosServer198

address 192.168.1.198

}

b、修改主机组

第39行

修改前

define hostgroup{

hostgroup_name linux-servers

alias Linux Servers

members localhost

}

修改后

define hostgroup{

hostgroup_name linux-servers

alias Linux Servers

members 01-client218,02-client219,nagiosServer198

}

（2）检查语法

a、命令

/etc/init.d/nagios checkconfig

Running configuration check... CONFIG ERROR! Check your Nagios configuration.

命令输出只是提示语法有错误，但没有提示错误是什么。

b、修改命令文件第183行，使之详细输出

vim /etc/init.d/nagios +183

修改前

$NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;

修改后

$NagiosBin -v $NagiosCfgFile;

c、再次检查，提示没有配置服务

/etc/init.d/nagios checkconfig

Running configuration check...

...

Checking services...

Error:There are no services defined!

Checked 0 services.

Checking hosts...

Warning: Host '01-client218' has no services associated with it!

Warning: Host '02-client219' has no services associated with it!

Warning: Host 'nagiosServer198' has no services associated with it!

Checked 3 hosts.

...

Total Warnings: 3

TotalErrors: 1

...

2、在服务端配置需监控的服务

（1）编辑配置文件services.cfg

vi /usr/local/nagios/etc/objects/services.cfg

define service{

use generic-service

host_name 01-client218

service_description DiskPartition

check_command check_nrpe!check_disk #nrpe.cfg中的配置

}

define service{

use generic-service

host_name 02-client219

service_description DiskPartition

check_command check_nrpe!check_disk

}

define service{

use generic-service

host_name nagiosServer198

service_description DiskPartition

check_command check_nrpe!check_disk

}

（2）检查语法

a、命令

/etc/init.d/nagios checkconfig

b、报错，提示命令check_nrpe没有定义

/etc/init.d/nagios checkconfig

Running configuration check...

...

Checking services...

Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost '01-client218' not defined anywhere!

Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost '02-client219' not defined anywhere!

Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost 'nagiosServer198' not defined anywhere!

Checked 3 services.

...

Total Warnings: 0

TotalErrors: 3

...

（3）定义check_nrpe命令（commands.cfg文件）

a、检查commands.cfg文件中确实没有定义check_nrpe命令

b、定义check_nrpe命令（在最后行新增）

vi /usr/local/nagios/etc/objects/commands.cfg

define command{

command_name check_nrpe

command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c$ARG1$

}

（4）再次检查语法，正确

/etc/init.d/nagios checkconfig

...

Total Warnings: 0

Total Errors: 0

Things look okay - No serious problems were detected during thepre-flight check

OK.

（5）直接命令检查：OK

/usr/local/nagios/libexec/check_nrpe -H 192.168.158.218 -ccheck_disk

DISK OK - free space: / 9150 MB (54% inode=74%);|/=7755MB;14252;16390;0;17816

（6）通过浏览器查看nagios监控：报错

a、IE打开http://192.168.158.198/nagios/，选择左边导航栏的Hosts，报错如下：

It appears asthough you do not have permission to view information for any of the hosts yourequested...

If you believethis is an error, check the HTTP server authentication requirements foraccessing this CGIand check the authorization options in your CGI configurationfile.

b、原因

报错提示是cgi权限不足，查看服务端/usr/local/nagios/etc/cgi.cfg文件内容，发现该配置文件中要求的用户为nagiosadmin。但我们登陆nagios的用户为test。

vi /usr/local/nagios/etc/cgi.cfg

grep nagiosadmin /usr/local/nagios/etc/cgi.cfg

authorized_for_system_information=nagiosadmin

authorized_for_configuration_information=nagiosadmin

authorized_for_system_commands=nagiosadmin

authorized_for_all_services=nagiosadmin

authorized_for_all_hosts=nagiosadmin

authorized_for_all_service_commands=nagiosadmin

authorized_for_all_host_commands=nagiosadmin

c、处理

（i）将cgi.cfg配置文件中的nagiosadmin替换为test。

sed -i s/nagiosadmin/test/g /usr/local/nagios/etc/cgi.cfg

（ii）重启/重新加载nagios

/etc/init.d/nagios reload

d、再次通过浏览器查看nagios监控

（i）IE再次打开http://192.168.158.198/nagios/，选择左边导航栏的Hosts，正常显示被监控主机状态

01-client218	UP	07-02-2017	0d 0h 14m 31s	PING OK - Packet loss = 0%, RTA = 0.41 ms
02-client219	UP	07-02-2017	0d 0h 11m 11s	PING OK - Packet loss = 0%, RTA = 0.52 ms
nagiosServer198	UP	07-02-2017	0d 0h 7m 51s	PING OK - Packet loss = 0%, RTA = 0.05 ms

（ii）选择左边导航栏的Services，有报错

01-client218	Disk	OK	07-02-2017	42738	DISK OK - free...
01-client219	Disk	OK	07-02-2017	42738	DISK OK - free...
nagiosServer198	Disk	CRITICAL	07-02-2017	42797	Connection refused by host

3、在服务端Services报错原因排查

（1）检查iptables和selinux，发现均未关闭，关闭iptables和selinux

/etc/init.d/iptables stop

/etc/init.d/ip6tables stop

chkconfig iptables off

chkconfig ip6tables off

setenforce 0

vi /etc/selinux/config

SELINUX=disabled

（2）检查nrpe是否启动，发现nrpe未启动，启动nrpe

#启动198上的nrpe

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

netstat -lntup | grep nrpe

tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 2473/nrpe

echo " /usr/local/nagios/bin/nrpe-c /usr/local/nagios/etc/nrpe.cfg -d">>/etc/rc.local

（3）命令检查：报错SSL未安装

./check_nrpe -H 192.168.158.198 -c check_disk

CHECK_NRPE: Error - Could not complete SSL handshake

（4）检查SSL，已安装

rpm -qa | grep openssl

openssl-1.0.1e-57.el6.i686

openssl-devel-1.0.1e-57.el6.i686

（5）检查nrpe.cfg配置文件，发现没有对服务端nrpe进行过配置

修改前：

allowed_hosts=127.0.0.1

command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c10

command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5-c 30,25,20

command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1

command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w5 -c 10 -s Z

command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w150 -c 200

修改后：

allowed_hosts=127.0.0.1,192.168.158.198

#command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c10

#command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5-c 30,25,20

#command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1

#command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs-w 5 -c 10 -s Z

#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w150 -c 200

command[check_load]=/application/nagios/libexec/check_load-w 15,10,6 -c 30,25,20

command[check_mem]=/application/nagios/libexec/check_memory.pl-w 6% -c 3%

command[check_disk]=/application/nagios/libexec/check_disk-w 20% -c 8% -p /

command[check_swap]=/application/nagios/libexec/check_swap-w 20% -c 10%"

command[check_iostat]=/application/nagios/libexec/check_iostat-w 6 -c 10

（6）通过浏览器查看nagios的services监控：报错

nagiosServer198 UNKNOWN ... NRPE:Unable to read output #客户端nrpe没有获取到数据

（7）检查nrpe.cfg配置文件，发现command[check_disk]路径错误

错误路径：（/application/nagios：这是客户端配置的nrpe的路径）

command[check_load]=/application/nagios/libexec/check_load -w 15,10,6-c 30,25,20

command[check_mem]=/application/nagios/libexec/check_memory.pl -w 6%-c 3%

command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c 8%-p /

command[check_swap]=/application/nagios/libexec/check_swap -w 20% -c10%"

command[check_iostat]=/application/nagios/libexec/check_iostat -w 6 -c 10

正确路径：（/usr/local/nagios：这才是服务端配置的nrpe的路径）

command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,6 -c30,25,20

command[check_mem]=/usr/local/nagios/libexec/check_memory.pl -w 6% -c3%

command[check_disk]=/usr/local/nagios/libexec/check_disk -w 20% -c 8%-p /

command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20% -c10%"

command[check_iostat]=/usr/local/nagios/libexec/check_iostat -w 6 -c 10

（8）拷贝check_memory.pl和check_iostat到libexec目录

echo "------- Step 8.8 : copy script to nagios -------"

/bin/cp /wddg/tools/check_memory.pl /usr/local/nagios/libexec/

/bin/cp /wddg/tools/check_iostat /usr/local/nagios/libexec/

echo "------- Step 8.9 : chmod 755 script-------"

chmod 755 /usr/local/nagios/libexec/check_memory.pl

chmod 755 /usr/local/nagios/libexec/check_iostat

echo "------- Step 8.10 : dos2unix script -------"

dos2unix /usr/local/nagios/libexec/check_memory.pl

dos2unix /usr/local/nagios/libexec/check_iostat

（9）重启服务端nrpe

ps -ef | grep nrpe

nagios 2314 1 .../usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

root 2759 2439 016:39 pts/0 00:00:00 grep --color=autonrpe

pkill nrpe

ps -ef | grep nrpe

root 2767 2439 016:39 pts/0 00:00:00 grep --color=autonrpe

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

（10）直接命令检查：OK

cd /usr/local/nagios/libexec

./check_nrpe -H 192.168.158.198 -c check_disk

DISK OK - free space: / 30025 MB (88% inode=94%);|/=3975MB;28656;32954;0;35820

（11）通过浏览器查看nagios监控：OK

4、在服务端新增对218的监控的服务

（1）编辑服务端配置文件services.cfg，新增check_mem和check_iostat服务

vi /usr/local/nagios/etc/objects/services.cfg

define service{

use generic-service

host_name 01-client218

service_description Mem

check_command check_nrpe!check_mem

}

define service{

use generic-service

host_name 01-client218

service_description IO

check_command check_nrpe!check_iostat

}

（2）检查语法：OK

/etc/init.d/nagios checkconfig

（3）重启相关服务（实际只需/etc/init.d/nagios reload）

pkill nrpe

/etc/init.d/httpd stop

/etc/init.d/nagios stop

/etc/init.d/nagios start

/etc/init.d/httpd start

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

（4）命令检查：OK

./check_nrpe -H 192.168.1.218 -c check_mem

CHECK_MEMORY OK - 1631M free | free=1710317568b;117609922.56:;58804961.28:

./check_nrpe -H 192.168.1.218 -c check_iostat

IOSTAT OK - user 0.02 nice 0.01 sys 0.17 iowait 0.24 idle 0.00 | iowait=0.24%;; idle=0.00%;; user=0.02%;;nice=0.01%;; sys=0.17%;;

（5）通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

六、主动监控模式：nagios服务器端发起的监控（如URL、端口的监控）

1、check_tcp插件

（1）进入目录

cd /usr/local/nagios/libexec

（2）查看帮助

./check_tcp --help

（3）测试

./check_tcp -H 192.168.1.148 -p 80

TCP OK - 0.001 second response time on port80|time=0.000865s;;;0.000000;10.000000

./check_tcp -H 192.168.1.148 -p 22

TCP OK - 0.001 second response time on port22|time=0.000721s;;;0.000000;10.000000

2、check_http插件

（1）查看帮助

./check_http --help

（2）测试

./check_tcp -I 192.168.1.148 -p 80

HTTP OK: HTTP/1.1 200 OK - 243 bytes in0.027 second response time |time=0.027303s;;;0.000000 size=243B;;;0

3、http监控实战

（1）自定义一个http模式的监控服务，放在自定义的services目录下

cd /usr/local/nagios/etc/services

vi myservices.cfg

define service{

use generic-service

host_name 01-client148

service_description nginsweb

check_command check_weburl!-I 192.168.1.148

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

（2）配置check_weburl命令（也可以直接用check_http，不配置check_weburl）

vi /usr/local/nagios/etc/objects/commands.cfg

# 'check_weburl' command definition

define command{

command_name check_weburl

command_line $USER1$/check_http $ARG1$ -w 10 -c 30

}

（3）检查语法：OK

/etc/init.d/nagios checkconfig

（4）重载nagios

/etc/init.d/nagios reload

（5）通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

4、域名监控实战（所有操作均在服务端）

（1）备份服务配置文件（可以分文件进行配置。如服务一个文件，或每个主机一个文件）

cd /usr/local/nagios/etc/services

cp myservices.cfg 01-client148.cfg

（2）修改配置文件01-client148.cfg

vi 01-client148.cfg

define service{

use generic-service

host_name 01-client148

service_description nginsweb

check_command check_weburl!-H blog.abc.org

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

define service{

use generic-service

host_name 02-client149

service_description nginsuri

check_command check_http!-H blog.nginx.org -u /static/

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

（3）修改hosts文件

vi /etc/hosts

127.0.0.1 localhostlocalhost.localdomain localhost4 localhost4.localdomain4

::1 localhostlocalhost.localdomain localhost6 localhost6.localdomain6

192.168.1.148 blog.abc.org

（4）检查01-client148上web服务

curl 192.168.58.148

www.nginx.org

curl blog.abc.org

www.nginx.org

（5）check_http测试

cd /usr/local/nagios/libexec/

./check_http -H blog.abc.org

HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0

（6）对URI测试（-u参数，如果uri比较复杂用“”）

cd /usr/local/nagios/libexec/

./check_http -H blog.abc.org -u "/static/index.php?id=1"

HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0

（7）通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

5、端口监控实战（通过check_tcp）

（1）备份服务配置文件（可以分文件进行配置。如服务一个文件，或每个主机一个文件）

cd /usr/local/nagios/etc/services

cp myservices.cfg 02-client149port.cfg

（2）修改配置文件01-client149port.cfg

vi 02-client149port.cfg

define service{

use generic-service

host_name 02-client149

service_description port_22

check_command check_tcp!-H blog.nginx.org -p 22

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

define service{

use generic-service

host_name 02-client149

service_description port_3306

check_command check_tcp!-H blog.nginx.org -p 3306

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

（3）修改hosts文件

vi /etc/hosts

127.0.0.1 localhost localhost.localdomainlocalhost4 localhost4.localdomain4

::1 localhostlocalhost.localdomain localhost6 localhost6.localdomain6

192.168.1.149 blog. nginx.org

（4）check_tcp测试

cd /usr/local/nagios/libexec/

./check_http -H blog.abc.org -p 22

HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0

（5）通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

6、主动监控小结

（1）在服务端的命令行把要监控的命令调试好

（2）在commands.cfg中定义好nagios命令，同时调用命令行插件

（3）在服务的配置文件中定义要监控的服务，调用commands.cfg里定义的nagios的监控命令。

七、nrpe被动模式监控80端口实战

1、在客户端测试nagios脚本命令

cd /application/nagios/libexec/

./check_tcp -H 192.168.58.148 -p 80

TCP OK - 0.000 second response time on port80|time=0.000243s;;;0.000000;10.000000

2、在客户端配置nrpe.cfg

cd /application/nagios/etc/

vi nrpe.cfg

#在结尾添加一行

command[check_port_80]=/application/nagios/libexec/check_tcp -H 192.168.58.148-p 80 -w 5 -c 10

3、在客户端重启nrpe

ps -ef | grep nrpe

nagios ... /application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d

pkill nrpe

/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d

netstat -lntup | grep nrpe

tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe

4、在服务端测试nrpe命令

/usr/local/nagios/libexec/check_nrpe -H 192.168.58.148 -ccheck_port_80

TCP OK - 0.000 second ...80|time=0.000169s;5.000000;10.000000;0.000000;10.000000

5、在服务端修改配置文件

cd /usr/local/nagios/etc/services/

vi 01-client148.cfg

#添加一个服务

define service{

use generic-service

host_name 01-client148

service_description port_80

check_command check_nrpe!check_port_80

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

6、在服务端检查语法

/etc/init.d/nagios checkconfig

7、重载nagios

/etc/init.d/nagios reload

8、通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

八、服务分组显示

1、格式

define servicegroup{

servicegroup_name 组名

alias 组别名

members 主机名,组名, 主机名,组名。。

}

2、要求

每个被监控主机的服务描述要和组名是一致的。如将服务中的service_description改为Mem1，servicegroup_name为Mem，则语法检查时会报错：

Error: Could not find a service matching host name '01-client148'and description 'Mem' (config file'/usr/local/nagios/etc/services/servergroup.cfg', starting on line 1)

Error: Could not expand member services specified in servicegroup(config file '/usr/local/nagios/etc/services/servergroup.cfg', starting on line1)

Error processing object config files!

3、创建分组文件（实际上可以放在任何.cfg的配置文件中）

vi /usr/local/nagios/etc/services/servergroup.cfg

define servicegroup{

servicegroup_name Mem

alias Mem

members01-client148,Mem,nagiosServer161225,Mem

}

4、在服务端检查语法

/etc/init.d/nagios checkconfig

5、重载nagios

/etc/init.d/nagios reload

6、通过浏览器查看nagios监控的Service Groups导航栏，分组正常

九、Nagios监控参数说明

1、联系人配置参数（/usr/local/nagios/etc/objects/contacts.cfg）

参数名	参数值	说明
name	generic-contact	联系人名称
service_notification_period	24x7	当服务出现异常时，发送通知的时间段，这个时间段"24x7"在timeperiods.cfg文件中定义
host_notification_period	24x7	当主机出现异常时，发送通知的时间段，这个时间段"24x7"在timeperiods.cfg文件中定义
service_notification_options	w,u,c,r	这个定义的是“通知可以被发出的情况”。w即warn，表示警告状态，u即unknown，表示不明状态;; c即criticle，表示紧急状态，r即recover，表示恢复状态;; 也就是在服务出现警告状态、未知状态、紧急状态和重新恢复状态时都发送通知给使用者。
host_notification_options	d,u,r	定义主机在什么状态下需要发送通知给使用者，d即down，表示宕机状态;; u即unreachable，表示不可到达状态，r即recovery，表示重新恢复状态。
service_notification_commands	notify-service-by-email	服务故障时，发送通知的方式，可以是邮件和短信，这里发送的方式是邮件;; 其中“notify-service-by-email”在commands.cfg文件中定义。
host_notification_commands	notify-host-by-email	主机故障时，发送通知的方式，可以是邮件和短信，这里发送的方式是邮件;; 其中“notify-host-by-email”在commands.cfg文件中定义。
register	0

2、主机配置参数

参数名	参数值	说明
use	linux-server	定义被监控主机使用的模版。具体见templates.cfg
host_name	01-client218	被监控主机名称，可随意定义
alias	01-client218	被监控主机名称别名，可随意定义
address	192.168.1.218	被监控主机的IP
check_command	check-host-alive	检测被监控主机是否存活的命令，来自commands.cfg
max_check_attempts	3	故障后，最大尝试检查次数
normal_check_interval	2	正常的检查间隔，默认单位为分钟
retry_check_interval	2	故障后重试的检查间隔，默认单位为分钟
check_period	24x7	检查同期，来自timeperiods.cfg
notification_interval	300	故障后2次报警通知的时间。单位是分钟
notification_period	24x7	故障时，发送通知的时间范围
notification_options	d,u,r	定义主机在什么状态下可以发送通知给使用者 d即down，表示宕机状态 u即unreachable，表示不可到达状态 r即recovery，表示重新恢复状态
contact_groups	admins	报警到联系人组，在contacts.cfg文件中定义

3、服务配置参数

参数名	参数值	说明
use	generic-service	定义服务使用的模版。具体见templates.cfg
host_name	01-client218	被监控主机名，来自hosts.cfg
service_description	Mem	报警服务描述，自己根据内容取有意义的名称
check_command	check_nrpe!check_mem	检查服务的命令
max_check_attempts	2	尝试检查的最大次数
normal_check_interval	2	正常的检查间隔，默认单位为分钟
retry_check_interval	2	故障后重试的检查间隔，默认单位为分钟
check_period	24x7	检查同期，来自timeperiods.cfg
notification_interval	300	故障后2次报警通知的时间。单位是分钟
notification_period	24x7	故障时，发送通知的时间范围
notification_options	w,u,c,r	定义主机在什么状态下可以发送通知给使用者 w即warn，表示警告状态 u即unreachable，表示不可到达状态 c即criticle，表示紧急状态 r即recovery，表示重新恢复状态
contact_groups	admins	报警到联系人组，在contacts.cfg文件中定义
process_perf_data	1	PNP出图记录数据相关

4、时间段配置参数（/usr/local/nagios/etc/objects/timeperiods.cfg）

（1）定义一个名为24x7的时间段，即监控所有时间段

timeperiod{

timeperiod_name 24x7 #时间段的名称,这个地方不要有空格

alias 24 Hours ADay, 7Days A Week

sunday 00:00-24:00

monday 00:00-24:00

tuesday 00:00-24:00

wednesday 00:00-24:00

thursday 00:00-24:00

friday 00:00-24:00

saturday 00:00-24:00

}

（2）定义一个名为workhours的时间段，即工作时间段。

timeperiod{

timeperiod_name workhours

alias NormalWorkHours

monday 09:00-17:00

tuesday 09:00-17:00

wednesday 09:00-17:00

thursday 09:00-17:00

friday 09:00-17:00

}

十、Nagios模版和联系人配置

1、模版位置

/usr/local/nagios/etc/objects/templates.cfg

2、作用

定义服务类型的模版配置文件，类似shell中的函数功能

3、Nagios主模板文件(templates.cfg)注释

egrep -v "#|^$"/usr/local/nagios/etc/objects/templates.cfg

################################联系方式模板################################

define contact{

name generic-contact #通用联系模板名称

service_notification_period 24x7 #服务通知周期(7*24小时)

host_notification_period 24x7 #主机通知周期

service_notification_options w,u,c,r,f,s #当服务状态为(警告、未知、严重、恢复、flapping)

host_notification_options d,u,r,f,s #当主机状态为(关机、不可达、恢复)

service_notification_commandsnotify-service-by-email #当出现错误时候，通知mail

host_notification_commands notify-host-by-email #当出现错误时候，通知mail

}

################################通用主机模板################################

define host{

name generic-host #通用模板主机名

notifications_enabled 1 #是否启用通知(1启用、0不启用)

event_handler_enabled 1 #主机事件处理(同上)

flap_detection_enabled 1 #Flap detection is enabled

failure_prediction_enabled 1 #Failure prediction is enabled

process_perf_data 1 #Process性能数据

retain_status_information 1 #保留程序重新启动状态信息

retain_nonstatus_information 1 #

notification_period 24x7 #发送主机状态通知(7*24)

}

################################linux主机模板################################

define host{

name linux-server #linux模板通用名

use generic-host #继承了通用主机模板的其他值

check_period 24x7 #检查周期7*24小时

check_interval 5 #每隔5分钟检查一次

retry_interval 1 #异常后，1分钟后重试

max_check_attempts 10 #异常后，最大尝试次数

check_command check-host-alive #检查主机存活命令

notification_period workhours #工作时间通知

notification_interval 120 #异常后，通知间隔120分

notification_options d,u,r #当主机down、unrealcable、recovery

contact_groups admins #通知发送管理员组

}

################################windows主机模板################################

define host{

name windows-server #windown主机模板名称

use generic-host #继承了通用主机模板的其他值

check_period 24x7 #检查周期7*24小时

check_interval 5 #每隔5分钟检查一次

retry_interval 1 #异常后，1分钟后重试

max_check_attempts 10 #异常后，最大尝试次数

check_command check-host-alive #检查主机是否存活

notification_period 24x7 #任何时间都可以发送通知

notification_interval 30 #30分钟后，重新发送通知

notification_options d,r #当主机状态为down、recovery时发送通知

contact_groups admins #通知发送管理员

hostgroups windows-servers #windows主机组

}

################################通用打印机模板################################

define host{

name generic-printer #这个host定义的名称

use generic-host #继承通用模板值

check_period 24x7 #7*24

check_interval 5 #每隔5分钟检查一次

retry_interval 1 #异常后，1分钟后重试

max_check_attempts 10 #异常后，最大尝试次数

check_command check-host-alive #检查主机是否存活

notification_period workhours #在工作时间通知

notification_interval 30 #异常后，重发通知间隔30分钟

notification_options d,r #仅在关机、恢复时通知

contact_groups admins #通知管理员组

}

################################通用交换机模板################################

define host{

name generic-switch #这个主机模板名称

use generic-host #继承通用模板

check_period 24x7 #7*24小时

check_interval 5 #每隔5分钟检查一次交换机

retry_interval 1 #一分钟后重试

max_check_attempts 10 #异常后，最大尝试次数

check_command check-host-alive #是否存活

notification_period 24x7 #7*24

notification_interval 30 #报警间隔

notification_options d,r #关机、恢复

contact_groups admins #通知管理组

}

################################通用服务模板################################

define service{

name generic-service #通用服务模板名称

active_checks_enabled 1 #服务检查启用

passive_checks_enabled 1 #被动检查启用

parallelize_check 1 #并行检查开启

obsess_over_service 1 #分布式监控使用，1启用，0禁用

check_freshness 0 #不检查服务'freshness'

notifications_enabled 1 #服务通知启用

event_handler_enabled 1 #启用服务事件处理程序

flap_detection_enabled 1 #Flap detection is enabled

failure_prediction_enabled 1 #启用故障预测

process_perf_data 1 #性能数据

retain_status_information 1 #保留重新启动状态信息

retain_nonstatus_information 1 #保留非状态信息

is_volatile 0 #The service is not volatile

check_period 24x7 #7*24

max_check_attempts 3 #重新检查服务3次，以确认是否真正的状态

normal_check_interval 10 #正常情况下每个10分钟检查一次

retry_check_interval 2 #每隔两分钟检查一次服务，直到真正的状态确定

contact_groups admins #通知管理组

notification_options w,u,c,r #发送通知，当服务状态为warning, unknown, critical, and recovery events

notification_interval 60 #60分钟后重新通知状态

notification_period 24x7 #7*24

}

################################本地服务模板################################

define service{

name local-service #本地服务模板名称

use generic-service #集成generic-service

max_check_attempts 4 #重试4次，以确认最终状态

normal_check_interval 5 #正常情况下每隔5分钟检查一次服务

retry_check_interval 1 #每隔1分钟检查一次，以确认状态

}

4、模版实战

（1）导出指定模版

sed -n '153,177p' /usr/local/nagios/etc/objects/templates.cfg > /tmp/mytemplates.cfg

（2）编辑自定义模版

vi /tmp/mytemplates.cfg

define service{

name generic-myservice

active_checks_enabled 1

passive_checks_enabled 1

parallelize_check 1

obsess_over_service 1

check_freshness 0

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

is_volatile 0

check_period 24x7

max_check_attempts 3

normal_check_interval 10

retry_check_interval 2

contact_groups admins

notification_options w,u,c,r

notification_interval 60

notification_period 24x7

}

（3）将自定义模版放入主模板文件templates.cfg中（添加到最后面）

vi /usr/local/nagios/etc/objects/templates.cfg

define service{

name generic-myservice

active_checks_enabled 1

passive_checks_enabled 1

parallelize_check 1

obsess_over_service 1

check_freshness 0

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

is_volatile 0

check_period 24x7

max_check_attempts 3

normal_check_interval 10

retry_check_interval 2

contact_groups admins

notification_options w,u,c,r

notification_interval 60

notification_period 24x7

}

（4）在服务配置中使用自定义模版

cd /usr/local/nagios/etc/services

vi myservices.cfg

a、修改前

define service{

use generic-service

host_name 01-client148

service_description nginsweb

check_command check_weburl!-I 192.168.1.148

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins

}

b、修改后

define service{

use generic-myservice

host_name 01-client148

service_description nginsweb

check_command check_weburl!-I 192.168.1.148

}

5、联系人实战

（1）在联系人配置文件中定义新的联系人和组

vi /usr/local/nagios/etc/objects/contacts.cfg

#定义运维人员（operationand maintenance staffs）

define contact{

contact_name test01

use generic-contact

alias OMS

email test01@localhost

}

define contact{

contact_name test02

use generic-contact

alias OMS

email test02@localhost

}

#定义运维组

define contactgroup{

contactgroup_name omgroup

alias Nagios OMG

members test01,test02

}

（2）使用方式

a、方式一：修改自定义模版中的用户组

vi /usr/local/nagios/etc/objects/templates.cfg

define service{

name generic-myservice

active_checks_enabled 1

passive_checks_enabled 1

parallelize_check 1

obsess_over_service 1

check_freshness 0

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

is_volatile 0

check_period 24x7

max_check_attempts 3

normal_check_interval 10

retry_check_interval 2

contact_groups admins,omgroup

notification_options w,u,c,r

notification_interval 60

notification_period 24x7

}

b、方式二：在每个服务中加入contact_groups参数

define service{

use generic-myservice

host_name 01-client148

service_description nginsweb

check_command check_weburl!-I 192.168.1.148

check_command omgroup

}

十一、自定义开发插件

1、原因

监控的内容不断在变化，插件也不断变化，默认的一些插件可能越来越不能满足需求，这个时候就需要自己来写些插件了

2、说明

nagios的插件支持多种脚本或编译后的程序（Java、C、C++、php、shell等）。nagios不限制任何开发语言，只要该自定义插件要满足2个条件，也就是要提供2个返回值就行：

（1）插件的退出状态码（返回值）：用于nagios判断插件相关的监控服务状态（面试题）

a、状态码

0：表示状态OK

1：表示状态warn

2：表示状态crit

3：表示状态未知

b、查看nagios中配置的状态码

head -7 utils.sh

#! /bin/sh

STATE_OK=0

STATE_WARNING=1

STATE_CRITICAL=2

STATE_UNKNOWN=3

STATE_DEPENDENT=4 #一般不同

c、不同语言的返回值实现

（i）Java：System.exit(int status)

（ii）php：exit(status)

（iii）python：sys.exit(int status)

（v）C/C++：return int status

（vi）bash：exit int status

（2）插件向标准设备输出一行字（控制台要打印一行数据）：用于nagios在web页面的状态显示说明（Status Information列）

a、只需要第一行数据

b、不同语言的打印语句

（i）Java：System.out.println(String msg)

（ii）php：echo msg

（iii）python：print msg

（v）C/C++：printf("%s", msg)

（vi）bash：echo/printf msg

3、实例1：监控密码文件/etc/passwd的变化

（1）在服务端通过md5sum命令生成密码文件/etc/passwd的校验码（指纹库）

md5sum /etc/passwd > /etc/passwd.md5

（2）在服务端通过md5sum命令测试密码文件/etc/passwd是否被修改

md5sum -c /etc/passwd.md5

/etc/passwd: OK

（3）在服务端编写检测脚本

vi /usr/local/nagios/libexec/check_passwd

char=`md5sum -c /etc/passwd.md5 | grep "OK" | wc -l`

if [ $char -eq 1 ];then

echo "passwd isok"

exit 0

else

echo "passwd ischanged"

exit 2

（4）在服务端测试脚本

a、脚本测试

sh check_passwd

passwd is ok

b、增加用户

useradd aaaa

c、脚本测试（多出警告输出，并占用第一行）

sh check_passwd

md5sum: WARNING: 1 of 1 computed checksum did NOT match

passwd is changed

（5）在服务端修改脚本，屏蔽输出

vi /usr/local/nagios/libexec/check_passwd

#!/bin/sh

char=`md5sum -c /etc/passwd.md5 2>/dev/null | grep "OK" | wc -l`

if [ $char -eq 1 ];then

echo "passwd isok"

exit 0

else

echo "passwd ischanged"

exit 2

（6）在服务端重建文件指纹库

md5sum /etc/passwd > /etc/passwd.md5

（7）在服务端再次测试脚本

a、脚本测试

sh check_passwd

passwd is ok

b、增加用户

useradd bbbb

c、脚本测试（多出警告输出，并占用第一行）

sh check_passwd

passwd is changed

（8）在服务端增加脚本的可执行权限

chmod +x /usr/local/nagios/libexec/check_passwd

ll /usr/local/nagios/libexec/check_passwd

-rwxr-xr-x 1 root root 182 Jul 8 16:21 /usr/local/nagios/libexec/check_passwd

（9）配置nrpe.cfg

vi /application/nagios/etc/nrpe.cfg

#在最后加上下面这行

command[check_passwd]=/application/nagios/libexec/check_passwd

（10）在服务端重启nrpe

ps -ef | grep nrpe

nagios ... //usr/local/nagios/bin/nrpe-c //usr/local/nagios/etc/nrpe.cfg -d

pkill nrpe

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d

netstat -lntup | grep nrpe

tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe

（11）在服务端测试nrpe

./check_nrpe -H 192.168.1.198 -c check_passwd

passwd is changed

md5sum /etc/passwd > /etc/passwd.md5

./check_nrpe -H 192.168.1.198 -c check_passwd

passwd is ok

（12）在客户端配置check_passwd脚本

md5sum /etc/passwd > /etc/passwd.md5

md5sum -c /etc/passwd.md5

vi /application/nagios/libexec/check_passwd

#!/bin/sh

char=`md5sum -c /etc/passwd.md5 2>/dev/null | grep "OK" | wc -l`

if [ $char -eq 1 ];then

echo "passwd isok"

exit 0

else

echo "passwd ischanged"

exit 2

chmod +x /application/nagios/libexec/check_passwd

vi /application/nagios/etc/nrpe.cfg

#在最后加上下面这行

command[check_passwd]=/application/nagios/libexec/check_passwd

（13）在客户端重启nrpe

ps -ef | grep nrpe

nagios ... /application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d

pkill nrpe

/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d

netstat -lntup | grep nrpe

tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe

（14）在服务端修改服务配置文件

vi /usr/local/nagios/etc/objects/services.cfg

#添加一个服务

define service{

use generic-service

host_name 01-client218,02-client219,nagiosServer198

service_description check_passwd

check_command check_nrpe!check_passwd

}

（15）在服务端检查语法

/etc/init.d/nagios checkconfig

（16）重载nagios

/etc/init.d/nagios reload

（17）通过浏览器查看nagios监控：OK（需等待一下才OK，处于PENDING状态是等待状态）

十二、Nagios图形监控显示和管理

1、安装pnp图形监控（服务端安装）

（1）环境检查

a、检查图形显示依赖包

rpm -q zlib zlib-devel freetype freetype-devel cairo pango gdgd-devel

zlib-1.2.3-29.el6.i686

zlib-devel-1.2.3-29.el6.i686

freetype-2.3.11-14.el6_3.1.i686

freetype-devel-2.3.11-14.el6_3.1.i686

cairo-1.8.8-3.1.el6.i686

pango-1.28.1-7.el6_3.i686

gd-2.0.35-11.el6.i686

packagegd-devel is not installed

b、安装gd-devel

rpm -ivh gd-devel-2.0.35-11.el6.i686.rpm

c、再次检查依赖包

rpm -q zlib zlib-devel freetype freetype-devel cairo pango gdgd-devel

zlib-1.2.3-29.el6.i686

zlib-devel-1.2.3-29.el6.i686

freetype-2.3.11-14.el6_3.1.i686

freetype-devel-2.3.11-14.el6_3.1.i686

cairo-1.8.8-3.1.el6.i686

pango-1.28.1-7.el6_3.i686

gd-2.0.35-11.el6.i686

gd-devel-2.0.35-11.el6.i686

（2）安装画图工具rrdtool（专门画图的工具）

a、安装依赖包libart_lgpl（rrdtool依赖libart_lgpl)

（i）安装方式一：yum

yum install libart_lgpl libart_lgpl-devel -y

（ii）安装方式二：编译

cd /wddg/tools/

wget http://ftp.gnome.org/pub/gnome/sources/libart_lgpl/2.3/libart_lgpl-2.3.17.tar.gz

tar zxf libart_lgpl-2.3.17.tar.gz

cd libart_lgpl-2.3.17

./configure

make

make install

/bin/cp -r /usr/local/include/libart-2.0 /usr/include/

cd ..

b、安装rrdtool

# wget http://oss.oetiker.ch/rrdtool/pub/rrdtool-1.2.14.tar.gz

tar xf rrdtool-1.2.14.tar.gz

cd rrdtool-1.2.14

./configure --prefix=/usr/local/rrdtool --disable-python--disable-tcl

make

make install

cd ..

ll /usr/local/rrdtool/bin

-rwxr-xr-x 1 root root 45032 Jul 9 12:08 rrdcgi

-rwxr-xr-x 1 root root 4915Jul 9 12:08 rrdtool

-rwxr-xr-x 1 root root 42633 Jul 9 12:08 rrdupdate

注：/usr/local/rrdtool/bin目录下出现上面3个文件表示安装成功，如果在configure时有warnning，可以忽略。

（3）安装出图工具pnp（专门展示图形的工具：pnp收集数据后由rrdtools画图，再由pnp展示）

tar zxf pnp-0.4.14.tar.gz

cd pnp-0.4.14

./configure--with-rrdtool=/usr/local/rrdtool/bin/rrdtool \

--with-perfdata-dir=/usr/local/nagios/share/perfdata/

make all

make install

make install-config

make install-init

ll /usr/local/nagios/libexec/ | grep process

-rwxr-xr-x 1 nagiosnagios 31827 Jul 9 12:21 process_perfdata.pl

注：

--with-rrdtool=/usr/local/rrdtool/bin/rrdtool：真正的出图命令

--with-perfdata-dir=/usr/local/nagios/share/perfdata/：出图所用的数据路径

/usr/local/nagios/libexec/目录下出现有process_perfdata.pl这个文件表示安装成功

如果在configure时有warnning，可以忽略

2、配置pnp（服务端）

（1）修改nagios.cfg文件（打开数据保存开关）

cd /usr/local/nagios/etc

cp nagios.cfg nagios.cfg.bak

vi nagios.cfg +835

修改前：（835行）

process_performance_data=0

修改后：（835行）

#打开保存数据开关。0不保存数据，1保存数据

process_performance_data=1

修改前：（847和848行）

#host_perfdata_command=process-host-perfdata

#service_perfdata_command=process-service-perfdata

修改后：（847和848行）

#保存主机数据

host_perfdata_command=process-host-perfdata

#保存服务数据

service_perfdata_command=process-service-perfdata

（2）修改commands.cfg文件（修改数据输出路径）

cd /usr/local/nagios/etc/objects

vi vi commands.cfg +227

修改前：

# 'process-host-perfdata' command definition

define command{

command_name process-host-perfdata

command_line /usr/bin/printf"%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n">> /

usr/local/nagios/var/host-perfdata.out

}

# 'process-service-perfdata' commanddefinition

define command{

command_name process-service-perfdata

command_line /usr/bin/printf "%b""$LASTSERVICECHECK$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATE$\t$SERVICEATTEMPT$\t$SERVICESTATETYPE$\t$SERVICEEXECUTIONTIME$\t$SERVI

CELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$\n">> /usr/local/nagios/var/service-perfdata.out

}

修改后：

# 'process-host-perfdata' command definition

define command{

command_name process-host-perfdata

command_line /usr/local/nagios/libexec/process_perfdata.pl

}

# 'process-service-perfdata' command definition

define command{

command_name process-service-perfdata

command_line /usr/local/nagios/libexec/process_perfdata.pl

}

（3）检查语法

/etc/init.d/nagios checkconfig

（4）重载nagios

/etc/init.d/nagios reload

（5）通过浏览器查看nagios监控：OK（需等待一下才OK）

http://192.168.161.225/nagios/pnp/index.php

（6）注意事项

页面出图，还需在模版templates.cfg或自定义的主机和服务中设置process_perf_data的参数值为1，才会有数据。

vi templates.cfg

define host{

name generic-host

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

notification_period 24x7

}

define service{

name generic-service

active_checks_enabled 1

passive_checks_enabled 1

parallelize_check 1

obsess_over_service 1

check_freshness 0

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

}

3、整合pnp的超链接到nagios图形显示界面（服务端）

（1）说明

目前只能通过http://192.168.161.225/nagios/pnp/index.php来看图，并且上查看所有可显法的图。希望在nagios的监控界面对应的主机或服务前面有的图形的小图标，点击图标进行相应的主机或服务的图形监控状态趋势。

可以在模版templates.cfg或自定义的主机和服务中设置action_url参数。

默认情况下nagios自带的插件可以出图，但自定义的插件没有图，是因为自定义的插件没有给nagios数据。

（2）主机出图参数

action_url /nagios/pnp/index.php?host=$HOSTNAME$

（3）服务出图参数

action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$

（4）配置示例

vi templates.cfg

define host{

name linux-server

use generic-host

check_period 24x7

check_interval 5

retry_interval 1

max_check_attempts 10

check_command check-host-alive

notification_period workhours

notification_interval 120

notification_options d,u,r

contact_groups admins

action_url /nagios/pnp/index.php?host=$HOSTNAME$

}

define service{

name generic-service

active_checks_enabled 1

passive_checks_enabled 1

parallelize_check 1

obsess_over_service 1

check_freshness 0

notifications_enabled 1

event_handler_enabled 1

flap_detection_enabled 1

failure_prediction_enabled 1

process_perf_data 1

retain_status_information 1

retain_nonstatus_information 1

is_volatile 0

check_period 24x7

max_check_attempts 3

normal_check_interval 10

retry_check_interval 2

contact_groups admins

notification_options w,u,c,r

notification_interval 60

notification_period 24x7

action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$

}

（5）出图数据

ll /usr/local/nagios/share/perfdata/01-client148/

-rw-r--r-- 1 nagios nagios 384736Jul 9 13:38 Disk.rrd

-rw-r--r-- 1 nagios nagios 11319 Jul 9 13:38 Disk.xml

-rw-r--r-- 1 nagios nagios 1917824 Jul 9 13:43 IO.rrd

-rw-r--r-- 1 nagios nagios 13077 Jul 9 13:43 IO.xml

-rw-r--r-- 1 nagios nagios 384736 Jul 9 13:44 Mem.rrd

-rw-r--r-- 1 nagios nagios 11405 Jul 9 13:44 Mem.xml

-rw-r--r-- 1 nagios nagios 768008 Jul 9 13:44 dnsweb.rrd

-rw-r--r-- 1 nagios nagios 11846 Jul 9 13:44 dnsweb.xml

-rw-r--r-- 1 nagios nagios 768008 Jul 9 13:45 nginsweb.rrd

-rw-r--r-- 1 nagios nagios 11857 Jul 9 13:45 nginsweb.xml

-rw-r--r-- 1 nagios nagios 384736 Jul 9 13:43 port_80.rrd

-rw-r--r-- 1 nagios nagios 11394 Jul 9 13:43 port_80.xml

（6）截图

十三、Nagios报警方式及策略

1、Nagios报警方式

（1）邮件报警：推荐2，主要用于重要不紧急的业务

生产环境应尽量使用自已公司的邮箱作为报警邮箱，因为其它邮箱对邮件的频率是有限制的，有可能会拒收或当垃圾邮件，导致报警延误或无法收到。

（2）飞信转短信报警：不推荐使用

需在win32上装一个飞信客户端，把对方手机加为好友，需对方确认。才可以发短信。

（3）邮件转短信报警

如139、126、189等邮箱，邮件到达后，通过手机通知收件人是邮箱提供商提供的邮件提醒的功能。报警内容长度有限制。

（4）http短信网关:推荐1（收费）

有专门的公司提供直接发给信息到手机的短信网关，常用的报警就是一个URL地址携带信息。要收短信费。格式如下：

http://s.ccme.cc/send.jsp?circle=username&pwd=password&mobile=$CONTACT&service=gg89-3aa06423clf83fd&msgid=23224&message=$TITLE[${alert_date}sa]

（5）短信猫:相当于手机终端

（6）电话语音报警中：实现的不多，有些是要收费的。在报警时直接电话给报警负责人

（7）MSN、QQ、微信等及时通讯报警：推荐3

模拟QQ、MSN发消息的功能，是网友们开发了程序，从命令行执行程序，利用MSN、QQ协议，直接发给MSN和QQ好友。

（8）声音报警：一般用于机房值班，通过声音提醒值班人员。

（9）邮件和微信绑定：以邮件后通过微信提醒

（10）手机邮件客户端报警

（11）7*24在线值班

2、生产场景报警方式

（1）邮件报警

对于不需要紧急处理的业务，一般选择邮件报警。如内存、磁盘空间剩余率。

（2）邮件加短信同时报警

用于重要且紧急的业务，会使用邮件加短信同时报警。使用邮件报警便于记录故障详细信息，短信报警是及时提醒。

（3）http短信网关

简单、易用、稳定、可靠、收费合理

（4）解决问题思路

花一定的费用，把业务做到最好，是正常工作的思维。如果总想免费，那如果业务报警报不出来，损失可能更大。所在要说清楚利弊，交领导评判。正规公司的业务报警应尽量选择可靠的报警方式

3、故障报警分级

（1）报警（故障）分类

A类：磁盘空间、CPU、内存报警等为一般报警，运维内部采取常规处理方式。

B类：服务宕机和网战打不开为严重报警，需协调技术部门相关人员会诊处理。

（2）值班职责（2部值班电话）

A类报警，原则上不限制处理时间，但以不影响服务为前提，进行及时处理

B类报警，需在10分钟类邮件周知运维全体同事及相关技术人员。

4、报警的配置过程原理

主要是配置/usr/local/nagios/etc/objects/contacts.cfg文件，加入全部将接收报警的人，并进行分组。在主机、服务或模版中配置contact_groups参数：contact_groups groupname。如：

contact_groups mobilegroup

#手机短信用户

define contact{

contact_name mobile_test01

use generic-contact

alias OMS

email [email protected] #通过邮件提醒实现短信提醒

}

#邮件及msn用户

define contact{

contact_name test02

use generic-contact

alias OMS

email [email protected]

addressl [email protected] #发MSN

}

#邮件用户

define contact{

contact_name test03

use generic-contact

alias OMS

email [email protected]

}

#定义手机组

define contactgroup{

contactgroup_name mobilegroup

alias mobile

members mobile_test01, mobile_test02

}

十四、短信网关报警实战

1、添加联系人及联系组（contacts.cfg）

vi usr/local/nagios/etc/objects/contacts.cfg

define contact{

contact_name test03-pager

use generic-contact

alias OMS

email [email protected]

pager 12345678911

}

2、添加短信报警命令（commands.cfg）

vi usr/local/nagios/etc/objects/commands.cfg

# 'notify-host-by-pager' command definition

define command{

command_name notify-host-by-pager

command_line $USER1$/sms_send"Host $HOSTSTATE$" alert for $HOSTNAME$ $CONTACPAGER$

}

# 'notify-service-by-pager' command definition

define command{

command_name notify-service-by-pager

command_line $USER1$/sms_send"$HOSTALIAS$/$SERVICEDESC$" is $SERVICESTATE$ $CONTACPAGER$

}

3、修改联系人通知命令（配置联系人或修改联系人模版）

（1）方法一：配置联系人

vi usr/local/nagios/etc/objects/contacts.cfg

define contact{

contact_name test03-pager

use generic-contact

alias OMS

email [email protected]

pager 12345678911

service_notification_commands notify-service-by-email,notify-service-by-pager

host_notification_commands notify-host-by-email,notify-host-by-pager

}

（2）方法二：修改联系人模版(查看联系人使用的模版generic-contact)

vi usr/local/nagios/etc/objects/templates.cfg

define contact{

name generic-contact

service_notification_period 24x7

host_notification_period 24x7

service_notification_options w,u,c,r,f,s

host_notification_options d,u,r,f,s

service_notification_commandsnotify-service-by-email,notify-service-by-pager

host_notification_commands notify-host-by-email,notify-host-by-pager

}

4、修改主机或服务的联系人及组（配置host.cfg和services.cfg或对应模版）

vi /usr/local/nagios/etc/objects/hosts.cfg

define host{

use linux-server

host_name 01-client218

alias 01-client218

contact_groups admins,test03-pager #可以组和用户混加

address 192.168.1.218

}

define service{

use generic-service

host_name 01-client148

service_description port_80

check_command check_nrpe!check_port_80

max_check_attempts 3

normal_check_interval 2

retry_check_interval 1

check_period 24x7

notification_interval 30

notification_period 24x7

notification_options w,u,c,r

contact_groups admins,test03-pager

}

5、开发短信报警脚本

/usr/local/nagios/libexec

vi sms_send

#!/bin/sh

alert_date=$(date +%y-%m-%d" "%H:%M)

TITLE=$1 #FORMAT"Host $HOSTSTATE$ alert for $HOSTNAME$"

CONTACT=$2

#curl方式

curl -d cdkey=3RTY-EMY-0980-MTUQ2 -d password=189162 -d phone=$1 -dmessage="$2[${alert_date} myusersa]" http://a.b.c/sdkproxy/sendsms.action

#wget --quiet"http://s.ccme.cc/qxt/send.jsp?circle=test01&pwd=123456&mobile=12345678901&service=f1fb0546-ebb6-0987-8f20-560524c1f88d&msgid=3956724&message=$TITLE[${alert_date}myusersa n]"

6、给短信报警脚本可执行权限

chmod +x /usr/local/nagios/libexec/sms_send

7、测试cul

./sms_send 123445678901 "aaaaaa"

十五、Nagios排错

1、常见错误及原因

（1）Connection refused by host

首先考虑iptables和selinux是否关闭，其次是考虑nrpe是否启动

（2）Could not complete SSL handshake

首先考虑openssl和openssl-devel是否安装，其次考虑nrpe.cfg配置文件中allowed_hosts是否配置服务器的IP。allowed_hosts=127.0.0.1,Server_IP多个IP用逗号分隔，不能有空格。

（3）NRPE: Unable to read output

这是客户端nrpe没有获取到数据，需按排错步骤进行检查。

（4）NRPE: Command 'Command_name' not defined

这是调用的命令没有定义。在nrpe.cfg配置文件中定义的command[Command_name]一定要和services.cfg配置文件中的check_command项的check_nrpe!Command_name一致。

（5）check_nrpe没有定义

a、错误提示

Error: Service check command 'check_nrpe' specified in service 'DiskPartition' for host 'host_name' not defined anywhere!

b、解决方法：在commands.cfg配置文件最后新增define command下面内容

vi /usr/local/nagios/etc/objects/commands.cfg

define command{

command_name check_nrpe

command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c$ARG1$

}

2、被动模式nrpe的排错步骤（以check_disk服务为例）

（1）检查服务端nagios自身及配置文件

（2）在服务端运行check_nrpe命令，检查服务羰到客户端的SSL和服务端、客户端NRPE是否启动

/usr/local/namgios/libexec/check_nrpe -H clent_ip -c check_disk

（3）在客户端运行check_nrpe命令，检查客户端NRPE是否可以获取数据

/usr/local/namgios/libexec/check_nrpe -H 127.0.0.1-c check_disk

（4）在客户端运行监控脚本命令，检查脚本是否可以获取数据，是否有可执行权限

a、查看nrpe.cfg配置文件中check_disk对应的脚本命令

command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c8% -p /

b、运行check_disk对应的脚本命令

/application/nagios/libexec/check_disk -w 20% -c 8% -p /

c、检查check_disk脚本命令是否有可执行权限

ll /application/nagios/libexec/check_disk

-rwxr-xr-x 1 root root 418052 Jun 30 20:19/application/nagios/libexec/check_disk

Linux运维学习笔记之三十一：监控利器Nagios实战

第四十二章 监控利器Nagios实战

一、Nagios介绍

1、哪些内容需要监控呢？

（1）本地资源

（2）网络服务

（3）其他设备

（4）业务数据

2、Nagios（难够死）监控工具介绍与优势

3、Nagios的特点

4、Nagios的构成

（1）NRPE：半被动模式 （用于Linux服务器，主要用于监控本地资源）

（2）NSClient++：半被动模式 （用于windows服务器）

（3）NDOUtils：不推荐使用

（4）NSCA：纯被动模式的监控

5、Nagios的监控原理图

二、Nagios服务端安装

1、演示环境

2、安装前准备工作

（1）配置yum源

（2）配置字符集

（3）关闭防火墙和SELinux

（4）配置时间同步任务（监控要求时间准确）

（5）安装gcc和lamp环境（Nagios提供web界面查看，Nagios与httpd配合是官方推荐）

3、安装

（1）增加nagios用户和组

（2）解压安装nagios软件包

（3）配置apache的web认证（也就是登陆web的用户和密码：test/123456）

（4）安装nagios插件

（5）安装nrpe（因为服务端需要chek_nrpe插件）

（6）启动服务并检查

（7）浏览器登陆验证

三、Nagios客户端安装

1、客户端需安装的软件

2、安装前准备工作

（1）配置yum源

（2）配置字符集

（3）关闭防火墙和SELinux

（4）配置时间同步任务（监控要求时间准确）

3、安装

（1）增加nagios用户和组

（2）安装nagios插件

（3）安装nrpe

（4）安装iostat（监控磁盘IO的插件）

（5）修改nrpe配置文件nrpe.cfg

（6）启动nrpe

四、配置Nagios服务端监控服务

1、Nagios服务端目录结构

（1）目录结构

（2）etc目录

（3）etc/objects目录

（4）nagios包含文件和目录的方式

2、配置主配置文件nagios.cfg

（1）备份nagios的etc目录

（2）编辑nagios.cfg文件

（3）创建在配置文件中指定的文件和目录

3、Nagios监控模式定义和选择

（1）监控模式

（2）模式选择

五、Nagios主机和服务监控服务实战

1、在服务端配置需监控的主机

（1）编辑配置文件hosts.cfg

（2）检查语法

2、在服务端配置需监控的服务

（1）编辑配置文件services.cfg

（2）检查语法

（3）定义check_nrpe命令（commands.cfg文件）

（4）再次检查语法，正确

（5）直接命令检查：OK

（6）通过浏览器查看nagios监控：报错

3、在服务端Services报错原因排查

（1）检查iptables和selinux，发现均未关闭，关闭iptables和selinux

（2）检查nrpe是否启动，发现nrpe未启动，启动nrpe

（3）命令检查：报错SSL未安装

（4）检查SSL，已安装

（5）检查nrpe.cfg配置文件，发现没有对服务端nrpe进行过配置

（6）通过浏览器查看nagios的services监控：报错

（7）检查nrpe.cfg配置文件，发现command[check_disk]路径错误

（8）拷贝check_memory.pl和check_iostat到libexec目录

（9）重启服务端nrpe

第四十二章监控利器Nagios实战

（1）NRPE：半被动模式（用于Linux服务器，主要用于监控本地资源）

（2）NSClient++：半被动模式（用于windows服务器）