a、负载:uptime;
b、CPU:top,sar,cpu温度;
c、磁盘:df;
d、内存:free;
e、IO:iostat;
f、RAID
g、passwd文件的变化(本地所有文件指纹识别)。
端口、URL、ping丢包、进程数、IDC网络流量
路由器、交换机端口流量、打印机、windows等
用户登录失败次数,用户登录网站次数,输入验证码失败次数、某个API接口流量并发,电商网站订单,支付交易的数量等。这个获取的过程可能是开发或者架构师完成的,但添加的过程就是运维;
Nagios是一款开源的网络及服务监控工具,功能强大,灵活性强。能有效监控Windows、Linux、Unix等系统主机的各种状态信息、交换机、路由器等网络设备、主机端口和URL服务等。根据不同业务故障级另发出告警信息(邮件、微信、短信、语音报警、飞信、MSN)给管理员,当故障恢复时也会发出恢复消息给管理员。
Nagios服务端可以在Linux系统和类Unix系统上运行,目前无法在Windows上运行,但Windows可以作为被监控的主机。
官方网站地址:https://www.nagios.org/
官方快速安装说明:https://support.nagios.com/kb/article.php?id=96#CentOS
(1)监控网络服务(SMTP、POP3、HTTP、NNTP、PING等)
(2)监控主机资源(处理器负荷、磁盘利用率等)
(3)简单地插件设计使得用户可以方便地扩展自己服务的检测方法
(4)并行服务检查机制
(5)具备定义网络分层结构的能力,用"parent"主机定义来表达网络主机间的关系,这种关系可被用来发现和明晰主机宕机或不可达状态
(6)当服务或主机问题产生与解决时将告警发送给联系人(通过EMail、短信、用户定义方式)
(7)具备定义事件句柄功能,它可以在主机或服务的事件发生时获取更多问题定位
(8)自动的日志回滚
(9)可以支持并实现对主机的冗余监控
(10)可选的WEB界面用于查看当前的网络状态、通知和故障历史、日志文件等
Nagios不好的地方在于它只做核心,很多其他功能都是通过插件来实现的。Nagios一般由一个主程序(Nagios),一个插件程序(Nagios-plugins)和一些可选的附加程序(NRPE,NSClient++,NSCA,NDOUtils)等。Nagios本身就是一个监控的平台而已,其具体的监控工作都是通过插件(Nagios-plugins,也可自己编写)来实现的。因此,Nagios主程序和Nagios-plugins插件都是Nagios服务端必须安装的程序组件,并且Nagios-plugins一般也要安装于被监控端。几个附加程序的描述如下:
a、存在位置
工作在被监控端,操作系统为Linux/Unix。
b、作用
用于在被监控的远程Linux/Unix主机上执行脚本插件获取数据回传给服务器端,以实现对这些主机资源的监控。主要用于监控本地资源。
c、存在形式
守护进程(agent)模式,开启的端口为5666.
d、原理:
相当于领导分配工作,下属做完回报工作。
a、存在位置
监控Windows主机。
b、作用
相当于Linux下的NRPE。
c、原理
a、存在位置
Nagios服务器端。
b、作用
用于将Nagios的配置信息和各event产生的数据存入数据库以实现对这些数据的检索和处理。但是存入数据库还不如存放在磁盘上,因此不推荐使用。
c、原理
a、存在位置
同时安装在Nagios的服务器端和客户端。
b、作用
用于让被监控的远程Linux/Unix主机主动将监控到的信息发送给Nagios服务器。在分布式监控集群模式中要用到,300台服务器以内可以不考虑。
c、原理
Host |
OS |
role |
remask |
192.168.1.198 |
RedHat6.4_32 |
Nagios监控服务器 |
服务端 |
192.168.1.218 |
CentOS6.5_32 |
LNMP_Web服务器 |
被监控的客户端服务器 |
192.168.1.219 |
CentOS6.5_32 |
LNMP_Web服务器 |
被监控的客户端服务器 |
echo "------- Step 1 : Config yum -------"
cd /etc/yum.repos.d/
cp /etc/yum.repos.d/CentOS-Base.repo/etc/yum.repos.d/CentOS-Base.repo.bak
wget -O /etc/yum.repos.d/CentOS-Base.repohttp://mirrors.aliyum.com/repo/CentOS-6.repo
echo "------- Step 2 : Config CharSet -------"
echo 'export LC_ALL=C' >> /etc/profile
source /etc/profile
echo "------- Step 3 : Stop iptables and SELinux -------"
a、关闭防火墙
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
chkconfig iptables off
chkconfig ip6tables off
b、关闭SELinux
setenforce 0
vi /etc/selinux/config
SELINUX=disabled
c、脚本方式关闭SELinu
if [ if /etc/selinux/config ]; then
sed -i's#SELINUX=enforcing#SELINUX=disable#g' /etc/selinux/config
setenforce 0
fi
echo "------- Step 4 : Config CharSet -------"
/usr/sbin/ntpdate pool.ntp.org
echo "#time sync by my at `date +%F` " >> /var/spool/cron/root
echo '*/10 * * * * /usr/sbin/ntpdate pool.ntp.org >/dev/null2>&1' >> /var/spool/cron/root
crontab -l
echo "------- Step 5 : Install gcc and lamp env etc-------"
yum install gcc glibc glibc-common -y #编译环境
yum install gd gd-devel -y #画图用
yum install httpd php php-gd -y #php环境
yum install mysql* -y #非必须,但如不安装,nagios在安装时,就不会产生监控数据库的插件
yum install perl-devel -y #安装nagios插件时需要
echo "------- Step 6 : add nagios user and group -------"
/usr/sbin/useradd -m nagios
#/usr/sbin/useradd apache #安装httpd时已安装
/usr/sbin/groupadd nagcmd
/usr/sbin/usermod -a -G nagcmd nagios
/usr/sbin/usermod -a -G nagcmd apache
echo "------- Step 7 : download and install nagios-------"
cd /tools
unzip oldboy_training_nagios_soft.zip
tar xzf nagios-3.5.1.tar.gz
cd nagios
./configure --with-command-group=nagcmd
make all
make install
make install-init #This installs theinit script in /etc/rc.d/init.d
make install-config # This installssample config files in /usr/local/nagios/etc
make install-commandmode #installs andconfigures permissions the external command file
make install-webconf #生成nagios在apache中的配置文件:/etc/httpd/conf.d/nagios.conf
cat /etc/httpd/conf.d/nagios.conf #查看nagios.conf文件的内容
# SAMPLE CONFIG SNIPPETS FOR APACHE WEB SERVER
# Last Modified: 11-26-2005
#
# This file contains examples of entries that need
# to be incorporated into your Apache web server
# configuration file. Customize the paths, etc. as
# needed to fit your system.
ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"
# SSLRequireSSL
Options ExecCGI
AllowOverride None
Order allow,deny
Allow from all
# Order deny,allow
# Deny from all
# Allow from 127.0.0.1
AuthName "NagiosAccess"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
Require valid-user
Alias /nagios "/usr/local/nagios/share"
# SSLRequireSSL
Options None
AllowOverride None
Order allow,deny
Allow from all
# Order deny,allow
# Deny from all
# Allow from 127.0.0.1
AuthName "NagiosAccess"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
Require valid-user
echo "------- Step 8 : config web auth -------"
#一定要用/etc/httpd/conf.d/nagios.conf文件中指定的AuthUserFile的值一样,否则不能登陆
htpasswd -cb /usr/local/nagios/etc/htpasswd.users test 123456
cd ..
echo "------- Step 9 : install nagios-plusins -------"
#yum install perl-devel -y #安装nagios插件时需要,需确认一下是否安装
tar zxf nagios-plugins-1.4.16.tar.gz
cd nagios-plugins-1.4.16
./configure --with-nagios-user=nagios --with-nagios-group=nagios--enable-perl-modules
make && make install
ls /usr/local/nagios/libexec/ | wc -l #查看安装的插件的数量
61
echo "------- Step 10 : install nrpe -------"
tar zxf nrpe-2.12.tar.gz
cd nrpe-2.12
./configure
make all
make install-plugin
make install-daemon
make install-daemon-config
cd ..
echo "------- Step 11 : startup service and check -------"
/etc/init.d/nagios start
/etc/init.d/httpd start
lsof -i tcp:80
ps -ef | grep nagios
http://192.168.1.198
出现登陆框,用户输入test,密码输入123456后提示用户名密码错误,经检查原因如下:
配置文件/etc/httpd/conf.d/nagios.conf文件中指定的AuthUserFile的值为:
AuthUserFile /usr/local/nagios/etc/htpasswd.users
但创建密码文件时的文件为:
htpasswd -cb /usr/local/nagios/etc/htpasswd.user test 123456
重新创建密码文件后,成功登陆
htpasswd -cb /usr/local/nagios/etc/htpasswd.users test 123456
(1)不需要安装的软件
a、无需lamp环境。
不需要安装gd、gd-devel、mysql*、httpd、php、php-gd
b、无需nagios服务端软件包
不需要安装nagios-3.5.1.tar.gz
c、无需gcc环境
不需要安装gcc glibc glibc-common
(2)需要安装的软件
a、客户端软件:
nrpe-2.12
b、插件:
Class-Accessor-0.31.tar.gz
Config-Tiny-2.12.tar.gz
Math-Calc-Units-1.07.tar.gz
Nagios-Plugin-0.34.tar.gz
Params-Validate-0.91.tar.gz
Regexp-Common-2010010201.tar.gz
check_iostat
check_memory.pl
echo "------- Step 1 : Config yum -------"
cd /etc/yum.repos.d/
cp /etc/yum.repos.d/CentOS-Base.repo/etc/yum.repos.d/CentOS-Base.repo.bak
wget -O /etc/yum.repos.d/CentOS-Base.repohttp://mirrors.aliyum.com/repo/CentOS-6.repo
echo "------- Step 2 : Config CharSet -------"
echo 'export LC_ALL=C' >> /etc/profile
source /etc/profile
echo "------- Step 3 : Stop iptables and SELinux -------"
a、关闭防火墙
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
chkconfig iptables off
chkconfig ip6tables off
b、关闭SELinux
setenforce 0
vi /etc/selinux/config
SELINUX=disabled
c、脚本方式关闭SELinu
if [ if /etc/selinux/config ]; then
sed -i's#SELINUX=enforcing#SELINUX=disable#g' /etc/selinux/config
setenforce 0
fi
echo "------- Step 4 : Config CharSet -------"
/usr/sbin/ntpdate pool.ntp.org
echo "#time sync by my at `date +%F` " >>/var/spool/cron/root
echo '*/10 * * * * /usr/sbin/ntpdate pool.ntp.org >/dev/null2>&1' >> /var/spool/cron/root
crontab -l
echo "------- Step 5 : add nagios user and group -------"
/usr/sbin/useradd -m nagios -s /sbin/nologin
echo "------- Step 6 : install nagios-plusins -------"
#yum install perl-devel -y #安装nagios插件时需要,需确认一下是否安装
scp /wddg/tools/[email protected]:/wddg/tools/
cd /wddg/tools/
unzip oldboy_training_nagios_soft.zip
tar zxf nagios-plugins-1.4.16.tar.gz
cd nagios-plugins-1.4.16
./configure --prefix=/application/nagios--enable-perl-modules --enable-redhat-pthread-workaround
make && make install
ls /application/nagios/libexec/ | wc -l #查看安装的插件的数量
64
echo "------- Step 7 : install nrpe -------"
tar zxf nrpe-2.12.tar.gz
cd nrpe-2.12
./configure --prefix=/application/nagios #目录必须与nagios-plugins目录一致
make all
make install-plugin
make install-daemon
make install-daemon-config
cd ..
echo "------- Step 8 : install iostat -------"
cd /wddg/tools/
echo "------- Step 8.1 : install Params-Validate -------"
tar zxvf Params-Validate-0.91.tar.gz
cd Params-Validate-0.91
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.2 : install Class-Accessor -------"
tar zxvf Class-Accessor-0.31.tar.gz
cd Class-Accessor-0.31
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.3 : install Config-Tiny -------"
tar zxvf Config-Tiny-2.12.tar.gz
cd Config-Tiny-2.12
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.4 : install Math-Calc-Units -------"
tar zxvf Math-Calc-Units-1.07.tar.gz
cd Math-Calc-Units-1.07
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.5 : install Regexp-Common -------"
tar zxvf Regexp-Common-2010010201.tar.gz
cd Regexp-Common-2010010201
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.6 : install Nagios-Plugin -------"
tar zxvf Nagios-Plugin-0.34.tar.gz
cd Nagios-Plugin-0.34
perl Makefile.PL
make
make install
cd -
echo "------- Step 8.7 : install sysstat -------"
#for monitor iostat
yum install sysstat -y
echo "------- Step 8.8 : copy script to nagios -------"
/bin/cp /wddg/tools/check_memory.pl /application/nagios/libexec/
/bin/cp /wddg/tools/check_iostat /application/nagios/libexec/
echo "------- Step 8.9 : chmod 755 script-------"
chmod 755 /application/nagios/libexec/check_memory.pl
chmod 755 /application/nagios/libexec/check_iostat
echo "------- Step 8.10 : dos2unix script -------"
dos2unix /application/nagios/libexec/check_memory.pl
dos2unix /application/nagios/libexec/check_iostat
cp /application/nagios/etc/nrpe.cfg/application/nagios/etc/nrpe.cfg.bak
vi /application/nagios/etc/nrpe.cfg
a、指定nagios服务端IP
#第79行修改前:
allowed_hosts=127.0.0.1
#第79行修改后:
allowed_hosts=127.0.0.1,192.168.1.198
b、删除第199行到203行
sed -i '199,203d' /application/nagios/etc/nrpe.cfg
下面是删除的内容:
command[check_users]=/application/nagios/libexec/check_users -w 5 -c10
command[check_load]=/application/nagios/libexec/check_load -w15,10,5 -c 30,25,20
command[check_hda1]=/application/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1
command[check_zombie_procs]=/application/nagios/libexec/check_procs-w 5 -c 10 -s Z
command[check_total_procs]=/application/nagios/libexec/check_procs-w 150 -c 200
c、在文件未尾加上如下内容
command[check_load]=/application/nagios/libexec/check_load -w15,10,6 -c 30,25,20
command[check_mem]=/application/nagios/libexec/check_memory.pl -w 6%-c 3%
command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c8% -p /
command[check_swap]=/application/nagios/libexec/check_swap -w 20% -c10%"
command[check_iostat]=/application/nagios/libexec/check_iostat -w 6-c 10
/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d
netstat -lntup | grep 5666
echo "/application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d">>/etc/rc.local
ll /usr/local/nagios/
bin:Nagios执行程序所在目录,包括nagios、npc、nrpe等;
etc:存放nagios配置文件
include:存放nagios的包含文件
libexec:存放nagios的插件
perl:
sbin:NagiosCgi文件所在目录,也就是执行外部命令所需文件所在的目录
share:存放nagios的web程序。主要是nagios界面展示的php程序
var:存放nagios的日志和数据
cd /usr/local/nagios/etc
ll
-rw-rw-r--. cgi.cfg
-rw-r--r--. htpasswd.users
-rw-rw-r--. nagios.cfg
-rw-r--r--. nrpe.cfg
drwxrwxr-x. objects
-rw-rw----. resource.cfg
其中,nagios.cfg是nagios的主配置文件,包含(include)了cgi.cfg、resource.cfg文件和objects目录下的所有文件。nrpe.cfg是客户端配置文件,如果要把nagios服务端也当成一个客户端时,就需要配置,否则就不需要配置。htpasswd.users是nagios的web密码验证文件。
ll /usr/local/nagios/etc/objects
commands.cfg:定义命令执行的文件,比如check_tcp、check_local_disk等,由后面定义服务的配置文件来引用;
contacts.cfg:定义联系人的文件,比如服务down了通知的对象;
localhost.cfg:定义本机的监控条目,默认生成;
printer.cfg:定义打印机的文件,默认未启用,在生产环境中意义不大;
switch.cfg:定义监控路由器和交换机的配置文件,默认未启用;
templates.cfg:定义服务类型的模版配置文件,类似shell中的函数功能;
timeperiods.cfg:定义要监控时间段(报警周期)文件,比如24x7,workhours等;
windows.cfg:定义监控Windows的文件,默认未启用。
services.cfg:自定义存放具体被临控的服务相关配置内容(上百台可以指定services目录,默认不存在)
hosts.cfg:自定义存放具体被临控的主机相关配置内容(上百台可以指定hosts目录,默认不存在)
nagios包含其它文件的方式不是include,而是cfg_file=文件全路径。例如
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
nagios包含其它目录的方式是cfg_dir=目录全路径,该目录下所有.cfg文件将会全部被包含。例如
cfg_dir=/usr/local/nagios/etc/servers
/usr/local/nagios
tar cvf etc.tar.gz etc/
cd /usr/local/nagios/etc
vi nagios.cfg
a、在第34行后加上下面2行
vi nagios.cfg +34
cfg_file=/usr/local/nagios/etc/objects/services.cfg
cfg_file=/usr/local/nagios/etc/objects/hosts.cfg
b、注释第38行
修改前:
# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
修改后:
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
将nagios服务端也当成客户端,通过nrpe来监控,不需要将其当成特殊服务器来监控。
c、在第58行新增1行,指定要监控的服务目录
cfg_dir=/usr/local/nagios/etc/services
a、创建etc/services目录
mkdir services
chown -R nagios.nagios services/ #指定nagios用户和组进行管理
b、创建hosts.cfg文件(通过localhost.cfg的前51行进行创建)
cd objects
head -51 localhost.cfg > hosts.cfg
chown nagios.nagios hosts.cfg
pwd
/usr/local/nagios/etc/objects
c、创建services.cfg文件
touch services.cfg
chown nagios.nagios services.cfg
a、主动监控
nagios按照检测周期像URL监控一样,由服务端主动发出请求获取远程主机的数据的监控方式。不需要在客户端安装任何插件。
b、半被动监控(nrpe)
把对负载、内存、硬盘、虚拟内存、IO、温度、风扇转速等本地资源的监控,通过nrpe插件定时连接客户端的nrpe服务,将获取的信息发回nagios服务端的监控方式。
c、全被动监控(nsca)
主动上报。
a、主动监控
对于web服务、数据库服务这种能对外提供服务的监控,一般用主动模式。如监控http、ssh、mysql、rsync等服务。
与nrpe无关,就是利用服务端本地插件直接获取信息。
b、半被动监控
对于本地资源性能等的监控,一般用被动模式。如对负载、内存、硬盘、虚拟内存、IO、温度、风扇转速等本地资源的监控。(有时也可通过snmp实现监控的部分系统资源)
主程序通过check_nrpe插件,与客户端nrpe进程沟通,调用客户端本地插件获取数据。
c、说明
主动模式和被动模式是相对的,并且是可以互相转换的,即主动模式的服务可以改成被动模式,被动模式的服务有时也可以改成主动模式。
vi /usr/local/nagios/etc/objects/hosts.cfg
a、修改主机
第25行(文件hosts.cfg是通过head -51 localhost.cfg > hosts.cfg得到的)
修改前
define host{
use linux-server
host_name localhost
alias localhost
address 127.0.0.1
}
修改后
define host{
use linux-server #这是模板,在templates.cfg文件中定义
host_name 01-client218 #可随意写
alias 01-client218 #可有可无,一般与主机名一样
address 192.168.1.218
}
define host{
use linux-server
host_name 02-client219
alias 02-client219
address 192.168.1.219
}
define host{
use linux-server
host_name nagiosServer198
alias nagiosServer198
address 192.168.1.198
}
b、修改主机组
第39行
修改前
define hostgroup{
hostgroup_name linux-servers
alias Linux Servers
members localhost
}
修改后
define hostgroup{
hostgroup_name linux-servers
alias Linux Servers
members 01-client218,02-client219,nagiosServer198
}
a、命令
/etc/init.d/nagios checkconfig
Running configuration check... CONFIG ERROR! Check your Nagios configuration.
命令输出只是提示语法有错误,但没有提示错误是什么。
b、修改命令文件第183行,使之详细输出
vim /etc/init.d/nagios +183
修改前
$NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
修改后
$NagiosBin -v $NagiosCfgFile;
c、再次检查,提示没有配置服务
/etc/init.d/nagios checkconfig
Running configuration check...
...
Checking services...
Error:There are no services defined!
Checked 0 services.
Checking hosts...
Warning: Host '01-client218' has no services associated with it!
Warning: Host '02-client219' has no services associated with it!
Warning: Host 'nagiosServer198' has no services associated with it!
Checked 3 hosts.
...
Total Warnings: 3
TotalErrors: 1
...
vi /usr/local/nagios/etc/objects/services.cfg
define service{
use generic-service
host_name 01-client218
service_description DiskPartition
check_command check_nrpe!check_disk #nrpe.cfg中的配置
}
define service{
use generic-service
host_name 02-client219
service_description DiskPartition
check_command check_nrpe!check_disk
}
define service{
use generic-service
host_name nagiosServer198
service_description DiskPartition
check_command check_nrpe!check_disk
}
a、命令
/etc/init.d/nagios checkconfig
b、报错,提示命令check_nrpe没有定义
/etc/init.d/nagios checkconfig
Running configuration check...
...
Checking services...
Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost '01-client218' not defined anywhere!
Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost '02-client219' not defined anywhere!
Error:Service check command 'check_nrpe' specified in service 'Disk Partition' forhost 'nagiosServer198' not defined anywhere!
Checked 3 services.
...
Total Warnings: 0
TotalErrors: 3
...
a、检查commands.cfg文件中确实没有定义check_nrpe命令
b、定义check_nrpe命令(在最后行新增)
vi /usr/local/nagios/etc/objects/commands.cfg
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c$ARG1$
}
/etc/init.d/nagios checkconfig
...
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during thepre-flight check
OK.
/usr/local/nagios/libexec/check_nrpe -H 192.168.158.218 -ccheck_disk
DISK OK - free space: / 9150 MB (54% inode=74%);|/=7755MB;14252;16390;0;17816
a、IE打开http://192.168.158.198/nagios/,选择左边导航栏的Hosts,报错如下:
It appears asthough you do not have permission to view information for any of the hosts yourequested...
If you believethis is an error, check the HTTP server authentication requirements foraccessing this CGIand check the authorization options in your CGI configurationfile.
b、原因
报错提示是cgi权限不足,查看服务端/usr/local/nagios/etc/cgi.cfg文件内容,发现该配置文件中要求的用户为nagiosadmin。但我们登陆nagios的用户为test。
vi /usr/local/nagios/etc/cgi.cfg
grep nagiosadmin /usr/local/nagios/etc/cgi.cfg
authorized_for_system_information=nagiosadmin
authorized_for_configuration_information=nagiosadmin
authorized_for_system_commands=nagiosadmin
authorized_for_all_services=nagiosadmin
authorized_for_all_hosts=nagiosadmin
authorized_for_all_service_commands=nagiosadmin
authorized_for_all_host_commands=nagiosadmin
c、处理
(i)将cgi.cfg配置文件中的nagiosadmin替换为test。
sed -i s/nagiosadmin/test/g /usr/local/nagios/etc/cgi.cfg
(ii) 重启/重新加载nagios
/etc/init.d/nagios reload
d、再次通过浏览器查看nagios监控
(i)IE再次打开http://192.168.158.198/nagios/,选择左边导航栏的Hosts,正常显示被监控主机状态
01-client218 |
UP |
07-02-2017 |
0d 0h 14m 31s |
PING OK - Packet loss = 0%, RTA = 0.41 ms |
02-client219 |
UP |
07-02-2017 |
0d 0h 11m 11s |
PING OK - Packet loss = 0%, RTA = 0.52 ms |
nagiosServer198 |
UP |
07-02-2017 |
0d 0h 7m 51s |
PING OK - Packet loss = 0%, RTA = 0.05 ms |
(ii)选择左边导航栏的Services,有报错
01-client218 |
Disk |
OK |
07-02-2017 |
42738 |
DISK OK - free... |
01-client219 |
Disk |
OK |
07-02-2017 |
42738 |
DISK OK - free... |
nagiosServer198 |
Disk |
CRITICAL |
07-02-2017 |
42797 |
Connection refused by host |
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
chkconfig iptables off
chkconfig ip6tables off
setenforce 0
vi /etc/selinux/config
SELINUX=disabled
#启动198上的nrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
netstat -lntup | grep nrpe
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 2473/nrpe
echo " /usr/local/nagios/bin/nrpe-c /usr/local/nagios/etc/nrpe.cfg -d">>/etc/rc.local
./check_nrpe -H 192.168.158.198 -c check_disk
CHECK_NRPE: Error - Could not complete SSL handshake
rpm -qa | grep openssl
openssl-1.0.1e-57.el6.i686
openssl-devel-1.0.1e-57.el6.i686
修改前:
allowed_hosts=127.0.0.1
command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c10
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5-c 30,25,20
command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w150 -c 200
修改后:
allowed_hosts=127.0.0.1,192.168.158.198
#command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c10
#command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5-c 30,25,20
#command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c10% -p /dev/hda1
#command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs-w 5 -c 10 -s Z
#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w150 -c 200
command[check_load]=/application/nagios/libexec/check_load-w 15,10,6 -c 30,25,20
command[check_mem]=/application/nagios/libexec/check_memory.pl-w 6% -c 3%
command[check_disk]=/application/nagios/libexec/check_disk-w 20% -c 8% -p /
command[check_swap]=/application/nagios/libexec/check_swap-w 20% -c 10%"
command[check_iostat]=/application/nagios/libexec/check_iostat-w 6 -c 10
nagiosServer198 UNKNOWN ... NRPE:Unable to read output #客户端nrpe没有获取到数据
错误路径:(/application/nagios:这是客户端配置的nrpe的路径)
command[check_load]=/application/nagios/libexec/check_load -w 15,10,6-c 30,25,20
command[check_mem]=/application/nagios/libexec/check_memory.pl -w 6%-c 3%
command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c 8%-p /
command[check_swap]=/application/nagios/libexec/check_swap -w 20% -c10%"
command[check_iostat]=/application/nagios/libexec/check_iostat -w 6 -c 10
正确路径:(/usr/local/nagios:这才是服务端配置的nrpe的路径)
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,6 -c30,25,20
command[check_mem]=/usr/local/nagios/libexec/check_memory.pl -w 6% -c3%
command[check_disk]=/usr/local/nagios/libexec/check_disk -w 20% -c 8%-p /
command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20% -c10%"
command[check_iostat]=/usr/local/nagios/libexec/check_iostat -w 6 -c 10
echo "------- Step 8.8 : copy script to nagios -------"
/bin/cp /wddg/tools/check_memory.pl /usr/local/nagios/libexec/
/bin/cp /wddg/tools/check_iostat /usr/local/nagios/libexec/
echo "------- Step 8.9 : chmod 755 script-------"
chmod 755 /usr/local/nagios/libexec/check_memory.pl
chmod 755 /usr/local/nagios/libexec/check_iostat
echo "------- Step 8.10 : dos2unix script -------"
dos2unix /usr/local/nagios/libexec/check_memory.pl
dos2unix /usr/local/nagios/libexec/check_iostat
ps -ef | grep nrpe
nagios 2314 1 .../usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
root 2759 2439 016:39 pts/0 00:00:00 grep --color=autonrpe
pkill nrpe
pkill nrpe
pkill nrpe
ps -ef | grep nrpe
root 2767 2439 016:39 pts/0 00:00:00 grep --color=autonrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
cd /usr/local/nagios/libexec
./check_nrpe -H 192.168.158.198 -c check_disk
DISK OK - free space: / 30025 MB (88% inode=94%);|/=3975MB;28656;32954;0;35820
vi /usr/local/nagios/etc/objects/services.cfg
define service{
use generic-service
host_name 01-client218
service_description Mem
check_command check_nrpe!check_mem
}
define service{
use generic-service
host_name 01-client218
service_description IO
check_command check_nrpe!check_iostat
}
/etc/init.d/nagios checkconfig
pkill nrpe
/etc/init.d/httpd stop
/etc/init.d/nagios stop
/etc/init.d/nagios start
/etc/init.d/httpd start
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
./check_nrpe -H 192.168.1.218 -c check_mem
CHECK_MEMORY OK - 1631M free | free=1710317568b;117609922.56:;58804961.28:
./check_nrpe -H 192.168.1.218 -c check_iostat
IOSTAT OK - user 0.02 nice 0.01 sys 0.17 iowait 0.24 idle 0.00 | iowait=0.24%;; idle=0.00%;; user=0.02%;;nice=0.01%;; sys=0.17%;;
cd /usr/local/nagios/libexec
./check_tcp --help
./check_tcp -H 192.168.1.148 -p 80
TCP OK - 0.001 second response time on port80|time=0.000865s;;;0.000000;10.000000
./check_tcp -H 192.168.1.148 -p 22
TCP OK - 0.001 second response time on port22|time=0.000721s;;;0.000000;10.000000
./check_http --help
./check_tcp -I 192.168.1.148 -p 80
HTTP OK: HTTP/1.1 200 OK - 243 bytes in0.027 second response time |time=0.027303s;;;0.000000 size=243B;;;0
cd /usr/local/nagios/etc/services
vi myservices.cfg
define service{
use generic-service
host_name 01-client148
service_description nginsweb
check_command check_weburl!-I 192.168.1.148
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
vi /usr/local/nagios/etc/objects/commands.cfg
# 'check_weburl' command definition
define command{
command_name check_weburl
command_line $USER1$/check_http $ARG1$ -w 10 -c 30
}
/etc/init.d/nagios checkconfig
/etc/init.d/nagios reload
cd /usr/local/nagios/etc/services
cp myservices.cfg 01-client148.cfg
vi 01-client148.cfg
define service{
use generic-service
host_name 01-client148
service_description nginsweb
check_command check_weburl!-H blog.abc.org
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
define service{
use generic-service
host_name 02-client149
service_description nginsuri
check_command check_http!-H blog.nginx.org -u /static/
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
vi /etc/hosts
127.0.0.1 localhostlocalhost.localdomain localhost4 localhost4.localdomain4
::1 localhostlocalhost.localdomain localhost6 localhost6.localdomain6
192.168.1.148 blog.abc.org
curl 192.168.58.148
www.nginx.org
curl blog.abc.org
www.nginx.org
cd /usr/local/nagios/libexec/
./check_http -H blog.abc.org
HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0
cd /usr/local/nagios/libexec/
./check_http -H blog.abc.org -u "/static/index.php?id=1"
HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0
cd /usr/local/nagios/etc/services
cp myservices.cfg 02-client149port.cfg
vi 02-client149port.cfg
define service{
use generic-service
host_name 02-client149
service_description port_22
check_command check_tcp!-H blog.nginx.org -p 22
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
define service{
use generic-service
host_name 02-client149
service_description port_3306
check_command check_tcp!-H blog.nginx.org -p 3306
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
vi /etc/hosts
127.0.0.1 localhost localhost.localdomainlocalhost4 localhost4.localdomain4
::1 localhostlocalhost.localdomain localhost6 localhost6.localdomain6
192.168.1.149 blog. nginx.org
cd /usr/local/nagios/libexec/
./check_http -H blog.abc.org -p 22
HTTP OK: HTTP/1.1 200 OK - 243 bytes in 0.002 second response time|time=0.001894s;;;0.000000 size=243B;;;0
(1)在服务端的命令行把要监控的命令调试好
(2)在commands.cfg中定义好nagios命令,同时调用命令行插件
(3)在服务的配置文件中定义要监控的服务,调用commands.cfg里定义的nagios的监控命令。
cd /application/nagios/libexec/
./check_tcp -H 192.168.58.148 -p 80
TCP OK - 0.000 second response time on port80|time=0.000243s;;;0.000000;10.000000
cd /application/nagios/etc/
vi nrpe.cfg
#在结尾添加一行
command[check_port_80]=/application/nagios/libexec/check_tcp -H 192.168.58.148-p 80 -w 5 -c 10
ps -ef | grep nrpe
nagios ... /application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d
pkill nrpe
pkill nrpe
/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d
netstat -lntup | grep nrpe
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe
/usr/local/nagios/libexec/check_nrpe -H 192.168.58.148 -ccheck_port_80
TCP OK - 0.000 second ...80|time=0.000169s;5.000000;10.000000;0.000000;10.000000
cd /usr/local/nagios/etc/services/
vi 01-client148.cfg
#添加一个服务
define service{
use generic-service
host_name 01-client148
service_description port_80
check_command check_nrpe!check_port_80
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
/etc/init.d/nagios checkconfig
/etc/init.d/nagios reload
define servicegroup{
servicegroup_name 组名
alias 组别名
members 主机名,组名, 主机名,组名。。
}
每个被监控主机的服务描述要和组名是一致的。如将服务中的service_description改为Mem1,servicegroup_name为Mem,则语法检查时会报错:
Error: Could not find a service matching host name '01-client148'and description 'Mem' (config file'/usr/local/nagios/etc/services/servergroup.cfg', starting on line 1)
Error: Could not expand member services specified in servicegroup(config file '/usr/local/nagios/etc/services/servergroup.cfg', starting on line1)
Error processing object config files!
vi /usr/local/nagios/etc/services/servergroup.cfg
define servicegroup{
servicegroup_name Mem
alias Mem
members01-client148,Mem,nagiosServer161225,Mem
}
/etc/init.d/nagios checkconfig
/etc/init.d/nagios reload
参数名 |
参数值 |
说明 |
name |
generic-contact |
联系人名称 |
service_notification_period |
24x7 |
当服务出现异常时,发送通知的时间段,这个时间段"24x7"在timeperiods.cfg文件中定义 |
host_notification_period |
24x7 |
当主机出现异常时,发送通知的时间段,这个时间段"24x7"在timeperiods.cfg文件中定义 |
service_notification_options |
w,u,c,r |
这个定义的是“通知可以被发出的情况”。w即warn,表示警告状态,u即unknown,表示不明状态;; c即criticle,表示紧急状态,r即recover,表示恢复状态;; 也就是在服务出现警告状态、未知状态、紧急状态和重新恢复状态时都发送通知给使用者 。 |
host_notification_options |
d,u,r |
定义主机在什么状态下需要发送通知给使用者,d即down,表示宕机状态;; u即unreachable,表示不可到达状态,r即recovery,表示重新恢复状态。 |
service_notification_commands |
notify-service-by-email |
服务故障时,发送通知的方式,可以是邮件和短信,这里发送的方式是邮件;; 其中“notify-service-by-email”在commands.cfg文件中定义。 |
host_notification_commands |
notify-host-by-email |
主机故障时,发送通知的方式,可以是邮件和短信,这里发送的方式是邮件;; 其中“notify-host-by-email”在commands.cfg文件中定义。 |
register |
0 |
|
参数名 |
参数值 |
说明 |
use |
linux-server |
定义被监控主机使用的模版。具体见templates.cfg |
host_name |
01-client218 |
被监控主机名称,可随意定义 |
alias |
01-client218 |
被监控主机名称别名,可随意定义 |
address |
192.168.1.218 |
被监控主机的IP |
check_command |
check-host-alive |
检测被监控主机是否存活的命令,来自commands.cfg |
max_check_attempts |
3 |
故障后,最大尝试检查次数 |
normal_check_interval |
2 |
正常的检查间隔,默认单位为分钟 |
retry_check_interval |
2 |
故障后重试的检查间隔,默认单位为分钟 |
check_period |
24x7 |
检查同期,来自timeperiods.cfg |
notification_interval |
300 |
故障后2次报警通知的时间。单位是分钟 |
notification_period |
24x7 |
故障时,发送通知的时间范围 |
notification_options |
d,u,r |
定义主机在什么状态下可以发送通知给使用者 d即down,表示宕机状态 u即unreachable,表示不可到达状态 r即recovery,表示重新恢复状态 |
contact_groups |
admins |
报警到联系人组,在contacts.cfg文件中定义 |
参数名 |
参数值 |
说明 |
use |
generic-service |
定义服务使用的模版。具体见templates.cfg |
host_name |
01-client218 |
被监控主机名,来自hosts.cfg |
service_description |
Mem |
报警服务描述,自己根据内容取有意义的名称 |
check_command |
check_nrpe!check_mem |
检查服务的命令 |
max_check_attempts |
2 |
尝试检查的最大次数 |
normal_check_interval |
2 |
正常的检查间隔,默认单位为分钟 |
retry_check_interval |
2 |
故障后重试的检查间隔,默认单位为分钟 |
check_period |
24x7 |
检查同期,来自timeperiods.cfg |
notification_interval |
300 |
故障后2次报警通知的时间。单位是分钟 |
notification_period |
24x7 |
故障时,发送通知的时间范围 |
notification_options |
w,u,c,r |
定义主机在什么状态下可以发送通知给使用者 w即warn,表示警告状态 u即unreachable,表示不可到达状态 c即criticle,表示紧急状态 r即recovery,表示重新恢复状态 |
contact_groups |
admins |
报警到联系人组,在contacts.cfg文件中定义 |
process_perf_data |
1 |
PNP出图记录数据相关 |
timeperiod{
timeperiod_name 24x7 #时间段的名称,这个地方不要有空格
alias 24 Hours ADay, 7Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
timeperiod{
timeperiod_name workhours
alias NormalWorkHours
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
}
/usr/local/nagios/etc/objects/templates.cfg
定义服务类型的模版配置文件,类似shell中的函数功能
egrep -v "#|^$"/usr/local/nagios/etc/objects/templates.cfg
################################联系方式模板################################
define contact{
name generic-contact #通用联系模板名称
service_notification_period 24x7 #服务通知周期(7*24小时)
host_notification_period 24x7 #主机通知周期
service_notification_options w,u,c,r,f,s #当服务状态为(警告、未知、严重、恢复、flapping)
host_notification_options d,u,r,f,s #当主机状态为(关机、不可达、恢复)
service_notification_commandsnotify-service-by-email #当出现错误时候,通知mail
host_notification_commands notify-host-by-email #当出现错误时候,通知mail
register 0 #
}
################################通用主机模板################################
define host{
name generic-host #通用模板主机名
notifications_enabled 1 #是否启用通知(1启用、0不启用)
event_handler_enabled 1 #主机事件处理(同上)
flap_detection_enabled 1 #Flap detection is enabled
failure_prediction_enabled 1 #Failure prediction is enabled
process_perf_data 1 #Process性能数据
retain_status_information 1 #保留程序重新启动状态信息
retain_nonstatus_information 1 #
notification_period 24x7 #发送主机状态通知(7*24)
register 0 #
}
################################linux主机模板################################
define host{
name linux-server #linux模板通用名
use generic-host #继承了通用主机模板的其他值
check_period 24x7 #检查周期7*24小时
check_interval 5 #每隔5分钟检查一次
retry_interval 1 #异常后,1分钟后重试
max_check_attempts 10 #异常后,最大尝试次数
check_command check-host-alive #检查主机存活命令
notification_period workhours #工作时间通知
notification_interval 120 #异常后,通知间隔120分
notification_options d,u,r #当主机down、unrealcable、recovery
contact_groups admins #通知发送管理员组
register 0 #
}
################################windows主机模板################################
define host{
name windows-server #windown主机模板名称
use generic-host #继承了通用主机模板的其他值
check_period 24x7 #检查周期7*24小时
check_interval 5 #每隔5分钟检查一次
retry_interval 1 #异常后,1分钟后重试
max_check_attempts 10 #异常后,最大尝试次数
check_command check-host-alive #检查主机是否存活
notification_period 24x7 #任何时间都可以发送通知
notification_interval 30 #30分钟后,重新发送通知
notification_options d,r #当主机状态为down、recovery时发送通知
contact_groups admins #通知发送管理员
hostgroups windows-servers #windows主机组
register 0 #
}
################################通用打印机模板################################
define host{
name generic-printer #这个host定义的名称
use generic-host #继承通用模板值
check_period 24x7 #7*24
check_interval 5 #每隔5分钟检查一次
retry_interval 1 #异常后,1分钟后重试
max_check_attempts 10 #异常后,最大尝试次数
check_command check-host-alive #检查主机是否存活
notification_period workhours #在工作时间通知
notification_interval 30 #异常后,重发通知间隔30分钟
notification_options d,r #仅在关机、恢复时通知
contact_groups admins #通知管理员组
register 0 #
}
################################通用交换机模板################################
define host{
name generic-switch #这个主机模板名称
use generic-host #继承通用模板
check_period 24x7 #7*24小时
check_interval 5 #每隔5分钟检查一次交换机
retry_interval 1 #一分钟后重试
max_check_attempts 10 #异常后,最大尝试次数
check_command check-host-alive #是否存活
notification_period 24x7 #7*24
notification_interval 30 #报警间隔
notification_options d,r #关机、恢复
contact_groups admins #通知管理组
register 0 #
}
################################通用服务模板################################
define service{
name generic-service #通用服务模板名称
active_checks_enabled 1 #服务检查启用
passive_checks_enabled 1 #被动检查启用
parallelize_check 1 #并行检查开启
obsess_over_service 1 #分布式监控使用,1启用,0禁用
check_freshness 0 #不检查服务'freshness'
notifications_enabled 1 #服务通知启用
event_handler_enabled 1 #启用服务事件处理程序
flap_detection_enabled 1 #Flap detection is enabled
failure_prediction_enabled 1 #启用故障预测
process_perf_data 1 #性能数据
retain_status_information 1 #保留重新启动状态信息
retain_nonstatus_information 1 #保留非状态信息
is_volatile 0 #The service is not volatile
check_period 24x7 #7*24
max_check_attempts 3 #重新检查服务3次,以确认是否真正的状态
normal_check_interval 10 #正常情况下每个10分钟检查一次
retry_check_interval 2 #每隔两分钟检查一次服务,直到真正的状态确定
contact_groups admins #通知管理组
notification_options w,u,c,r #发送通知,当服务状态为warning, unknown, critical, and recovery events
notification_interval 60 #60分钟后重新通知状态
notification_period 24x7 #7*24
register 0 #
}
################################本地服务模板################################
define service{
name local-service #本地服务模板名称
use generic-service #集成generic-service
max_check_attempts 4 #重试4次,以确认最终状态
normal_check_interval 5 #正常情况下每隔5分钟检查一次服务
retry_check_interval 1 #每隔1分钟检查一次,以确认状态
register 0 #
}
sed -n '153,177p' /usr/local/nagios/etc/objects/templates.cfg > /tmp/mytemplates.cfg
vi /tmp/mytemplates.cfg
define service{
name generic-myservice
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 10
retry_check_interval 2
contact_groups admins
notification_options w,u,c,r
notification_interval 60
notification_period 24x7
register 0
}
vi /usr/local/nagios/etc/objects/templates.cfg
define service{
name generic-myservice
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 10
retry_check_interval 2
contact_groups admins
notification_options w,u,c,r
notification_interval 60
notification_period 24x7
register 0
}
cd /usr/local/nagios/etc/services
vi myservices.cfg
a、修改前
define service{
use generic-service
host_name 01-client148
service_description nginsweb
check_command check_weburl!-I 192.168.1.148
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
}
b、修改后
define service{
use generic-myservice
host_name 01-client148
service_description nginsweb
check_command check_weburl!-I 192.168.1.148
}
vi /usr/local/nagios/etc/objects/contacts.cfg
#定义运维人员(operationand maintenance staffs)
define contact{
contact_name test01
use generic-contact
alias OMS
email test01@localhost
}
define contact{
contact_name test02
use generic-contact
alias OMS
email test02@localhost
}
#定义运维组
define contactgroup{
contactgroup_name omgroup
alias Nagios OMG
members test01,test02
}
a、方式一:修改自定义模版中的用户组
vi /usr/local/nagios/etc/objects/templates.cfg
define service{
name generic-myservice
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 10
retry_check_interval 2
contact_groups admins,omgroup
notification_options w,u,c,r
notification_interval 60
notification_period 24x7
register 0
}
b、方式二:在每个服务中加入contact_groups参数
define service{
use generic-myservice
host_name 01-client148
service_description nginsweb
check_command check_weburl!-I 192.168.1.148
check_command omgroup
}
监控的内容不断在变化,插件也不断变化,默认的一些插件可能越来越不能满足需求,这个时候就需要自己来写些插件了
nagios的插件支持多种脚本或编译后的程序(Java、C、C++、php、shell等)。nagios不限制任何开发语言,只要该自定义插件要满足2个条件,也就是要提供2个返回值就行:
a、状态码
0:表示状态OK
1:表示状态warn
2:表示状态crit
3:表示状态未知
b、查看nagios中配置的状态码
head -7 utils.sh
#! /bin/sh
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
STATE_DEPENDENT=4 #一般不同
c、不同语言的返回值实现
(i)Java:System.exit(int status)
(ii)php:exit(status)
(iii)python:sys.exit(int status)
(v)C/C++:return int status
(vi)bash:exit int status
a、只需要第一行数据
b、不同语言的打印语句
(i)Java:System.out.println(String msg)
(ii)php:echo msg
(iii)python:print msg
(v)C/C++:printf("%s", msg)
(vi)bash:echo/printf msg
md5sum /etc/passwd > /etc/passwd.md5
md5sum -c /etc/passwd.md5
/etc/passwd: OK
vi /usr/local/nagios/libexec/check_passwd
char=`md5sum -c /etc/passwd.md5 | grep "OK" | wc -l`
if [ $char -eq 1 ];then
echo "passwd isok"
exit 0
else
echo "passwd ischanged"
exit 2
fi
a、脚本测试
sh check_passwd
passwd is ok
b、增加用户
useradd aaaa
c、脚本测试(多出警告输出,并占用第一行)
sh check_passwd
md5sum: WARNING: 1 of 1 computed checksum did NOT match
passwd is changed
vi /usr/local/nagios/libexec/check_passwd
#!/bin/sh
char=`md5sum -c /etc/passwd.md5 2>/dev/null | grep "OK" | wc -l`
if [ $char -eq 1 ];then
echo "passwd isok"
exit 0
else
echo "passwd ischanged"
exit 2
fi
md5sum /etc/passwd > /etc/passwd.md5
a、脚本测试
sh check_passwd
passwd is ok
b、增加用户
useradd bbbb
c、脚本测试(多出警告输出,并占用第一行)
sh check_passwd
passwd is changed
chmod +x /usr/local/nagios/libexec/check_passwd
ll /usr/local/nagios/libexec/check_passwd
-rwxr-xr-x 1 root root 182 Jul 8 16:21 /usr/local/nagios/libexec/check_passwd
vi /application/nagios/etc/nrpe.cfg
#在最后加上下面这行
command[check_passwd]=/application/nagios/libexec/check_passwd
ps -ef | grep nrpe
nagios ... //usr/local/nagios/bin/nrpe-c //usr/local/nagios/etc/nrpe.cfg -d
pkill nrpe
pkill nrpe
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
netstat -lntup | grep nrpe
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe
./check_nrpe -H 192.168.1.198 -c check_passwd
passwd is changed
md5sum /etc/passwd > /etc/passwd.md5
./check_nrpe -H 192.168.1.198 -c check_passwd
passwd is ok
md5sum /etc/passwd > /etc/passwd.md5
md5sum -c /etc/passwd.md5
vi /application/nagios/libexec/check_passwd
#!/bin/sh
char=`md5sum -c /etc/passwd.md5 2>/dev/null | grep "OK" | wc -l`
if [ $char -eq 1 ];then
echo "passwd isok"
exit 0
else
echo "passwd ischanged"
exit 2
fi
chmod +x /application/nagios/libexec/check_passwd
vi /application/nagios/etc/nrpe.cfg
#在最后加上下面这行
command[check_passwd]=/application/nagios/libexec/check_passwd
ps -ef | grep nrpe
nagios ... /application/nagios/bin/nrpe-c /application/nagios/etc/nrpe.cfg -d
pkill nrpe
pkill nrpe
/application/nagios/bin/nrpe -c /application/nagios/etc/nrpe.cfg -d
netstat -lntup | grep nrpe
tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 22133/nrpe
vi /usr/local/nagios/etc/objects/services.cfg
#添加一个服务
define service{
use generic-service
host_name 01-client218,02-client219,nagiosServer198
service_description check_passwd
check_command check_nrpe!check_passwd
}
/etc/init.d/nagios checkconfig
/etc/init.d/nagios reload
a、检查图形显示依赖包
rpm -q zlib zlib-devel freetype freetype-devel cairo pango gdgd-devel
zlib-1.2.3-29.el6.i686
zlib-devel-1.2.3-29.el6.i686
freetype-2.3.11-14.el6_3.1.i686
freetype-devel-2.3.11-14.el6_3.1.i686
cairo-1.8.8-3.1.el6.i686
pango-1.28.1-7.el6_3.i686
gd-2.0.35-11.el6.i686
packagegd-devel is not installed
b、安装gd-devel
rpm -ivh gd-devel-2.0.35-11.el6.i686.rpm
c、再次检查依赖包
rpm -q zlib zlib-devel freetype freetype-devel cairo pango gdgd-devel
zlib-1.2.3-29.el6.i686
zlib-devel-1.2.3-29.el6.i686
freetype-2.3.11-14.el6_3.1.i686
freetype-devel-2.3.11-14.el6_3.1.i686
cairo-1.8.8-3.1.el6.i686
pango-1.28.1-7.el6_3.i686
gd-2.0.35-11.el6.i686
gd-devel-2.0.35-11.el6.i686
a、安装依赖包libart_lgpl(rrdtool依赖libart_lgpl)
(i)安装方式一:yum
yum install libart_lgpl libart_lgpl-devel -y
(ii)安装方式二:编译
cd /wddg/tools/
wget http://ftp.gnome.org/pub/gnome/sources/libart_lgpl/2.3/libart_lgpl-2.3.17.tar.gz
tar zxf libart_lgpl-2.3.17.tar.gz
cd libart_lgpl-2.3.17
./configure
make
make install
/bin/cp -r /usr/local/include/libart-2.0 /usr/include/
cd ..
b、安装rrdtool
# wget http://oss.oetiker.ch/rrdtool/pub/rrdtool-1.2.14.tar.gz
tar xf rrdtool-1.2.14.tar.gz
cd rrdtool-1.2.14
./configure --prefix=/usr/local/rrdtool --disable-python--disable-tcl
make
make install
cd ..
ll /usr/local/rrdtool/bin
-rwxr-xr-x 1 root root 45032 Jul 9 12:08 rrdcgi
-rwxr-xr-x 1 root root 4915Jul 9 12:08 rrdtool
-rwxr-xr-x 1 root root 42633 Jul 9 12:08 rrdupdate
注:/usr/local/rrdtool/bin目录下出现上面3个文件表示安装成功,如果在configure时有warnning,可以忽略。
tar zxf pnp-0.4.14.tar.gz
cd pnp-0.4.14
./configure--with-rrdtool=/usr/local/rrdtool/bin/rrdtool \
--with-perfdata-dir=/usr/local/nagios/share/perfdata/
make all
make install
make install-config
make install-init
ll /usr/local/nagios/libexec/ | grep process
-rwxr-xr-x 1 nagiosnagios 31827 Jul 9 12:21 process_perfdata.pl
注:
--with-rrdtool=/usr/local/rrdtool/bin/rrdtool:真正的出图命令
--with-perfdata-dir=/usr/local/nagios/share/perfdata/:出图所用的数据路径
/usr/local/nagios/libexec/目录下出现有process_perfdata.pl这个文件表示安装成功
如果在configure时有warnning,可以忽略
cd /usr/local/nagios/etc
cp nagios.cfg nagios.cfg.bak
vi nagios.cfg +835
修改前:(835行)
process_performance_data=0
修改后:(835行)
#打开保存数据开关。0不保存数据,1保存数据
process_performance_data=1
修改前:(847和848行)
#host_perfdata_command=process-host-perfdata
#service_perfdata_command=process-service-perfdata
修改后:(847和848行)
#保存主机数据
host_perfdata_command=process-host-perfdata
#保存服务数据
service_perfdata_command=process-service-perfdata
cd /usr/local/nagios/etc/objects
vi vi commands.cfg +227
修改前:
# 'process-host-perfdata' command definition
define command{
command_name process-host-perfdata
command_line /usr/bin/printf"%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n">> /
usr/local/nagios/var/host-perfdata.out
}
# 'process-service-perfdata' commanddefinition
define command{
command_name process-service-perfdata
command_line /usr/bin/printf "%b""$LASTSERVICECHECK$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATE$\t$SERVICEATTEMPT$\t$SERVICESTATETYPE$\t$SERVICEEXECUTIONTIME$\t$SERVI
CELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$\n">> /usr/local/nagios/var/service-perfdata.out
}
修改后:
# 'process-host-perfdata' command definition
define command{
command_name process-host-perfdata
command_line /usr/local/nagios/libexec/process_perfdata.pl
}
# 'process-service-perfdata' command definition
define command{
command_name process-service-perfdata
command_line /usr/local/nagios/libexec/process_perfdata.pl
}
/etc/init.d/nagios checkconfig
/etc/init.d/nagios reload
http://192.168.161.225/nagios/pnp/index.php
页面出图,还需在模版templates.cfg或自定义的主机和服务中设置process_perf_data的参数值为1,才会有数据。
vi templates.cfg
define host{
name generic-host
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_period 24x7
register 0
}
define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
}
目前只能通过http://192.168.161.225/nagios/pnp/index.php来看图,并且上查看所有可显法的图。希望在nagios的监控界面对应的主机或服务前面有的图形的小图标,点击图标进行相应的主机或服务的图形监控状态趋势。
可以在模版templates.cfg或自定义的主机和服务中设置action_url参数。
默认情况下nagios自带的插件可以出图,但自定义的插件没有图,是因为自定义的插件没有给nagios数据。
action_url /nagios/pnp/index.php?host=$HOSTNAME$
action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
vi templates.cfg
define host{
name linux-server
use generic-host
check_period 24x7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_period workhours
notification_interval 120
notification_options d,u,r
contact_groups admins
register 0
action_url /nagios/pnp/index.php?host=$HOSTNAME$
}
define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 10
retry_check_interval 2
contact_groups admins
notification_options w,u,c,r
notification_interval 60
notification_period 24x7
register 0
action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
}
ll /usr/local/nagios/share/perfdata/01-client148/
-rw-r--r-- 1 nagios nagios 384736Jul 9 13:38 Disk.rrd
-rw-r--r-- 1 nagios nagios 11319 Jul 9 13:38 Disk.xml
-rw-r--r-- 1 nagios nagios 1917824 Jul 9 13:43 IO.rrd
-rw-r--r-- 1 nagios nagios 13077 Jul 9 13:43 IO.xml
-rw-r--r-- 1 nagios nagios 384736 Jul 9 13:44 Mem.rrd
-rw-r--r-- 1 nagios nagios 11405 Jul 9 13:44 Mem.xml
-rw-r--r-- 1 nagios nagios 768008 Jul 9 13:44 dnsweb.rrd
-rw-r--r-- 1 nagios nagios 11846 Jul 9 13:44 dnsweb.xml
-rw-r--r-- 1 nagios nagios 768008 Jul 9 13:45 nginsweb.rrd
-rw-r--r-- 1 nagios nagios 11857 Jul 9 13:45 nginsweb.xml
-rw-r--r-- 1 nagios nagios 384736 Jul 9 13:43 port_80.rrd
-rw-r--r-- 1 nagios nagios 11394 Jul 9 13:43 port_80.xml
生产环境应尽量使用自已公司的邮箱作为报警邮箱,因为其它邮箱对邮件的频率是有限制的,有可能会拒收或当垃圾邮件,导致报警延误或无法收到。
需在win32上装一个飞信客户端,把对方手机加为好友,需对方确认。才可以发短信。
如139、126、189等邮箱,邮件到达后,通过手机通知收件人是邮箱提供商提供的邮件提醒的功能。报警内容长度有限制。
有专门的公司提供直接发给信息到手机的短信网关,常用的报警就是一个URL地址携带信息。要收短信费。格式如下:
http://s.ccme.cc/send.jsp?circle=username&pwd=password&mobile=$CONTACT&service=gg89-3aa06423clf83fd&msgid=23224&message=$TITLE[${alert_date}sa]
模拟QQ、MSN发消息的功能,是网友们开发了程序,从命令行执行程序,利用MSN、QQ协议,直接发给MSN和QQ好友。
对于不需要紧急处理的业务,一般选择邮件报警。如内存、磁盘空间剩余率。
用于重要且紧急的业务,会使用邮件加短信同时报警。使用邮件报警便于记录故障详细信息,短信报警是及时提醒。
简单、易用、稳定、可靠、收费合理
花一定的费用,把业务做到最好,是正常工作的思维。如果总想免费,那如果业务报警报不出来,损失可能更大。所在要说清楚利弊,交领导评判。正规公司的业务报警应尽量选择可靠的报警方式
A类:磁盘空间、CPU、内存报警等为一般报警,运维内部采取常规处理方式。
B类:服务宕机和网战打不开为严重报警,需协调技术部门相关人员会诊处理。
A类报警,原则上不限制处理时间,但以不影响服务为前提,进行及时处理
B类报警,需在10分钟类邮件周知运维全体同事及相关技术人员。
主要是配置/usr/local/nagios/etc/objects/contacts.cfg文件,加入全部将接收报警的人,并进行分组。在主机、服务或模版中配置contact_groups参数:contact_groups groupname。如:
contact_groups mobilegroup
#手机短信用户
define contact{
contact_name mobile_test01
use generic-contact
alias OMS
email [email protected] #通过邮件提醒实现短信提醒
}
#邮件及msn用户
define contact{
contact_name test02
use generic-contact
alias OMS
email [email protected]
addressl [email protected] #发MSN
}
#邮件用户
define contact{
contact_name test03
use generic-contact
alias OMS
email [email protected]
}
#定义手机组
define contactgroup{
contactgroup_name mobilegroup
alias mobile
members mobile_test01, mobile_test02
}
vi usr/local/nagios/etc/objects/contacts.cfg
define contact{
contact_name test03-pager
use generic-contact
alias OMS
email [email protected]
pager 12345678911
}
vi usr/local/nagios/etc/objects/commands.cfg
# 'notify-host-by-pager' command definition
define command{
command_name notify-host-by-pager
command_line $USER1$/sms_send"Host $HOSTSTATE$" alert for $HOSTNAME$ $CONTACPAGER$
}
# 'notify-service-by-pager' command definition
define command{
command_name notify-service-by-pager
command_line $USER1$/sms_send"$HOSTALIAS$/$SERVICEDESC$" is $SERVICESTATE$ $CONTACPAGER$
}
vi usr/local/nagios/etc/objects/contacts.cfg
define contact{
contact_name test03-pager
use generic-contact
alias OMS
email [email protected]
pager 12345678911
service_notification_commands notify-service-by-email,notify-service-by-pager
host_notification_commands notify-host-by-email,notify-host-by-pager
}
vi usr/local/nagios/etc/objects/templates.cfg
define contact{
name generic-contact
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
service_notification_commandsnotify-service-by-email,notify-service-by-pager
host_notification_commands notify-host-by-email,notify-host-by-pager
register 0
}
vi /usr/local/nagios/etc/objects/hosts.cfg
define host{
use linux-server
host_name 01-client218
alias 01-client218
contact_groups admins,test03-pager #可以组和用户混加
address 192.168.1.218
}
define service{
use generic-service
host_name 01-client148
service_description port_80
check_command check_nrpe!check_port_80
max_check_attempts 3
normal_check_interval 2
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,u,c,r
contact_groups admins,test03-pager
}
/usr/local/nagios/libexec
vi sms_send
#!/bin/sh
alert_date=$(date +%y-%m-%d" "%H:%M)
TITLE=$1 #FORMAT"Host $HOSTSTATE$ alert for $HOSTNAME$"
CONTACT=$2
#curl方式
curl -d cdkey=3RTY-EMY-0980-MTUQ2 -d password=189162 -d phone=$1 -dmessage="$2[${alert_date} myusersa]" http://a.b.c/sdkproxy/sendsms.action
#wget --quiet"http://s.ccme.cc/qxt/send.jsp?circle=test01&pwd=123456&mobile=12345678901&service=f1fb0546-ebb6-0987-8f20-560524c1f88d&msgid=3956724&message=$TITLE[${alert_date}myusersa n]"
chmod +x /usr/local/nagios/libexec/sms_send
./sms_send 123445678901 "aaaaaa"
首先考虑iptables和selinux是否关闭,其次是考虑nrpe是否启动
首先考虑openssl和openssl-devel是否安装,其次考虑nrpe.cfg配置文件中allowed_hosts是否配置服务器的IP。allowed_hosts=127.0.0.1,Server_IP多个IP用逗号分隔,不能有空格。
这是客户端nrpe没有获取到数据,需按排错步骤进行检查。
这是调用的命令没有定义。在nrpe.cfg配置文件中定义的command[Command_name]一定要和services.cfg配置文件中的check_command项的check_nrpe!Command_name一致。
a、错误提示
Error: Service check command 'check_nrpe' specified in service 'DiskPartition' for host 'host_name' not defined anywhere!
b、解决方法:在commands.cfg配置文件最后新增define command下面内容
vi /usr/local/nagios/etc/objects/commands.cfg
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c$ARG1$
}
/usr/local/namgios/libexec/check_nrpe -H clent_ip -c check_disk
/usr/local/namgios/libexec/check_nrpe -H 127.0.0.1-c check_disk
a、查看nrpe.cfg配置文件中check_disk对应的脚本命令
command[check_disk]=/application/nagios/libexec/check_disk -w 20% -c8% -p /
b、运行check_disk对应的脚本命令
/application/nagios/libexec/check_disk -w 20% -c 8% -p /
c、检查check_disk脚本命令是否有可执行权限
ll /application/nagios/libexec/check_disk
-rwxr-xr-x 1 root root 418052 Jun 30 20:19/application/nagios/libexec/check_disk