torque pbs排错

1.1

qsub: submit error (Bad UID for job execution MSG=ruserok failed validating jobtest/jobtest from mn01)

推测报错原因是,当并发提交数量较多时,torque默认会把一些作业以代理用户的方式,在其他提交节点或执行节点提交作业,在设置中允许代理用户和执行节点提交,以增强并发提交作业时的处理能力。

torque server - root:


  1. # 设置提交节点的hosts

  2. qmgr -c 'set server submit_hosts = mn01'

  3. # 允许执行节点提交作业

  4. qmgr -c 'set server allow_node_submit = True'

  5. # 允许代理用户

  6. qmgr -c 'set server allow_proxy_user = True'

1.2

qsub: submit error (Bad UID for job execution MSG=User lsh does not exist in server password file)

torque server - root:


  1. qmgr -c 'set server allow_node_submit = True'

  2. qmgr -c 'set server submit_hosts = mn01'

以及,在 server 节点建立与客户端相同名称的用户,并可以互相无密码登录

1.3

LOG_ERROR::Unable to get connection to socket (15096) in tcp_connect_sockaddr, Failed when trying to get privileged port - socket_get_tcp_priv() failed

  1. systemctl stop pbs_mom pbs_server trqauthd pbs_sched

  2. #systemctl status -l pbs_mom pbs_server trqauthd pbs_sched

  3. systemctl restart pbs_mom && systemctl status pbs_mom

  4. systemctl restart pbs_server && systemctl status pbs_server

  5. systemctl restart trqauthd && systemctl status trqauthd

  6. #systemctl restart pbs_sched && systemctl status pbs_sched

  7. systemctl restart maui.d && systemctl status maui.d

  8. systemctl status -l pbs_mom pbs_server trqauthd maui.d

tail -f /var/spool/torque/job_logs/20180511

1.4


  1. 5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_WARNING::Bad UID for job execution (15025) in log_commit_error, send_job commit failed, rc=15025 (Bad UID for job execution: start failed on unknown node)

  2. 5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 12.c01n01

  3. 5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 12.c01n01

  4. 5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_ERROR::Request invalid for state of job (15018) in 12.c01n01, obit received for job 12.c01n01 from host c01n05 with bad state (state: QUEUED)

  5. 5月 09 16:29:21 c01n01 pbs_server[18995]: Assertion failed, bad pointer in link: file "req_select.c", line 401

torque server - root:

tracejob 

  1. 05/09/2018 16:31:27 A queue=batch

  2. 05/09/2018 16:31:28.912 S unable to run job, MOM rejected/timeout

  3. 05/09/2018 16:31:28.913 S unable to run job, send to MOM '172.18.1.5' failed\


  1. ssh [email protected]

  2. groupadd -g 1004 jobtest

  3. useradd -u 1004 -g 1004 jobtest

  4. passwd jobtest

  5. su - jobtest

  6. ssh-******

  7. [jobtest@c01n05 ~]$ ssh-copy-id mn01

1.5 server 找不到 mom 节点


  1. May 09 13:58:46 server02.localdomain systemd[1]: Started TORQUE pbs_mom daemon.

  2. May 09 13:58:46 server02.localdomain systemd[1]: Starting TORQUE pbs_mom daemon...

  3. May 09 13:58:47 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

  4. May 09 13:58:47 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

  5. May 09 13:58:51 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Status update successfully sent after 1 MOM status update intervals

  6. May 09 13:58:46 server02.localdomain systemd[1]: Started TORQUE pbs_server daemon.

  7. May 09 13:58:46 server02.localdomain systemd[1]: Starting TORQUE pbs_server daemon...

  8. May 09 13:58:51 server02.localdomain PBS_Server[12638]: LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.0.82:519 (address not trusted - check entry in server_priv/nodes)

解决方法

vim /var/spool/torque/server_priv/nodes

1.6 环境冲突,设置环境变量 PATH

export PATH=/usr/local/sbin:/usr/local/bin/:$PATH

1.7

pbs_server[18995]: Assertion failed, bad pointer in link: file "req_select.c", line 401

1.8

PBS_Server[25497]: LOG_ERROR::Client connection not found. trqauthd unable to authorize user. Possible transient failure. Please try again (15135) in req_authenuser, trqauthd fail 49436

重启客户端的 trqauthd 服务,尝试重启网络(多网卡环境可能与网卡的优先级、启动顺序有关)

systemctl restart trqauthd && systemctl status trqauthd

1.9


  1. 6月 05 09:11:57 c01n01 PBS_Server[25414]: LOG_ERROR::Permission denied (13) in chk_file_sec, Security violation with "/var/spool/torque/spool/" - /var/spool/torque/spool/ cannot be accessed

  2. 6月 05 09:11:57 c01n01 PBS_Server[25414]: LOG_ERROR::PBS_Server, pbsd_init failed

http://www.clusterresources.com/pipermail/torqueusers/2010-November/011730.html


  1. chmod -Rf 755 /var

  2. chmod -Rf 777 /var/spool/torque/spool/

  3. chmod +t /var/spool/torque/spool/

  4. chmod -Rf 777 /var/spool/torque/undelivered/

  5. chmod +t /var/spool/torque/undelivered/

1.10 PBS_MOM 报错


  1. [root@c03n08 ~]# systemctl status -l pbs_mom trqauthd

  2. ● pbs_mom.service - TORQUE pbs_mom daemon

  3. Loaded: loaded (/usr/lib/systemd/system/pbs_mom.service; enabled; vendor preset: disabled)

  4. Active: failed (Result: core-dump) since Wed 2018-08-08 13:34:32 CST; 2s ago

  5. Process: 19159 ExecStop=/bin/bash -c for i in {1..5}; do kill -0 $MAINPID &>/dev/null || exit 0; /usr

  6. /local/sbin/momctl -s && exit; sleep 1; done (code=exited, status=0/SUCCESS)

  7. Process: 19155 ExecStart=/usr/local/sbin/pbs_mom -F -d $PBS_HOME $PBS_ARGS (code=dumped, signal=SEGV)

  8. Main PID: 19155 (code=dumped, signal=SEGV)

  9. Aug 08 13:34:30 c03n08 systemd[1]: Started TORQUE pbs_mom daemon.

  10. Aug 08 13:34:30 c03n08 systemd[1]: Starting TORQUE pbs_mom daemon...

  11. Aug 08 13:34:31 c03n08 pbs_mom[19155]: LOG_ERROR::No such file or directory (2) in task_recov, open of task file

  12. Aug 08 13:34:32 c03n08 pbs_mom[19155]: LOG_ERROR::init_abort_jobs, job 404.c01n01 no longer has valid password entry

  13. - deleting

  14. Aug 08 13:34:32 c03n08 systemd[1]: pbs_mom.service: main process exited, code=dumped, status=11/SEGV

  15. Aug 08 13:34:32 c03n08 systemd[1]: Unit pbs_mom.service entered failed state.

  16. Aug 08 13:34:32 c03n08 systemd[1]: pbs_mom.service failed.

参考:
http://www.clusterresources.com/pipermail/torqueusers/2012-November/015266.html
pbs - Torque : pbs_Server No such file or directory (2) in recov_attr, read2 - Stack Overflow


  1. mkdir -p ~/backups/venv/var/spool/torque/mom_priv/jobs/

  2. mv /var/spool/torque/mom_priv/jobs/404.c01n01.JB ~/backups/venv/var/spool/torque/mom_priv/jobs/

  3. systemctl restart pbs_mom

  4. systemctl status -l pbs_mom

1.11 提交作业时报错 ghost_queue

qsub: submit error (This queue had errors during its recovery. Please correct any settings that were lost on restart and then unset the ghost_queue setting via qmgr. Once this is unset, then the queue will be able to accept new jobs again.)

在 torque server 节点查看队列信息,发现 ghost_queue = True, 这是因为之前队列节点发生一些问题(重启等)导致 torque 的保护机制生效,在队列内节点恢复后,把 ghost_queue 设置为 False 即可。参考:
Queue Attribute Reference
Automatic Queue and Job Recovery


  1. qmgr -c 'list queue beijing'

  2. qmgr -c 'set queue beijing ghost_queue = 0'

1.12


  1. ● pbs_server.service - TORQUE pbs_server daemon

  2. Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)

  3. Active: failed (Result: signal) since Sat 2018-08-25 04:38:09 CST; 2 days ago

  4. Process: 13194 ExecStart=/usr/local/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=killed, signal=ABRT)

  5. Main PID: 13194 (code=killed, signal=ABRT)

  6. Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17c000-7f8dda17d000 r--p 00021000 fd:00 33699 /usr/lib64/ld-2.17.so

  7. Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17d000-7f8dda17e000 rw-p 00022000 fd:00 33699 /usr/lib64/ld-2.17.so

  8. Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17e000-7f8dda17f000 rw-p 00000000 00:00 0

  9. Aug 25 04:38:08 c01n01 pbs_server[13194]: 7fff9d866000-7fff9d977000 rw-p 00000000 00:00 0 [stack]

  10. Aug 25 04:38:08 c01n01 pbs_server[13194]: 7fff9d9aa000-7fff9d9ac000 r-xp 00000000 00:00 0 [vdso]

  11. Aug 25 04:38:08 c01n01 pbs_server[13194]: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

  12. Aug 25 04:38:08 c01n01 pbs_server[13194]: pbs_server is up (version - 6.1.2, port - 15001)

  13. Aug 25 04:38:09 c01n01 systemd[1]: pbs_server.service: main process exited, code=killed, status=6/ABRT

  14. Aug 25 04:38:09 c01n01 systemd[1]: Unit pbs_server.service entered failed state.

  15. Aug 25 04:38:09 c01n01 systemd[1]: pbs_server.service failed.

1.13 Common Reasons Why Jobs Won't Start

Common Reasons Why Jobs Won't Start

1.13.1 could not locate requested resources xxx (node_spec failed) job allocation request exceeds currently available cluster nodes

[torquedev] strange behavior in TORQUE 5.1.1
[torqueusers] pbsnodes still show node state=free with all np assigned

你可能感兴趣的:(网络,服务器,linux)