LSF使用技巧 :应用程序退出码含义?

LSF中应用程序退出码的说明

退出码

说明

0

应用程序运行过程中没有发生错误,正常结束。

1 ~ 125

应用程序退出码,需要查看应用程序手册确定退出码的含义。有些应用程序非零退出码也代表正常结束。

126

用户没有权限执行命令

127

没有找到要执行的命令

> 128

表示作业被信号中断,信号值为 退出码 - 128,需要在相应操作系统上查看对应信号的涵义。如退出码130, 130 - 128 = 2, 在Linux平台信号2表示SIGINT,即中断信号。

255

作业以 -1 退出

示例1:退出码255

编写C程序以-1退出, cat /tmp/calibre.c 


#include 
int main(void){
   printf("Hello world.\n");
   return(-1);
}

编译后在命令行运行,可见退出码为255

[lsfadmin@master tmp]$ gcc calibre.c -o calibre

[lsfadmin@master tmp]$ chmod +x calibre

[lsfadmin@master tmp]$ ./calibre

Hello world.

[lsfadmin@master tmp]$ echo $? 255

将此命令提交LSF执行

[lsfadmin@master ~]$ bsub -I calibre

Job <1349> is submitted to default queue .

<>

<>

Hello world.

[lsfadmin@master ~]$ bjobs -UF 1349

Job <1349>, User , Project , Status , Queue , Interactive mode, Command , Share group charged

Sat May 21 21:48:27: Submitted from host , CWD <$HOME>;

Sat May 21 21:48:27: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) ;

Sat May 21 21:48:32: Exited with exit code 255. The CPU time used is 0.0 seconds.

Sat May 21 21:48:32: Completed .

示例2: 退出码 127 找不到命令

提交一个不存在的命令到LSF执行:

[lsfadmin@master configdir]$ bsub -I pt_shell 
Job <1346> is submitted to default queue . 
<> 
<> 
/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/myjs.sh: line 17: pt_shell: command not found 
[lsfadmin@master configdir]$ bjobs -UF 1346 
Job <1346>, User , Project , Status , Queue , Interactive mode, Command , Share group charged  
Sat May 21 21:38:06: Submitted from host , CWD ; 
Sat May 21 21:38:06: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) ; 
Sat May 21 21:38:11: Exited with exit code 127. The CPU time used is 0.0 seconds. 
Sat May 21 21:38:11: Completed .

示例3: 退出码126 没有访问权限

以用户帐号lsfadmin创建程序,并设置权限为仅自己可访问。

[lsfadmin@master /]$ ls -l /tmp/gen

-rwx------ 1 lsfadmin lsfadmin 34 Jul 4 18:06 /tmp/gen

[lsfadmin@openlava-master /]$ bsub -Ip -m master /tmp/gen

Job <208> is submitted to default queue .

<>

<>

Hello World!

切换用户帐号shugb,提交以上命令到LSF中运行。

[shugb@master ~]$ bsub -Ip -m master /tmp/gen

Job <209> is submitted to default queue .

<>

<>

/home/shugb/.lsbatch/1656929425.209: line 8: /tmp/gen: Permission denied

[shugb@master ~]$ bjobs -UF 209

Job <209>, User <shugb>, Project , Status , Queue , Interactive pseudo-terminal mode, Command , Share group charged shugb>

Mon Jul 4 18:10:25: Submitted from host , CWD <$HOME>, Specified Hosts ;

Mon Jul 4 18:10:25: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) ;

Mon Jul 4 18:10:31: Exited with exit code 126. The CPU time used is 0.0 seconds.

Mon Jul 4 18:10:31: Completed .

示例4: 退出码130 程序被中断运行

提交作业到LSF执行

[shugb@master ~]$ bsub -m cmp1 sleep 1000

Job <210> is submitted to default queue .

在执行机上,中断程序执行

[root@cmp1 log]# ps -elf|grep sleep

0 S shuguan+ 97555 97553 0 80 0 - 27015 hrtime 18:17 ? 00:00:00 sleep 1000

0 S root 97572 1399 0 80 0 - 27014 hrtime 18:17 ? 00:00:00 sleep 60

0 S root 97578 94237 0 80 0 - 28204 pipe_w 18:17 pts/0 00:00:00 grep --color=auto sleep

[root@openlava-cmp1 log]# kill -2 97555

[root@openlava-cmp1 log]#

检查作业退出码

[shugb@master ~]$ bjobs -UF 210

Job <210>, User <shugb>, Project , Status , Queue , Command , Share group charged shugb> Mon Jul 4 18:17:19: Submitted from host , CWD <$HOME>, Specified Hosts ;

Mon Jul 4 18:17:19: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) , Execution Home shugb>, Execution CWD shugb>;

Mon Jul 4 18:17:56: Exited with exit code 130. The CPU time used is 0.1 seconds.

Mon Jul 4 18:17:56: Completed .

示例5: 作业被用户或管理员通过LSF命令终止

如果作业是用户或管理员能过LSF命令终止,在作业信息中除了有退出码外,还会有诸如 TERM_OWNER、TERM_ADMIN等提示

[shugb@master ~]$ bsub -m openlava-cmp1 sleep 1000

Job <211> is submitted to default queue .

[shugb@master ~]$ bkill 211

Job <211> is being terminated

[shugb@master ~]$ bjobs -UF 211

Job <211>, User <shugb>, Project , Status , Queue , Command , Share group charged shugb> Mon Jul 4 18:24:12: Submitted from host , CWD <$HOME>, Specified Hosts ;

Mon Jul 4 18:24:13: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) , Execution Home shugb>, Execution CWD shugb>;

Mon Jul 4 18:24:22: Exited with exit code 130. The CPU time used is 0.0 seconds.

Mon Jul 4 18:24:22: Completed ; TERM_OWNER: job killed by owner.

以帐号shugb提交作业,然后管理员lsfadmin通过LSF命令bkill 终止作业,查看作业信息。

[shugb@master ~]$ bsub -m cmp1 sleep 1000

Job <212> is submitted to default queue .

[shugb@master ~]$ bjobs -UF 212 Job <212>, User <shugb>, Project , Status , Queue , Command , Share group charged shugb>

Mon Jul 4 18:27:42: Submitted from host , CWD <$HOME>, Specified Hosts ;

Mon Jul 4 18:27:43: Started 1 Task(s) on Host(s) , Allocated 1 Slot(s) on Host(s) , Execution Home shugb>, Execution CWD shugb>;

Mon Jul 4 18:28:30: Exited with exit code 130. The CPU time used is 0.1 seconds.

Mon Jul 4 18:28:30: Completed ; TERM_ADMIN: job killed by root or an administrator.

你可能感兴趣的:(LSF使用技巧,linux,运维,服务器,云计算)