Bug #1312
openTimeout script sometimes fails with "No child processes" message
0%
Description
Sometimes timeout script displays message like
ldv-timeout: WARNING: Wait(2819) failed: No child processes
and dies with code 0. This is a fatal error, the source of which is unknown to me. Quoting Eugene from bug #996, here's a sample:
aspectator: TRACE: ********* 3d aspectator stage stage beginning *********. aspectator: TRACE: LDV_STAGE=3 LDV_ASPECT_FILE=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/./files/model0043a-blast.aspect LDV_DIR_ORIG=1 /home/joker/work/14_driver/test_ldv_tools/rule-instrumentor/aspectator/gcc-4.6/bin/compiler-core "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config//generated" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config/include//generated" "-Wp,-MD,drivers/usb/misc/.emi62.o.d" "-nostdinc" "-isystem" "/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.5.2/include" "-I/home/joker/work/14_driver/43_1a/inst/current/envs/linux-2.6.37/linux-2.6.37/arch/x86/include" "-Iinclude" "-include" "include/generated/autoconf.h" "-D__KERNEL__" "-Wall" "-Wundef" "-Wstrict-prototypes" "-Wno-trigraphs" "-fno-strict-aliasing" "-fno-common" "-Werror-implicit-function-declaration" "-Wno-format-security" "-fno-delete-null-pointer-checks" "-m64" "-mtune=generic" "-mno-red-zone" "-mcmodel=kernel" "-funit-at-a-time" "-maccumulate-outgoing-args" "-fstack-protector" "-DCONFIG_AS_CFI=1" "-DCONFIG_AS_CFI_SIGNAL_FRAME=1" "-DCONFIG_AS_FXSAVEQ=1" "-pipe" "-Wno-sign-compare" "-fno-asynchronous-unwind-tables" "-mno-sse" "-mno-mmx" "-mno-sse2" "-mno-3dnow" "-fno-omit-frame-pointer" "-fno-optimize-sibling-calls" "-g" "-pg" "-Wdeclaration-after-statement" "-Wno-pointer-sign" "-fno-strict-overflow" "-DDEBUG" "-fprofile-arcs" "-ftest-coverage" "-DMODULE" "-DKBUILD_STR(s)=#s" "-DKBUILD_BASENAME=KBUILD_STR(emi62)" "-DKBUILD_MODNAME=KBUILD_STR(emi62)" "-c" "-I/home/joker/work/14_driver/43_1a/inst/current/envs/linux-2.6.37/linux-2.6.37/drivers/usb/misc" "-DLDV_MAIN0_sequence_infinite_withcheck_stateful" "-Wno-attributes" "-g" "-U__GNUC__" "-D__GNUC__=4" "-U__GNUC_MINOR__" "-D__GNUC_MINOR__=2" "-U__GNUC_PATCHLEVEL__" "-D__GNUC_PATCHLEVEL__=1" "-include" "/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/./files/config-tracers.h" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/files/" -fsyntax-only -x cpp-output /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p.ma. ldv-timeout: WARNING: Wait(2819) failed: No child processes Something wrong with aspectator: it doesn't produce file '/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.c' at /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/rule-instrumentor.pl line 1530. rule-instrumentor: DEBUG: The command execution status is '0'. rule-instrumentor: TRACE: Redirect STDERR to its default place'. rule-instrumentor: TRACE: Read failure description. rule-instrumentor: TRACE: The command execution full description is 'Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "spin_lock_nested(lock,subclass)=ldv_spin_lock()". rule-instrumentor: TRACE: Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "spin_lock_nest_lock(lock,nest_lock)=ldv_spin_lock()". rule-instrumentor: TRACE: Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "spin_lock_irqsave(lock,flags)=ldv_spin_lock()". rule-instrumentor: TRACE: Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "spin_lock_irqsave_nested(lock,flags,subclass)=ldv_spin_lock()". rule-instrumentor: TRACE: Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "spin_trylock_irqsave(lock,flags)=ldv_spin_trylock()". rule-instrumentor: TRACE: Advice was already applied. rule-instrumentor: TRACE: Create cpp definition "atomic_dec_and_lock(atomic,lock)=ldv_spin_trylock()". rule-instrumentor: TRACE: /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p: In function 'ldv_main0_sequence_infinite_withcheck_stateful': rule-instrumentor: TRACE: /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p:453:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] rule-instrumentor: TRACE: /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p: At top level: rule-instrumentor: TRACE: /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p:565:1: fatal error: character "�" wasn't put to stream rule-instrumentor: TRACE: compilation terminated. rule-instrumentor: TRACE: Aspectator fails on the 3d stage rule-instrumentor: TRACE: '. rule-instrumentor: TRACE: Save the information obtained to cache (even if it's a failure). rule-instrumentor: TRACE: Go to the initial directory. dscv: WARNING: Exception occured: INTEGRATION ERROR. Cmdline: /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/rule-instrumentor.pl --basedir=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a --rule-model=43_1a --cmdfile=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/cmd_after_deg.xml --cmdfile-out=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/cmdfiles/cmd43_1a.xml at /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/dscv line 628. dscv: WARNING: Fatal error. Stopping services before reporting... dscv: INFO: Shutting down watcher watcher: DEBUG: Called watcher: /home/joker/work/14_driver/test_ldv_tools/ldv-core/../watcher/ldv-watcher fail dscv ldv 1 dscv 2044 watcher: TRACE: Checking if exists /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/watcher/instance_pool... watcher: TRACE: Checking if exists /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/watcher/key_pool...
Such behavior confuses our infrastructure, but it also confuses me, as I don't known for sure why it fails. We should fix this issue in one of our critical scripts.
Updated by Evgeny Novikov over 11 years ago
- Status changed from New to Open
- Assignee changed from Pavel Shved to Vitaly Mordan
New Resource Manager shouldn't have such issues.
Updated by Pavel Shved over 11 years ago
FYI, it was a race condition between waitpid and alarm. If wait4 syscall reaps the child process, but alarm is delivered before we cancel the timer with `ualarm 0`, we lose the return code, and think that waitpid was interrupted, and that child is still there. On the next iteration waitpid surprises us with the news: there's no child anymore!
Adding `ualarm 0` mitigated this, but there will always be a narrow window between the actual wait4 syscall and the actual setitimer ("alarm") syscall if you use Perl to wrap them. The proper solution is to handle SIGCHLD asynchronously. OR use cgroups / ResourceManager, whatever it is :-)
Updated by Evgeny Novikov over 11 years ago
Thank you for explanation!
Resource Manager is a new implementation of timeout script. Resource Manager is written in C and it is based on control groups. Indeed, there is the same trouble in it, since control groups don't offer to set up any time limits (it just can set up a memory limit). So, we set up a timer (that kills a child if a time limit is exceeded) and simultaneously wait for finishing this child. This timer is disabled just when wait returns. I don't know exactly whether it is possible in C that wait can be interrupted by a signal when it is processing a child exit status, but we shouldn't rely on this anyway and thus should track wait return code.
Updated by Pavel Shved over 11 years ago
If wait is interrupted by a signal, it will return EINTR rather than ECHILD. In any case, the SIGALRM can arrive between waitpid and setitimer even if you don't wrap their calls in any libraries (be it Perl or glibc)--like, between their two lines of code :-). You should handle SIGCHLD and read the status code from there with waitpid(...WNOHANG...). Work within the assumption that the SIGALRM does not mean that the child is still alive.
AFAIK, Control groups can set a limit on total CPU consumption, which is, like, CPU time. Not sure about the wall time.