Project

General

Profile

Actions

Bug #1312

open

Timeout script sometimes fails with "No child processes" message

Added by Pavel Shved almost 13 years ago. Updated over 10 years ago.

Status:
Open
Priority:
Normal
Assignee:
Category:
Infrastructure
Start date:
06/02/2011
Due date:
% Done:

0%

Estimated time:
Detected in build:
svn
Platform:
Published in build:

Description

Sometimes timeout script displays message like

ldv-timeout: WARNING: Wait(2819) failed: No child processes          

and dies with code 0. This is a fatal error, the source of which is unknown to me. Quoting Eugene from bug #996, here's a sample:

aspectator: TRACE: ********* 3d aspectator stage stage beginning *********.                                                   
aspectator: TRACE: LDV_STAGE=3 LDV_ASPECT_FILE=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/./files/model0043a-blast.aspect LDV_DIR_ORIG=1 /home/joker/work/14_driver/test_ldv_tools/rule-instrumentor/aspectator/gcc-4.6/bin/compiler-core "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config//generated" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/rule-instrumentor/43_1a/config/include//generated" "-Wp,-MD,drivers/usb/misc/.emi62.o.d" "-nostdinc" "-isystem" "/usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.5.2/include" "-I/home/joker/work/14_driver/43_1a/inst/current/envs/linux-2.6.37/linux-2.6.37/arch/x86/include" "-Iinclude" "-include" "include/generated/autoconf.h" "-D__KERNEL__" "-Wall" "-Wundef" "-Wstrict-prototypes" "-Wno-trigraphs" "-fno-strict-aliasing" "-fno-common" "-Werror-implicit-function-declaration" "-Wno-format-security" "-fno-delete-null-pointer-checks" "-m64" "-mtune=generic" "-mno-red-zone" "-mcmodel=kernel" "-funit-at-a-time" "-maccumulate-outgoing-args" "-fstack-protector" "-DCONFIG_AS_CFI=1" "-DCONFIG_AS_CFI_SIGNAL_FRAME=1" "-DCONFIG_AS_FXSAVEQ=1" "-pipe" "-Wno-sign-compare" "-fno-asynchronous-unwind-tables" "-mno-sse" "-mno-mmx" "-mno-sse2" "-mno-3dnow" "-fno-omit-frame-pointer" "-fno-optimize-sibling-calls" "-g" "-pg" "-Wdeclaration-after-statement" "-Wno-pointer-sign" "-fno-strict-overflow" "-DDEBUG" "-fprofile-arcs" "-ftest-coverage" "-DMODULE" "-DKBUILD_STR(s)=#s" "-DKBUILD_BASENAME=KBUILD_STR(emi62)" "-DKBUILD_MODNAME=KBUILD_STR(emi62)" "-c" "-I/home/joker/work/14_driver/43_1a/inst/current/envs/linux-2.6.37/linux-2.6.37/drivers/usb/misc" "-DLDV_MAIN0_sequence_infinite_withcheck_stateful" "-Wno-attributes" "-g" "-U__GNUC__" "-D__GNUC__=4" "-U__GNUC_MINOR__" "-D__GNUC_MINOR__=2" "-U__GNUC_PATCHLEVEL__" "-D__GNUC_PATCHLEVEL__=1" "-include" "/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/./files/config-tracers.h" "-I/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/kernel-rules/files/" -fsyntax-only -x cpp-output /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p.ma.                               
ldv-timeout: WARNING: Wait(2819) failed: No child processes                                                                   
Something wrong with aspectator: it doesn't produce file '/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.c' at /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/rule-instrumentor.pl line 1530.                                                 
rule-instrumentor: DEBUG: The command execution status is '0'.                                                                
rule-instrumentor: TRACE: Redirect STDERR to its default place'.                                                              
rule-instrumentor: TRACE: Read failure description.
rule-instrumentor: TRACE: The command execution full description is 'Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "spin_lock_nested(lock,subclass)=ldv_spin_lock()".
rule-instrumentor: TRACE:  Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "spin_lock_nest_lock(lock,nest_lock)=ldv_spin_lock()".
rule-instrumentor: TRACE:  Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "spin_lock_irqsave(lock,flags)=ldv_spin_lock()".
rule-instrumentor: TRACE:  Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "spin_lock_irqsave_nested(lock,flags,subclass)=ldv_spin_lock()".
rule-instrumentor: TRACE:  Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "spin_trylock_irqsave(lock,flags)=ldv_spin_trylock()".
rule-instrumentor: TRACE:  Advice was already applied.
rule-instrumentor: TRACE:  Create cpp definition "atomic_dec_and_lock(atomic,lock)=ldv_spin_trylock()".
rule-instrumentor: TRACE:  /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p: In function 'ldv_main0_sequence_infinite_withcheck_stateful':
rule-instrumentor: TRACE:  /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p:453:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
rule-instrumentor: TRACE:  /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p: At top level:
rule-instrumentor: TRACE:  /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a/drivers/usb/misc/emi62.c.p:565:1: fatal error: character "�" wasn't put to stream
rule-instrumentor: TRACE:  compilation terminated.
rule-instrumentor: TRACE:  Aspectator fails on the 3d stage
rule-instrumentor: TRACE: '.
rule-instrumentor: TRACE: Save the information obtained to cache (even if it's a failure).
rule-instrumentor: TRACE: Go to the initial directory.
dscv: WARNING: Exception occured: INTEGRATION ERROR.  Cmdline:  /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/rule-instrumentor.pl --basedir=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/ri/43_1a --rule-model=43_1a --cmdfile=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/cmd_after_deg.xml --cmdfile-out=/home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/csd_deg_dscv/2044/dscv_tempdir/dscv/cmdfiles/cmd43_1a.xml at /home/joker/work/14_driver/test_ldv_tools/ldv-core/../bin/dscv line 628.
dscv: WARNING: Fatal error.  Stopping services before reporting...
dscv: INFO: Shutting down watcher
watcher: DEBUG: Called watcher: /home/joker/work/14_driver/test_ldv_tools/ldv-core/../watcher/ldv-watcher fail dscv ldv 1 dscv 2044
watcher: TRACE: Checking if exists /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/watcher/instance_pool...
watcher: TRACE: Checking if exists /home/joker/work/14_driver/43_1a/work/current--X--drivers/--X--defaultlinux-2.6.37--X--43_1a/linux-2.6.37/watcher/key_pool...

Such behavior confuses our infrastructure, but it also confuses me, as I don't known for sure why it fails. We should fix this issue in one of our critical scripts.

Actions #1

Updated by Evgeny Novikov over 10 years ago

  • Status changed from New to Open
  • Assignee changed from Pavel Shved to Vitaly Mordan

New Resource Manager shouldn't have such issues.

Actions #2

Updated by Pavel Shved over 10 years ago

FYI, it was a race condition between waitpid and alarm. If wait4 syscall reaps the child process, but alarm is delivered before we cancel the timer with `ualarm 0`, we lose the return code, and think that waitpid was interrupted, and that child is still there. On the next iteration waitpid surprises us with the news: there's no child anymore!

Adding `ualarm 0` mitigated this, but there will always be a narrow window between the actual wait4 syscall and the actual setitimer ("alarm") syscall if you use Perl to wrap them. The proper solution is to handle SIGCHLD asynchronously. OR use cgroups / ResourceManager, whatever it is :-)

Actions #3

Updated by Evgeny Novikov over 10 years ago

Thank you for explanation!

Resource Manager is a new implementation of timeout script. Resource Manager is written in C and it is based on control groups. Indeed, there is the same trouble in it, since control groups don't offer to set up any time limits (it just can set up a memory limit). So, we set up a timer (that kills a child if a time limit is exceeded) and simultaneously wait for finishing this child. This timer is disabled just when wait returns. I don't know exactly whether it is possible in C that wait can be interrupted by a signal when it is processing a child exit status, but we shouldn't rely on this anyway and thus should track wait return code.

Actions #4

Updated by Pavel Shved over 10 years ago

If wait is interrupted by a signal, it will return EINTR rather than ECHILD. In any case, the SIGALRM can arrive between waitpid and setitimer even if you don't wrap their calls in any libraries (be it Perl or glibc)--like, between their two lines of code :-). You should handle SIGCHLD and read the status code from there with waitpid(...WNOHANG...). Work within the assumption that the SIGALRM does not mean that the child is still alive.

AFAIK, Control groups can set a limit on total CPU consumption, which is, like, CPU time. Not sure about the wall time.

Actions

Also available in: Atom PDF