Bug #8460
closedKlever stops solving tasks
0%
Description
Klever successfully solves one task and starts preparing the next one. It creates task directory, puts client.json and nothing more. Scheduler considers there is a task processing. There are no errors found in the logs, but, having cil file it doesnt not start CPAchecker during an hour. Interesting fact: it happens, when I set options 'keep working directory' to true. May be, it does not matter.
A part of logs:
2017-09-27 12:24:01,057 (__init__.py:222) root DEBUG> Add new PENDING task 295 2017-09-27 12:24:01,057 (__init__.py:247) root DEBUG> Prepare new task 295 before launching 2017-09-27 12:24:01,058 (native.py:324) root INFO> Going to prepare execution of the task 295 2017-09-27 12:24:01,058 (resource_scheduler.py:343) root DEBUG> Check the resource limits of pending task 'drivers/hid/hid.ko/linux:fs:char_dev' 2017-09-27 12:24:01,058 (native.py:562) root DEBUG> Create working directory tasks/295 2017-09-27 12:24:01,064 (__init__.py:378) root DEBUG> Update information about available verification tools 2017-09-27 12:24:01,064 (__init__.py:388) root DEBUG> Update information about connected nodes 2017-09-27 12:24:01,064 (resource_scheduler.py:074) root DEBUG> Try to receive information about resources from controller 2017-09-27 12:24:01,067 (connectionpool.py:208) requests.packages.urllib3.connectionpool INFO> Starting new HTTP connection (1): localhost 2017-09-27 12:24:01,068 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/catalog/nodes HTTP/1.1" 200 202 2017-09-27 12:24:01,069 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson 2017-09-27 12:24:01,071 (connectionpool.py:208) requests.packages.urllib3.connectionpool INFO> Starting new HTTP connection (1): localhost 2017-09-27 12:24:01,073 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701 2017-09-27 12:24:01,073 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson 2017-09-27 12:24:01,076 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701 2017-09-27 12:24:01,076 (resource_scheduler.py:534) root INFO> Going to check that deadlock are impossible 2017-09-27 12:24:01,077 (resource_scheduler.py:206) root DEBUG> Do not submit information about the workload to Bridge as nothing has been changed 2017-09-27 12:24:01,077 (__init__.py:396) root INFO> Start scheduling new tasks 2017-09-27 12:24:01,077 (__init__.py:406) root INFO> Going to start 1 new tasks and 0 jobs 2017-09-27 12:24:01,078 (__init__.py:718) root INFO> Try to start task 295 2017-09-27 12:24:01,078 (native.py:209) root DEBUG> Start solution of task '295' 2017-09-27 12:24:01,078 (native.py:534) root DEBUG> Future task '295': Going to start a new process which will start native scheduler client 2017-09-27 12:24:01,078 (__init__.py:441) root DEBUG> Flush submitted tasks and jobs if necessary 2017-09-27 12:24:01,079 (__init__.py:446) root DEBUG> Scheduler iteration has finished 2017-09-27 12:24:01,080 (native.py:536) root DEBUG> Future task '295': get pid of the started process. 2017-09-27 12:24:01,080 (native.py:538) root DEBUG> Future task '295': the pid is 2908. 2017-09-27 12:24:01,079 (__init__.py:633) root INFO> Increase data exchange period from 5s to 10s 2017-09-27 12:24:06,084 (__init__.py:128) root INFO> Start scheduling iteration with statuses exchange with the server 2017-09-27 12:24:06,085 (__init__.py:145) root INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error tasks 2017-09-27 12:24:06,085 (__init__.py:160) root INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error jobs 2017-09-27 12:24:06,085 (__init__.py:163) root DEBUG> Add task 0 error descriptions 2017-09-27 12:24:06,085 (__init__.py:170) root DEBUG> Add job 0 error descriptions 2017-09-27 12:24:06,085 (__init__.py:623) root INFO> Skip the next data exchange iteration with Bridge 2017-09-27 12:24:06,086 (__init__.py:378) root DEBUG> Update information about available verification tools 2017-09-27 12:24:06,086 (__init__.py:388) root DEBUG> Update information about connected nodes 2017-09-27 12:24:06,087 (resource_scheduler.py:074) root DEBUG> Try to receive information about resources from controller 2017-09-27 12:24:06,090 (connectionpool.py:208) requests.packages.urllib3.connectionpool INFO> Starting new HTTP connection (1): localhost 2017-09-27 12:24:06,092 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/catalog/nodes HTTP/1.1" 200 202 2017-09-27 12:24:06,093 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson 2017-09-27 12:24:06,096 (connectionpool.py:208) requests.packages.urllib3.connectionpool INFO> Starting new HTTP connection (1): localhost 2017-09-27 12:24:06,097 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701 2017-09-27 12:24:06,098 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson 2017-09-27 12:24:06,102 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701 2017-09-27 12:24:06,102 (resource_scheduler.py:534) root INFO> Going to check that deadlock are impossible 2017-09-27 12:24:06,103 (resource_scheduler.py:201) root INFO> Submit information about the workload to Bridge 2017-09-27 12:24:06,104 (bridge.py:076) root DEBUG> Send "POST" request to "http://localhost:8998/service/update_nodes/" 2017-09-27 12:24:06,107 (connectionpool.py:243) requests.packages.urllib3.connectionpool INFO> Resetting dropped connection: localhost 2017-09-27 12:24:06,208 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "POST /service/update_nodes/ HTTP/1.1" 200 None 2017-09-27 12:24:06,209 (__init__.py:396) root INFO> Start scheduling new tasks 2017-09-27 12:24:06,209 (__init__.py:406) root INFO> Going to start 0 new tasks and 0 jobs 2017-09-27 12:24:06,209 (__init__.py:441) root DEBUG> Flush submitted tasks and jobs if necessary 2017-09-27 12:24:06,210 (__init__.py:446) root DEBUG> Scheduler iteration has finished 2017-09-27 12:24:11,213 (__init__.py:128) root INFO> Start scheduling iteration with statuses exchange with the server 2017-09-27 12:24:11,213 (__init__.py:145) root INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error tasks 2017-09-27 12:24:11,213 (__init__.py:160) root INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error jobs 2017-09-27 12:24:11,213 (__init__.py:163) root DEBUG> Add task 0 error descriptions 2017-09-27 12:24:11,213 (__init__.py:170) root DEBUG> Add job 0 error descriptions
Updated by Evgeny Novikov over 7 years ago
- Category set to Scheduling
- Priority changed from Normal to Immediate
- Target version set to 0.2
This can be a crucial or/and a regression bug.
Updated by Pavel Andrianov over 7 years ago
With 'keep working directory' disabled Klever fails at the third task with 'Corrupted'. I do not know why, because there are no logs without the option. Without any changes in configs I relaunched it. Then it solved 11 tasks and again stopped responding. Seems, the bug appears independently from the option status.
Updated by Evgeny Novikov over 7 years ago
- Priority changed from Immediate to Urgent
- Target version changed from 0.2 to 1.0
It looks like such the strange behavior is a consequence when Klever Core is terminated because of out of memory. So, this can be really treated as a crucial bug.
Updated by Ilja Zakharov over 7 years ago
- Status changed from New to Resolved
Implemented in 8460-tasks-stopping fix. However, I cannot guarantee that it is enough since it is a non-deterministic bug.
Updated by Evgeny Novikov over 7 years ago
- Target version changed from 1.0 to 0.2
Let's include this bug fix into Klever 0.2 and see whether it helps.
Updated by Evgeny Novikov over 7 years ago
- Status changed from Resolved to Closed
I merged it to master in ea39165e.