Project

General

Profile

Bug #8460

Klever stops solving tasks

Added by Pavel Andrianov 3 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
Scheduling
Target version:
Start date:
09/27/2017
Due date:
% Done:

0%

Estimated time:
Detected in build:
svn
Platform:
Published in build:

Description

Klever successfully solves one task and starts preparing the next one. It creates task directory, puts client.json and nothing more. Scheduler considers there is a task processing. There are no errors found in the logs, but, having cil file it doesnt not start CPAchecker during an hour. Interesting fact: it happens, when I set options 'keep working directory' to true. May be, it does not matter.

A part of logs:

2017-09-27 12:24:01,057 (__init__.py:222) root DEBUG> Add new PENDING task 295
2017-09-27 12:24:01,057 (__init__.py:247) root DEBUG> Prepare new task 295 before launching
2017-09-27 12:24:01,058 (native.py:324) root  INFO> Going to prepare execution of the task 295
2017-09-27 12:24:01,058 (resource_scheduler.py:343) root DEBUG> Check the resource limits of pending task 'drivers/hid/hid.ko/linux:fs:char_dev'
2017-09-27 12:24:01,058 (native.py:562) root DEBUG> Create working directory tasks/295
2017-09-27 12:24:01,064 (__init__.py:378) root DEBUG> Update information about available verification tools
2017-09-27 12:24:01,064 (__init__.py:388) root DEBUG> Update information about connected nodes
2017-09-27 12:24:01,064 (resource_scheduler.py:074) root DEBUG> Try to receive information about resources from controller
2017-09-27 12:24:01,067 (connectionpool.py:208) requests.packages.urllib3.connectionpool  INFO> Starting new HTTP connection (1): localhost
2017-09-27 12:24:01,068 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/catalog/nodes HTTP/1.1" 200 202
2017-09-27 12:24:01,069 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson
2017-09-27 12:24:01,071 (connectionpool.py:208) requests.packages.urllib3.connectionpool  INFO> Starting new HTTP connection (1): localhost
2017-09-27 12:24:01,073 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701
2017-09-27 12:24:01,073 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson
2017-09-27 12:24:01,076 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701
2017-09-27 12:24:01,076 (resource_scheduler.py:534) root  INFO> Going to check that deadlock are impossible
2017-09-27 12:24:01,077 (resource_scheduler.py:206) root DEBUG> Do not submit information about the workload to Bridge as nothing has been changed
2017-09-27 12:24:01,077 (__init__.py:396) root  INFO> Start scheduling new tasks
2017-09-27 12:24:01,077 (__init__.py:406) root  INFO> Going to start 1 new tasks and 0 jobs
2017-09-27 12:24:01,078 (__init__.py:718) root  INFO> Try to start task 295
2017-09-27 12:24:01,078 (native.py:209) root DEBUG> Start solution of task '295'
2017-09-27 12:24:01,078 (native.py:534) root DEBUG> Future task '295': Going to start a new process which will start native scheduler client
2017-09-27 12:24:01,078 (__init__.py:441) root DEBUG> Flush submitted tasks and jobs if necessary
2017-09-27 12:24:01,079 (__init__.py:446) root DEBUG> Scheduler iteration has finished
2017-09-27 12:24:01,080 (native.py:536) root DEBUG> Future task '295': get pid of the started process.
2017-09-27 12:24:01,080 (native.py:538) root DEBUG> Future task '295': the pid is 2908.
2017-09-27 12:24:01,079 (__init__.py:633) root  INFO> Increase data exchange period from 5s to 10s
2017-09-27 12:24:06,084 (__init__.py:128) root  INFO> Start scheduling iteration with statuses exchange with the server
2017-09-27 12:24:06,085 (__init__.py:145) root  INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error tasks
2017-09-27 12:24:06,085 (__init__.py:160) root  INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error jobs
2017-09-27 12:24:06,085 (__init__.py:163) root DEBUG> Add task 0 error descriptions
2017-09-27 12:24:06,085 (__init__.py:170) root DEBUG> Add job 0 error descriptions
2017-09-27 12:24:06,085 (__init__.py:623) root  INFO> Skip the next data exchange iteration with Bridge
2017-09-27 12:24:06,086 (__init__.py:378) root DEBUG> Update information about available verification tools
2017-09-27 12:24:06,086 (__init__.py:388) root DEBUG> Update information about connected nodes
2017-09-27 12:24:06,087 (resource_scheduler.py:074) root DEBUG> Try to receive information about resources from controller
2017-09-27 12:24:06,090 (connectionpool.py:208) requests.packages.urllib3.connectionpool  INFO> Starting new HTTP connection (1): localhost
2017-09-27 12:24:06,092 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/catalog/nodes HTTP/1.1" 200 202
2017-09-27 12:24:06,093 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson
2017-09-27 12:24:06,096 (connectionpool.py:208) requests.packages.urllib3.connectionpool  INFO> Starting new HTTP connection (1): localhost
2017-09-27 12:24:06,097 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701
2017-09-27 12:24:06,098 (adapters.py:078) consulate.adapters DEBUG> GET http://localhost:8500/v1/kv/states/wilson
2017-09-27 12:24:06,102 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "GET /v1/kv/states/wilson HTTP/1.1" 200 701
2017-09-27 12:24:06,102 (resource_scheduler.py:534) root  INFO> Going to check that deadlock are impossible
2017-09-27 12:24:06,103 (resource_scheduler.py:201) root  INFO> Submit information about the workload to Bridge
2017-09-27 12:24:06,104 (bridge.py:076) root DEBUG> Send "POST" request to "http://localhost:8998/service/update_nodes/" 
2017-09-27 12:24:06,107 (connectionpool.py:243) requests.packages.urllib3.connectionpool  INFO> Resetting dropped connection: localhost
2017-09-27 12:24:06,208 (connectionpool.py:388) requests.packages.urllib3.connectionpool DEBUG> "POST /service/update_nodes/ HTTP/1.1" 200 None
2017-09-27 12:24:06,209 (__init__.py:396) root  INFO> Start scheduling new tasks
2017-09-27 12:24:06,209 (__init__.py:406) root  INFO> Going to start 0 new tasks and 0 jobs
2017-09-27 12:24:06,209 (__init__.py:441) root DEBUG> Flush submitted tasks and jobs if necessary
2017-09-27 12:24:06,210 (__init__.py:446) root DEBUG> Scheduler iteration has finished
2017-09-27 12:24:11,213 (__init__.py:128) root  INFO> Start scheduling iteration with statuses exchange with the server
2017-09-27 12:24:11,213 (__init__.py:145) root  INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error tasks
2017-09-27 12:24:11,213 (__init__.py:160) root  INFO> Scheduler has 0 pending, 1 processing, 0 finished and 0 error jobs
2017-09-27 12:24:11,213 (__init__.py:163) root DEBUG> Add task 0 error descriptions
2017-09-27 12:24:11,213 (__init__.py:170) root DEBUG> Add job 0 error descriptions


Related issues

Related to Klever - Feature #8466: Improve error messages when Native Scheduler terminates jobs because of reached computational resource limits Closed2017-09-29

Related to Klever - Feature #8467: Reduce the default number of task generator workersClosed2017-09-29

History

#1 Updated by Evgeny Novikov 3 months ago

  • Category set to Scheduling
  • Priority changed from Normal to Immediate
  • Target version set to 0.2

This can be a crucial or/and a regression bug.

#2 Updated by Pavel Andrianov 2 months ago

With 'keep working directory' disabled Klever fails at the third task with 'Corrupted'. I do not know why, because there are no logs without the option. Without any changes in configs I relaunched it. Then it solved 11 tasks and again stopped responding. Seems, the bug appears independently from the option status.

#3 Updated by Evgeny Novikov 2 months ago

  • Priority changed from Immediate to Urgent
  • Target version changed from 0.2 to 0.3

It looks like such the strange behavior is a consequence when Klever Core is terminated because of out of memory. So, this can be really treated as a crucial bug.

#4 Updated by Ilja Zakharov 2 months ago

  • Status changed from New to Resolved

Implemented in 8460-tasks-stopping fix. However, I cannot guarantee that it is enough since it is a non-deterministic bug.

#5 Updated by Evgeny Novikov about 2 months ago

  • Target version changed from 0.3 to 0.2

Let's include this bug fix into Klever 0.2 and see whether it helps.

#6 Updated by Evgeny Novikov about 2 months ago

  • Status changed from Resolved to Closed

I merged it to master in commit:ea39165e.

Also available in: Atom PDF