Project

General

Profile

Actions

Feature #8286

closed

Clarify what resource limits are exceeded

Added by Evgeny Novikov over 7 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Scheduling
Target version:
Start date:
07/11/2017
Due date:
% Done:

0%

Estimated time:
Published in build:

Description

At the moment it is unclear what resource limits are too high since Native Scheduler doesn't clarify this:

Given resource limits are two high, we do not have such amount of resources

Besides, it seems that it may be unclear when there is not enough disk space for solving verification jobs (the appropriate error in this case is "Execution of job 0dec5cb3-4d64-4cf8-bc44-37e25ebc2d1f terminated with an exception: Exited with exit code: 1").


Related issues 1 (0 open1 closed)

Has duplicate Klever - Feature #10610: Clarify resource limits errorRejected12/03/2020

Actions
Actions #1

Updated by Evgeny Novikov about 7 years ago

  • Priority changed from Urgent to High

This issue does not have such the high priority.

Actions #2

Updated by Evgeny Novikov almost 4 years ago

Actions #4

Updated by Evgeny Novikov almost 4 years ago

  • Target version set to 3.1

Let's do this in Klever 3.1.

Actions #5

Updated by Evgeny Novikov almost 4 years ago

  • Description updated (diff)
Actions #6

Updated by Evgeny Novikov over 3 years ago

  • Target version changed from 3.1 to 3.2

We need to release Klever 3.1 faster due to an incompatibility with Clade 3.3+ and a new OpenStack cloud.

Actions #7

Updated by Evgeny Novikov over 3 years ago

Pavel revealed that the same issue also exists when there is no enough disk space to solve verification tasks. In this case there is RP Unknown with something like this:

Raise exception:
Traceback (most recent call last):
  File "/home/novikov/work/klever/klever/core/components.py", line 395, in run
    self.main()
  File "/home/novikov/work/klever/klever/core/components.py", line 304, in callbacks_caller
    ret = attr(*args, **kwargs)
  File "/home/novikov/work/klever/klever/core/vrp/__init__.py", line 315, in fetcher
    raise RuntimeError('Failed to decide verification task: {0}'.format(self.task_error))
RuntimeError: Failed to decide verification task: Task failed 4214: SchedulerException('Execution of task 4214 terminated with an exception: Exited with exit code: 1')

Just in the scheduler log one can find the clarification:
2021-03-26 12:42:27,975 SchedulerClient  INFO> Going to solve a verification task with identifier 4214
2021-03-26 12:42:27,975 SchedulerClient  INFO> Create session for user "service" at Klever Bridge "localhost:8998" 
Reached disk memory limit of 10000B, killing process 13801
root  INFO> Submit information about the workload to Bridge
13802: Cancelling process 13801
13802: Cancellation of 13801 is successfull, exiting
2021-03-26 12:42:29,102 SchedulerClient WARNING> Traceback (most recent call last):
  File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 105, in run_benchexec
    exit_code = solve(logger, conf, mode, srv)
  File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 136, in solve
    return solve_task(logger, conf, srv)
  File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 175, in solve_task
    exit_code = run(logger, args, conf, logger=logger)
  File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 358, in run
    ec = execute(args, logger=logger, disk_limitation=dl, disk_checking_period=dcp)
  File "/home/novikov/work/klever/klever/scheduler/utils/__init__.py", line 390, in execute
    raise RuntimeError("Disk space limitation of {}B is exceeded".format(disk_limitation))
RuntimeError: Disk space limitation of 10000B is exceeded
2021-03-26 12:42:29,103 SchedulerClient  INFO> Exiting with exit code 1
root WARNING> Cannot obtain key 'solutions/Klever/4214' from key-value storage: KeyError('Key not found (solutions/Klever/4214)')
root  INFO> Going to check execution of the task 4214
root  INFO> Future processor of task 4214 returned 1
root WARNING> Exited with exit code: 1
root WARNING> Task failed 4214: SchedulerException('Execution of task 4214 terminated with an exception: Exited with exit code: 1')

Actions #8

Updated by Ilja Zakharov over 3 years ago

  • Status changed from New to Resolved

Implemented in detailed-scheduler-error.

Actions #9

Updated by Evgeny Novikov over 3 years ago

  • Status changed from Resolved to Open

It's awesome that at last user will understand fast what is wrong with their resource limitations. I tried to add more details to the provided error messages and suddenly revealed a bug. Indeed, you first check that resource limits for a job do not exceed all available computational resources (that is you ignore demands for verification tasks) and then you checked that resource limits for tasks do not exceed remaining computational resources.

For instance, Klever can use 32.49 GB of RAM. When I specify 33 GB for Klever Core, I get the following error message: "Given resource limits for job and tasks are too high: you can use 32.49GB of memory or less in total while current demand is 33GB". When I specify 32 GB for Klever Core, I get the following error message: "Given resource limits for job and tasks are too high: you can use 0.49GB of memory or less in total while current demand is 5GB". Both logic and current error messages should be fixed.

My suggestion is too show available computational resources as well as computational resources for jobs and tasks both separately and together, e.g. "Given resource limits for job and tasks are too high: you can use 32.49GB of memory or less in total while current demand is 32 GB for the job and 5GB for tasks". Of course, you should take this into account during calculations as well.

Actions #10

Updated by Ilja Zakharov over 3 years ago

  • Status changed from Open to Resolved

Fixed in detailed-scheduler-error.

Actions #11

Updated by Evgeny Novikov over 3 years ago

  • Status changed from Resolved to Closed

I find that both logic and error messages are correct after the fix, so, I merged the branch to master in 3598a0b14.

Actions

Also available in: Atom PDF