Feature #8286
closedClarify what resource limits are exceeded
0%
Description
At the moment it is unclear what resource limits are too high since Native Scheduler doesn't clarify this:
Given resource limits are two high, we do not have such amount of resources
Besides, it seems that it may be unclear when there is not enough disk space for solving verification jobs (the appropriate error in this case is "Execution of job 0dec5cb3-4d64-4cf8-bc44-37e25ebc2d1f terminated with an exception: Exited with exit code: 1").
Updated by Evgeny Novikov over 7 years ago
- Priority changed from Urgent to High
This issue does not have such the high priority.
Updated by Evgeny Novikov about 4 years ago
- Has duplicate Feature #10610: Clarify resource limits error added
Updated by Evgeny Novikov about 4 years ago
Also, see at https://forge.ispras.ru/issues/10610#note-1.
Updated by Evgeny Novikov almost 4 years ago
- Target version changed from 3.1 to 3.2
We need to release Klever 3.1 faster due to an incompatibility with Clade 3.3+ and a new OpenStack cloud.
Updated by Evgeny Novikov over 3 years ago
Pavel revealed that the same issue also exists when there is no enough disk space to solve verification tasks. In this case there is RP Unknown with something like this:
Raise exception: Traceback (most recent call last): File "/home/novikov/work/klever/klever/core/components.py", line 395, in run self.main() File "/home/novikov/work/klever/klever/core/components.py", line 304, in callbacks_caller ret = attr(*args, **kwargs) File "/home/novikov/work/klever/klever/core/vrp/__init__.py", line 315, in fetcher raise RuntimeError('Failed to decide verification task: {0}'.format(self.task_error)) RuntimeError: Failed to decide verification task: Task failed 4214: SchedulerException('Execution of task 4214 terminated with an exception: Exited with exit code: 1')
Just in the scheduler log one can find the clarification:
2021-03-26 12:42:27,975 SchedulerClient INFO> Going to solve a verification task with identifier 4214 2021-03-26 12:42:27,975 SchedulerClient INFO> Create session for user "service" at Klever Bridge "localhost:8998" Reached disk memory limit of 10000B, killing process 13801 root INFO> Submit information about the workload to Bridge 13802: Cancelling process 13801 13802: Cancellation of 13801 is successfull, exiting 2021-03-26 12:42:29,102 SchedulerClient WARNING> Traceback (most recent call last): File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 105, in run_benchexec exit_code = solve(logger, conf, mode, srv) File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 136, in solve return solve_task(logger, conf, srv) File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 175, in solve_task exit_code = run(logger, args, conf, logger=logger) File "/home/novikov/work/klever/klever/scheduler/client/__init__.py", line 358, in run ec = execute(args, logger=logger, disk_limitation=dl, disk_checking_period=dcp) File "/home/novikov/work/klever/klever/scheduler/utils/__init__.py", line 390, in execute raise RuntimeError("Disk space limitation of {}B is exceeded".format(disk_limitation)) RuntimeError: Disk space limitation of 10000B is exceeded 2021-03-26 12:42:29,103 SchedulerClient INFO> Exiting with exit code 1 root WARNING> Cannot obtain key 'solutions/Klever/4214' from key-value storage: KeyError('Key not found (solutions/Klever/4214)') root INFO> Going to check execution of the task 4214 root INFO> Future processor of task 4214 returned 1 root WARNING> Exited with exit code: 1 root WARNING> Task failed 4214: SchedulerException('Execution of task 4214 terminated with an exception: Exited with exit code: 1')
Updated by Ilja Zakharov over 3 years ago
- Status changed from New to Resolved
Implemented in detailed-scheduler-error.
Updated by Evgeny Novikov over 3 years ago
- Status changed from Resolved to Open
It's awesome that at last user will understand fast what is wrong with their resource limitations. I tried to add more details to the provided error messages and suddenly revealed a bug. Indeed, you first check that resource limits for a job do not exceed all available computational resources (that is you ignore demands for verification tasks) and then you checked that resource limits for tasks do not exceed remaining computational resources.
For instance, Klever can use 32.49 GB of RAM. When I specify 33 GB for Klever Core, I get the following error message: "Given resource limits for job and tasks are too high: you can use 32.49GB of memory or less in total while current demand is 33GB". When I specify 32 GB for Klever Core, I get the following error message: "Given resource limits for job and tasks are too high: you can use 0.49GB of memory or less in total while current demand is 5GB". Both logic and current error messages should be fixed.
My suggestion is too show available computational resources as well as computational resources for jobs and tasks both separately and together, e.g. "Given resource limits for job and tasks are too high: you can use 32.49GB of memory or less in total while current demand is 32 GB for the job and 5GB for tasks". Of course, you should take this into account during calculations as well.
Updated by Ilja Zakharov over 3 years ago
- Status changed from Open to Resolved
Fixed in detailed-scheduler-error.
Updated by Evgeny Novikov over 3 years ago
- Status changed from Resolved to Closed
I find that both logic and error messages are correct after the fix, so, I merged the branch to master in 3598a0b14.