Bug #7707: Several instances of NativeScheduler concurrently decide jobs - Klever - Open-Source Projects

Actions

Copy link

Bug #7707

open

Several instances of NativeScheduler concurrently decide jobs

Added by Evgeny Novikov over 8 years ago. Updated over 8 years ago.

Status:

New

Priority:

High

Assignee:

Ilja Zakharov

Category:

Scheduling

Target version:

Start date:

11/10/2016

Due date:

% Done:

Estimated time:

Detected in build:

svn

Platform:

Published in build:

Description

I observed strange issues, like Core couldn't find locking file is solving, Core couldn't decide a job since it doesn't have an appropriate status and everything suddenly and unexpectedly crashes. It turned out that the reason was two instances of NativeScheduler using the same Bridge operated simultaneously on the same computer.

Actions

Copy link

Updated by Ilja Zakharov over 8 years ago

What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.

I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.

Actions

Copy link

Updated by Evgeny Novikov over 8 years ago

Ilja Zakharov wrote:

What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.

I agree that incomplete solutions shouldn't be done in this case. Also I didn't request to automatically fix the issue, e.g. by killing some processes. For instance, in my case the first instance of NativeScheduler existed because of I killed its parent PyCharm but didn't kill NativeScheduler after that.

I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.

This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.

Actions

Copy link

Updated by Ilja Zakharov over 8 years ago

This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.

Propose to update a format of messages from Controller to Bridge and add there error messages somehow. Messages should be shown together with scheduler statuses in the same page.

Actions

Copy link