Project

General

Profile

Actions

Bug #7707

open

Several instances of NativeScheduler concurrently decide jobs

Added by Evgeny Novikov over 7 years ago. Updated over 7 years ago.

Status:
New
Priority:
High
Assignee:
Category:
Scheduling
Target version:
-
Start date:
11/10/2016
Due date:
% Done:

0%

Estimated time:
Detected in build:
svn
Platform:
Published in build:

Description

I observed strange issues, like Core couldn't find locking file is solving, Core couldn't decide a job since it doesn't have an appropriate status and everything suddenly and unexpectedly crashes. It turned out that the reason was two instances of NativeScheduler using the same Bridge operated simultaneously on the same computer.

Actions #1

Updated by Ilja Zakharov over 7 years ago

What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.

I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.

Actions #2

Updated by Evgeny Novikov over 7 years ago

Ilja Zakharov wrote:

What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.

I agree that incomplete solutions shouldn't be done in this case. Also I didn't request to automatically fix the issue, e.g. by killing some processes. For instance, in my case the first instance of NativeScheduler existed because of I killed its parent PyCharm but didn't kill NativeScheduler after that.

I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.

This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.

Actions #3

Updated by Ilja Zakharov over 7 years ago

This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.

Propose to update a format of messages from Controller to Bridge and add there error messages somehow. Messages should be shown together with scheduler statuses in the same page.

Actions #4

Updated by Evgeny Novikov over 7 years ago

I agree with this since otherwise one will be redirected to some logs. But #6542 is a more important issue still since it happens much often.

Actions

Also available in: Atom PDF