Bug #7707
openSeveral instances of NativeScheduler concurrently decide jobs
0%
Description
I observed strange issues, like Core couldn't find locking file is solving, Core couldn't decide a job since it doesn't have an appropriate status and everything suddenly and unexpectedly crashes. It turned out that the reason was two instances of NativeScheduler using the same Bridge operated simultaneously on the same computer.
Updated by Ilja Zakharov about 8 years ago
What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.
I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.
Updated by Evgeny Novikov about 8 years ago
Ilja Zakharov wrote:
What are you suggesting? This is known and predictable behaviour and yes, it leads to various bugs. I can add checks to prevent second scheduler execution at the same working directory with the locked file but it does not fix the issue completely. I would strongly avoid checking running processes in the system, since controller already does it but it is not designed for killing processes.
I agree that incomplete solutions shouldn't be done in this case. Also I didn't request to automatically fix the issue, e.g. by killing some processes. For instance, in my case the first instance of NativeScheduler existed because of I killed its parent PyCharm but didn't kill NativeScheduler after that.
I would suggest the following solution. Controller can detect this issue happening in the system and it should not report status of the Native scheduler as HEALTHY. In this case user will not be able to start any solution until he fix the problem.
This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.
Updated by Ilja Zakharov about 8 years ago
This will be a more predicable behaviour but how the user will find the reasons of schedulers bad statuses? I didn't remember that I encountered status AILING of a scheduler, so I haven't any idea what to do in this case. This is almost the same as now when I had to debug Bridge to see that 2 requests for deciding the same job come.
Propose to update a format of messages from Controller to Bridge and add there error messages somehow. Messages should be shown together with scheduler statuses in the same page.
Updated by Evgeny Novikov about 8 years ago
I agree with this since otherwise one will be redirected to some logs. But #6542 is a more important issue still since it happens much often.