Bug #7854
closedRemove busy waiting from VTG strategies
0%
Description
Any VTG strategy (including the basic strategy) can be presented in the following way:
1. prepare verification task;
2. wait for verifier to solve task;
3. process result of solving verification task.
Currently step 2 is implemented as busy waiting:
while True:
task_status = session.get_task_status(task_id)
# processes status, break if task has been solved
time.sleep(1)
Function get_task_status
, which is called each second, sends POST request to Bridge. Note, that usually there are several threads, which use this busy waiting simultaneously (for example, in "classic" full launches there are 16 such threads). Even if corresponding tasks have "PENDING" status (were not started to solving), VTG strategy is in busy waiting.
This all leads to excessive waste of resources in VTG strategy component (such as SBT).
Here are some experiment results, which demonstrate VTG strategy component CPU time.
1. All Linux kernel modules, MAV.
Current implementation - 33 000 seconds
Increasing waiting time up to 100 (less busy waiting) - 23 000 seconds (~40% less resources).
Completely removing busy waiting will further reduce wasting of resources.
2. Master, basic strategy, any module, which runs into timeout (for example, drivers/ata/libata.ko
).
Increasing waiting time up to timeout (no busy waiting at all) - 11 seconds (2 times less).
Current implementation - 20 seconds.
Reducing waiting time up to 0.001 (more busy waiting) - 139 seconds.
Removing busy waiting may require some more functionality from Bridge.
Ideally this issue should be implemented in accordance with #7272 and #7800.
Note, that this problem comes from old VTG strategy ABKM, which was erroneously taken as a basis for all current VTG strategies.
Updated by Ilja Zakharov about 8 years ago
First, I strongly propose to separate this functionality from any strategy and separate it into a library that can be used in strategies. Than it can be seamlessly improved.
Updated by Evgeny Novikov about 8 years ago
Definitely this looks like a redundant load both on server (Bridge) and client (Core/VTG/Strategy), but this allows to process verification results almost immediately after they are produced by task workers, thus this reduces total wall time on average.
Simple increase of timeouts up to some constant value can lead to increase of total wall time especially if this constant will be large. This will be extremely notable when we have many simple tasks which are prepared and decided quickly (e.g. tests) while the number of waiting Strategy processes is not so huge (keep in mind that each such process also needs some resources, so we shouldn't increase their number infinitely).
So I suggest to use some advanced scheme of timeouts increase. For instance, each next iteration we can wait for a minimum of a period already elapsed and, say, 30 seconds. Then the worst delay will be 30 seconds but for simple tasks it will be just 15 seconds on average (actually much less since the most of simple tasks are decided in several seconds - and the worst delay will be the same).
Of course this should be done in a separate library as suggested by Ilja.
Updated by Vitaly Mordan about 8 years ago
Evgeny Novikov wrote:
So I suggest to use some advanced scheme of timeouts increase. For instance, each next iteration we can wait for a minimum of a period already elapsed and, say, 30 seconds. Then the worst delay will be 30 seconds but for simple tasks it will be just 15 seconds on average (actually much less since the most of simple tasks are decided in several seconds - and the worst delay will be the same).
This will not resolve the problem (busy waiting), just improves some specific cases (and may lead to degradation in general case).
Ideally busy waiting should be completely removed, which may require extension of Bridge (for example, Bridge sends report to the strategy once task has been solved).
Of course this should be done in a separate library as suggested by Ilja.
This functionality (receiving result of solving verification task) should be unified for all strategies (whether in a separate library as it was done with processing of error traces or in common strategy).
Updated by Evgeny Novikov about 8 years ago
Vitaly Mordan wrote:
Evgeny Novikov wrote:
So I suggest to use some advanced scheme of timeouts increase. For instance, each next iteration we can wait for a minimum of a period already elapsed and, say, 30 seconds. Then the worst delay will be 30 seconds but for simple tasks it will be just 15 seconds on average (actually much less since the most of simple tasks are decided in several seconds - and the worst delay will be the same).
This will not resolve the problem (busy waiting), just improves some specific cases (and may lead to degradation in general case).
Ideally busy waiting should be completely removed, which may require extension of Bridge (for example, Bridge sends report to the strategy once task has been solved).
My solution will make it almost invisible (see my explanations for details) and can be very easily implemented (in 30 minutes, or in 1 hour if move the maximum delay period to job decision configuraiton).
Your solution is very hard to be implemented since each Strategy process should start a separate server, Bridge should know about all of them (what server waits for what task) and trigger corresponding requests. Indeed this can result in even more resource consumption. Possible issues and absence of developers time further prevent this direction.
Updated by Alexey Khoroshilov about 8 years ago
May be task queues should be managed using something like RethinkDB or CouchDB?
Or maybe PostgreSQL notifications would be enough?
Updated by Vitaly Mordan about 8 years ago
Evgeny Novikov wrote:
My solution will make it almost invisible (see my explanations for details) and can be very easily implemented
It is not clear, how to choose worst delay and what are simple tasks. Suggested 30 or 15 seconds are too big for tests, but are also too little for big launches. It would not be easy to suggest a good automatic schema for such timeouts.
The easiest way (but not the best one) is to change this timeout manually (for example, in job description). We can also calculate some predefined values, for example:Your solution is very hard to be implemented
- ~10 seconds for tests or other simple jobs (checking one module). Since scheduler requires about 2-3 seconds to start solving job (which, probably, also overheads), and the easiest task requires a few seconds to be solved, timeout less than 5-6 seconds does not make any sense.
- ~100 seconds for big launches (may be bigger for strategies, which check several rules at once). In this case most of the tasks (~90%) will be solved without busy waiting. Note, that overall wall time will be slightly reduced in comparison with 1 second timeout.
Updated by Evgeny Novikov about 8 years ago
Alexey Khoroshilov wrote:
May be task queues should be managed using something like RethinkDB or CouchDB?
Or maybe PostgreSQL notifications would be enough?
This would be great if we will have ~3 devoper-weeks for accurate investigation, development, deployment in various environments, testing and documenting. Perhaps one day when we will have large extremely loaded clusters we will have to do this. Now we have quite small scales and a little developer efforts.
Updated by Evgeny Novikov about 8 years ago
Vitaly Mordan wrote:
Evgeny Novikov wrote:
My solution will make it almost invisible (see my explanations for details) and can be very easily implemented
It is not clear, how to choose worst delay and what are simple tasks. Suggested 30 or 15 seconds are too big for tests, but are also too little for big launches. It would not be easy to suggest a good automatic schema for such timeouts.
The worst delay at best should be specified when a job decision is started (30 more minutes for implementation). Or just like now it can be hard coded.
You didn't understand my notes about adjusted delays for tests. The most of tests are decided in a couple of seconds, thus the maximum delay will be just several seconds. Also I don't understand why it is bad to execute queries each 30 seconds for large tasks. For timeouts the worst delay will be just 30 / 15 / 60 = 1 / 30 ~ 3% of their decision time while there will be done just approximately 60 requests rather than ~900 like now.
The easiest way (but not the best one) is to change this timeout manually (for example, in job description). We can also calculate some predefined values, for example:Your solution is very hard to be implemented
- ~10 seconds for tests or other simple jobs (checking one module). Since scheduler requires about 2-3 seconds to start solving job (which, probably, also overheads), and the easiest task requires a few seconds to be solved, timeout less than 5-6 seconds does not make any sense.
- ~100 seconds for big launches (may be bigger for strategies, which check several rules at once). In this case most of the tasks (~90%) will be solved without busy waiting. Note, that overall wall time will be slightly reduced in comparison with 1 second timeout.
It is boring to specify this parameter manually each time. Any way it is unclear how to do this properly assuming its constant value since there are many other parameters and characteristics that affect the outcome. For instance, I am not sure that total wall time will slightly reduced if abstract verification tasks will be generated in parallel, not all tasks of a job will require really huge time (usually this is the case) and we have much resources used more efficiently than now, e.g. by leveraging speculative scheduling.
Updated by Vitaly Mordan about 8 years ago
Evgeny Novikov wrote:
The worst delay at best should be specified when a job decision is started (30 more minutes for implementation). Or just like now it can be hard coded.
It is a bad idea to hard code this parameter, because even in case of basic strategy it should be different. More strategies may require different values of that timeout. Current value (1 second) is bad for all cases.
It is boring to specify this parameter manually each time.
Maybe, it should be specified in scheduler configuration (we know, what kind of jobs are solved there). Or maybe, it should be set to default value, which can be overridden somewhere.
For instance, I am not sure that total wall time will slightly reduced if abstract verification tasks will be generated in parallel
Wall time may be reduced in big launch, because some resources (in example, about 10 000 CPU seconds) will be spent on solving tasks instead.
Updated by Evgeny Novikov about 8 years ago
Vitaly Mordan wrote:
Evgeny Novikov wrote:
The worst delay at best should be specified when a job decision is started (30 more minutes for implementation). Or just like now it can be hard coded.
It is a bad idea to hard code this parameter, because even in case of basic strategy it should be different. More strategies may require different values of that timeout. Current value (1 second) is bad for all cases.
Okay, one will spend 30 minutes more.
It is boring to specify this parameter manually each time.
Maybe, it should be specified in scheduler configuration (we know, what kind of jobs are solved there). Or maybe, it should be set to default value, which can be overridden somewhere.
No, schedulers are common and don't know tasks specifics. This is a job setting, in particular just a job decision setting since it doesn't affect verification results (just times) in ideal.
For instance, I am not sure that total wall time will slightly reduced if abstract verification tasks will be generated in parallel
Wall time may be reduced in big launch, because some resources (in example, about 10 000 CPU seconds) will be spent on solving tasks instead.
This is just one more attribute that can affect the whole system unexpectedly.
Updated by Evgeny Novikov almost 8 years ago
- Status changed from New to Rejected
Reject it as a duplicate of #6906 that was created much earlier.