Project

General

Profile

Actions

Bug #7891

closed

Bridge does not properly check for unfinished reports

Added by Evgeny Novikov over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Immediate
Category:
Bridge
Target version:
-
Start date:
01/24/2017
Due date:
01/25/2017
% Done:

100%

Estimated time:
Detected in build:
svn
Platform:
Published in build:

Description

Now a job decision has status Failed if there are unfinished reports. So one can download corresponding job archives. When one will try to upload such the archives the error description from the errors log will be the following:

[24.Jan.2017 08:36:50] Uploading report failed: type object argument after * must be a sequence, not NoneType
Traceback (most recent call last):
  File "/var/www/bridge/jobs/Download.py", line 485, in __create_job_from_tar
    UploadReports(job, reports_data, report_files)
  File "/var/www/bridge/jobs/Download.py", line 531, in __init__
    self.__upload_all()
  File "/var/www/bridge/jobs/Download.py", line 553, in __upload_all
    report_id = curr_func(data)
  File "/var/www/bridge/jobs/Download.py", line 578, in __create_report_component
    'finish_date': datetime(*data['finish_date'], tzinfo=pytz.timezone('UTC')),
TypeError: type object argument after * must be a sequence, not NoneType
Stack (most recent call last):
  File "/usr/local/bin/gunicorn", line 11, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/app/wsgiapp.py", line 74, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/app/base.py", line 192, in run
    super(Application, self).run()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/arbiter.py", line 189, in run
    self.manage_workers()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/arbiter.py", line 524, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/arbiter.py", line 590, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/arbiter.py", line 557, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/base.py", line 132, in init_process
    self.run()
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/sync.py", line 124, in run
    self.run_for_one(timeout)
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/sync.py", line 68, in run_for_one
    self.accept(listener)
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/sync.py", line 30, in accept
    self.handle(listener, client, addr)
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.4/dist-packages/gunicorn/workers/sync.py", line 176, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.4/dist-packages/django/core/handlers/wsgi.py", line 177, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.4/dist-packages/django/core/handlers/base.py", line 147, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.4/dist-packages/django/contrib/auth/decorators.py", line 23, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/var/www/bridge/bridge/utils.py", line 118, in wait
    res = f(*args, **kwargs)
  File "/var/www/bridge/bridge/utils.py", line 58, in wrapper
    res = f(*args, **kwargs)
  File "/var/www/bridge/jobs/views.py", line 635, in upload_job
    zipdata = UploadJob(parent, request.user, job_dir.name)
  File "/var/www/bridge/jobs/Download.py", line 305, in __init__
    self.err_message = self.__create_job_from_tar()
  File "/var/www/bridge/jobs/Download.py", line 487, in __create_job_from_tar
    logger.exception("Uploading report failed: %s" % e, stack_info=True)

I noticed this issue just on one job archive. The job wasn't decided completely - its decision was terminated due to fatal errors happening before I increased timeouts.


Files

linux_alloc - part.zip (14.8 MB) linux_alloc - part.zip Evgeny Novikov, 01/24/2017 11:44 AM
Job-0bf7ec7f02-0.zip (15.9 KB) Job-0bf7ec7f02-0.zip Evgeny Novikov, 01/25/2017 10:51 AM
Actions #1

Updated by Vladimir Gratinskiy over 7 years ago

The reason is the job has unfinished reports. But when you upload finish Core report there is check, if there are unfinished report, the job will become "Corrupted". Without this report the job can't become "Failed" or "Finished" (you can download only such jobs). So you shouldn't do what you do with unfinished jobs.

Actions #2

Updated by Evgeny Novikov over 7 years ago

  • Subject changed from Bridge can't upload reports sometimes to Bridge does not properly check for unfinished reports
  • Description updated (diff)
Actions #3

Updated by Vladimir Gratinskiy over 7 years ago

  • Due date set to 01/25/2017
  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Fixed in fix_7891. The problem was there were no finish reports from Core when the scheduler set the job status as Failed.

Actions #4

Updated by Evgeny Novikov over 7 years ago

  • Status changed from Resolved to Open

I didn't noticed any changes. I still can download archives of jobs having unfinished reports. Their uploading still results in exceptions like in the issue description.

BTW, why it is bad to download and upload jobs having unfinished reports? Such the jobs are absolutely normal, for instance, if Core and its components are terminated due to memory out or because of Bridge/Scheduler internal errors. They can hold some useful data obtained during a long period of time and even this data is incomplete it is still valuable (e.g. one can start to analyze unsafes until complete results will be obtained and the corresponding jobs will be replaced).

Actions #5

Updated by Evgeny Novikov over 7 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from Bridge does not properly check for unfinished reports to Allow to download and upload archives of jobs with unfinished reports
  • Description updated (diff)
Actions #6

Updated by Evgeny Novikov over 7 years ago

I attached a much more simpler archive with a job having unfinished reports due to Core exceeds an artificial small memory limit.

Actions #7

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

I didn't noticed any changes. I still can download archives of jobs having unfinished reports. Their uploading still results in exceptions like in the issue description.

But you can't make new Failed jobs with unfinished reports anymore.

Actions #8

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

I didn't noticed any changes. I still can download archives of jobs having unfinished reports. Their uploading still results in exceptions like in the issue description.

But you can't make new Failed jobs with unfinished reports anymore.

I attached the example of such the job recently ;)

Actions #9

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

I didn't noticed any changes. I still can download archives of jobs having unfinished reports. Their uploading still results in exceptions like in the issue description.

But you can't make new Failed jobs with unfinished reports anymore.

I attached the example of such the job recently ;)

I mean you can't decide the job so its status will be Failed and there will be unfinished reports. Old jobs are not affected by this fix.

Actions #10

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

I didn't noticed any changes. I still can download archives of jobs having unfinished reports. Their uploading still results in exceptions like in the issue description.

But you can't make new Failed jobs with unfinished reports anymore.

I attached the example of such the job recently ;)

I mean you can't decide the job so its status will be Failed and there will be unfinished reports. Old jobs are not affected by this fix.

I decide that job after checkouting your branch. So I did get this job archive although I couldn't do that.

Actions #11

Updated by Vladimir Gratinskiy over 7 years ago

  • Status changed from Open to Resolved

Sorry, I forgot about one place without check of unfinished reports :) But anyway now jobs can have unfinished reports and users can download and upload its archives without exceptions.

Actions #12

Updated by Evgeny Novikov over 7 years ago

  • Tracker changed from Feature to Bug
  • Subject changed from Allow to download and upload archives of jobs with unfinished reports to Bridge does not properly check for unfinished reports
  • Status changed from Resolved to Open
  • Detected in build set to svn

I have some comments.

First, you shouldn't use variable FORMAT to distinguish the way job archives are represented. In accordance to the specification that variable should be used to distinguish job representations. So, after you updated its value there should be a migration and all new jobs will have a new format and won't be solved by old versions of Core. You can introduce some new variable for this purpose.

Second, although I didn't test this, but commit message says that solved jobs can have unfinished reports. This sounds very strange. Actually it would be better if you will specify status Corrupted for any jobs (either solved completely successfully or unsuccessfully by any reason) if they have unfinished reports - that I originally requested and you almost implemented. But then it will be very good if one will be able to download/upload archives of jobs having status Corrupted. Is it possible? I will open another issue while for this.

BTW, this issue is very close to #7758 which speaks about the same problems but for jobs with status Canceled. As far as there are some complicated things that are hard to implement for transferring canceled/corrupted job archives, e.g. proper cache recalculation. But if this will be done at least for some particular examples then we will be able to fix some extra cases. Without full support of job archives transferring like for solved jobs there is less sense to have an ability to download/upload them since one will not have many useful means for analysis.

Actions #13

Updated by Vladimir Gratinskiy over 7 years ago

Fixed after merge to branch "feature_7902". Now if:
1) scheduler says the job is failed it will be FAILED;
2) scheduler says the job is solved:
- if the job has unfinished reports then the job is CORRUPTED;
- elif the job is full-weight and there are unknown reports for Core component then the job is FAILED;
- else the job is SOLVED.

There are no other ways when the job will become SOLVED.

Actions #14

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Fixed after merge to branch "feature_7902". Now if:
1) scheduler says the job is failed it will be FAILED;
2) scheduler says the job is solved:
- if the job has unfinished reports then the job is CORRUPTED;
- elif the job is full-weight and there are unknown reports for Core component then the job is FAILED;
- else the job is SOLVED.

There are no other ways when the job will become SOLVED.

But if the job has status FAILED due to the scheduler failed it but it has unfinished reports, it can be downloaded while its uploading will fail, will it?

Actions #15

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

But if the job has status FAILED due to the scheduler failed it but it has unfinished reports, it can be downloaded while its uploading will fail, will it?

Exactly.

Actions #16

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

But if the job has status FAILED due to the scheduler failed it but it has unfinished reports, it can be downloaded while its uploading will fail, will it?

Exactly.

So, I don't understand why Bridge doesn't mark corresponding jobs as corrupted. The only reason for that is to show an error message from the scheduler that is associated with status FAILED. But it would be better to associate it with an error message for CORRUPTED in addition to its own error message.

Actions #17

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

But if the job has status FAILED due to the scheduler failed it but it has unfinished reports, it can be downloaded while its uploading will fail, will it?

Exactly.

So, I don't understand why Bridge doesn't mark corresponding jobs as corrupted. The only reason for that is to show an error message from the scheduler that is associated with status FAILED. But it would be better to associate it with an error message for CORRUPTED in addition to its own error message.

Oh, sorry. Uploading will not fail. I will open downloading/uploading all jobs (except pending and processing). I didn't find any problems with uploading such jobs now, but we will find it when try to use it. The branch feature_7902 is ready for testing.

Actions #18

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

But if the job has status FAILED due to the scheduler failed it but it has unfinished reports, it can be downloaded while its uploading will fail, will it?

Exactly.

So, I don't understand why Bridge doesn't mark corresponding jobs as corrupted. The only reason for that is to show an error message from the scheduler that is associated with status FAILED. But it would be better to associate it with an error message for CORRUPTED in addition to its own error message.

Oh, sorry. Uploading will not fail. I will open downloading/uploading all jobs (except pending and processing). I didn't find any problems with uploading such jobs now, but we will find it when try to use it. The branch feature_7902 is ready for testing.

But it looks like some inconsistency.

Status Corrupted is intended for jobs which verification results are malformed, e.g. this is the case when Core or its components sends unexpected reports or a reports tree is broken after all.

Status Canceled is very close to Corrupted but in this case the user stops a job decision and most likely verification results are malformed.

Status Failed is intended for jobs which verification results are well-formed but something bad happens but Core and its components could detect and report this.

Status Solved is intended for jobs with good verification results.

Unfortunately there is no place where the scheduler can report errors detected on top of a job decision, e.g. information about fatal errors and time/memory outs. So these errors are specified as popups for status Failed even though corresponding jobs verification results are malformed.

At first I would like to understand whether this is correct and complete description.

Actions #19

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

Unfortunately there is no place where the scheduler can report errors detected on top of a job decision, e.g. information about fatal errors and time/memory outs. So these errors are specified as popups for status Failed even though corresponding jobs verification results are malformed.

Popups contains error either from scheduler or from bridge (any exception description). For example, if report has wrong format, uploading report would raise the exception that will be caught and saved to SolvingProgress.error field.

Status Solved is for good verification results (that were uploaded without any problems) and with full set of reports (all reports are finished).
Status Failed is for good verification results but scheduler said that the job is Failed. This status is also for jobs with good verification results with unknown report for Core.
Status Corrupted is for other cases. For example, the scheduler was disconnected, or uploading report failed, or scheduler think everything is OK and status should be Solved but bridge found unfinished reports.
The difference of Cancelled status from Corrupted is the first one just shows that user corrupted results, not service.

Actions #20

Updated by Evgeny Novikov over 7 years ago

So, I suggest to introduce one more status to specify errors related with schedulers, that are both errors it explicitly reports (e.g. that a job decision was terminated due to timeout) and other errors like disconnecting. Let's call this status Terminated since it corresponds to the most of such errors.

Then we will not have ambiguity for statuses Failed and Corrupted.

Actions #21

Updated by Vladimir Gratinskiy over 7 years ago

Evgeny Novikov wrote:

So, I suggest to introduce one more status to specify errors related with schedulers, that are both errors it explicitly reports (e.g. that a job decision was terminated due to timeout) and other errors like disconnecting. Let's call this status Terminated since it corresponds to the most of such errors.

Then we will not have ambiguity for statuses Failed and Corrupted.

When this status should be set? When scheduler incorrectly set the status of the job to Solved while there are unfinished reports or Core have unknown report (for full-weight jobs)?

Actions #22

Updated by Evgeny Novikov over 7 years ago

Vladimir Gratinskiy wrote:

Evgeny Novikov wrote:

So, I suggest to introduce one more status to specify errors related with schedulers, that are both errors it explicitly reports (e.g. that a job decision was terminated due to timeout) and other errors like disconnecting. Let's call this status Terminated since it corresponds to the most of such errors.

Then we will not have ambiguity for statuses Failed and Corrupted.

When this status should be set? When scheduler incorrectly set the status of the job to Solved while there are unfinished reports or Core have unknown report (for full-weight jobs)?

When schedulers report errors themselves (you described this as "Status Failed is for good verification results but scheduler said that the job is Failed." - actually results can be malformed, e.g. there can be unfinished reports) or they are disconnected, or there are some other issues related with schedulers.

Actions #23

Updated by Vladimir Gratinskiy over 7 years ago

Fixed. Now Failed status just for jobs with unknown report for Core.

Actions #24

Updated by Vladimir Gratinskiy over 7 years ago

The status of jobs with unfinished reports if scheduler didn't report an error will be Corrupted.

Actions #25

Updated by Evgeny Novikov over 7 years ago

  • Status changed from Open to Closed

I merged the branch to master in 61cc1ca together with implementation of #7902.

Actions

Also available in: Atom PDF