Project

General

Profile

Actions

Bug #974

closed

ldv-core skips drivers under heavy load

Added by Pavel Shved about 13 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Start date:
03/24/2011
Due date:
% Done:

0%

Estimated time:
Detected in build:
svn
Platform:
Published in build:
1b22552

Description

On a heavy-loaded system, ldv-core sometimes doesn't yield commands for drivers (i.e. does not launch the relevant DSCVs). During the experiment with a cluster, it skipped 9 out of 2000 drivers.

When it happens, ldv-core issues a trace message

ldv-core: Test for next command...

which means that WSDL service returned NIL instead of the command.

Should be investigated properly.

Actions #1

Updated by Pavel Shved about 13 years ago

Here's an exception trace for CSD service. There are exactly 9 traces in csd_err.log — as many as there are missing drivers.

javax.xml.ws.WebServiceException: javax.xml.stream.XMLStreamException: java.io.IOException: Broken pipe
  at com.sun.xml.internal.ws.encoding.StreamSOAPCodec.encode(StreamSOAPCodec.java:101)
  at com.sun.xml.internal.ws.encoding.SOAPBindingCodec.encode(SOAPBindingCodec.java:249)
  at com.sun.xml.internal.ws.transport.http.HttpAdapter.encodePacket(HttpAdapter.java:328)
  at com.sun.xml.internal.ws.transport.http.HttpAdapter.access$100(HttpAdapter.java:82)
  at com.sun.xml.internal.ws.transport.http.HttpAdapter$HttpToolkit.handle(HttpAdapter.java:470)
  at com.sun.xml.internal.ws.transport.http.HttpAdapter.handle(HttpAdapter.java:233)
  at com.sun.xml.internal.ws.transport.http.server.WSHttpHandler.handleExchange(WSHttpHandler.java:95)
  at com.sun.xml.internal.ws.transport.http.server.WSHttpHandler.handle(WSHttpHandler.java:80)
  at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:83)
  at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
  at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:86)
  at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:589)
  at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:83)
  at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:561)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  at java.lang.Thread.run(Thread.java:636)
Caused by: javax.xml.stream.XMLStreamException: java.io.IOException: Broken pipe
  at com.sun.xml.internal.stream.writers.XMLStreamWriterImpl.flush(XMLStreamWriterImpl.java:395)
  at com.sun.xml.internal.ws.streaming.XMLStreamWriterUtil.getOutputStream(XMLStreamWriterUtil.java:86)
  at com.sun.xml.internal.ws.message.jaxb.JAXBMessage.writePayloadTo(JAXBMessage.java:308)
  at com.sun.xml.internal.ws.message.AbstractMessageImpl.writeTo(AbstractMessageImpl.java:131)
  at com.sun.xml.internal.ws.encoding.StreamSOAPCodec.encode(StreamSOAPCodec.java:98)
  ... 16 more
Caused by: java.io.IOException: Broken pipe
  at sun.nio.ch.FileDispatcher.write0(Native Method)
  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:122)
  at sun.nio.ch.IOUtil.write(IOUtil.java:93)
  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:352)
  at sun.net.httpserver.Request$WriteStream.write(Request.java:410)
  at sun.net.httpserver.ChunkedOutputStream.writeChunk(ChunkedOutputStream.java:122)
  at sun.net.httpserver.ChunkedOutputStream.flush(ChunkedOutputStream.java:156)
  at sun.net.httpserver.PlaceholderOutputStream.flush(ExchangeImpl.java:416)
  at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
  at com.sun.xml.internal.stream.writers.UTF8OutputStreamWriter.flush(UTF8OutputStreamWriter.java:138)
  at com.sun.xml.internal.stream.writers.XMLStreamWriterImpl.flush(XMLStreamWriterImpl.java:393)
  ... 20 more
Actions #2

Updated by Pavel Shved about 13 years ago

Prior to such failures, five-minute stalls were observed, like this:

 16130| 2011-03-23 22:53:26 Node::ldv.101: TRACE: ldv-core: Calling watcher...
 16130| 2011-03-23 22:53:26 Node::ldv.101:  INFO: watcher: Called watcher (output not checked): /mnt/cluster/toolset/ldv-core/../watcher/ldv-watcher queue dscv /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir ldv 101 dscv 4545
 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: External key is ldv.101
 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: ldv.101: queue dscv /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir ldv 101 dscv 4545
 16130| 2011-03-23 22:53:26 Node::ldv.101:  INFO: watcher: Queueing dscv task with /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor and wd /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir
 16130| 2011-03-23 22:53:26 Node::ldv.101:  WARN: watcher: Packing files []
 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: call send_files: name: "ldv.101.dscv.4545-from-parent.pax", fi: []
 16130| 2011-03-23 22:53:26 Node::ldv.101: TRACE: watcher: New sender: {:user=>"ldv", :pass=>"12345", :initrb=>"/mnt/cluster/toolset/ldv-core/../watcher/../cluster/tester.rb", :log=><Logging::Logger:0x156af9209f1c name="23549180577120">, :root=>"/mnt/cluster/toolset/ldv-core/../watcher/../cluster", :namespace_root=>"/mnt/cluster/work", :log_level=>:warn, :host=>"localhost", :format=>:json, :filesrv=>"pavel@shved:/mnt/cluster/files/", :vhost=>"/test2", :ping_time=>1, :spawn_key=>nil}
 16130| 2011-03-23 22:53:27 Node::ldv.101: DEBUG: watcher: sender init opts {:pass=>"12345", :user=>"ldv", :namespace_root=>"/mnt/cluster/work", :format=>:json, :host=>"localhost", :filesrv=>"pavel@shved:/mnt/cluster/files/", :vhost=>"/test2", :spawn_key=>nil}
 16130| 2011-03-23 22:53:27 Node::ldv.101: TRACE: watcher: Sending payload {:type=>"dscv", :env=>[], :workdir=>"/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir", :key=>"ldv.101.dscv.4545", :args=>"<input>\n  <report>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir/report_after_dscv.xml</report>\n  <workdir>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir</workdir>\n  <ruledb>/mnt/cluster/toolset/ldv-core/../kernel-rules/model-db.xml</ruledb>\n  <properties>32_7</properties>\n  <cmdfile>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/cmd_after_deg.xml</cmdfile>\n</input>\n"}
 16130| 2011-03-23 22:53:27 Node::ldv.101: TRACE: watcher: Reply gotten.  Sender dies in a second.
 16130| 2011-03-23 22:53:28 Node::ldv.101: ERROR: [Wed, 23 Mar 2011 22:53:26 +0300] DEBUG: [setup] setting log level to INFO
 16130| 2011-03-23 22:53:28 Node::ldv.101: ERROR: [Wed, 23 Mar 2011 22:53:26 +0300] WARN: init.rb /mnt/cluster/toolset/ldv-core/../watcher/../cluster/tester.rb does not exist or is not reachable
 16130| 2011-03-23 22:58:22 Node::ldv.101:  INFO: watcher: Watcher returns 0, waitpid: 0
 16130| 2011-03-23 22:58:22 Node::ldv.101: TRACE: ldv-core: Set task status to "queued" 
Actions #3

Updated by Pavel Shved about 13 years ago

  • Priority changed from High to Normal

This problem was mitigated by fixing implementation of watcher's "queue" command in a cluster environment. The problem is still there, though it causes less damage now.

Actions #4

Updated by Pavel Shved almost 13 years ago

  • Status changed from Open to Resolved
  • Published in build set to 1b22552

Watcher's queue slowness did contribute to this, but wasn't the root of the problem.

The thing is that SOAP::Lite calls time out by default after 3 minutes of inactivity. ldv-core calls getCommand method on CSD's WSDL service, and this method may block for a long time (while the kernel was compiling). Meanwhile, the client silently drops the connection, and continues polling the server with the similar command. Server, on the other hand, being inside a handler of that method, consumed the next command when it appeared, and SOAP backend then died because it was trying to write into a closed pipe (it seems, that local SOAP connections are implemented as pipes, or it's just Java's representation; see the "Broken pipe" above). And the command consumed is never reclaimed, since there is no single opportunity to handle such an exception from anywhere in the user's code (note the absence of our functions in the stacktrace)

The correct solution would be to make ldv-core the server, and CSD — the client, CSD sending methods as soon as they arrive from BCE. However, it would be too long to fix, so now getCommand officially returns NULL if there is no command available in 10 seconds; ldv-core also handles the actual error messages from SOAP just like it should: die on 10th real error.

See 1b22552

Actions #5

Updated by Pavel Shved almost 13 years ago

I've just tested drivers/net, and it occurs that no drivers were skipped (before the fix about 15 of them were lost). So, I guess, the fix now works.

Actions #6

Updated by Evgeny Novikov almost 13 years ago

Should be closed?

Actions #7

Updated by Pavel Shved almost 13 years ago

Eugene Novikov wrote:

Should be closed?

I will perform more cluster runs, and will close it if it works.

Actions #8

Updated by Evgeny Novikov almost 13 years ago

At last should be closed?

Actions #9

Updated by Pavel Shved over 12 years ago

  • Status changed from Resolved to Closed

Haven't seen any more problems like this with our cluster.

Actions

Also available in: Atom PDF