Bug #974
closedldv-core skips drivers under heavy load
0%
Description
On a heavy-loaded system, ldv-core sometimes doesn't yield commands for drivers (i.e. does not launch the relevant DSCVs). During the experiment with a cluster, it skipped 9 out of 2000 drivers.
When it happens, ldv-core issues a trace message
ldv-core: Test for next command...
which means that WSDL service returned NIL instead of the command.
Should be investigated properly.
Updated by Pavel Shved over 13 years ago
Here's an exception trace for CSD service. There are exactly 9 traces in csd_err.log — as many as there are missing drivers.
javax.xml.ws.WebServiceException: javax.xml.stream.XMLStreamException: java.io.IOException: Broken pipe at com.sun.xml.internal.ws.encoding.StreamSOAPCodec.encode(StreamSOAPCodec.java:101) at com.sun.xml.internal.ws.encoding.SOAPBindingCodec.encode(SOAPBindingCodec.java:249) at com.sun.xml.internal.ws.transport.http.HttpAdapter.encodePacket(HttpAdapter.java:328) at com.sun.xml.internal.ws.transport.http.HttpAdapter.access$100(HttpAdapter.java:82) at com.sun.xml.internal.ws.transport.http.HttpAdapter$HttpToolkit.handle(HttpAdapter.java:470) at com.sun.xml.internal.ws.transport.http.HttpAdapter.handle(HttpAdapter.java:233) at com.sun.xml.internal.ws.transport.http.server.WSHttpHandler.handleExchange(WSHttpHandler.java:95) at com.sun.xml.internal.ws.transport.http.server.WSHttpHandler.handle(WSHttpHandler.java:80) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:83) at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:86) at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:589) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:83) at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:561) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: javax.xml.stream.XMLStreamException: java.io.IOException: Broken pipe at com.sun.xml.internal.stream.writers.XMLStreamWriterImpl.flush(XMLStreamWriterImpl.java:395) at com.sun.xml.internal.ws.streaming.XMLStreamWriterUtil.getOutputStream(XMLStreamWriterUtil.java:86) at com.sun.xml.internal.ws.message.jaxb.JAXBMessage.writePayloadTo(JAXBMessage.java:308) at com.sun.xml.internal.ws.message.AbstractMessageImpl.writeTo(AbstractMessageImpl.java:131) at com.sun.xml.internal.ws.encoding.StreamSOAPCodec.encode(StreamSOAPCodec.java:98) ... 16 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:122) at sun.nio.ch.IOUtil.write(IOUtil.java:93) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:352) at sun.net.httpserver.Request$WriteStream.write(Request.java:410) at sun.net.httpserver.ChunkedOutputStream.writeChunk(ChunkedOutputStream.java:122) at sun.net.httpserver.ChunkedOutputStream.flush(ChunkedOutputStream.java:156) at sun.net.httpserver.PlaceholderOutputStream.flush(ExchangeImpl.java:416) at java.io.FilterOutputStream.flush(FilterOutputStream.java:140) at com.sun.xml.internal.stream.writers.UTF8OutputStreamWriter.flush(UTF8OutputStreamWriter.java:138) at com.sun.xml.internal.stream.writers.XMLStreamWriterImpl.flush(XMLStreamWriterImpl.java:393) ... 20 more
Updated by Pavel Shved over 13 years ago
Prior to such failures, five-minute stalls were observed, like this:
16130| 2011-03-23 22:53:26 Node::ldv.101: TRACE: ldv-core: Calling watcher... 16130| 2011-03-23 22:53:26 Node::ldv.101: INFO: watcher: Called watcher (output not checked): /mnt/cluster/toolset/ldv-core/../watcher/ldv-watcher queue dscv /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir ldv 101 dscv 4545 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: External key is ldv.101 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: ldv.101: queue dscv /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir ldv 101 dscv 4545 16130| 2011-03-23 22:53:26 Node::ldv.101: INFO: watcher: Queueing dscv task with /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/task_descriptor and wd /mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir 16130| 2011-03-23 22:53:26 Node::ldv.101: WARN: watcher: Packing files [] 16130| 2011-03-23 22:53:26 Node::ldv.101: DEBUG: watcher: call send_files: name: "ldv.101.dscv.4545-from-parent.pax", fi: [] 16130| 2011-03-23 22:53:26 Node::ldv.101: TRACE: watcher: New sender: {:user=>"ldv", :pass=>"12345", :initrb=>"/mnt/cluster/toolset/ldv-core/../watcher/../cluster/tester.rb", :log=><Logging::Logger:0x156af9209f1c name="23549180577120">, :root=>"/mnt/cluster/toolset/ldv-core/../watcher/../cluster", :namespace_root=>"/mnt/cluster/work", :log_level=>:warn, :host=>"localhost", :format=>:json, :filesrv=>"pavel@shved:/mnt/cluster/files/", :vhost=>"/test2", :ping_time=>1, :spawn_key=>nil} 16130| 2011-03-23 22:53:27 Node::ldv.101: DEBUG: watcher: sender init opts {:pass=>"12345", :user=>"ldv", :namespace_root=>"/mnt/cluster/work", :format=>:json, :host=>"localhost", :filesrv=>"pavel@shved:/mnt/cluster/files/", :vhost=>"/test2", :spawn_key=>nil} 16130| 2011-03-23 22:53:27 Node::ldv.101: TRACE: watcher: Sending payload {:type=>"dscv", :env=>[], :workdir=>"/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir", :key=>"ldv.101.dscv.4545", :args=>"<input>\n <report>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir/report_after_dscv.xml</report>\n <workdir>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/dscv_tempdir</workdir>\n <ruledb>/mnt/cluster/toolset/ldv-core/../kernel-rules/model-db.xml</ruledb>\n <properties>32_7</properties>\n <cmdfile>/mnt/cluster/work/work/current--X--drivers/--X--defaultlinux-2.6.31.6--X--32_7/linux-2.6.31.6/csd_deg_dscv/4545/cmd_after_deg.xml</cmdfile>\n</input>\n"} 16130| 2011-03-23 22:53:27 Node::ldv.101: TRACE: watcher: Reply gotten. Sender dies in a second. 16130| 2011-03-23 22:53:28 Node::ldv.101: ERROR: [Wed, 23 Mar 2011 22:53:26 +0300] DEBUG: [setup] setting log level to INFO 16130| 2011-03-23 22:53:28 Node::ldv.101: ERROR: [Wed, 23 Mar 2011 22:53:26 +0300] WARN: init.rb /mnt/cluster/toolset/ldv-core/../watcher/../cluster/tester.rb does not exist or is not reachable 16130| 2011-03-23 22:58:22 Node::ldv.101: INFO: watcher: Watcher returns 0, waitpid: 0 16130| 2011-03-23 22:58:22 Node::ldv.101: TRACE: ldv-core: Set task status to "queued"
Updated by Pavel Shved over 13 years ago
- Priority changed from High to Normal
This problem was mitigated by fixing implementation of watcher's "queue
" command in a cluster environment. The problem is still there, though it causes less damage now.
Updated by Pavel Shved over 13 years ago
- Status changed from Open to Resolved
- Published in build set to 1b22552
Watcher's queue slowness did contribute to this, but wasn't the root of the problem.
The thing is that SOAP::Lite calls time out by default after 3 minutes of inactivity. ldv-core calls getCommand
method on CSD's WSDL service, and this method may block for a long time (while the kernel was compiling). Meanwhile, the client silently drops the connection, and continues polling the server with the similar command. Server, on the other hand, being inside a handler of that method, consumed the next command when it appeared, and SOAP backend then died because it was trying to write into a closed pipe (it seems, that local SOAP connections are implemented as pipes, or it's just Java's representation; see the "Broken pipe" above). And the command consumed is never reclaimed, since there is no single opportunity to handle such an exception from anywhere in the user's code (note the absence of our functions in the stacktrace)
The correct solution would be to make ldv-core the server, and CSD — the client, CSD sending methods as soon as they arrive from BCE. However, it would be too long to fix, so now getCommand officially returns NULL if there is no command available in 10 seconds; ldv-core also handles the actual error messages from SOAP just like it should: die on 10th real error.
See 1b22552
Updated by Pavel Shved over 13 years ago
I've just tested drivers/net
, and it occurs that no drivers were skipped (before the fix about 15 of them were lost). So, I guess, the fix now works.
Updated by Pavel Shved over 13 years ago
Eugene Novikov wrote:
Should be closed?
I will perform more cluster runs, and will close it if it works.
Updated by Pavel Shved over 13 years ago
- Status changed from Resolved to Closed
Haven't seen any more problems like this with our cluster.