[CRIU] criu fails to dump JVM processes due to a half-closed socketpair()
Florian Gross
Florian.S.Gross at web.de
Tue Jul 2 06:19:59 EDT 2013
Hey there,
I'm trying to use criu to rollback Java applications to a state right after their main() method has run.
Since I'm working with graphical applications, I went with the wiki recommendations and set up a process structure like this (I later plan to switch to Xvfb):
vnc_server.sh(5197)
Xvnc(5198)
eclipse(5200)
java(5211)
Using criu dump to checkpoint a Java process fails when dumping files:
root at freya:~# ./criu-0.6/criu dump --file-locks --tcp-established -D eclipse-dump -v4 -t 5197
[…]
(00.255592) ========================================
(00.255630) Dumping task (pid: 5211)
(00.255667) ========================================
[…]
(00.590492) Error (sk-unix.c:211): Dangling in-flight connection 19151
(00.590529) ----------------------------------------
(00.590564) Error (cr-dump.c:1468): Dump files (pid: 5211) failed with -1
We can use lsof to map that to a fd, strace and gdb then allow us to find the syscall and code location responsible for setting up the socket. In this case, it is set up in Java_sun_nio_ch_FileDispatcher_init() which looks like this [1]:
static int preCloseFD = -1; /* File descriptor to which we dup other fd's
before closing them for real */
JNIEXPORT void JNICALL
Java_sun_nio_ch_FileDispatcher_init(JNIEnv *env, jclass cl)
{
int sp[2];
if (socketpair(PF_UNIX, SOCK_STREAM, 0, sp) < 0) {
JNU_ThrowIOExceptionWithLastError(env, "socketpair failed");
return;
}
preCloseFD = sp[0];
close(sp[1]);
}
What happens here is that a new pair of unix sockets is created. One of the endpoints is then immediately closed. The other one is kept around in a static variable. See [2] for the rationale behind this.
Unfortunately, criu fails to dump half-closed socket pairs. To confirm this, I wrote this minimal C program, which criu fails to dump with the same symptoms [3]:
#include <sys/socket.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char** argv) {
int sp[2];
if (socketpair(PF_UNIX, SOCK_STREAM, 0, sp) < 0) {
perror("Failed to get socketpair");
}
close(sp[1]);
for (;;) { sleep(1000); }
}
So here's my question: Would it be possible to specially detect this case and just create a new half-open socketpair on restore? Or is there a better solution? This would help in supporting checkpointing of the Java VM, which would be very awesome. (I'm seeing the same behavior for both OpenJDK 6 and 7.)
Thanks a lot,
Florian Gross
(For completeness sake: This is using the criu-0.6 tools release from the web page together with a self-built kernel with all the appropriate config flags based on 768f616.)
[1] Code is available as part of the OpenJDK, or at http://svn.netlabs.org/repos/java/tags/rc/openjdk/jdk/src/solaris/native/sun/nio/ch/FileDispatcher.c
[2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html
[3] Also available from https://gist.github.com/flgr/5909055/raw/
More information about the CRIU
mailing list