Troubleshooting “Error reading data from FastCGI server”

Recently, I have observed the following error message in Apache’s error.log:

[Thu Jun 27 19:27:50 2013] [warn] [client xxx.xxx.xxx.xxx] (104)Connection reset by peer: mod_fcgid: error reading data from FastCGI server
[Thu Jun 27 19:27:50 2013] [error] [client xxx.xxx.xxx.xxx] Premature end of script headers: xxx.php

My setup consists of Apache 2.2.16 (with MPM worker), mod_fcgid 2.3.6 and PHP 5.3.3. The error message appeared only when the web server was serving requests at a rate of approximately 20-30 requests per second. Below, I will present the troubleshooting steps that I used in order to discover what was the problem. To cut a long story short (tl;dr) you can read the last paragraph titled Special PHP considerations here: http://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#examples.

In order to further investigate the problem, I’ve written and executed the following script (which was adapted from here: http://www.commandlinefu.com/commands/view/8097/easily-strace-all-your-apache-processes):

ps -ef | egrep 'apache|php5' > processes
cat processes | awk '{print"-p " $2}' | xargs sudo strace -f -o strace.out -t &

This script traces system calls for every Apache or PHP process in the system and also follows any possible fork of the those processes. I left the script running, while I was monitoring the error.log. When the error message appeared, I terminated the script and began to investigate the trace output produced:

1) First of all, I searched for the error message, since I suspected I could find the respective write system call:

$ grep 'Jun 27 19:27:50' strace.out
12403 19:27:50 write(15, "[Thu Jun 27 19:27:50 2013] [warn"..., 169) = 169
12403 19:27:50 write(15, "[Thu Jun 27 19:27:50 2013] [erro"..., 131) = 131

2) Indeed, process 12403 writes to file descriptor 15 (which is probably the error.log) the error message. The next thing I wanted to do was to check what this process was doing before emitting the error message:

$ grep -C1000 'Jun 27 19:27:50' strace.out | grep ^12403
12403 19:27:50 ) = 1 ([{fd=43, revents=POLLIN}])
12403 19:27:50 read(43, "GET /info.php?id=109&state=live "..., 8000) = 675
12403 19:27:50 stat64("/var/www/penny/htdocs/info.php", {st_mode=S_IFREG|0664, st_size=311, ...}) = 0
# lookup and read .htaccess
12403 19:27:50 stat64("/var/www/penny/htdocs/info.php", {st_mode=S_IFREG|0664, st_size=311, ...}) = 0
12403 19:27:50 semop(12517376, {{0, -1, SEM_UNDO}}, 1) = 0
12403 19:27:50 semop(12517376, {{0, 1, SEM_UNDO}}, 1) = 0
12403 19:27:50 socket(PF_FILE, SOCK_STREAM, 0) = 49
12403 19:27:50 connect(49, {sa_family=AF_FILE, path="/var/lib/apache2/fcgid/sock/515.470"}, 110) = 0
12403 19:27:50 setsockopt(49, SOL_TCP, TCP_NODELAY, [1], 4) = -1 EOPNOTSUPP (Operation not supported)
12403 19:27:50 fcntl64(49, F_GETFL) = 0x2 (flags O_RDWR)
12403 19:27:50 fcntl64(49, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12403 19:27:50 writev(49, [{"\1\1\1\10", 8}, {"\1", 8}, {"\1\4\1\4\317", 8}, {"\n\nKOHANA_ENVproduction\t\16HTTP_HOS"..., 1231}, {"\1\4\1", 8}, {"\1\5\1", 8}, {NULL, 0}], 7) = 1271
12403 19:27:50 read(49, 0xb9978ff8, 8192) = -1 EAGAIN (Resource temporarily unavailable)
12403 19:27:50 poll([{fd=49, events=POLLIN}], 1, 40000
12403 19:27:50 ) = 1 ([{fd=49, revents=POLLIN|POLLERR|POLLHUP}])
12403 19:27:50 read(49, 0xb9978ff8, 8192) = -1 ECONNRESET (Connection reset by peer)
12403 19:27:50 write(15, "[Thu Jun 27 19:27:50 2013] [warn"..., 169) = 169
12403 19:27:50 write(15, "[Thu Jun 27 19:27:50 2013] [erro"..., 131) = 131
12403 19:27:50 close(49) = 0
12403 19:27:50 semop(12517376, {{0, -1, SEM_UNDO}}, 1) = 0
12403 19:27:50 semop(12517376, {{0, 1, SEM_UNDO}}, 1) = 0
12403 19:27:50 writev(43, [{"HTTP/1.1 500 Internal Server Err"..., 239}, {"\37\213\10\3", 10}, {"uR\315\216\3230\20\276\373)\206=\201\264\33w+\2018X\221`[D\245]\250\264\341\200V\34"..., 380}, {"\25m\241\24j\2", 8}], 4) = 637
12403 19:27:50 write(28, "46.12.92.42 - - [27/Jun/2013:19:"..., 238) = 238

3) So, this process is probably Apache. It reads from fd 43 the HTTP request, it performs some setup actions and then tries to communicate with the FastCGI server (PHP) using the FastCGI interface over a Unix domain socket. It connects successfully, writes the request, reads (but no data is available; this is normal since the FastCGI server hasn’t yet got a chance to process the request) and then polls for data. The process is then resumed from polling and the return value of the poll indicates an error condition which is identified as ECONNRESET (Connection reset by peer) when the process tries to read from the socket. Finally, the process writes to access.log and error.log. So, the next thing was to check what happened to the FastCGI server which is listening on this specific Unix domain socket:

$ grep '/var/lib/apache2/fcgid/sock/515.470' strace.out | grep -v connect
515 19:20:28 unlink("/var/lib/apache2/fcgid/sock/515.470") = -1 ENOENT (No such file or directory)
515 19:20:28 bind(35, {sa_family=AF_FILE, path="/var/lib/apache2/fcgid/sock/515.470"}, 110) = 0
515 19:20:28 chmod("/var/lib/apache2/fcgid/sock/515.470", 0700) = 0
515 19:27:51 unlink("/var/lib/apache2/fcgid/sock/515.470") = 0

4) So, process 515 binds to this Unix domain socket using fd 35. Let’s see what happens next:

$ grep -A100 'bind.*/var/lib/apache2/fcgid/sock/515.470' strace.out | grep ^515
515 19:20:28 bind(35, {sa_family=AF_FILE, path="/var/lib/apache2/fcgid/sock/515.470"}, 110) = 0
515 19:20:28 umask(022) = 077
515 19:20:28 chmod("/var/lib/apache2/fcgid/sock/515.470", 0700) = 0
515 19:20:28 listen(35, 5) = 0
515 19:20:28 geteuid32() = 33
515 19:20:28 fcntl64(35, F_GETFD) = 0
515 19:20:28 fcntl64(35, F_SETFD, FD_CLOEXEC) = 0
# irrelevant
515 19:20:28 dup(35) = 40
515 19:20:28 fcntl64(40, F_GETFD) = 0
515 19:20:28 fcntl64(40, F_SETFD, 0) = 0
515 19:20:28 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb74c7768) = 15768

5) The listening socket is handed over to a newly created child process with PID 15768. Finally, the last thing was to check how this process and the initial process 12403 interact:

$ grep -C100 'Jun 27 19:27:50' strace.out | egrep '^12403|^15768'
12403 19:27:50 connect(49, {sa_family=AF_FILE, path="/var/lib/apache2/fcgid/sock/515.470"}, 110) = 0
12403 19:27:50 setsockopt(49, SOL_TCP, TCP_NODELAY, [1], 4) = -1 EOPNOTSUPP (Operation not supported)
12403 19:27:50 fcntl64(49, F_GETFL) = 0x2 (flags O_RDWR)
12403 19:27:50 fcntl64(49, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12403 19:27:50 gettimeofday({1372354070, 840571}, NULL) = 0
12403 19:27:50 writev(49, [{"\1\1\1\10", 8}, {"\1", 8}, {"\1\4\1\4\317", 8}, {"\n\nKOHANA_ENVproduction\t\16HTTP_HOS"..., 1231}, {"\1\4\1", 8}, {"\1\5\1", 8}, {NULL, 0}], 7) = 1271
12403 19:27:50 read(49, 0xb9978ff8, 8192) = -1 EAGAIN (Resource temporarily unavailable)
12403 19:27:50 poll([{fd=49, events=POLLIN}], 1, 40000
# 15768 does munmap's
15768 19:27:50 brk(0xa08d000) = 0xa08d000
15768 19:27:50 exit_group(0) = ?
12403 19:27:50 ) = 1 ([{fd=49, revents=POLLIN|POLLERR|POLLHUP}])
12403 19:27:50 read(49, 0xb9978ff8, 8192) = -1 ECONNRESET (Connection reset by peer)

6) We observe here that process 12403 connects to the Unix domain socket successfully, writes to it and polls it for incoming data. Moments later, process 15768, which is responsible for handling the request, terminates. This causes the interruption of the poll and the ECONNRESET error to be raised by the read system call that follows.

So, the problem arises because the PHP FastCGI process terminates without handling the pending request. As stated here (http://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#examples) PHP FastCGI processes terminate after PHP_FCGI_MAX_REQUESTS (which defaults to 500) requests. This is probably done in order to deal with possible memory leaks. There is clearly a race condition here and it occurs when a request arrives in the time slot after the PHP FastCGI process handles the last request and before it exits. During this time window, it still accepts connections but it won’t serve them. One possible solution would be to set the FcgidMaxRequestsPerProcess directive in Apache’s configuration to 500, thus instructing FastCGI to terminate the process at the same time as PHP does.

4 Responses to “Troubleshooting “Error reading data from FastCGI server””

  1. Premature end of script headers Says:

    […] Нагуглил дебаг и решение проблемы в одном блоге. […]

  2. ladadadada Says:

    I just wanted to commend you on an excellent post. Far too many of these sorts of things just say “I got this error message and made some random change and now it works… lol”

    Yours shows the entire debugging process and even annotated strace dumps which enables people to figure out whether they have the same problem as you or not and why their is different, if it is.

    It didn’t solve my problem but it’s still an excellent post.

  3. Sergey Danyushin Says:

    An excellent post indeed. Solved my issue and provided great insight into the process. Thanks!

  4. Jay Says:

    Great job, very helpful post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: