Friday, February 3, 2012

Apache and CGI

If you've been reading this blog, you should know by now that Apache uses modules to extend its functionality. But what if you needed it to interact with an external program or script, what then? CGI (Common Gateway Interface) is a protocol built just for that. It defines a way in which Apache can interact with an external "CGI" script.

When you run a program or script in the terminal, the keyboard is stdin and what you type is sent as input to the program. Output from the program (like those from printf statements, etc) is sent to stdout which is usually the terminal window. CGI programs operate in a similar manner except that the server sends input (like POST arguments, etc) through stdin to it, not a keyboard. It receives output from the program through stdout. There's no modification needed for the program since its using standard interfaces. It should run just like a regular invocation. Clever.

CGI thus acts like a glue between the server and server-side scripts. Sure you, you can use PHP, Java Servlets, ASP, SSI (Server Side Includes) or any other server-side scripting langauge too for this, and you probably should, but CGI is one such option too and it's always good to know what your options are. You can use any scripting language that has a terminal/command-line based interpreter for making CGI scripts. You can in fact, use any programming language (compiled or otherwise) for this purpose. As long as the program/script outputs something through stdout in the right format, it should work. It's that simple.

Hello, World
CGI needs to know how to interpret a particular script. May be the code you wrote is in Perl, or may be Python. You specify the interpreter to use in the first line. Again, this can be anything, even a C program you wrote. Specify the path in full. Let's try something in Perl.
#!/usr/bin/perl
You also need to specify the type of response data from the server. It could be HTML (text/html), or an image (image/jpg, image/png, etc) etc. When Apache usually serves a file to a user, it fills in Content-type and other headers for you. Not here though. Let's output HTML. You can add other headers too like charset, etc by separating them with . End the headers with \r\n\r\n
print "Content-type: text/html; charset=iso-8859-1\r\n\r\n";
print "Hello, World.";
Now let's build a little something in C. First print out the Content-type and other headers.
printf("Content-type: text/html; charset=iso-8859-1\n\n");
POST arguments are sent through stdin. Read them. An environment variable CONTENT_LENGTH is set indicating the no. of bytes written to stdin. In my Windows system (not tested on other platforms), this value is however always greater than the actual string length by 2 bytes. One of them could be a \n sent by the server to terminate the string in stdin. What about the other one?? I don't know. Googling about it din't help either.
char buff[1000];
scanf("%s",buff);
printf("POST: %s (%s bytes)",buff,getenv("CONTENT_LENGTH"));
QUERY_STRING is an environment that contains the GET arguments sent to the server.
printf("GET: %s",getenv("QUERY_STRING"));
Now let's also see what else is being sent to our program by the system. In my system, exactly 2 arguments are being sent - the first one, being the program name and the second one, the script's name.
int i=0;
while(argv[i]!='\0')
  printf("%d: %s",i,argv[i++]);
Let's also print 2 forms for submitting arguments. Note: Arguments are submitted as URL encoded strings. That means you are not going to get an associative array of arguments, just one string with keys and values separated by = and each key-value pair separated by &. Decode it appropriately.
printf(
"<form method='post'>"
"<input type='submit'/>"
"<input type='text' value='postval' name='post1'/>"
"</form>"
"<form method='get'>"
"<input type='submit'/>"
"<input type='text' value='getval' name='get1'/>"
"</form>"
);
The Complete program should look like this.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
  printf("Content-type: text/html; charset=iso-8859-1\n\n");
  char buff[1000];
  scanf("%s",buff);
  printf("POST: %s (%s bytes)",buff,getenv("CONTENT_LENGTH"));
  printf("GET: %s",getenv("QUERY_STRING"));
  int i=0;
  while(argv[i]!='\0')
    printf("%d: %s",i,argv[i++]);
  printf(
  "<h1>Hello World</h1>"
  "<form method='post'>"
  "<input type='submit'/>"
  "<input type='text' value='postval' name='post1'/>"
  "</form>"
  "<form method='get'>"
  "<input type='submit'/>"
  "<input type='text' value='getval' name='get1'/>"
  "</form>"
  );
  return 0;
}

Running the code
Save any compiled programs or scripts in Apache's cgi-bin folder (In XAMPP, it is /xampp/cgi-bin/). Apache uses this folder for CGI scripts. The files under this folder have executable permissions. For security reasons, Apache uses a single folder for this purpose and scripts aren't just spread across the server like HTML files. When a file is requested from this folder, Apache attempts to run it like a program/script rather than just display the contents of that file. You can either use a ScriptAlias directive to specify the cgi-bin folder for the server,
ScriptAlias /cgi-bin/ /usr/local/apache/cgi-bin/
or explicitly give a folder CGI execution permissions using the Options directive.
<Directory /usr/local/apache/htdocs/scripts>
  Options +ExecCGI
</Directory>
If your server has a configuration like
AddHandler cgi-script cgi
save all your scripts and executables with the .cgi extension. Remember to chmod all scripts to give them executable permissions. Also note (if you haven't realized by now), all this config data goes into the httpd.conf file. Also, You need to have mod_cgi to run CGI applications. It comes with Apache version 2.0 and above. So if you have those, your covered.

ApacheBench reveals that a simple Hello World bench tops at around 70-75 req/sec in my system which is definitely better than our mod_python based Apache module but still poor compared to something like PHP, probably caused by all the new processes created for every single request. Now if only there was a way to avoid all this unnecessary CPU and I/O work.

Fast(er)CGI?
The above problem with CGI is exactly what led to the development of FastCGI. It is a high performance extension to CGI that provides persistence to your applications. Persistence means that the application is always in memory - no more initialization or other related overheads for every request. One thing to note is that CGI applications don't just work right out of the box and need to be ported. Let us convert a simplified version of our Hello World example to FastCGI.
#include "fcgi_stdio.h"
#include <stdlib.h>
int main(int argc, char *argv[])
{
  int req = 0;
  while(FCGI_Accept()>=0)
  {
    printf("Content-type: text/html; charset=iso-8859-1\n\n"
    "<h1>Hello World</h1>"
    "Request No: %d",++req);
  }
}
Here FCGI_Accept() is a blocking function. When there's a new request, code in the while loop is executed. Else, it blocks and just sits in memory till a new request arrives at the server. Thus, any initialization is done only once. To port code, you need to split it such that code that is request-specific is in the while loop while other stuff that needs to be executed just once is defined before the loop. Also note, the header file we included - fcgi_stdio.h isn't the regular stdio.h we normally include. Also, also note, to run FastCGI on Apache, you need a module like mod_fastcgi or mod_fcgid.

I don't have actual benchies with me to show you how fast FastCGI really is, yet. But I'll be sure to post them here once I do. In the meantime, why don't you try FastCGI out for yourself. The devkit is available for C, C++, Perl and Java.