Friday, February 3, 2012

Apache and CGI

If you've been reading this blog, you should know by now that Apache uses modules to extend its functionality. But what if you needed it to interact with an external program or script, what then? CGI (Common Gateway Interface) is a protocol built just for that. It defines a way in which Apache can interact with an external "CGI" script.

When you run a program or script in the terminal, the keyboard is stdin and what you type is sent as input to the program. Output from the program (like those from printf statements, etc) is sent to stdout which is usually the terminal window. CGI programs operate in a similar manner except that the server sends input (like POST arguments, etc) through stdin to it, not a keyboard. It receives output from the program through stdout. There's no modification needed for the program since its using standard interfaces. It should run just like a regular invocation. Clever.

CGI thus acts like a glue between the server and server-side scripts. Sure you, you can use PHP, Java Servlets, ASP, SSI (Server Side Includes) or any other server-side scripting langauge too for this, and you probably should, but CGI is one such option too and it's always good to know what your options are. You can use any scripting language that has a terminal/command-line based interpreter for making CGI scripts. You can in fact, use any programming language (compiled or otherwise) for this purpose. As long as the program/script outputs something through stdout in the right format, it should work. It's that simple.

Hello, World
CGI needs to know how to interpret a particular script. May be the code you wrote is in Perl, or may be Python. You specify the interpreter to use in the first line. Again, this can be anything, even a C program you wrote. Specify the path in full. Let's try something in Perl.
#!/usr/bin/perl
You also need to specify the type of response data from the server. It could be HTML (text/html), or an image (image/jpg, image/png, etc) etc. When Apache usually serves a file to a user, it fills in Content-type and other headers for you. Not here though. Let's output HTML. You can add other headers too like charset, etc by separating them with . End the headers with \r\n\r\n
print "Content-type: text/html; charset=iso-8859-1\r\n\r\n";
print "Hello, World.";
Now let's build a little something in C. First print out the Content-type and other headers.
printf("Content-type: text/html; charset=iso-8859-1\n\n");
POST arguments are sent through stdin. Read them. An environment variable CONTENT_LENGTH is set indicating the no. of bytes written to stdin. In my Windows system (not tested on other platforms), this value is however always greater than the actual string length by 2 bytes. One of them could be a \n sent by the server to terminate the string in stdin. What about the other one?? I don't know. Googling about it din't help either.
char buff[1000];
scanf("%s",buff);
printf("POST: %s (%s bytes)",buff,getenv("CONTENT_LENGTH"));
QUERY_STRING is an environment that contains the GET arguments sent to the server.
printf("GET: %s",getenv("QUERY_STRING"));
Now let's also see what else is being sent to our program by the system. In my system, exactly 2 arguments are being sent - the first one, being the program name and the second one, the script's name.
int i=0;
while(argv[i]!='\0')
  printf("%d: %s",i,argv[i++]);
Let's also print 2 forms for submitting arguments. Note: Arguments are submitted as URL encoded strings. That means you are not going to get an associative array of arguments, just one string with keys and values separated by = and each key-value pair separated by &. Decode it appropriately.
printf(
"<form method='post'>"
"<input type='submit'/>"
"<input type='text' value='postval' name='post1'/>"
"</form>"
"<form method='get'>"
"<input type='submit'/>"
"<input type='text' value='getval' name='get1'/>"
"</form>"
);
The Complete program should look like this.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
  printf("Content-type: text/html; charset=iso-8859-1\n\n");
  char buff[1000];
  scanf("%s",buff);
  printf("POST: %s (%s bytes)",buff,getenv("CONTENT_LENGTH"));
  printf("GET: %s",getenv("QUERY_STRING"));
  int i=0;
  while(argv[i]!='\0')
    printf("%d: %s",i,argv[i++]);
  printf(
  "<h1>Hello World</h1>"
  "<form method='post'>"
  "<input type='submit'/>"
  "<input type='text' value='postval' name='post1'/>"
  "</form>"
  "<form method='get'>"
  "<input type='submit'/>"
  "<input type='text' value='getval' name='get1'/>"
  "</form>"
  );
  return 0;
}

Running the code
Save any compiled programs or scripts in Apache's cgi-bin folder (In XAMPP, it is /xampp/cgi-bin/). Apache uses this folder for CGI scripts. The files under this folder have executable permissions. For security reasons, Apache uses a single folder for this purpose and scripts aren't just spread across the server like HTML files. When a file is requested from this folder, Apache attempts to run it like a program/script rather than just display the contents of that file. You can either use a ScriptAlias directive to specify the cgi-bin folder for the server,
ScriptAlias /cgi-bin/ /usr/local/apache/cgi-bin/
or explicitly give a folder CGI execution permissions using the Options directive.
<Directory /usr/local/apache/htdocs/scripts>
  Options +ExecCGI
</Directory>
If your server has a configuration like
AddHandler cgi-script cgi
save all your scripts and executables with the .cgi extension. Remember to chmod all scripts to give them executable permissions. Also note (if you haven't realized by now), all this config data goes into the httpd.conf file. Also, You need to have mod_cgi to run CGI applications. It comes with Apache version 2.0 and above. So if you have those, your covered.

ApacheBench reveals that a simple Hello World bench tops at around 70-75 req/sec in my system which is definitely better than our mod_python based Apache module but still poor compared to something like PHP, probably caused by all the new processes created for every single request. Now if only there was a way to avoid all this unnecessary CPU and I/O work.

Fast(er)CGI?
The above problem with CGI is exactly what led to the development of FastCGI. It is a high performance extension to CGI that provides persistence to your applications. Persistence means that the application is always in memory - no more initialization or other related overheads for every request. One thing to note is that CGI applications don't just work right out of the box and need to be ported. Let us convert a simplified version of our Hello World example to FastCGI.
#include "fcgi_stdio.h"
#include <stdlib.h>
int main(int argc, char *argv[])
{
  int req = 0;
  while(FCGI_Accept()>=0)
  {
    printf("Content-type: text/html; charset=iso-8859-1\n\n"
    "<h1>Hello World</h1>"
    "Request No: %d",++req);
  }
}
Here FCGI_Accept() is a blocking function. When there's a new request, code in the while loop is executed. Else, it blocks and just sits in memory till a new request arrives at the server. Thus, any initialization is done only once. To port code, you need to split it such that code that is request-specific is in the while loop while other stuff that needs to be executed just once is defined before the loop. Also note, the header file we included - fcgi_stdio.h isn't the regular stdio.h we normally include. Also, also note, to run FastCGI on Apache, you need a module like mod_fastcgi or mod_fcgid.

I don't have actual benchies with me to show you how fast FastCGI really is, yet. But I'll be sure to post them here once I do. In the meantime, why don't you try FastCGI out for yourself. The devkit is available for C, C++, Perl and Java.

Saturday, January 28, 2012

PHP Compiler Part Three: Building An Apache Module

Apache exposes an API for programmers to extend its functionality. Even PHP is run as a module. Without this module, Apache would serve .php files in the same way it does .jpg's and .html's - There'd be no parsing of the script and you'd get just the source code.

The usual way to build an Apache module is to use C. Let me be honest, I never did like C. I started programming with Visual Basic 6. If you've used it, you'll know how much simpler it is than a language like C/C++. But the world loves C and so do the people at Apache. Their examples and sample source code are all in C. One look at their Hello World example and I was already looking for a simpler way to get things done.

Enter mod_python/mod_perl
These are modules for Apache too but they implement a Python/Perl interpreter within Apache. So building a module for Apache is as simple as writing a Python/Perl program. I ended up using mod_python because I know Python, but I hear that Perl is fast. Really fast.

Get mod_python from here. You will have to compile the Linux version first and copy it to your modules folder (/apache/modules/) whereas the Windows version is available as an installer. Note, mod_python for windows needs Python 2.5. After installing/copying, add the following to your httpd.conf Apache config file along with the other LoadModule lines already present.
LoadModule python_module modules/mod_python.so
You will probably have to restart your server after this. Next add one of the following configs to either your .htaccess file or to your httpd.conf file (Quick lesson: .htaccess is for a per-directory config while httpd.conf is used to set a server-wide config. httpd.conf can also be used to make directory level changes too. Also .htaccess files are parsed at run-time when Apache actually parses through a directory whereas httpd.conf settings are loaded at startup. As a result, you will have to restart the server when you make changes in httpd.conf.)

The Python Handler
You should use the publisher handler if you plan on using multiple .py files as separate modules.
PythonHandler mod_python.publisher
The publisher handler locates the module specified in the URL. It also allows access to functions and variables within a module through the URL. Eg, accessing the URL http://localhost/test.py/func1?var1=value would locate the module test.py, execute the function func1 in it and set variable var1 equal to value. index() is the default function that wil be called if nothing is specified after test.py in the above URL.

Another option is to have all your code in a single .py file (eg, test.py as below - extension not specified).
PythonHandler test
Here, a request to any file ending with .py will be handled by test.py which has a handler(req) function which receives a req request object. Apache internals may be accessed through this object to get details about headers, method, connection, filename, etc.

The Code
This is the code I used for this project.
from mod_python import apache, util
import os

def handler(req):
 req.content_type = 'text/plain'
 file = os.path.splitext(req.parsed_uri[apache.URI_PATH][1:])[0]+".exe"
 file = os.path.split(req.filename)[0]+"/"+file
 out = os.popen4('"'+file+'"')[1].read()
 form = util.FieldStorage(req, keep_blank_values=1)
 for i in form:
  out = out + "\n" + i + ":" + form[i]
 req.write(out)
 return apache.OK
When a module is called with the request filename set as something.py, what it does is execute something.exe, read its output from stdout (lines 6-8) and write it back to the user. Lines 9-11 are just to show you how you access GET arguments.

And The Results
I used Apache Bench to benchmark this setup. If you are using xampp, you should find it under /Program Files/xampp/apache/bin/ab.exe (It's probably available even with regular Apache distributions. I don't really know). I compiled out.php (shown below) and saved it as out.exe in the server document root.
<?php
echo 'Hello World', "\n";
?>
These are the ApacheBench results (with a few unnecessary details stripped out).
>ab -n 1000 -c 50 http://localhost/out.php
Document Path:          /out.php
Document Length:        12 bytes

Concurrency Level:      50
Time taken for tests:   1.635 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      336000 bytes
HTML transferred:       12000 bytes
Requests per second:    611.59 [#/sec] (mean)
Time per request:       81.755 [ms] (mean)
Time per request:       1.635 [ms] (mean, across all concurrent requests)
Transfer rate:          200.68 [Kbytes/sec] received


>ab -n 1000 -c 50 http://localhost/out.py
Document Path:          /out.py
Document Length:        12 bytes

Concurrency Level:      50
Time taken for tests:   39.737 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      292000 bytes
HTML transferred:       12000 bytes
Requests per second:    25.17 [#/sec] (mean)
Time per request:       1986.864 [ms] (mean)
Time per request:       39.737 [ms] (mean, across all concurrent requests)
Transfer rate:          7.18 [Kbytes/sec] received
Notice lines 12 and 29. Congratulations if you expected this the second I mentioned mod_python in this post! The reason performance is piss poor is because,
a) Python is an interpreted language
b) PHP and its Apache module have been built for exactly this kind of work and are thus more optimized.
c) Extra steps in the request handling phase when using mod_python like loading the interpreter everytime, etc
d) I'm a lazy idiot who shouldn't have taken the easy way.

Bye bye mod_python. Looks like I'll have to brush up my C coding skills after all. Head over to Part 4 to see how you build an Apache module the right way. Also here's something else you could try if you absolutely insist that you're not going to code a module.