9. Stocking a CGI programming tool box

contents

9.1 Validating CGI form/URL input

Installing programs on the Internet where anyone can provide input to them creates a special challenge to program designers who need to ensure the security and integrity of these applications, and the validity of data received from these.

Security

Security requires that these programs don't have loopholes which can be exploited by crackers seeking to break into a system to steal information, deny service to legitimate users or cause damage. Professional programmers tend to use the word "hacking" to describe legitimate and exploratory programming to investigate systems and solve problems. The term "cracking" is used for criminal theft of data, system hijacking and electronic vandalism.

To design secure programs which can get their input from literally anyone with an Internet connection, it is thought better only to accept data which is known to be safe and legitimate than to try to know in advance and exclude every kind of input that is insecure.

For example it is much simpler and likely to be more secure to accept valid email address entries which comply with current Internet conventions, than to allow for all of the various outdated email address encodings for messages historically gatewayed over the Internet, for receipt from and delivery to other networks, e.g. Bitnet, UUCP, JANET coloured book, OSI X500. The disadvantage is that occasionally you might exclude a legitimate address which uses one of these old-fashioned address forms.

A technique often used by crackers involves sending exceptionally long responses to programs to try to get these to overrun the memory allocated for the input, to make the program jump to and execute instructions inserted by the cracker. These buffer overruns are less likely to be a problem with scripting languages such as Python where all memory needed for input is dynamically allocated, than with 'C' where memory is more likely to be allocated in fixed size blocks at compile time. A programming technique which defends against buffer overruns is to reject any input which is over a maximum size considered appropriate for the field input in question.

Integrity

Some HTML forms will have a set of required fields, and may have some fields which are not always required. For example a contact details form might usefully ask for a fax number, but this information would not be required unless users of this form require fax machines. It might be useful for the required fields on the form to be marked as such e.g. with an asterisk (*), and for the user to be returned to the partially completed form if not all required fields are completed. For an example of a required field, someone who wants to join an email list would not be able to do so without an email address.

Validity

In general CGI programmers must take special care not to execute program instructions input by application users. Care is also needed where users input values which will be used to make names of files on the server, as these could overwrite or read things our CGI users should not have access to. Frequent use is made of Regular Expressions to ensure that inputs are secure and valid. These can accept what we know is valid and reject what we know is not valid.

9.2 Validation functions

Below are 2 functions which can be used for validating various inputs and mail addresses. The first: is_valid() is the most generally applicable. This accepts the following parameters:

def is_valid(data,max_leng=30,allowed=r"^\w+$",prohibited=None):
    """data (string) is valid if <= max_leng,
         AND if an allowed RE is specified:  must match allowed.
         AND must not match prohibited RE
         Defaults: max_leng: 30, allowed: string of 1 or more \w word chars"""
    import re
    if len(data) > max_leng:
        return 0 # not valid
    if allowed:
        match_allowed=re.search(allowed,data)
        if not match_allowed:
            return 0 # not valid
    if prohibited: # function can also be used to exclude prohibited RE
        match_prohibited=re.search(prohibited,data)
        if match_prohibited:
            return 0 # not valid
    return 1 # passed all tests so should be valid

The second validation function: is_email() is more specialised. It accepts an email address as its parameter and passes this as the data parameter of is_valid() . is_email also calls is_valid with values for max_leng of 100, and 2 regular expressions, the first of which all valid email addresses are expected to match and the second being a regular expression which would only match if the email address were invalid.

def is_email(address):
    """ Checks whether a supplied email address is valid or not.
      Allowed RE: any word characters , dot (.), hyphen (-) and at (@) .
      Prohibited RE: any dot or at in the wrong place,
      or no at or more than 1 at's or zero dot's."""

    return is_valid(address,max_leng=100,
                    allowed=r"^[\w\.\-@]+$",
                    prohibited=r"^[\.@]|[\.@][\.@]|@.*@|[\.@]$|^[^.]+$|^[^@]+$")

9.3 Packaging some useful CGI and mail functions

CGI programming is potentially complex, in that it involves mixing 2 languages, Python and HTML. However, to make this complexity manageable, we will be doing the same kinds of thing in the much same way in various different programs and trying to keep Python and HTML in different files where possible. This means we will need to make use of Python modules and functions, and design these in a reusable manner. We made a start last week with functions to send a mail message and to start an HTML output file. Let's put the these functions into a module: cgiutils.py and see if we can extend them to make them more reusable. We can also put the 2 data validation functions: is_valid() and is_email() into this module.

A reuseable html_header() function

If we are going to be using the html_header() function in many places, which aspects of it do we need to customise ? The contents of the <title> tag is used again as a <H1> header. Let's also parameterise the background colour attribute of the <body> tag, and give title and bgcolor parameters suitable default values in case we forget to specify these when calling this function:

def html_header(title="CGI generated HTML",bgcolor='"#FFFFFF"'):
    # prints HTTP/HTML header to browser
    print "Content-type: text/html\n"
    print "<html><Head><Title>",title,"</Title></Head>"
    print "<Body bgcolor=",bgcolor,"><H1>",title,"</H1>"

Similar techniques can be used with other HTML tag attributes if you are likely to make much use of them, eg. LINK,VLINK etc.

A reuseable send_mail() function

Our send_mail() function is already parameterised so it can be called with any message and from and to addresses. To make it more self contained, the function will also import the smtplib module. It is less likely to be useful to specify defaults for the parameters unless your applications are always using the same fromaddress or toaddress or both of these.

def send_mail(fromaddress,toaddress,message):
	# sends a mail message
	import smtplib
	server = smtplib.SMTP('localhost') # create server object
	server.sendmail(fromaddress, toaddress, message) # send message
	server.quit() # close mail server connection

9.4 Ending the HTML

The more interesting our CGI programming becomes, the more situations will occur where we will need to end an HTML file output by a CGI program. The common situations where this will occur are:

a. When we wish to assure the user that data submitted has been received and that a transaction has been completed or is being completed behind the scenes.

b. When an error has occurred and the user will need to resubmit their input.

c. When the result of earlier data being submitted by the user, or the user initiating an interaction is for the CGI program to send an HTML form to the users browser requesting further information. This also has to end the HTML output to the user, because by sending a form for them to fill in, the ball is in their court.

In cases a. and b. it helps the user to continue navigating and interacting with our website if the user is given a link to the home page. The following function handles cases a. and b :

def html_end(error=""):,
    # ends an html with error message or assurance.
    # Change homepage reference to your own homepage URL
    homepage='"http://copsewood.net/"'
    if error: # only happens if this parameter is filled in
        print "<p> ",error,"</p>" # print details of the error
        print "<p>An error has occurred. Please press the &lt;back&gt; "
        print "key on your browser and correct/complete your input</p>"
    else:
        print "<p>Thanks for your input which has been received and is "
        print "being processed.</p>"

    # useful to give a way out at the end of a dialog or session.
    print "<p>You can return to our "
    print "<a href=%s>home page</a> if you want.</p>" % homepage
    print "</body></html>" # end html

9.5 Maintaining session state

Some useful CGI interactions can be provided for members of the general public over the Internet without needing positively to identify who the users are. But this limits data access to information you are willing to make publicly available. Where there is a need for your web site to collect information of an acceptable quality from multiple users, either:

a. all inputs e.g. comments in response to a published article, have to be moderated and edited by having people with the required knowledge available on a round the clock rota, or provided through delayed site updates, or:

b. the web application must restrict access to data based on knowledge of who the users are, e.g. through a login/password protocol.

Option a. is either time consuming and introduces delays or is very expensive. Option b. allows for many interesting web applications to be fully automated. A user-friendly design will involve the user submitting information in stages, e.g. starting with an authentication form and then other forms will be sent and processed in sequence. Expecting the user to provide all the information needed for the interaction in a single stage would require too complex a form and would be inflexible for the user.

Breaking down the interaction into stages presents a difficulty because HTTP and HTML are stateless protocols. This means that, without additional methods to maintain information connected with the state of an interaction, a simple web-server is expected to give the same response every time to the same input.

To solve requirement b, however, methods of maintaining session state, i.e. knowledge by the server of the state or condition of its user sessions are required. A session is a set of interactions between the web server and a user with a beginning and an end, and which may involve a number of web pages being generated by one or more CGI programs and viewed on the user's browser, with forms being completed and submitted.

9.6 Use of cookies in maintaining state

A block of information is sent by the web-server to the web browser and a request made to the client that the user accept the cookie for local storage on the client. This information is indexed under the domain name of the server (e.g. copsewood.net ) so that when the user goes back to this domain the data block can be retrieved and resent to the server. This approach can be used for lightweight authentication (e.g. login) requirements.

Cookie information is sent from the CGI program to the client using extra HTTP headers. (Remember to put one newline between each HTTP header, and 2 newlines between the HTTP headers and the HTML.) Cookies returned by a client are made available by the web-server to the CGI program using the HTTP_COOKIE environment variable. For further details and a working CGI example see:

http://www.python.org/doc/current/lib/module-Cookie.html and also http://www.algonet.se/~ug/html+pycgi/scripts.html.

9.7 Use of data stored on the server in maintaining state

It is possible for a CGI program to create, remove, read, write and update files on the web server in the same ways any Python program can. Such files can be dedicated to particular sessions, e.g. so the server can store the status of a particular board game being played with a particular client, and retrieve this information when another move is played.

The web application designer will need to bear in mind the requirements and risks of concurrent file access, as it may be possible for more than one client to request update of the same file at the same time. If this occurs it is possible for a file to become corrupted, though for an experimental or lightly-used site this risk is small. This requirement can be handled more simply if different sessions with multiple clients don't need to share any data. In this situation each session could be allocated a different file. Concurrent file and database storage and locking issues and strategies will be examined during week 11. Until then we will be using simpler, but more vulnerable file storage techniques.

9.8 Use of hidden form fields to authenticate users and track session

In an authenticated application, the server will do nothing for a user other than to enable him or her to login (and help someone find out why they might want to) until this stage has been completed . The server then sends another form to a user. When input from this form is then submitted the server then needs to know which user session this form is connected to.

This is achieved by sending a copy of the userid and an access code (e.g. a PIN number) to the user embedded in an HTML form using hidden fields. These hidden fields don't appear on the users web browser, but if the user chooses to view the HTML source displayed in the browser, hidden fields become visible. These field values are sent back to the server with other name=value data pairs when the form is submitted. Normally an ID and a password are sent and returned within each part of the session to prevent someone from impersonating another user.

This requires the ability to send a form as part of a dynamic (i.e. CGI program generated) HTML sent to the browser. Most of this form will stay the same. The ID and PIN will be changed to identify the session. In an on-line game application the user doesn't need to be authenticated with a userid and password or PIN. The game server must maintain client sessions, so that more than one person can play the game against the server at the same time and the server can track the state of each clients game. This can be done using a file for each session and the form sent back to the client for each move will include a session key. This key is typically created as a random number, and can use a hidden form field, URL query string code or cookie which is sent back and forth between client and server each move, to identify and maintain the session. The server will identify the file storing the state of a clients game between moves using this session key.

One solution to this requirement is for the fixed HTML part of the form to be read from an external file. Variable parts of the form (hidden field values) can be passed to the function as a list parameter. Keeping the HTML, most of which will be unchanged, as a separate file has the advantage of keeping HTML and Python source codes apart which makes both easier to maintain. By putting %d, %f and %s Python print formatting escape codes within the fixed HTML text stored in the file, the variable fields can be interpolated after the file is read and before the modified form is output to the browser. We will convert this parameter list to a tuple before interpolating it, so that we can use our standard Python string interpolation syntax which expects a tuple.

def send_form(html_file,vars=[]):
    # sends an html form (containing end html tags) to browser
    # Name of file containing standard HTML is in html_file parameter.
    # This file may contain %d, %f and %s and other Python string escapes.
    inp = open(html_file,'r')
    form=inp.read() # read HTML form from html_file
    tvars=tuple(vars) # convert vars form variables list to tuple
    print form % tvars  # send session coded form to browser

9.9 Creating a session key or PIN

In some applications it is preferable to allow the user to select their own password. This has the benefit that the user might be able to choose a password they can remember, and the disadvantage that they will choose something others can easily guess.

The other solution is for the server to choose a password or PIN randomly. The same approach can be used to generate a session key, which the user normally wouldn't need to be aware of. A user would be sent, and would be expected to remember the PIN. Given that the user is likely to forget the PIN, it is useful for your software to store the user's email address, so they can request the same PIN be resent to the same address registered for the user whenever required.

In a low-security application the PIN can be kept short enough to easily remember (e.g. 4 digits) and sent as plain-text over the Internet. In a high-security application you would need to use encrypted connections (e.g. using the HTTPS or HTTP Secure protocol) and longer passwords or PINs chosen from a larger range. For these course notes I will be using simpler, low-security approaches. In practice these probably give a similar level of privacy as when your snail-mail is sent using ordinary envelopes, which are easy for certain people to open or scan by various means. However, social constraints generally prevent messengers or people sharing access to the same front door from doing this.

Despite these weaknesses and the losses they incur, banks still seem to find it cheaper to send credit cards and PIN numbers by snail mail than to increase the management costs incurred in using more secure methods for distributing these keys to your money, e.g. by making you turn up at a staffed bank branch, passport in hand, for you to collect your credit cards and PINs in person. You might, however, have to do this if your address contains what the banks consider to be a risky postcode.

The following code generates a random session key or PIN number, within specified minimum and maximum values (1000 - 9999):

>>> import random
>>> random.randrange(1000,9999)
7516
>>> random.randrange(1000,9999)
9804
>>> random.randrange(1000,9999)
7565

This is good enough for a 4 digit PIN transmitted using plain-text over the Internet. However, the internal design of the random module shouldn't be used for strong cryptographic purposes, as it repeats its sequence of outputs every 6,953,607,871,644 cycles.

9.10 Testing cgiutils.py module code interactively

Before we can place our cgiutils module into the Python reuse world we will have to test and debug it. A Python convention is to place some unit testing code at the end of a module which will be run if the module is run as a main program, but is not be run if the module is imported by another program. This can be arranged by checking at run/import time if we are in the module named __main__ or some other module. If the __name__ built in variable has the value "__main__" then the program is being run directly and not imported. The test code should be capable of exercising the facilities of every function within the module.

Testing and debugging CGI programs is difficult, because the environment in which they run doesn't always give us as much information as we can get for programs with a more direct user interface. However, given that they produce HTML onto the standard output, we can run CGI programs interactively, and cut and paste the HTML output into a text editor, save this as an HTML file and check validity of this output using a web browser. We should probably do this to check they work as much as possible before we install them into the CGI environment where faults are less readily visible.

def tester():
  import string
  while 1: # loop around test menu until user quits
    print "enter	to test"
    print "  1	data validation"
    print "  2	Send mail message"
    print "  3	Normal HTML"
    print "  4	Error HTML"
    print "  5	Form HTML"
    print "  6     Quit"
    test=int(raw_input("option: "))
    if test == 1:
        data=raw_input("enter data to be validated as a string")
        max=int(raw_input("enter max data length as an int"))
        allowed=raw_input("enter allowed regular expression")
        prohibited=raw_input("enter prohibited RE or None")
        if prohibited == "None": prohibited=None
        if is_valid(data,max,allowed,prohibited):
            print "data is valid"
        else:
            print "invalid data detected"
    elif test == 2:
        message=raw_input("input send_mail() test message: ")
        toaddress=raw_input("enter to address for test message: ")
        fromaddress=raw_input("enter from address for test message: ")
        if is_email(fromaddress) and is_email(toaddress):
            send_mail(fromaddress,toaddress,message)
            print "message sent"
        else:
            print "one or both addresses were invalid"
    elif test == 3:
        html_header(title="Testing cgiutils normal HTML",bgcolor='"#FFFF88"')
        html_end()
    elif test == 4:
        error=raw_input("input error message")
        html_header(title="Testing cgiutils error HTML",bgcolor='"#FF88FF"')
        html_end(error)
    elif test == 5:
        # the form facility is more complex to setup,
        # this requires a custom HTML file and variable data for a
        # secure form driven application. Play around with
        # http://copsewood.net/letsplay to get the general idea of how
        # PIN and ID fields stay associated with a user session,
        # and to register your own ID and PIN which can be used here.
        html_header(title="Testing cgiutils form HTML",bgcolor='"#88FFFF"')
        scratch=raw_input("enter name for scratch file: ")
        id=raw_input("enter value for letsplay ID field: ")
        import random
        pin=int(raw_input("enter (int) letsplay PIN or 0 for a random one: "))
        if not pin: pin=random.randrange(1000,9999)
        form="""
	<form action="http://copsewood.net/cgi-bin/wantsform.pl"
	method="post">
	<INPUT TYPE="hidden" NAME="id" VALUE="%s" >
	<INPUT TYPE="hidden" NAME="pin" VALUE="%d" >
	<p><b>View Wants</b> :
	<INPUT TYPE=Checkbox NAME="view"><p>
	<INPUT TYPE="submit"></form></body></html>"""
        sfile=open(scratch,"w") # write HTML form to scratch file
        sfile.write(form)
        sfile.close()
        send_form(scratch,vars=[id,pin]) # test send_form function
    elif test == 6:
        break
    else:
        print "invalid option. Try again\n"
    raw_input("press enter to continue testing")
if __name__ == '__main__': # true when run, false when imported
    tester()

9.11 Further developing the cgiutils module

What other functions can usefully be added to our cgiutils module ?

The cgiutils.py module will be extended in subsequent weeks with other utilities. These will need utilities to generate random user PIN numbers and session identifiers, and to create HTML tables from simple databases. You are encouraged to identify functions which are useful to your own CGI programming work and to develop your own modules in order to customise the way you develop your own interactive web sites.

You may prefer to download the standard cgiutils.py module from the website when it contains updates which you find useful and put functions which you like to use in addition to this in your own importable module. This avoids the need for you to duplicate the work of patching cgiutils.py as bugs are found and fixes are made available. At a certain stage in your learning process you are likely to need to seek out publicly available and more advanced Python CGI and HTML generation tools. My own cgiutils.py module will be kept simple enough for student use, so I won't be getting it to do everything.