Purpose of the Assignment
The general purpose of this assignment is to continue to explore network programming and more advanced concepts by building a simplified web cache, leveraging the web client and server constructed for Assignment #2. This assignment is designed to give you further experience in:
- writing networked applications
- the socket API in Python
- writing software supporting an Internet protocol
- techniques for managing network performance
You are required to implement in Python a stripped down and simplified web cache, leveraging the web server and web downloader client you implemented as part of Assignment #2. A sample implementation of Assignment #2 will be provided in the near future (after all submissions are in) that you can use as a basis for development in this assignment, if you would rather do that than use your own previous work.
Web caches can be complex, but we will be making a pretty straightforward one. As a refresher, you might want to quickly review the lesson materials covering web caches as part of our look at the Application Layer. In a nutshell, your cache will sit between your web client and web server. With the appropriate command line options, you will route requests from your web client to your web server through your cache, using it as a proxy. In the process, the cache will save copies of all files that pass through it for a period of time. Subsequent requests will use the cached version instead of going all the way to your web server to retrieve the files. Conditional GET requests and other mechanisms will be used to help make sure the cache doesn’t get out of sync with the files stored with your web server.
Here are some specific requirements and other important notes:
- Your web cache will be built using elements from the Assignment #2 client and server programs. As noted in the lessons, a web cache acts as a server to web clients and a client to web servers. So, it will require aspects of both Python programs.
- The web cache will require an expiration time configured in it. You can configure this via a command line parameter or have it defined as a constant in your cache program. Ordinarily, this expiration time would be big enough to avoid the cache being invalidated frequently, but that could make testing a pain. So, you might want this to be set to a minute or two for testing purposes. Making it easy to update will certainly help.
- To use your web cache, the web client now needs to support a -proxy command line option to specify the host and port number of cache (e.g. -proxy localhost:8000). This is in addition to previous command line parameters to specify the file/URL to get from your web server. When -proxy is specified, your client will make the same GET request that it would have made otherwise, but it instead connects to the cache at the host and port specified by -proxy and submits the request there rather than directly contacting your web server. (You will need to make sure your Host: header line in your client’s request includes the port number of the server as your cache will need it, if you did not do that in your Assignment #2. See this page for details on headers.) If -proxy is not provided, the client will behave as usual and just contact the web server directly and things will work as they did in Assignment #2.
- When your web client makes a request to the web cache, the cache first checks to see if it has a copy of the file requested stored locally. If it does not, it makes its own connection to the server and forwards the original GET request, downloads the file, and stores it locally. The cache then forwards the requested file back to the client. To organize files and avoid naming conflicts between servers, the cache will need to create a folder for each different server it connects to (with folder named after the server host name and port number). The cache will also need to create folders to store and organize files from a given server. For example, if your client requests http://localhost:8080/foo/bar.html, your cache will need to create a folder for the server (like localhost_8080) and a folder foo under this. In the end, your cache will have in its directory localhost_8080/foo/bar.html. If you had a second copy of your web server running on port 8888, its cached files would get stored under localhost_8888 instead.
- Merging the examples from the previous two bullets together, lets say your web cache and server are both on the same computer as your web client. You could then execute: python client.py -proxy localhost:8000 http://localhost:8080/foo/bar.html In this case, your client will connect to your cache running on localhost at port 8000. The client will make a GET request to the cache looking for the file /foo/bar.html with a header line specifying Host: localhost:8080. The cache, receiving this, will know that it ultimately has to go to a server on localhost, listening at port 8080 to retrieve the file in question as that was what was indicated in the Host: header line.
- If your web cache has a copy of a requested file stored locally, the cache retrieves the timestamp on the file for when it was written. The cache then checks its expiration time to see if the file has expired. If so, the cache treats things as if the file doesn’t exist, and downloads a new copy to overwrite the existing file. (The cache can also delete the file and then grab a fresh copy, if you’d prefer.). If the file has not expired, the cache issues a conditional GET to your web server using the timestamp to see if the server’s version has been modified since then. If it the server sends along a new version back to the cache, the cache updates its copy of the file accordingly. The cache then responds to client with the latest version of the file in response to its original request. (So, while the cache is doing its work, the client is left waiting for a response.)
- Your web server also needs to be modified to support an If-modified-since header from requestors to implement the conditional GET, as discussed in the lessons. If present, the server checks the timestamp of the requested file and either sends back the file if it is newer, or sends back a 304 Not Modified result otherwise. (This can be handled like other errors in Assignment #2, with the message hard coded in the server, or with a message stored in a corresponding 304.html file.)
- If the requested file has been deleted from your web server, your web cache will receive a 404 message in response to its conditional GET request to the server, in which case the cache deletes its cached file and returns the 404 message to the client.
- If you need additional hosts for multiple servers, you can use compute.gaul.csd.uwo.ca and/or cs3357.gaul.csd.uwo.ca as necessary.
You are to provide all of the Python code for this assignment yourself, except for code used from the Assignment #2 implementation provided to you. You are not allowed to use Python functions to execute other programs for you, nor are you allowed to use any libraries that provide HTTP request handling for you. (If there is a particular library you would like to use/import, you must check first.) All server code files must begin with server, all client files must begin with client, and all cache files must begin with cache. All of these files must be submitted with your assignment.
As an important note, marks will be allocated for code style. This includes appropriate use of comments and indentation for readability, plus good naming conventions for variables, constants, and functions. Your code should also be well structured (i.e. not all in the main function).
Please remember to test your program as well, since marks will obviously be given for correctness! You should transfer and cache several different HTML documents, as well as images or other binary files. You can then use diff to compare the original files and the downloaded files to ensure the correct operation of your client, server, and cache. To manipulate timestamps on your files for testing caching and modification times, you might find the touch command useful at the command line.