I wanted to monitor a number of web sites that did not provide RSS feeds or their RSS feeds were misbehaving. So, I ended up writing a Python utility that does exactly that: Given a list of URLs and some rules, it retrieves them, stores them and compares them.
The utility is composed of several components:
- The WebBrowser component is responsible for transferring data from the configured URLs. It supports HTTP cookies, it can submit HTML forms in order to login into web sites and, also, uses the user agent string of Mozilla Firefox in order not to trigger weird behaviour in some web sites.
- The Storage component is responsible for storing the retrieved web pages into an SQLite database. Each web page is compressed using zlib and stored along with its URL and its retrieval timestamp. This component can, also, read from the database the two most recent versions of a web page for comparison.
- The HTMLReport component handles the generation of the HTML report that summarises the changes found in the configured URLs. It uses Mako Templates for Python in order to easily generate the desired HTML output.
- The ConfParser component handles reading the configuration file, which is written in YAML, through PyYAML.
- The BaseDiffEngine component is an abstract component which is the basis for creating components that compare specific web sites. Comparing HTML documents is accomplished through the lxml library. Each URL can be set in the configuration file to use a specific difference engine for comparison.
Along with the sitemon utility come some sample difference engines:
- The Comparison engine compares web sites line-by-line as if someone used the diff command line utility on them.
- The DiffInvision, DiffVBulletin and DiffPHPBB compare the respective forums (Invision’s IP.Board, vBulletin and phpBB). Actually these engines can compare the topic summary pages that these forums provide (so that you can see if a new post has been made).
You can download the sitemon utility along with the above described difference engines from here: sitemon.
Below you can see a sample configuration file for comparing the Dropbox home page after logging into it.
authentication: https://www.dropbox.com : method: post url: https://www.dropbox.com/login params: login_email: HIDDEN login_password: HIDDEN remember_me: on t: '' sites: - url: https://www.dropbox.com/home diff_engine: Comparison validations: - xpath: //a[contains(@href,"/logout")] should_exist: Yes