Web site monitor

I wanted to monitor a number of web sites that did not provide RSS feeds or their RSS feeds were misbehaving. So, I ended up writing a Python utility that does exactly that: Given a list of URLs and some rules, it retrieves them, stores them and compares them.

The utility is composed of several components:

  • The WebBrowser component is responsible for transferring data from the configured URLs. It supports HTTP cookies, it can submit HTML forms in order to login into web sites and, also, uses the user agent string of Mozilla Firefox in order not to trigger weird behaviour in some web sites.
  • The Storage component is responsible for storing the retrieved web pages into an SQLite database. Each web page is compressed using zlib and stored along with its URL and its retrieval timestamp. This component can, also, read from the database the two most recent versions of a web page for comparison.
  • The HTMLReport component handles the generation of the HTML report that summarises the changes found in the configured URLs. It uses Mako Templates for Python in order to easily generate the desired HTML output.
  • The ConfParser component handles reading the configuration file, which is written in YAML, through PyYAML.
  • The BaseDiffEngine component is an abstract component which is the basis for creating components that compare specific web sites. Comparing HTML documents is accomplished through the lxml library. Each URL can be set in the configuration file to use a specific difference engine for comparison.

Along with the sitemon utility come some sample difference engines:

  • The Comparison engine compares web sites line-by-line as if someone used the diff command line utility on them.
  • The DiffInvision, DiffVBulletin and DiffPHPBB compare the respective forums (Invision’s IP.Board, vBulletin and phpBB). Actually these engines can compare the topic summary pages that these forums provide (so that you can see if a new post has been made).

You can download the sitemon utility along with the above described difference engines from here: sitemon.

Below you can see a sample configuration file for comparing the Dropbox home page after logging into it.

  https://www.dropbox.com :
    method: post
    url: https://www.dropbox.com/login
      login_email: HIDDEN
      login_password: HIDDEN
      remember_me: on
      t: ''
  - url: https://www.dropbox.com/home
    diff_engine: Comparison
      - xpath: //a[contains(@href,"/logout")]
        should_exist: Yes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: