Thursday, January 15, 2009

Creating WAR Web Tools

Web tools are what I call sites like Waralytics and WAR Watcher. They take information from one place and use it to provide a new service. This post aims to serve as an introduction to how to create such a tool. It will require some programming knowledge, but nothing difficult. If you don't know how to program, now is a great time to start!

Purpose: First off, what do we want to do? This example will show a list of Warhammer Online servers. This may not be particularly useful by itself, but it serves as a great starting point.

Where is the data? Next, we need to find a place to get the server list. It just so happens this information is readily available on the Mythic Realm War page. So the answer to our question is: http://realmwar.warhammeronline.com/realmwar/Index.war

Yep, it is just the web page. More often than not, this will be the primary source for web tools. other times it is an actual Web Service or XML files.

Get the data. Now we need to do 3 basic steps: Read the web page, pull out the information we want and then display it.

If you haven't already, bring up the webpage in your browser of choice. When loaded, view it's source (Right Click, View Page Source). You should see a lot of HTML. That is what we will have to go through to pull out the server information.

What we are doing is commonly called parsing or screen scraping. It is not a very optimal way of doing things, but we have little choice. To do this, I use the Python programming language. Most any scripting language (i.e. Perl, Ruby, VBScript) can do the same. They are easy to learn (compared to C) and have a number of tutorials available.

To make my life even easier, I use a Python module called BeautifulSoup. This module goes through all of that HTML and parses it for us, so we can easily navigate our way through. Now, on to the code!

Here we go get the web page. It will put all that raw HTML code in the html variable.


url = 'http://realmwar.warhammeronline.com/realmwar/Index.war'

import urllib2
response = urllib2.urlopen(url)

html = response.read()


Next, we give it to BeautifulSoup to parse and do all the hard work for us.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)


We still need to figure out where our data is in that web page. From looking at it, we can see that the server name is contained within a div tag, with a class name of PairSelect-Name. That should do nicely, so lets pull it out.

for x in soup.findAll('div'):
if x.has_key('class') and x['class'] == 'PairSelect-Name':
print x.a.string


In the above code, we go through each div tag, and find one that has a class attribute and make sure it has the name we want. If it does, we pull out its anchor tag (a) and use the string version (x.a.string). The string version will give us what the anchor tag contains, which is the server name.

That is it! It only took about 9 lines to pull down the server names.

In the next installment, I will show how you can make this information available to everyone else.

5 comments:

I´ve done that quite a month ago using PHP for providing my [EU] "Warhammer Online signature image generator". It uses the data of the [EU] offical database at for creating self updating signature images that can be used in forums etc.

Feel free to have a look at
http://totmacher.de/WAR/Signature/index.php

I have always like the Signature generators. They are also a great example of a Web tool.

I would love to see an article on combining graphics and data from WAR to make those sigs.

Cool stuff. I used urllib2 and lxml to get the data for http://www.warheap.com/players/search

Never heard of Beautiful Soup before now. Looks pretty nice though.

I did check out lxml first. It is an effective but more complex module. I was going for rapid development, so BeautifulSoup fit the bill. One thing about it, it does eat up some CPU, as it parse's the whole document.

Beautiful Soup eats up a lot of CPU? lxml parses the whole document also and then stores it into a tree. I'm not sure if these solutions or regex would be better as far as efficiency goes, but I find it a lot easier to do tree traversals than regex.

Post a Comment