View this PageEdit this Page (locked)Attachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Donated By Intel Milestone 5

Goal:

To add autocompletion functionality to the current Genealogy program. We had to gather information from Genealogy databases on the internet, supporting a total of 4 databases.

Purpose:

To gain some experience with web page mining, and with the HtmlParser class built into squeak. The main goal was to make our project more functional, by allowing the user to fill in some pieces of information automatically.

Design:

We followed our milestone3 design pretty closely for this milestone. We had an abstract Site object, and then had 4 Website objects that implemented the parsing functions specified by the abstract class in their own specific ways. We then had a Completer object, which holds references to the various website objects, and directs them to gather data when appropriate. The individual website objects gather the information, and put it into a generic format, and the Completer then pulls the information from all 4 websites in the generic format, and provides a single listing of possible information to use in the completion.

Here is the chunk of our UML concerning the classes created for M5:
External Image

Implementation:

We added a new button to our Genealogy Control Panel. When the user presses this button, a search is done for the currently selected Person by first and last name. If a birth or death date is present for the person, the results from the internet are filtered by those dates as well. The user is then given a list of possible
data to use.

After playing with Tgen for a short period of time, I found the HtmlParser that was already a part of squeak, and decided to use that. It allowed me to just connect to the web pages, and it throws all of the code for the page in a big nested OrderedCollection. Then my functions had to move through those OrderedCollections to extract the proper data for each different website. I found that the easiest way to extract the data was to see what the php url looked like for each page, and then plug in the desired information. So to do a search in the Social Security Death Index database, I started with code like this:

url := 'http://ssdi.rootsweb.com/cgi-bin/ssdi.cgi?lastname=', lastName, '&firstname='.
url := url, firstName, '&nt=exact'.

mimeDoc := url asUrl retrieveContents.
HTMLTree := (HtmlParser parse: mimeDoc content).

And then HTMLTree is holding the big nasty OrderedCollection of the webpage. So to make the HtmlParser work, you just need to feed a string (mimeDoc content) to the parse class method.

Squeak presented some problems when working with the internet. I needed a way to have squeak not try to connect to dead websites, if the internet connection was lost after doing a single search, squeak didn't realize this and blindly tried to connect to nothing, and if the connection died in the middle of a search, the entire image no longer functioned, and I had to file my code into fresh images. All of these problems were eventually figured out, and the resulting system was solid and usable.

We searched for missing information at the following websites:
California Death Index
Social Security Death Index
Cemetary Records
Huge World Connect DB

Usage

First, create a new map by running 'GenealogyMap new open.' in the workspace, and create a person:

External Image


Then, after selecting the databases to search in, you choose what you want to complete:

External Image


After a short pause, the program will present the user with a list of matches. If they select any of the possible data matches, that data is added to the person on the Genealogy Map.

External Image


Conclusion:

Other than a couple problems squeak had with the internet, this milestone was not too terrible. The functions that dig through the OrderedCollection are difficult to follow, and they are useless as soon as a webpage changes, but this is the norm in web page mining. Creating a universal parser that could overcome changes to the actual web sites would be extremely difficult, if not impossible. Our program now allows users to auto complete birth or death information for people, checking 4 good databases for possible data.

For more information about what this milestone involved, take a look at the Fall2002 M5 page.

Here is the source for just this milestone, if you would rather exclude the later milestone code in the final .st on our main case page.

M5.st

Link to this Page