View this PageEdit this PageAttachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Milestone 3 - Analysis

This milestone, for our group, was by far the most difficult milestone in the entire project. For this milestone, we to actually go and parse the websites and save all the articles in the text file. None of our team mates had ever done parsing like this before. But this was a great learning experience.

Parsing the websites was made pretty easy the Squeak's built in HTML parser. The reason we chose HTML parser rather than T-Gen or build or own is because we thought this was the easiest and will save us a lot of time. The Reporter class went to the websites and got the html. The html was then passed to the Editor class. This class would take that html and parse through extracting out the headline and article text storing them in an instance of Article. A list of articles were created by the editor.

The parsing methods worked recursively. The HTML parser generates an tree, basically an ordered collection of ordered collections. To extract the inforamtion, you have to visit each and every node of the tree. By opening an explorer on the object returned by the squeak's html parser, one can see exactly how sqeuak sets up this parse tree.
Every html tag is a node in the tree and everything between the start and end tags are children of this node. Each tag had it's attributes and contents. The contents then coubld be another list of tags

All these tags have some represent some kind of html entity. each entity has it's distinct functionality. The way we parsed the html is by looking at a pattern, i.e, after exactly what pattern do we know we start grabbing the text for the article and after what pattern do we stop. the tree was travrsed recusively and each node was checked for a pattern and was either added to article text or not.

These were the logical steps in which the program worked

We crambled and finished our parsings. This was very rough. Each site had its own set of patterns and you had to account for changes in website daily. By making patterns general, we pretty much got our parsing working very close to perfect. Because of our lack of experience with parsing and in particlar with squeak html parser, we felt this was the toughest out of all the milestones up ahead.

Link to this Page