View this PageEdit this Page (locked)Attachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

How to attack the HTML monster

Introduction



There are a few core concepts that you need to understand before you take on the task of doing HTML parsing in Squeak. First, you need to have a solid grasp of HTML and how the markup language actually works. The main reason for this is that if you are going to use the HTML parser that Squeak provides you'll have to be able to search the OrderedCollections of HTML tags it returns and actually understand what it means. Otherwise, if you don't want to use the internal parser you can write your own parser using TGEN or create a recursive descent parser totally on your own. Either one of these options requires some knowledge of how HTML tags work and how CGI stuff is integrated into HTML nowadays. The focus of this study will be using the internal parser that comes with Squeak.

Tools of the trade



If you're new to Squeak then you're gonna want to read this section to get you up to speed on how I'm going to reference all of the topics in this case study. First, you want to familiarize yourself with the way the Squeak interface works and realize that your Workspace, Transcript window and the "object" explore commands are the key to really understanding the flow of data in Squeak. Squeak also has an AMAZINGLY powerful debugger that will pop up tell you what message send went wrong and let you directly see the contents of the objects in your code. This lets you check and see if your OrderedCollection is actually full or is nil for some reason. More often than not the message that Squeak throws isn't the message you're interested in. You're going to be interested a few steps down in the runtime stack where the bad data was sent. Most of the code that I going to show you is going to be Workspace code that you can directly execute and see the results, so be sure to have a workspace and transcript open and visible at all times.
The picture below shows a workspace, transcript, explore window and a debug window open and running. Be sure to look at how things are arranged in the debug window and explore window.
External Image

Continue on to...
Getting the contents of a web-page
Go back to...
HTML parsing in squeak

Link to this Page