View this PageEdit this Page (locked)Attachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Getting the contents of a web-page

How to rip the web-page HTML


Now to the fun part....Squeak is really great for actually getting the contents of a web-page. The call is simple and almost always returns you something usable. Here is an example of the actual code to get the contents of cnn.com.
External Image
After you run this code you should see two important things.
1. Your transcript will show you the status of the HTML retrieval operation. For example, you should see something like...
data was slow
data was slow
data was late
etc etc. This lets you know that something is actually happening.
2. After this code is complete an explorer window pops up showing you the contents of the htmlData object. (Note: Whenever you use explore in your code the execution terminates at that point so don't expect anything else to continue until you remove your explore.) This is the basic code that gives you back the source from a page, much like if you did a view source from a browser.

Parsing the HTML into something useful


Ok, so now we have this huge string of HTML "stuff" and you are wondering what to with it to make any sense of it. Well, Squeak can help you do just that in only one line of code! The red line of code is the line needed to parse the HTML. Once done, you'll get a new explorer that should look something like what I have below.
External Image

What you should now do is basically look through all of the data that the explore window contains. If you look you'll see a TON of ordered collections. That is the way that the internal Squeak parser breaks everything into useable chunks. This is a great way to handle the data because of all the powerful operations that Squeak lets you perform on OrderedCollections. We'll get to all of those in a little bit. Basically, all webpages should have a head and a body. This will allow you to select whatever you need. For example, if you just need the story title all that information is stored in the HTML head. (Note: for those of you w/out HTML knowledge the head is what is at the top of your browser bar usually). All of the rest of the data in the HTML page resides in the body and that is what we are going to focus on in the rest of this case study. At this point you should take some time to familiarize yourself with messages that OrderedCollections understand. Doing this now will become an invaluable resource throughout all of the time that you spend working on Squeak. In order to grab just the body of the HTML we would do something like this.
External Image
In much the same way we could get the head of the HTML page by exchanging "last" with "first" in the above code. The reason that works is that works is because the parser returns an OrderedCollection of two OrderedCollections( namely head and body). So as you can see there are a lot of ordered collections that are going to have to be handled.
We finally have everything we need to get the information we care about out of the web-page. Now we just have to figure out how to get EXACTLY what we need...
Continue on to...
How to sift through OrderedCollection Hell
Go back to...
HTML parsing in squeak

Links to this Page