View this PageEdit this Page (locked)Attachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

How to sift through OrderedCollection Hell

Tip 1:Look for something a little different..


After staring at hours of parsed HTML I've learned to let anything that might be unique in the HTML be what I search for. Now, I know that might sound really obvious, and it is.. but there are a few catches to it. There are so many possible ways that the data you'll be handed back will be ordered, I'm not sure if this will even be that helpful. However, I'll go over tables which I found to be quite common in the code I had to write.
-Example: Tables- Tables often times contain the data that you need to extract. So what you want to do is find something that makes the table with the data you need unique. Below is an example.

Lets assume that the data I need to get from the webpage is in a table. So I would perform a simple select on my OrderedCollection to extract all of the tables from the HTML. Here is the workspace code

External Image
As usual, the added code is in red. Basically we're looking through all of the body to find and select ONLY the tables. The object that is returned is an OrderedCollection of tables. As shown here.
External Image
Now we have tons of tables, but how do we find the information we need. At this point you need to remember what I said earlier about looking for something unique. If I said to you that the data we needed was contained in a table that had a table "width" of 270... would that help you at all? It should. That's the obvious part of searching for something unique. If you know your data is going to be in a table, find something unique about that table and exploit it. Now all we have to do is a trivial select of that one object in the collection.
External Image
This code will return to us the only OrderedCollection that we will need to worry about for the rest of the method. So, we now have an OrderedCollection, but we have NO idea how many elements it has or how many subelements each one of its elements has... all we know is that it has the data we need.

Tip 2:Recursion is your friend... honest!


Ahh.... a perfect use for the beauty of recursive methods. All that nonsense about not knowing how deep to go to extract the text is meaningless in the face of a recursive text extraction tecnique. Fortunately, Squeak already has that built in for us... kinda. The secret is in a method that OrderedCollections understand called "allSubentitiesDo".
An example of an "allSubentitiesDo" call.
External Image
This will return an OrderedCollection of Text objects. All that is required then is to iteratively concatenate the Text ojects into one long string.
In my opinion, this line of code is perhaps one of the most powerful lines that I wrote while working on the HTML parser. I literally used it everywhere, the utility of it was amazing. The key to its usefulness is that it allows you to generally print something readable and pretty close to what you need. If you know that your data is in one of 3 tables and you can't narrow it down, extract all of the text and then try to work with it as text. You also use it when you know exactly what you need from an OrderedCollection and it can then extract the text no matter what form it is in( i.e.-hypertext or links).

Tip 3: Learn Dictionaries and how they relate to HTML in Squeak


This is the final thing for this quick tutorial that I would say is important. It seems that many of the lower level data structures that Squeak creates and puts in the OrderedCollections are usually dictionaries. Dictionaries are an incredibly powerful resource to use and allow for complicated things to be done quickly. So learn what an HtmlAnchor is and what a HtmlFontEntity is. Take some time to create mental assoications between the HTML you already know and what Squeak hands to you. This will save you serious time when it comes down to the wire and you have 5 hours to code a reporter for cnn.com.

Conclusion


I really hope that someone out there found this stuff useful. There was a lot of blood, sweat and tears that went into this project and I think that the HTML parser was one of the harder aspects of the project. It is really neat what you can do with Squeak and its built-in functionality. Below I've linked to some SampleReporter code that I filed out from our project. This code will not run, but it might complie... not really sure. However, it IS full of useful comments and contains ALL of the examples that you found above. You can prolly pick and choose pieces to make some workspace code out of. Anyway, always remember above all else , .... FEAR 10,000 NEEDLES!



Get the source to the Sample Reporter here

Go back to...
HTML parsing in squeak


Links to this Page