View this PageEdit this PageAttachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Los Pimps - Parsing Tips


Here I give you some simple tips on parsing an HTML file for information you and and for parsing a simple GEDCOM file.

HTML parsing:
Do Not use the squeak html parser!!!
It will give you an orderedCollection from hell.

The simplest approach is to just grab the entire HTML page and extract the information you need directly from the source. If you read
our Los Pimps - Milestone 5 page then you should know how to get the page as a string. We will continue from here.

"go to start of table"
tableBegin :=page findString:(some string) startingAt:0.
tableEnd := page findString:(some string) startingAt:(tableBegin + 8).
tableEnd :=tableEnd+8.

"Chop out the table"
page := (page copyFrom: tableBegin to: tableEnd).
tableEnd := page findString:(some string) startingAt:8.
index :=0.

Find the start of a cell and grab the data in it.

"first and last name"
start := page findString: '< 'td' >' startingAt: index.
stop := page findString:'< '/td' >' startingAt: start.
temp:= (page copyFrom: (start+4) to: (stop-1)).

GEDCOM file parsing:

To start off you need to open a file stream tot he file you want to parse.

filestream := FileStream fileNamed: fileName.

Generally, parsing usually involves two steps, Tokenizing and then parsing the tokens. Tokenizing involves going character by character and converting things to known tokens by matching. Simple string compare should suffice. Parsing theory is in the lecture slides.

We did things slightly different....
We combined the scanner and parser into a ghetto scanner parser that crawled character by character without using peek!
Weird thing is it worked... just that is wasnt very flexible. Unknown tags placed in the right placed would cause parse errors (which we caught).

Our parsing algorithm looked something like this:

[file atEnd] whileFalse:
["read level"
char := indexofnewline.

(char isDigit) ifFalse: [
"throwError: 'Line does not begin with level.' file close. ^ nil."
[ self parseLine: file start: char.

In hind sight we should have done it the standard way. We won't go much into that because that's in the lecture slides. So our GEDCOM parsing tip is what not to do.

Links to this Page