Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007
Los Pimps - Parsing Tips
LOS-PIMPS PARSING TIPS
Here I give you some simple tips on parsing an HTML file for information you and and for parsing a simple GEDCOM file.
Do Not use the squeak html parser!!!
It will give you an orderedCollection from hell.
The simplest approach is to just grab the entire HTML page and extract the information you need directly from the source. If you read
our Los Pimps - Milestone 5 page then you should know how to get the page as a string. We will continue from here.
- 90% of the time, the info you will be looking for will be part of some table. Look at the source and find a UNIQUE piece of text that appears somewhere before the table begins and likewise for after the table ends. Similar to the following.
"go to start of table"
tableBegin :=page findString:(some string) startingAt:0.
tableEnd := page findString:(some string) startingAt:(tableBegin + 8).
- Using these indeces we copy or truncate the large html page into a smaller string that just has the table.
"Chop out the table"
page := (page copyFrom: tableBegin to: tableEnd).
tableEnd := page findString:(some string) startingAt:8.
Find the start of a cell and grab the data in it.
- Next, look to see what CELLS your data is contained in. Table cells are usually delimited by 'td' tags for columns and 'tr' tags for rows.
"first and last name"
start := page findString: '< 'td' >' startingAt: index.
stop := page findString:'< '/td' >' startingAt: start.
temp:= (page copyFrom: (start+4) to: (stop-1)).
- Now that you have the data copied to a string. Feel free to tokenize it and do whatever you want with it. That wasn't too hard now was it?
GEDCOM file parsing:
To start off you need to open a file stream tot he file you want to parse.
filestream := FileStream fileNamed: fileName.
Generally, parsing usually involves two steps, Tokenizing and then parsing the tokens. Tokenizing involves going character by character and converting things to known tokens by matching. Simple string compare should suffice. Parsing theory is in the lecture slides.
We did things slightly different....
We combined the scanner and parser into a ghetto scanner parser that crawled character by character without using peek!
Weird thing is it worked... just that is wasnt very flexible. Unknown tags placed in the right placed would cause parse errors (which we caught).
Our parsing algorithm looked something like this:
[file atEnd] whileFalse:
char := indexofnewline.
(char isDigit) ifFalse: [
"throwError: 'Line does not begin with level.' file close. ^ nil."
[ self parseLine: file start: char.
In hind sight we should have done it the standard way. We won't go much into that because that's in the lecture slides. So our GEDCOM parsing tip is what not to do.
Links to this Page