View this PageEdit this PageAttachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

Sp00 Final Exam Review: Redesigning the Personal Newspaper

See Final Exam Review - Sp2000


a) The built-in HtmlParser is nice, but needs to be updated for some things ( DIV tags are
put within tags and interpreted as such ). Given that HtmlParser would be used,
the HUGE problem would still remain: parsing the sites and getting the proper content,
which would involve lots of parsing code to go through the parse itself.

If memory serves, T-Gen allows you to create objects as you parse, and use the parsed
information as needed. Also.. if memory serves, you can create rules for when to make
the object and when not to. If these rules could somehow be the content parsing logic,
then you could parse and build your content/artcile simultaneously. Also, if
things in the site moved around a bit, the T-Gen parse would probably be more lenient then
a "Go down 4, then on that 4th element, find... etc.. " parse of what HtmlParser returns.

So, I think an optimal solution would be to use T-Gen and define your own sets of rules
of what to grab and what not to grab. These rules is the content/article parsal, and you
would have everything done. You could also have these T-Gen - generated objects create
and update your Article object on the fly.

right...? :)

b) news source specfication would most likely still be in some configuration file. A more
sophisticated format could be chosen outside of just listing sites and a yes/no attrib.
You could use a dictionary, with some before decided-upon key that hashed to a true/false
value for each site ( i.e.: 'ESPNTV'->true 'ESPNSports'->false etc... ). Dictionary
also has some internal way to write to disk, so no extra code would be needed for saving
its state. I don't see a reason ( yet... ) to create the extra overhead of creating
some script-like language that will then need some interpreter, etc...

c) Factory: You can have some GenericReporter class that provides a means of returning
parsing rules for T-Gen to parse out a given site. It's subclasses ( ESPNReporter,
BBCReporter, etc.. ) would actually have the rules spelled out. You could have:

" in GenericReporter "
parseRules
^self siteParseRules.

" in ESPNReporter - subclass, site-specific Reporter "
siteParseRules
" answers this site's parsing logic for gathering pertinent
article information for Article creation
"
^"logic for parsing"


I'm going on the assumption you can change the grammar and such for T-Gen. If this isn't
possible ( can't open ppt slides... ), then you could go back to using the HtmlParser
and have siteParseRules return a Block that could take in as parameters the huge
collection HtmlParser generates.

so....how's that?


Good, but not the only answers. Mark Guzdial



a) Ok, we thought that the HTMLParser was to convoluted for commercial development. It takes too long to figure out the appropriate way to walk the ordered collection returned for each specific web site. T-Gen seems too complex, but that may be just because I've never used it before. We think the best way to do this would be to make our own "scanner", like lex. You could search for specific strings in the HTML document. It would take more time to initially build it, but any new parsers would take less time to write.

b) Unless we can come up with a "magic parser" that can parse any site, we would want to have a configuration file. Is this what the question is asking? It seems a little vague.

c) We would want to use a Reporter Factory, ideally. It would manufacture reporters as needed, and call the methods to parse the sites.

d) More later.

Susi Rathmann, jeremy, Matt Flagg



[a] - I would use the built-in HtmlParser, despite its shortcomings, since it would take a really long time to create a parser by hand, even when using T-Gen. Then to parse the htmlCollection that HtmlParser returned I would recursively descend through the collection, retrieving all necessary data. This is expensive since parsing is done twice.

[b] - I would handle the specs for new news sources via configuration files, such as "newssources.txt". This way the data could be sent easily between multiple machines. This would eliminate the overhead of the AJC keeping configuration files for each user on the server. Each user would keep their file on their home computer and the interface would send it back and forth in the form of a "cookie". The only pdrawback to this design is that a user with news sources stored at home would have to re-input their information on another computer.

As for the format of the config file, I would use the same one that I designed for my group newspaper project.

section[url{topic}url2{topic2}url...{topic...}]section2[url...{topic...}]

with -Primary- as an escape sequence for the primary source.

ex:

National or World News[-Primary-{Primary}]Sports[http://espn.go.com{Sports Headlines}]Weather[http://www.accessatlanta.com/{Weather}]

[c] - I would use the factory approach like Stephen, but in parseRules i would have a generic parser to handle most of the data, since in my experience almost all of the information is repeated. Then I would have the site rules subclassed in a pattern resembling a tree. For instance, CNN would have more general rules applicable to all of the sites it contains (as compared to the generic parser) for all of the sites, with say CNN US having specialized rules for that page.

Harper Maddox



a. & b.) Maybe I'm taking the scenario too seriously, but if budget constraints allowed it, I would completely remove site-specific parsing logic from the end-user application. Instead, our Newspaper application should connect to a special server maintained by Cox Communications and download compressed articles from it. While this is more expensive, it allows complete control of what end-users see (evil, but very business-saavy). This could be extended so that the list of available news sources is refreshed whenever the user connects (no more fuss about how to create a system that handles new/changing article sources). Finally, this allows us to parse any web site with whatever language/tool does it best; I'm not to familiar with Perl, but I'm betting that a wizbang Perl programmer could manufacture the correct scripts quickly and cheaply.

Stephen Bennett

Link to this Page