Questions on Linear News Display Milestone

See Spring 2000 Project Milestones – ask questions here:

My question has to do with the actual parsing of the HTML file. For example, lets say we go to slashdot. Once we have returned the body of the html back into squeak, it is represented as an ordered collection. Then what would you recommend that we do with it. I see that it is further divided into an ordered collection then if you inspect it you can see the contents and the attribs. So lets say that on Slashdot I want to grab all the "Read More..." hrefs. How would I find them in the collections and return them... I am really kinda lost. Any help at all would be very appreciated.
Brian Smith
Why is the HTML represented as an ordered collection? Did you write the parser? If so, why are you just representing an ordered collection? Are you using HtmlParser? Then it's not an ordered collection – it's an HtmlDocument. (Did you attend lecture this last week? I think I addressed all of the above...) Mark Guzdial

If an instance of class B isn't an attribute of class A, is it ok for class A to have a containment relationship with class B? In general,
I'm having a hard time deciding when "uses" relationships should be
refined into "contains" relationships. In order for a coffee maker to know which coffee to make, should a coffee maker contain
coffee, or be told which coffee to use via passing the coffee as a method argument(myCofeeMaker make: cupOfFolgers) ? My question comes from the relationship between a NewspaperSectionEditor and the NewspaperSection he/she edits...

This sounds like a dependency relationship – a "temporary Has-A" as I say in the book. Mark Guzdial

Can I reuse built-in parser classes rather than creating mine with TGen?

Of course! I said so in class! You have several options for the parser: Build your own with recursive descent, build your own with T-Gen, use HtmlParser. All are acceptable. Mark Guzdial

I've got more of a UI type question. In Morphic, is there a way to restrict the region in which a HandMorph can "carry" a grabbed morph? I've looked and looked, and the only thing close might be to use a WorldWindow, which is quite ugly. I assume modifications to basic morph components (HandMorph, etc...) are out of the scope of these projects and/or disallowed for this class, correct? Matthew Wolenetz

In general, that's considered very bad UI, to limit where the cursor can go. You're essentially creating an invisible "mode" – very yucky. You can limit the targets that can accept a grabbed morph, but you can't limit where the Hand goes. Mark Guzdial

I understand why this is very bad UI. I was using the wrong terminology/asking the wrong question: "grabbing/dropping/carrying". Let me rephrase. If scrollbar sliders weren't limited to the scrollbar, we'd be allowed to move the slider outside of the scrollbar. The limitation on the sliders does exist in current UI toolkits including Morphic, and I was attempting to determine if there is a generic method for implementing limitations of this kind in Morphic. (Or are scrollbar sliders the only special case?) Matthew Wolenetz

Ohhhhh...I get it. What's going on in scrollbars is not a case of grabbing, dropping, nor carrying. Rather, the scrollbar catches mousedown and then polls the mouse and updates itself based on the mouse position. Since it's the scrollbar moving itself, it can limit it's position. There are some mechanisms for doing this generally (often called pin and groove mechanisms), but I don't know if any are built in to Squeak. You're welcome to build one! Mark Guzdial

Since Slashdot usually provides a summary and a link to an article on another site, should our newspaper contain just the summary (since we have no knowledge of the external site's organization)? Also, I assume we may consider all of Slashdot's articles to be technology oriented? Nick Michalko

You're the publisher – it's certainly your perogative what to include and what not to include. Yes, I think it's safe to say that Slashdot is just technology-oriented. Mark Guzdial

As brought up in the 1:30 class, the time limit of 5 minutes for newspaper generation is probably too short. Just grabbing the contents of a single web page can take almost an entire minute in the worst case. I have found the amount of time it takes my group's code to visit 14 websites to be anywhere from 2.5 to 5 minutes; if the user decides to get information from all 20+ websites, I don't think it is reasonable to assume that the newspaper can be generated in under 5 minutes in the average case. Note that the code to process the web information can be done qiute quickly (message sends are cheap), it's the actual process of getting the web information in the first place that is slow. Perhaps there should be a sliding time limit for newspaper generation that is proportional to the number of sites that have to be visited . . .


How does 2 minutes per number of sites visited sound? Mark Guzdial

That sounds very reasonable.


If delays are encountered in HTTPSocket handling a single request such that the 2 minute time limit per site is exceeded, is there a way to abort a request that's simpler than embedding the request in a separate process and using "terminate" messages appropriately? Also, we are finding large numbers of parseable articles at certain sites. Is limiting the number of articles retrieved to a subset of those found allowed? (I hope so :) ). Matthew Wolenetz

Question to anyone... Ok, I am wondering if I am missing something, because the reporter aspect of the project seems very complicated. For example, you go to slashdot and you have to grab all the relevant URL's for the articles ( I just noticed that they are all HtmlParagraphs with the [Read More...] in them) then you have to go to each href you have and extract the text from the page... I am having problems because it seems that each site will have to have it's own specialized reporter for each type of page that it needs to visit. The other problem I am having on Slashdot is that the text for the needed text is sometimes a HtmlItalicsEntity or sometimes just a HtmlTextEntity... and it seems to be very difficult to get just the needed text from the page. I think I could hard code all this stuff and get it working, but is there something in the HtmlParsing tools that I am missing??? Any help at all would be much appreciated. Thx
Brian Smith

I think that your understanding of the problem is correct. You need to deal with all these cases. Mark Guzdial

I have one of my reporters getting the ESPN page. When I ask the HTML parser to parsre it. I get an error saying number 8827 is out of bounds. When I try to load it with Scamper I get the same error when it tries to parse it. Is there a limit on how big of a page that the parser can parse and if so how can we work around it to get the page. Right now I can't do anything with the ESPN page.

Ansley Post

I'll bet that this is the bug discussed in second lecture yesterday – it seems that Scamper can't handle UNICODE (Characters defined by two bytes via an Ampersand expression). You have two choices:
Mark Guzdial

I think I got the fix. File in the below. I'm no longer getting errors on ESPN. Mark Guzdial
'From Squeak2.7 of 5 January 2000 [latest update: #1762] on 1 March 2000 at 4:41:00 pm'!

!String class methodsFor: 'internet' stamp: 'mjg 3/1/2000 16:40'!
valueOfHtmlEntity: specialEntity
	specialEntity = 'quot' ifTrue: [ ^$" ].
	specialEntity = 'lt' ifTrue: [ ^$< ].
	specialEntity = 'amp' ifTrue: [ ^$& ].
	specialEntity = 'gt' ifTrue: [ ^ $>].
	specialEntity = 'nbsp' ifTrue: [ ^ Character nbsp ].

	(specialEntity beginsWith: '#') ifTrue: [
		^Character value: ((specialEntity copyFrom: 2 to: specialEntity size) asNumber min: 255)].

	^nil ! !

ALSO More on Scamper's Character Support

Did this actually fix anyones problems I am still getting errors on
cnn's webpage when I use this.

Chris Spears

We (Squeak Times) have discovered a bug in the Windows Squeak virtual machine. We needed a dynamically generated date string for some of our parsing. It works fine in Linux, and it worked fine in Windows prior to today. The following line (run on 3-1-2000) returns different results depending on the vm:
Date today printFormat: #( 3 2 1 $/ 1 1 2 )
Linux: '2000/03/01'
Windows: '2000/02/29'
y2k leapyear incompliance????
Due to this bug, can we (as a team) specify to the TA testing our project that he needs to test it on the Linux vm??
Matthew Wolenetz

No. Mark Guzdial

I am sorry but I don't remember when is the due date for the hardcopy of the design.. I remember in class that Mark said we can slide it
under the door till 4:30pm on friday not sure though can someone confirm thanks

Irfan Ahmed
Yes, it's due at my office, CCB 254, before 4:30 pm Friday. If I'm not there, slide it under the door. (This is also on the Fall 2002 Announcements page.) Mark Guzdial

Our group is able to output parsed HTML to a text file just fine. However, the text is only readable in Microsoft Word. If we use Wordpad or Notepad, the spacing is a little off (the carrige returns aren't read correctly). For grading purposes, is it ok if our text is readable in Word and not in Notepad/Wordpad?

Peter Tsai

How does it read in the FileList? If it's readable in Squeak, you're golden. Mark Guzdial

Is there any policy on including bylines, dates, copyrights, etc.? I would assume that it's our "editorial choice."


It is, but that's my mistake. I should have required you to specify the website at least for each article. I may yet add that to the final assignment. Without it, there is the potential for copyright violation. Mark Guzdial

I was testing the article retrieving code (without worrying about time limits). After a few minutes, it produced this pink box error: "send data timeout; data not sent"
This is not because of any code that my group has written, but is in
HTTPSocket sendData:
[bytesSent bytesToSend] whileTrue: [
(self waitForSendDoneUntil: (Socket deadlineSecs: 20))
ifFalse: [self error: 'send data timeout; data not sent'].

It is hard coded to wait at most 20 seconds. If it times out, everything stops, and the paper cannot be produced. Will we be counted off if this happens?

Jared Ivey

Yes. What's happening is that the server is going away during the connection. Use [DoSomething] ifError: [Error recovery code] to trap and deal with the error. Mark Guzdial

Couple of questions about BBC:

1) News and World Service link to the exact same articles. Can we combine them into one category?

2) History and Home and Garden sections do not contain articles, but rather links to websites about these topics. How are we supposed to handle that?

Those darn europeans can't do anything right. :-)


Sure, you can combine for that site. That's fine to skip History, Home, and Garden for BBC. Mark Guzdial

How can I get Squeak to correctly recognize a literal string with apostrophes/single-quotes in it? For example, when Squeak evaluates 'This project's annoying me.', it interprets that as a string and an unmatched string quote. Also, when I execute 'Date today' I get yesterday's date. Anyone else getting this problem? Thanks.
Nick Michalko

Double the single quotes: 'This isn''t a problem' Mark Guzdial

Squeaks has all kinds of problems when trying to deal with files mounted over nfs/samba. If I change the file's directory to something like C:\TEMP it works perfectly.

Can we ask our TA to grade this project in Windows, so we can put them in C:\temp or somewhere similar?


I don't have any official say, but I would say the answer is no. It must be platform independent (same reason my group can't specify Linux). Squeak should have no problem writing to the default directory (ie: the one in which the image file resides) since it writes to the changes/image files.

Also, there is nothing that guarantees the existence of a c:\temp directory (or something similar), nor anything saying you have write access to it. The only one I see you can assume exists and has write access is the default directory.

Jared Ivey

Actually, most of the TA's (I think) know about these problems and are careful with their own Squeak usage. You can't ask the TA's to work on a particular platform, but you can ask the TA's to be aware of the limitations of their own platform so that Squeak has network and file accessibility. A similar example: You don't have to deal with the Domain Name Server (DNS – NetNameResolver in Squeak) possibly going away or not being present. Yes, that's possible, but that's not something we expect you to deal with – it's a platform limitation issue. Webservers going away IS a real issue. Mark Guzdial

