View this PageEdit this PageAttachments to this PageHistory of this PageHomeRecent ChangesSearch the SwikiHelp Guide
Hotspots: Admin Pages | Turn-in Site |
Current Links: Cases Final Project Summer 2007

How to Mine Web Pages

Milestones 6 and 7 from Spring 2001 involve mining WWW sites to find MP3 files. The Spring 2002 project may involve mminig WWW sites, as well. This page tries to sketch an approach to this part of the project.

  1. If you are distracted by thinking about how stupid this assignment seems, please read Mining WWW Pages is Stupid.
  2. Start by stepping through the web sites in a regular web browser. See which web pages you will want your program to use. Perhaps map on a scrap of paper the sequence of pages that are traversed to submit a search and then find relevant mp3's.
  3. Open a workspace in Squeak, and try to duplicate the path followed using Squeak's HTTP code. For example, try inspecting the result of:
    'http://search.yahoo.com/bin/search' asUrl retrieveContentsArgs:  {'p' -> {'smalltalk'}}
    .
  4. Start writing some code to search the HTML results from each step and find the requests to make in the next step. This will probably involve some kind of parsing. Don't worry about making the analysis perfect; heuristics are inherent to web mining.
  5. Finally, bundle up your bits of code into nicely factored methods and classes that can be used by the main part of your system.


Cookies

mp3.com, and probably other sites, require that your client use cookies to download an Mp3 file. Base Squeak does not support cookies, but here is a simple patch to add the capability: cookies.cs. File this in, and your Squeak image will start storing and transmitting cookies when it does HTTP requests.

Note that mp3.com only requests an email address, etc, if you have not already entered that information and obtained a relevant cookie. Thus, a Squeak image can be primed, once, by downloading an MP3 file through Scamper. Then, the image will have the necessary cookies to submit and mp3.com won't ask for the information again.


M3U files

mp3.com, and probably other sites, conclude a search by providing you a file with extension "m3u". The usual content-type of the file is "audio/x-mpegurl". Such files contain a list of URLs pointing to actual mp3 files; thus, to "play" an M3U file, open the file, retrieve the mp3 file on each line, and then play the individual mp3 files.

Note, M3U files often use Unix line-ending conventions. To cope with this, you can use the method #withSqueakLineEndings.


Trying it Yourself

wwwhack.zip (screenshot) contains an image and changes file that walks you through the above things in detail, and lets you try it yourself.


Links to this Page