Mining WWW Pages is Stupid

Mining WWW pages is heuristically parsing HTML to extract structured data. For example, when Squeak looks up the definition of a word, it makes a request to a web site and then extracts the definition from the HTML that is returned – Squeak "mines" the WWW to find the definition.

For many reasons, this WWW Mining stuff is really stupid. It is rarely certain that the parsing is correct for all possible data. The web site may change its layout at any time, completely invalidating mining software. Mining software is tied to particular sites, so that analyzing new sites requires writing new software. The software itself is complex in design, complex to implement, and complex to maintain. And.... mining is just stupid! Why can't the site just provide a nice database interface? Stupid stupid stupid.

And yet, WWW mining can be very useful. Sometimes, web site owners just don't have the inclination or the time to provide a nice database interface. When that's the case, you must either use heuristic approaches, or you must abandon the data on that site.

Re: Why can't the site just provide a nice database interface? Stupid stupid stupid.

That's what XML is going to change... -Anonymous

The technology is already present to provide nice services over the internet, when it is desired. CDDB works fine without XML, for example. It simply takes extra work, and not all sites are going to volunteer that work for you. Not all sites will even let you volunteer the work, because of the overhead of incorporating outside code into their site. -Lex Spoon

