[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: text processing as the problem

To: "'KELLEHER,KEVIN \(Non-HP-Roseville,ex1\)'" <address@hidden>, <address@hidden>
Subject: RE: text processing as *the* problem
From: "Clint Bidlack" <address@hidden>
Date: Wed, 28 Nov 2001 16:33:37 -0500
Importance: Normal
In-reply-to: <028FE7141C79D511B65100D0B74FE875CAF22C@xrose01.rose.hp.com>
Reply-to: <address@hidden>
Sender: address@hidden

Title: RE: text processing as *the* problem

Hi Kevin.

I too have similar text processing needs. FYI, for the type of processing you suggested:

> The sort of problem I often need to solve is to extract the links and
> accompanying
> text from a web page, but only from a certain part of the page. I would
> like to
> be able to easily program some processing rules, such as "ignore tables that
> contain forms" or "only collect links that begin with '/news/'".

you may want to look at wrapper (screen scraping technologies). I've used W4F successfully at my last company to extract content from web pages and in fairly large volume (probably close to 1000 web pages). The syntax is not so intuitive, but for crawling DOM trees and yanking out content based on the node your at - W4F is/was probably the better languages (and associated tools). W4F code is compiled into Java code. So if your a Java fan like me then it can work out nicely.

http://db.cis.upenn.edu/W4F/

Also, checkout Webel, originally from a research group at Compaq. (Was Webel, I guess now is recoded in Java and now has another name?

http://www.research.compaq.com/SRC/WebL/index.html

Another is DEL, but I haven't looked at it yet in detail. Considering its XML based, well... most likely it will be very unintuitive to program.

http://www.w3.org/TR/data-extraction

Regards...

-Clint

-----Original Message-----
From: owner-ll1-discuss@ai.mit.edu
[mailto:owner-ll1-discuss@ai.mit.edu]On Behalf Of KELLEHER,KEVIN
(Non-HP-Roseville,ex1)
Sent: Wednesday, November 28, 2001 2:42 PM
To: 'll1-discuss@ai.mit.edu'
Subject: text processing as *the* problem

Congratulations on LL1. I'm glad that language developers can get together
and share ideas.

As a language user, I am looking for a language that I can fall in love
with,
and have been following the appearance of new languages for several years.
However, there is a problem space that seems neglected, and that is
text processing.

I am well acquainted with regular expressions and the sort of work that
can be done with Perl, for example, but it does not have the sort of
*feel* that I am looking for.

My objection to the regular-expression approach is that it is a low-level
approach.
It is not far removed from number crunching as a computing activity.
String-processing seems like an afterthought in language design.

Are there any languages, even big languages, that were *built* with
text processing in mind? Are there approaches that are not limited
to an implementation of regular-expression matching?

The sort of problem I often need to solve is to extract the links and
accompanying
text from a web page, but only from a certain part of the page. I would
like to
be able to easily program some processing rules, such as "ignore tables that
contain forms" or "only collect links that begin with '/news/'".

I've also encountered problems in parsing XML that have required some
"heavy lifting" in terms of string comparisons that I have had to implement
in C. All the while something inside cries out that it shouldn't be so hard
to do.

Kevin Kelleher

References:
- text processing as *the* problem
  - From: "KELLEHER,KEVIN (Non-HP-Roseville,ex1)" <kevin_kelleher@non.hp.com>

Prev by Date: Re: Python's GC approach
Next by Date: Re: Python's GC approach
Previous by thread: Re: text processing as *the* problem
Next by thread: Re: text processing as *the* problem
Index(es):
- Date
- Thread