[Seriously OT] Regex to convert HTML to text

Reply Subscribe

Thread Tools

Sep 10, 2002 | 05:46 PM

AusS2000

Thread Starter

Moderator

Joined: Oct 2000

Posts: 30,809

Likes: 15

From: Sydney

[Seriously OT] Regex to convert HTML to text

I realise this is seriously OT for an S2000 forum, and a question topic only applicable to a very small percentage of the populous but here goes.

I need to take a string of HTML and strip out the tags: ie, andthing in between '<' and the subsequent '>'.

It doesn't really need to do much more than that as it's only for the first 160 characters of a string.

The software I'm using has a REGEX metatag that, with the correct expression, should be able to achieve this.

And help from REGEX afficianados would be greatly appreciated.

Sep 10, 2002 | 09:07 PM

PeaceLove&S2K

Joined: Jul 2002

Posts: 19,257

Likes: 19

From: San Diego, CA

does
:%s/<[^>]*>//g

in vi work?

what kind of regex implementation are you using?

Sep 11, 2002 | 12:12 AM

The Unabageler

Former Moderator

Joined: Oct 2000

Posts: 20,448

Likes: 0

From: internet

that regexp is correct for most cases. this one is much more thorough, it does html entities as well and handles some other whacky cases.

Sep 11, 2002 | 12:13 AM

The Unabageler

Former Moderator

Joined: Oct 2000

Posts: 20,448

Likes: 0

From: internet

it's in perl but u should be able to translate it to whatever language you want.

Sep 11, 2002 | 01:06 AM

AusS2000

Thread Starter

Moderator

Joined: Oct 2000

Posts: 30,809

Likes: 15

From: Sydney

Wow!! Thanks guys.

I use Tango. @REGEX is a metatag in Tango files.

It has the format:

Syntax

<@REGEX EXPR=expression STR=text TYPE=type>

Description

Provides an interface to POSIX regular expression matching routines from inside Tango. This gives you powerful tools to match text patterns if they are needed.

<@REGEX> accepts as attributes the regular expression (EXPR), the text to match the pattern against(STR), and the type of the regular expression (TYPE), basic or extended. If the attributes contain spaces, they must be quoted--single or double, as appropriate. <@REGEX> returns its results in the form of an array and should be assigned to a variable via <@ASSIGN>.

Upon a successful match, <@REGEX> returns an array with three columns and n+1 rows, where n is the number of parenthesized subexpressions in the pattern. The first column contains the matching text, the second column contains the start index of the matching portion, and the third column gives the length of the matching portion. The start and length are compatible with the <@SUBSTRING> tag.

Rows i from 1 to n give the ith matching parenthesized subexpression, and row n+1 gives the entire matching portion of the text. (If there are no parenthesized subexpressions, the whole match is returned in the first row.)

The table gives a sample array returned from <@REGEX>.

<@REGEX EXPR="([[:alpha:]]+),[[:space:]]+([A-Z]{2})[[:space:]]+([A-Z][0-9][A-Z] [0-9][A-Z][0-9])" STR="in Mississauga, ON L5N 6J5." TYPE=E>.

Mississauga 4 11
ON 17 2
L5N 6J5 20 7
Mississauga, ON L5N 6J5 4 23

If attributes are missing, <@REGEX> returns a string with the problem attributes. Upon an error condition, <@REGEX> returns a single character, "C" for a pattern compile failure, and an "M" for a match failure. If any attributes are missing, a textual message is displayed indicating the missing items. You can easily test for success by using <@VARINFO NAME=variable ATTRIBUTE=TYPE>.