Off-topic Talk Where overpaid, underworked S2000 owners waste the worst part of their days before the drive home. This forum is for general chit chat and discussions not covered by the other off-topic forums.

[Seriously OT] Regex to convert HTML to text

Thread Tools
 
Old Sep 10, 2002 | 05:46 PM
  #1  
AusS2000's Avatar
Thread Starter
Moderator
20 Year Member
 
Joined: Oct 2000
Posts: 30,809
Likes: 15
From: Sydney
Default [Seriously OT] Regex to convert HTML to text

I realise this is seriously OT for an S2000 forum, and a question topic only applicable to a very small percentage of the populous but here goes.

I need to take a string of HTML and strip out the tags: ie, andthing in between '<' and the subsequent '>'.

It doesn't really need to do much more than that as it's only for the first 160 characters of a string.

The software I'm using has a REGEX metatag that, with the correct expression, should be able to achieve this.

And help from REGEX afficianados would be greatly appreciated.
Reply
Old Sep 10, 2002 | 09:07 PM
  #2  
PeaceLove&S2K's Avatar
20 Year Member
 
Joined: Jul 2002
Posts: 19,257
Likes: 19
From: San Diego, CA
Default

does
:%s/<[^>]*>//g

in vi work?

what kind of regex implementation are you using?
Reply
Old Sep 11, 2002 | 12:12 AM
  #3  
The Unabageler's Avatar
Former Moderator
 
Joined: Oct 2000
Posts: 20,448
Likes: 0
From: internet
Default

that regexp is correct for most cases. this one is much more thorough, it does html entities as well and handles some other whacky cases.
Reply
Old Sep 11, 2002 | 12:13 AM
  #4  
The Unabageler's Avatar
Former Moderator
 
Joined: Oct 2000
Posts: 20,448
Likes: 0
From: internet
Default

it's in perl but u should be able to translate it to whatever language you want.
Reply
Old Sep 11, 2002 | 01:06 AM
  #5  
AusS2000's Avatar
Thread Starter
Moderator
20 Year Member
 
Joined: Oct 2000
Posts: 30,809
Likes: 15
From: Sydney
Default

Wow!! Thanks guys.

I use Tango. @REGEX is a metatag in Tango files.

It has the format:

Syntax

<@REGEX EXPR=expression STR=text TYPE=type>



Description

Provides an interface to POSIX regular expression matching routines from inside Tango. This gives you powerful tools to match text patterns if they are needed.

<@REGEX> accepts as attributes the regular expression (EXPR), the text to match the pattern against(STR), and the type of the regular expression (TYPE), basic or extended. If the attributes contain spaces, they must be quoted--single or double, as appropriate. <@REGEX> returns its results in the form of an array and should be assigned to a variable via <@ASSIGN>.

Upon a successful match, <@REGEX> returns an array with three columns and n+1 rows, where n is the number of parenthesized subexpressions in the pattern. The first column contains the matching text, the second column contains the start index of the matching portion, and the third column gives the length of the matching portion. The start and length are compatible with the <@SUBSTRING> tag.

Rows i from 1 to n give the ith matching parenthesized subexpression, and row n+1 gives the entire matching portion of the text. (If there are no parenthesized subexpressions, the whole match is returned in the first row.)

The table gives a sample array returned from <@REGEX>.

<@REGEX EXPR="([[:alpha:]]+),[[:space:]]+([A-Z]{2})[[:space:]]+([A-Z][0-9][A-Z] [0-9][A-Z][0-9])" STR="in Mississauga, ON L5N 6J5." TYPE=E>.

Mississauga 4 11
ON 17 2
L5N 6J5 20 7
Mississauga, ON L5N 6J5 4 23

If attributes are missing, <@REGEX> returns a string with the problem attributes. Upon an error condition, <@REGEX> returns a single character, "C" for a pattern compile failure, and an "M" for a match failure. If any attributes are missing, a textual message is displayed indicating the missing items. You can easily test for success by using <@VARINFO NAME=variable ATTRIBUTE=TYPE>.
Reply
Related Topics
Thread
Thread Starter
Forum
Replies
Last Post
wantone
Off-topic Talk
6
Jun 10, 2009 02:04 PM
jamesk
Off-topic Talk
9
Apr 23, 2009 11:17 PM
Blackie
Aus & NZ Off Topic
59
Jun 20, 2007 07:26 PM




All times are GMT -8. The time now is 10:56 AM.