[Seriously OT] Regex to convert HTML to text
I realise this is seriously OT for an S2000 forum, and a question topic only applicable to a very small percentage of the populous but here goes.
I need to take a string of HTML and strip out the tags: ie, andthing in between '<' and the subsequent '>'.
It doesn't really need to do much more than that as it's only for the first 160 characters of a string.
The software I'm using has a REGEX metatag that, with the correct expression, should be able to achieve this.
And help from REGEX afficianados would be greatly appreciated.
I need to take a string of HTML and strip out the tags: ie, andthing in between '<' and the subsequent '>'.
It doesn't really need to do much more than that as it's only for the first 160 characters of a string.
The software I'm using has a REGEX metatag that, with the correct expression, should be able to achieve this.
And help from REGEX afficianados would be greatly appreciated.
that regexp is correct for most cases. this one is much more thorough, it does html entities as well and handles some other whacky cases.
Wow!! Thanks guys.
I use Tango. @REGEX is a metatag in Tango files.
It has the format:
Syntax
<@REGEX EXPR=expression STR=text TYPE=type>
Description
Provides an interface to POSIX regular expression matching routines from inside Tango. This gives you powerful tools to match text patterns if they are needed.
<@REGEX> accepts as attributes the regular expression (EXPR), the text to match the pattern against(STR), and the type of the regular expression (TYPE), basic or extended. If the attributes contain spaces, they must be quoted--single or double, as appropriate. <@REGEX> returns its results in the form of an array and should be assigned to a variable via <@ASSIGN>.
Upon a successful match, <@REGEX> returns an array with three columns and n+1 rows, where n is the number of parenthesized subexpressions in the pattern. The first column contains the matching text, the second column contains the start index of the matching portion, and the third column gives the length of the matching portion. The start and length are compatible with the <@SUBSTRING> tag.
Rows i from 1 to n give the ith matching parenthesized subexpression, and row n+1 gives the entire matching portion of the text. (If there are no parenthesized subexpressions, the whole match is returned in the first row.)
The table gives a sample array returned from <@REGEX>.
<@REGEX EXPR="([[:alpha:]]+),[[:space:]]+([A-Z]{2})[[:space:]]+([A-Z][0-9][A-Z] [0-9][A-Z][0-9])" STR="in Mississauga, ON L5N 6J5." TYPE=E>.
Mississauga 4 11
ON 17 2
L5N 6J5 20 7
Mississauga, ON L5N 6J5 4 23
If attributes are missing, <@REGEX> returns a string with the problem attributes. Upon an error condition, <@REGEX> returns a single character, "C" for a pattern compile failure, and an "M" for a match failure. If any attributes are missing, a textual message is displayed indicating the missing items. You can easily test for success by using <@VARINFO NAME=variable ATTRIBUTE=TYPE>.
I use Tango. @REGEX is a metatag in Tango files.
It has the format:
Syntax
<@REGEX EXPR=expression STR=text TYPE=type>
Description
Provides an interface to POSIX regular expression matching routines from inside Tango. This gives you powerful tools to match text patterns if they are needed.
<@REGEX> accepts as attributes the regular expression (EXPR), the text to match the pattern against(STR), and the type of the regular expression (TYPE), basic or extended. If the attributes contain spaces, they must be quoted--single or double, as appropriate. <@REGEX> returns its results in the form of an array and should be assigned to a variable via <@ASSIGN>.
Upon a successful match, <@REGEX> returns an array with three columns and n+1 rows, where n is the number of parenthesized subexpressions in the pattern. The first column contains the matching text, the second column contains the start index of the matching portion, and the third column gives the length of the matching portion. The start and length are compatible with the <@SUBSTRING> tag.
Rows i from 1 to n give the ith matching parenthesized subexpression, and row n+1 gives the entire matching portion of the text. (If there are no parenthesized subexpressions, the whole match is returned in the first row.)
The table gives a sample array returned from <@REGEX>.
<@REGEX EXPR="([[:alpha:]]+),[[:space:]]+([A-Z]{2})[[:space:]]+([A-Z][0-9][A-Z] [0-9][A-Z][0-9])" STR="in Mississauga, ON L5N 6J5." TYPE=E>.
Mississauga 4 11
ON 17 2
L5N 6J5 20 7
Mississauga, ON L5N 6J5 4 23
If attributes are missing, <@REGEX> returns a string with the problem attributes. Upon an error condition, <@REGEX> returns a single character, "C" for a pattern compile failure, and an "M" for a match failure. If any attributes are missing, a textual message is displayed indicating the missing items. You can easily test for success by using <@VARINFO NAME=variable ATTRIBUTE=TYPE>.
Thread
Thread Starter
Forum
Replies
Last Post




