|
+ Search |
|
Nov 30th, 2004 10:30
Chris Burkhardt, Joe Bloggs, Magnus Lyckå, Matthew Schinckel, Paul Allopenna,
If you want to (quickly) strip all HTML tags from a string of data, try using the re module: import re file = open(filename,'r') data = file.read() file.close() text = re.sub('<!--.*?-->', '', data) #Remove comments first, or '>' in #comments will be interpreted as #end of (comment) tag. text = re.sub('<.*?>', '', text) This will also strip any javascript, but only if the page has been made 'properly' - that is, the javascript is within HTML comments. If you want to know how it works, read the 're' chapter in the library reference, as it discusses the usefulness of 'non-greedy' regular expressions.