After seeing it mentioned on the newsgroup, I just started trying out
Amit Patel's proxy server (http://theory.stanford.edu/~amitp/proxy.html)
which does a good job of blocking all sorts of annoying web page behaviours.
It occured to me that it would be possible to utilize this proxy server
idea with the Spambayes classifier to come up with an all-Python web
filter (suitable for use as a parental control or company internet monitor)
Essentially the proxy could be taught to classify pages by signature and
block or restrict access to questionable pages rather than the
traditionally buggy (in the sense of reporting false hits) keyword based
approaches.
Does anyone have any experience in this area to say whether this
approach is workable?
Paul
jerf Feb 1 2003 20:08
On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote:
> Does anyone have any experience in this area to say whether this
> approach is workable?
Perfectly workable, though it would probably require some tweaks to the
tokenizer to work as well as possible.
It would not take long to set up at least a prototype of this.
Skip Montanaro Feb 1 2003 21:26
Paul> It occured to me that it would be possible to utilize this proxy
Paul> server idea with the Spambayes classifier to come up with an
Paul> all-Python web filter (suitable for use as a parental control or
Paul> company internet monitor)
Ignoring training, parameterization via the config file and the notion that
classifying web pages will be slightly different than classifying email,
maybe this will get you started:
Whenever I tried to connect to a web server I got this traceback:
Traceback (most recent call last):
File "proxy3.py", line 540, in proxy
connections = connections + ready.process()
File "/Users/skip/src/proxy/proxy3_web.py", line 137, in process
stream = self.create_stream(length)
File "/Users/skip/src/proxy/proxy3_web.py", line 167, in create_stream
self.serverheaders)
File "/Users/skip/src/proxy/proxy3_filter.py", line 38, in get_filter
serverheaders=serverheaders)
TypeError: __init__() got an unexpected keyword argument 'clientheaders'
which didn't look related to this module, but apparently was, because when I
started the proxy as
python proxy3.py stdio
everything worked fine.
I've never used Patel's proxy, but it looks like it should be a relative
no-brainer to integrate with Spambayes. It's just that my brain is
apparently now disengaged, it being Saturday evening. I'll let someone else
fiddle with this.
Skip
Paul Paterson Feb 2 2003 1:24
jerf@compy.attbi.com wrote:
> On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote:
> >Does anyone have any experience in this area to say whether this
> >approach is workable?
> Perfectly workable, though it would probably require some tweaks to the
> tokenizer to work as well as possible.
> It would not take long to set up at least a prototype of this.
The prototype turned out to be shorter than my original post,
#
# mod_spambayesfilter.py - used by proxy3
#
from spambayes import tokenizer, classifier
class SpamBayesFilter(BufferSomeFilter):
BUFFER_LEN = 128
LOWER_BOUND = 0.5
tok = tokenizer.Tokenizer()
checker = classifier.Classifier()
def filter(self, s):
if checker.chi2_spamprob(t.tokenize(text)) > self.LOWER_BOUND:
return "Not authorized"
else:
return s
Am I right in thinking that the spambayes tokenizer will just revert to
splitting up words if it doesn't think it is looking at an email?
Perhaps this might be sufficient for webpage filtering since web pages
probably wont be using the same kinds of subtrefuge that spammers resort to.
Skip Montanaro Feb 2 2003 13:17
>> Perfectly workable, though it would probably require some tweaks to
>> the tokenizer to work as well as possible.
...
Paul> The prototype turned out to be shorter than my original post,
...
This doesn't quite work right. (Nor does the similar version I posted
earlier.) The .filter() method gets passed chunks of an HTML response, not
the entire thing. The SpamBayesFilter class should subclass
BufferAllFilter. Here's a tweaked version of mine which does a better job:
import os
from proxy3_filter import *
import proxy3_options
from spambayes import hammie, Options, mboxutils
dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
class SpambayesFilter(BufferAllFilter):
hammie = hammie.open(dbf, 1, 'r')
Sorry for the too quick post. In rearranging things I lost the spam return.
Just to be sure it was actually filtering something, I searched for "sex" at
Google. It let that page in, allowed the safersex and SEX.ETC pages
through, but blocked HBO's Sex and the City and janesguide. Note that this
is using my current hammmie.db file, which has only been trained on my ham
and spam email collections. I don't expect it to necessarily do a very good
job with web pages given no training.
Skip
import os
from proxy3_filter import *
import proxy3_options
from spambayes import hammie, Options, mboxutils
dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
class SpambayesFilter(BufferAllFilter):
hammie = hammie.open(dbf, 1, 'r')
"Skip Montanaro" wrote in message news:mailman.1044210485.12265.python-list@python.org...
> Sorry for the too quick post. In rearranging things I lost the spam
return.
> Just to be sure it was actually filtering something, I searched for "sex"
at
> Google. It let that page in, allowed the safersex and SEX.ETC pages
> through, but blocked HBO's Sex and the City and janesguide. Note that
this
> is using my current hammmie.db file, which has only been trained on my ham
> and spam email collections. I don't expect it to necessarily do a very
good
> job with web pages given no training.
> Skip
> import os
> from proxy3_filter import *
> import proxy3_options
> from spambayes import hammie, Options, mboxutils
> dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
> class SpambayesFilter(BufferAllFilter):
> hammie = hammie.open(dbf, 1, 'r')
> def filter(self, s):
> if self.reply.split()[1] == '200':
> prob = self.hammie.score("%s\r\n%s" % (self.serverheaders,
s))
> print "| prob: %.5f" % prob
> if prob >= Options.options.spam_cutoff:
> print self.serverheaders
> print "text:", s[0:40], "...", s[-40:]
> return "not authorized"
> return s
> from proxy3_util import *
> register_filter('*/*', 'text/html', SpambayesFilter)
This looks great - I'm giving this a go now.
I think that, as you say, the key now is to train on a corpus of web pages
rather than spam/ham. I notice that Spambayes has a proxy server which can
be used for easy training. I'll take a look at this and see if it can be
used to train on web pages too.
Skip Montanaro Feb 2 2003 19:34
Paul> I think that, as you say, the key now is to train on a corpus of
Paul> web pages rather than spam/ham. I notice that Spambayes has a
Paul> proxy server which can be used for easy training. I'll take a look
Paul> at this and see if it can be used to train on web pages too.
Yes, you can use pop3proxy. You might be able to fudge the proxytee.py
script with something like:
httpget some-url | proxytee.py
I doubt there's anything in proxytee which is email-specific. See what
happens.
Skip
Paul Paterson Feb 3 2003 1:35
"Skip Montanaro" wrote in message news:mailman.1044238025.4542.python-list@python.org...
> Paul> I think that, as you say, the key now is to train on a corpus of
> Paul> web pages rather than spam/ham. I notice that Spambayes has a
> Paul> proxy server which can be used for easy training. I'll take a
look
> Paul> at this and see if it can be used to train on web pages too.
> Yes, you can use pop3proxy. You might be able to fudge the proxytee.py
> script with something like:
> httpget some-url | proxytee.py
> I doubt there's anything in proxytee which is email-specific. See what
> happens.
As a quick hack I put the teaching code inside the proxy filter and then
surfed for a bit to give it some examples of good pages (news and such) and
"bad" pages (sports!). It was very quickly able to spot the sports pages,
even on new sites, and it was able to pick out sport sections from the news
sections on an individual site.
I'll try to build a more rigorous test with a larger corpus - it looks
promissing so far.