Archive of - dBforums
Click here to see this site with all of the graphics, features, and links.
 
dBforums Archive for: comp.lang.python

Spambayes + HTTP proxy server
Click here to see this thread with all of the graphics and features.

Paul Paterson
Feb 1 2003 17:47 
After seeing it mentioned on the newsgroup, I just started trying out
Amit Patel's proxy server (http://theory.stanford.edu/~amitp/proxy.html)
which does a good job of blocking all sorts of annoying web page behaviours.

It occured to me that it would be possible to utilize this proxy server
idea with the Spambayes classifier to come up with an all-Python web
filter (suitable for use as a parental control or company internet monitor)

Essentially the proxy could be taught to classify pages by signature and
block or restrict access to questionable pages rather than the
traditionally buggy (in the sense of reporting false hits) keyword based
approaches.

Does anyone have any experience in this area to say whether this
approach is workable?

Paul

jerf
Feb 1 2003 20:08 
On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote:
 > Does anyone have any experience in this area to say whether this
 > approach is workable?

Perfectly workable, though it would probably require some tweaks to the
tokenizer to work as well as possible.

It would not take long to set up at least a prototype of this.

Skip Montanaro
Feb 1 2003 21:26 
Paul> It occured to me that it would be possible to utilize this proxy
Paul> server idea with the Spambayes classifier to come up with an
Paul> all-Python web filter (suitable for use as a parental control or
Paul> company internet monitor)

Ignoring training, parameterization via the config file and the notion that
classifying web pages will be slightly different than classifying email,
maybe this will get you started:

from proxy3_filter import *
import proxy3_options

from spambayes import hammie, Options, mboxutils

DB = "/Users/skip/hammie.db"

HTML_ERROR = '''


Forbidden



Forbidden to connect: prob = %(prob)s


'''

class SpambayesFilter(BufferSomeFilter):
def __init__(self, *args):
BufferSomeFilter.__init__(self, *args)
self.hammie = hammie.open(DB, 1, 'r')

def filter(self, s):
prob, clues = self.hammie.score(s)
print "prob:", prob
if prob >= Options.options.spam_cutoff:
return HTML_ERROR % locals()
return s

from proxy3_util import *

register_filter('*', 'text/html', SpambayesFilter)

I called it mod_spambayes.py and ran the proxy as

python proxy3.py stdio spambayes

Whenever I tried to connect to a web server I got this traceback:

Traceback (most recent call last):
File "proxy3.py", line 540, in proxy
connections = connections + ready.process()
File "/Users/skip/src/proxy/proxy3_web.py", line 137, in process
stream = self.create_stream(length)
File "/Users/skip/src/proxy/proxy3_web.py", line 167, in create_stream
self.serverheaders)
File "/Users/skip/src/proxy/proxy3_filter.py", line 38, in get_filter
serverheaders=serverheaders)
TypeError: __init__() got an unexpected keyword argument 'clientheaders'

which didn't look related to this module, but apparently was, because when I
started the proxy as

python proxy3.py stdio

everything worked fine.

I've never used Patel's proxy, but it looks like it should be a relative
no-brainer to integrate with Spambayes. It's just that my brain is
apparently now disengaged, it being Saturday evening. I'll let someone else
fiddle with this.

Skip

Paul Paterson
Feb 2 2003 1:24 
jerf@compy.attbi.com wrote:

 > On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote:
  > >Does anyone have any experience in this area to say whether this
  > >approach is workable?
 > Perfectly workable, though it would probably require some tweaks to the
 > tokenizer to work as well as possible.
 > It would not take long to set up at least a prototype of this.
The prototype turned out to be shorter than my original post,

#
# mod_spambayesfilter.py - used by proxy3
#
from spambayes import tokenizer, classifier

class SpamBayesFilter(BufferSomeFilter):
BUFFER_LEN = 128
LOWER_BOUND = 0.5

tok = tokenizer.Tokenizer()
checker = classifier.Classifier()

def filter(self, s):
if checker.chi2_spamprob(t.tokenize(text)) > self.LOWER_BOUND:
return "Not authorized"
else:
return s

register_filter('*/*', 'text/html', SpamBayesFilter)


Am I right in thinking that the spambayes tokenizer will just revert to
splitting up words if it doesn't think it is looking at an email?
Perhaps this might be sufficient for webpage filtering since web pages
probably wont be using the same kinds of subtrefuge that spammers resort to.

Skip Montanaro
Feb 2 2003 13:17 
  >> Perfectly workable, though it would probably require some tweaks to
>> the tokenizer to work as well as possible.
...
Paul> The prototype turned out to be shorter than my original post,
...

This doesn't quite work right. (Nor does the similar version I posted
earlier.) The .filter() method gets passed chunks of an HTML response, not
the entire thing. The SpamBayesFilter class should subclass
BufferAllFilter. Here's a tweaked version of mine which does a better job:

import os

from proxy3_filter import *
import proxy3_options

from spambayes import hammie, Options, mboxutils
dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)

class SpambayesFilter(BufferAllFilter):
hammie = hammie.open(dbf, 1, 'r')

def filter(self, s):
if self.reply.split()[1] == '200':
prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s))
print "| prob: %.5f" % prob
if prob >= Options.options.spam_cutoff:
print self.serverheaders
print "text:", s[0:40], "...", s[-40:]
return s

from proxy3_util import *

register_filter('*/*', 'text/html', SpambayesFilter)

Skip

Skip Montanaro
Feb 2 2003 13:27 
Sorry for the too quick post. In rearranging things I lost the spam return.
Just to be sure it was actually filtering something, I searched for "sex" at
Google. It let that page in, allowed the safersex and SEX.ETC pages
through, but blocked HBO's Sex and the City and janesguide. Note that this
is using my current hammmie.db file, which has only been trained on my ham
and spam email collections. I don't expect it to necessarily do a very good
job with web pages given no training.

Skip

import os

from proxy3_filter import *
import proxy3_options

from spambayes import hammie, Options, mboxutils
dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)

class SpambayesFilter(BufferAllFilter):
hammie = hammie.open(dbf, 1, 'r')

def filter(self, s):
if self.reply.split()[1] == '200':
prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s))
print "| prob: %.5f" % prob
if prob >= Options.options.spam_cutoff:
print self.serverheaders
print "text:", s[0:40], "...", s[-40:]
return "not authorized"
return s

from proxy3_util import *

register_filter('*/*', 'text/html', SpambayesFilter)

Paul Paterson
Feb 2 2003 16:23 
"Skip Montanaro" wrote in message
news:mailman.1044210485.12265.python-list@python.org...
 > Sorry for the too quick post. In rearranging things I lost the spam
return.
 > Just to be sure it was actually filtering something, I searched for "sex"
at
 > Google. It let that page in, allowed the safersex and SEX.ETC pages
 > through, but blocked HBO's Sex and the City and janesguide. Note that
this
 > is using my current hammmie.db file, which has only been trained on my ham
 > and spam email collections. I don't expect it to necessarily do a very
good
 > job with web pages given no training.
 > Skip
 > import os
 > from proxy3_filter import *
 > import proxy3_options
 > from spambayes import hammie, Options, mboxutils
 > dbf =
os.path.expanduser(Options.options.hammiefilter_persistent_storage_file)
 > class SpambayesFilter(BufferAllFilter):
 > hammie = hammie.open(dbf, 1, 'r')
 > def filter(self, s):
 > if self.reply.split()[1] == '200':
 > prob = self.hammie.score("%s\r\n%s" % (self.serverheaders,
s))
 > print "| prob: %.5f" % prob
 > if prob >= Options.options.spam_cutoff:
 > print self.serverheaders
 > print "text:", s[0:40], "...", s[-40:]
 > return "not authorized"
 > return s
 > from proxy3_util import *
 > register_filter('*/*', 'text/html', SpambayesFilter)

This looks great - I'm giving this a go now.

I think that, as you say, the key now is to train on a corpus of web pages
rather than spam/ham. I notice that Spambayes has a proxy server which can
be used for easy training. I'll take a look at this and see if it can be
used to train on web pages too.

Skip Montanaro
Feb 2 2003 19:34 
Paul> I think that, as you say, the key now is to train on a corpus of
Paul> web pages rather than spam/ham. I notice that Spambayes has a
Paul> proxy server which can be used for easy training. I'll take a look
Paul> at this and see if it can be used to train on web pages too.

Yes, you can use pop3proxy. You might be able to fudge the proxytee.py
script with something like:

httpget some-url | proxytee.py

I doubt there's anything in proxytee which is email-specific. See what
happens.

Skip

Paul Paterson
Feb 3 2003 1:35 
"Skip Montanaro" wrote in message
news:mailman.1044238025.4542.python-list@python.org...
 > Paul> I think that, as you say, the key now is to train on a corpus of
 > Paul> web pages rather than spam/ham. I notice that Spambayes has a
 > Paul> proxy server which can be used for easy training. I'll take a
look
 > Paul> at this and see if it can be used to train on web pages too.
 > Yes, you can use pop3proxy. You might be able to fudge the proxytee.py
 > script with something like:
 > httpget some-url | proxytee.py
 > I doubt there's anything in proxytee which is email-specific. See what
 > happens.

As a quick hack I put the teaching code inside the proxy filter and then
surfed for a bit to give it some examples of good pages (news and such) and
"bad" pages (sports!). It was very quickly able to spot the sports pages,
even on new sites, and it was able to pick out sport sections from the news
sections on an individual site.

I'll try to build a more rigorous test with a larger corpus - it looks
promissing so far.


dBforums.com - RDBMS discussion and helpdesk
 

More Computer Related Text Archives:  MainFrameForum Plan, Implement and Support


Click here to see this site with all of the graphics, features, and links.

copyright ©2001, 2002 dBforums.com