dBforums - Spambayes + HTTP proxy server

Archive of - dBforums
Click here to see this site with all of the graphics, features, and links.

dBforums Archive for: comp.lang.python

Spambayes + HTTP proxy server
Click here to see this thread with all of the graphics and features.

Paul Paterson Feb 1 2003 17:47	After seeing it mentioned on the newsgroup, I just started trying out Amit Patel's proxy server (http://theory.stanford.edu/~amitp/proxy.html) which does a good job of blocking all sorts of annoying web page behaviours. It occured to me that it would be possible to utilize this proxy server idea with the Spambayes classifier to come up with an all-Python web filter (suitable for use as a parental control or company internet monitor) Essentially the proxy could be taught to classify pages by signature and block or restrict access to questionable pages rather than the traditionally buggy (in the sense of reporting false hits) keyword based approaches. Does anyone have any experience in this area to say whether this approach is workable? Paul
jerf Feb 1 2003 20:08	On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote: > Does anyone have any experience in this area to say whether this > approach is workable? Perfectly workable, though it would probably require some tweaks to the tokenizer to work as well as possible. It would not take long to set up at least a prototype of this.
Skip Montanaro Feb 1 2003 21:26	Paul> It occured to me that it would be possible to utilize this proxy Paul> server idea with the Spambayes classifier to come up with an Paul> all-Python web filter (suitable for use as a parental control or Paul> company internet monitor) Ignoring training, parameterization via the config file and the notion that classifying web pages will be slightly different than classifying email, maybe this will get you started: from proxy3_filter import * import proxy3_options from spambayes import hammie, Options, mboxutils DB = "/Users/skip/hammie.db" HTML_ERROR = ''' Forbidden Forbidden to connect: prob = %(prob)s ''' class SpambayesFilter(BufferSomeFilter): def __init__(self, args): BufferSomeFilter.__init__(self, args) self.hammie = hammie.open(DB, 1, 'r') def filter(self, s): prob, clues = self.hammie.score(s) print "prob:", prob if prob >= Options.options.spam_cutoff: return HTML_ERROR % locals() return s from proxy3_util import * register_filter('*', 'text/html', SpambayesFilter) I called it mod_spambayes.py and ran the proxy as python proxy3.py stdio spambayes Whenever I tried to connect to a web server I got this traceback: Traceback (most recent call last): File "proxy3.py", line 540, in proxy connections = connections + ready.process() File "/Users/skip/src/proxy/proxy3_web.py", line 137, in process stream = self.create_stream(length) File "/Users/skip/src/proxy/proxy3_web.py", line 167, in create_stream self.serverheaders) File "/Users/skip/src/proxy/proxy3_filter.py", line 38, in get_filter serverheaders=serverheaders) TypeError: __init__() got an unexpected keyword argument 'clientheaders' which didn't look related to this module, but apparently was, because when I started the proxy as python proxy3.py stdio everything worked fine. I've never used Patel's proxy, but it looks like it should be a relative no-brainer to integrate with Spambayes. It's just that my brain is apparently now disengaged, it being Saturday evening. I'll let someone else fiddle with this. Skip
Paul Paterson Feb 2 2003 1:24	jerf@compy.attbi.com wrote: > On Sat, 01 Feb 2003 22:47:06 +0000, Paul Paterson wrote: > >Does anyone have any experience in this area to say whether this > >approach is workable? > Perfectly workable, though it would probably require some tweaks to the > tokenizer to work as well as possible. > It would not take long to set up at least a prototype of this. The prototype turned out to be shorter than my original post, # # mod_spambayesfilter.py - used by proxy3 # from spambayes import tokenizer, classifier class SpamBayesFilter(BufferSomeFilter): BUFFER_LEN = 128 LOWER_BOUND = 0.5 tok = tokenizer.Tokenizer() checker = classifier.Classifier() def filter(self, s): if checker.chi2_spamprob(t.tokenize(text)) > self.LOWER_BOUND: return "Not authorized" else: return s register_filter('/', 'text/html', SpamBayesFilter) Am I right in thinking that the spambayes tokenizer will just revert to splitting up words if it doesn't think it is looking at an email? Perhaps this might be sufficient for webpage filtering since web pages probably wont be using the same kinds of subtrefuge that spammers resort to.
Skip Montanaro Feb 2 2003 13:17	>> Perfectly workable, though it would probably require some tweaks to >> the tokenizer to work as well as possible. ... Paul> The prototype turned out to be shorter than my original post, ... This doesn't quite work right. (Nor does the similar version I posted earlier.) The .filter() method gets passed chunks of an HTML response, not the entire thing. The SpamBayesFilter class should subclass BufferAllFilter. Here's a tweaked version of mine which does a better job: import os from proxy3_filter import * import proxy3_options from spambayes import hammie, Options, mboxutils dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file) class SpambayesFilter(BufferAllFilter): hammie = hammie.open(dbf, 1, 'r') def filter(self, s): if self.reply.split()[1] == '200': prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s)) print "\| prob: %.5f" % prob if prob >= Options.options.spam_cutoff: print self.serverheaders print "text:", s[0:40], "...", s[-40:] return s from proxy3_util import * register_filter('/', 'text/html', SpambayesFilter) Skip
Skip Montanaro Feb 2 2003 13:27	Sorry for the too quick post. In rearranging things I lost the spam return. Just to be sure it was actually filtering something, I searched for "sex" at Google. It let that page in, allowed the safersex and SEX.ETC pages through, but blocked HBO's Sex and the City and janesguide. Note that this is using my current hammmie.db file, which has only been trained on my ham and spam email collections. I don't expect it to necessarily do a very good job with web pages given no training. Skip import os from proxy3_filter import * import proxy3_options from spambayes import hammie, Options, mboxutils dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file) class SpambayesFilter(BufferAllFilter): hammie = hammie.open(dbf, 1, 'r') def filter(self, s): if self.reply.split()[1] == '200': prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s)) print "\| prob: %.5f" % prob if prob >= Options.options.spam_cutoff: print self.serverheaders print "text:", s[0:40], "...", s[-40:] return "not authorized" return s from proxy3_util import * register_filter('/', 'text/html', SpambayesFilter)
Paul Paterson Feb 2 2003 16:23	"Skip Montanaro" wrote in message news:mailman.1044210485.12265.python-list@python.org... > Sorry for the too quick post. In rearranging things I lost the spam return. > Just to be sure it was actually filtering something, I searched for "sex" at > Google. It let that page in, allowed the safersex and SEX.ETC pages > through, but blocked HBO's Sex and the City and janesguide. Note that this > is using my current hammmie.db file, which has only been trained on my ham > and spam email collections. I don't expect it to necessarily do a very good > job with web pages given no training. > Skip > import os > from proxy3_filter import * > import proxy3_options > from spambayes import hammie, Options, mboxutils > dbf = os.path.expanduser(Options.options.hammiefilter_persistent_storage_file) > class SpambayesFilter(BufferAllFilter): > hammie = hammie.open(dbf, 1, 'r') > def filter(self, s): > if self.reply.split()[1] == '200': > prob = self.hammie.score("%s\r\n%s" % (self.serverheaders, s)) > print "\| prob: %.5f" % prob > if prob >= Options.options.spam_cutoff: > print self.serverheaders > print "text:", s[0:40], "...", s[-40:] > return "not authorized" > return s > from proxy3_util import * > register_filter('/', 'text/html', SpambayesFilter) This looks great - I'm giving this a go now. I think that, as you say, the key now is to train on a corpus of web pages rather than spam/ham. I notice that Spambayes has a proxy server which can be used for easy training. I'll take a look at this and see if it can be used to train on web pages too.
Skip Montanaro Feb 2 2003 19:34	Paul> I think that, as you say, the key now is to train on a corpus of Paul> web pages rather than spam/ham. I notice that Spambayes has a Paul> proxy server which can be used for easy training. I'll take a look Paul> at this and see if it can be used to train on web pages too. Yes, you can use pop3proxy. You might be able to fudge the proxytee.py script with something like: httpget some-url \| proxytee.py I doubt there's anything in proxytee which is email-specific. See what happens. Skip
Paul Paterson Feb 3 2003 1:35	"Skip Montanaro" wrote in message news:mailman.1044238025.4542.python-list@python.org... > Paul> I think that, as you say, the key now is to train on a corpus of > Paul> web pages rather than spam/ham. I notice that Spambayes has a > Paul> proxy server which can be used for easy training. I'll take a look > Paul> at this and see if it can be used to train on web pages too. > Yes, you can use pop3proxy. You might be able to fudge the proxytee.py > script with something like: > httpget some-url \| proxytee.py > I doubt there's anything in proxytee which is email-specific. See what > happens. As a quick hack I put the teaching code inside the proxy filter and then surfed for a bit to give it some examples of good pages (news and such) and "bad" pages (sports!). It was very quickly able to spot the sports pages, even on new sites, and it was able to pick out sport sections from the news sections on an individual site. I'll try to build a more rigorous test with a larger corpus - it looks promissing so far.

dBforums.com - RDBMS discussion and helpdesk

More Computer Related Text Archives: MainFrameForum Plan, Implement and Support

Click here to see this site with all of the graphics, features, and links.