|
Reader's Favorites
Media Casualties Mount Administration Split On Europe Invasion Administration In Crisis Over Burgeoning Quagmire Congress Concerned About Diversion From War On Japan Pot, Kettle On Line Two... Allies Seize Paris The Natural Gore Book Sales Tank, Supporters Claim Unfair Tactics Satan Files Lack Of Defamation Suit Why This Blog Bores People With Space Stuff A New Beginning My Hit Parade
Instapundit (Glenn Reynolds) Tim Blair James Lileks Bleats Virginia Postrel Kausfiles Winds Of Change (Joe Katzman) Little Green Footballs (Charles Johnson) Samizdata Eject Eject Eject (Bill Whittle) Space Alan Boyle (MSNBC) Space Politics (Jeff Foust) Space Transport News (Clark Lindsey) NASA Watch NASA Space Flight Hobby Space A Voyage To Arcturus (Jay Manifold) Dispatches From The Final Frontier (Michael Belfiore) Personal Spaceflight (Jeff Foust) Mars Blog The Flame Trench (Florida Today) Space Cynic Rocket Forge (Michael Mealing) COTS Watch (Michael Mealing) Curmudgeon's Corner (Mark Whittington) Selenian Boondocks Tales of the Heliosphere Out Of The Cradle Space For Commerce (Brian Dunbar) True Anomaly Kevin Parkin The Speculist (Phil Bowermaster) Spacecraft (Chris Hall) Space Pragmatism (Dan Schrimpsher) Eternal Golden Braid (Fred Kiesche) Carried Away (Dan Schmelzer) Laughing Wolf (C. Blake Powers) Chair Force Engineer (Air Force Procurement) Spacearium Saturn Follies JesusPhreaks (Scott Bell) Science
Nanobot (Howard Lovy) Lagniappe (Derek Lowe) Geek Press (Paul Hsieh) Gene Expression Carl Zimmer Redwood Dragon (Dave Trowbridge) Charles Murtaugh Turned Up To Eleven (Paul Orwin) Cowlix (Wes Cowley) Quark Soup (Dave Appell) Economics/Finance
Assymetrical Information (Jane Galt and Mindles H. Dreck) Marginal Revolution (Tyler Cowen et al) Man Without Qualities (Robert Musil) Knowledge Problem (Lynne Kiesling) Journoblogs The Ombudsgod Cut On The Bias (Susanna Cornett) Joanne Jacobs The Funny Pages
Cox & Forkum Day By Day Iowahawk Happy Fun Pundit Jim Treacher IMAO The Onion Amish Tech Support (Lawrence Simon) Scrapple Face (Scott Ott) Regular Reading
Quasipundit (Adragna & Vehrs) England's Sword (Iain Murray) Daily Pundit (Bill Quick) Pejman Pundit Daimnation! (Damian Penny) Aspara Girl Flit Z+ Blog (Andrew Zolli) Matt Welch Ken Layne The Kolkata Libertarian Midwest Conservative Journal Protein Wisdom (Jeff Goldstein et al) Dean's World (Dean Esmay) Yippee-Ki-Yay (Kevin McGehee) Vodka Pundit Richard Bennett Spleenville (Andrea Harris) Random Jottings (John Weidner) Natalie Solent On the Third Hand (Kathy Kinsley, Bellicose Woman) Patrick Ruffini Inappropriate Response (Moira Breen) Jerry Pournelle Other Worthy Weblogs
Ain't No Bad Dude (Brian Linse) Airstrip One A libertarian reads the papers Andrew Olmsted Anna Franco Review Ben Kepple's Daily Rant Bjorn Staerk Bitter Girl Catallaxy Files Dawson.com Dodgeblog Dropscan (Shiloh Bucher) End the War on Freedom Fevered Rants Fredrik Norman Heretical Ideas Ideas etc Insolvent Republic of Blogistan James Reuben Haney Libertarian Rant Matthew Edgar Mind over what matters Muslimpundit Page Fault Interrupt Photodude Privacy Digest Quare Rantburg Recovering Liberal Sand In The Gears(Anthony Woodlief) Sgt. Stryker The Blogs of War The Fly Bottle The Illuminated Donkey Unqualified Offerings What she really thinks Where HipHop & Libertarianism Meet Zem : blog Space Policy Links
Space Future The Space Review The Space Show Space Frontier Foundation Space Policy Digest BBS AWOL
USS Clueless (Steven Den Beste) Media Minder Unremitting Verse (Will Warren) World View (Brink Lindsay) The Last Page More Than Zero (Andrew Hofer) Pathetic Earthlings (Andrew Lloyd) Spaceship Summer (Derek Lyons) The New Space Age (Rob Wilson) Rocketman (Mark Oakley) Mazoo Site designed by Powered by Movable Type |
Regex Bleg Can someone give me a regular expression for my blacklist that would disallow any four-consonant (lower case) string? I've been getting a lot of spam lately like this one: Name: Ryan None of these seem to be real domains, so I don't know what the point is, but they all seem to have at least four consonants in a row. I figure that there are few real words like this, at least in English, so it would keep out the riff raff without impeding genuine commenters. [Update on Friday night] OK, as a commenter has pointed out, this would preclude some actual English word (like "strength"). So let's go for five consonants. My goal is to err on the side of letting good posts through. [Update on Saturday night] It doesn't catch them all, but I did come up with a good trap for them: q[^ua\ \.\,] Anything with a "q" in it followed by anything other than a "u" or "a" (or a space, period or comma, so we can write "Iraq") is blocked. A lot of these things have "q"s inserted in them. [Another update, a few minutes later, after testing] I'm getting a lot of false positives. TrackBack URL for this entry:
http://www.transterrestrial.com/mt-diagnostics.cgi/5246 Listed below are links to weblogs that reference this post from Transterrestrial Musings.
Comments
/[^AEIOUaeiou]{1,4}\.html Note that this doesn't worry about symbols being part of your 'four consonant' string either. I can't think of anything I'd name that way: hh$hh.html, rc#d.html. Ick. Posted by Al at March 31, 2006 01:32 PMTry this: .*[bcdfghjklmnpqrstvwxyz]{4}.* This will match the entirity of any string that contains four consecutive consonants. However, that would also match strings such as "http" and "html." What you might want instead is a regexp that looks any string that contains four consonants between two slashes: .*/[bcdfghjklmnpqrstvwxyz]{4}/.* This may still generate false positives (eg, "http://foo.com/html/goodpage.htm"), but fewer of them. You could also try matching any string that has a slash, four consonants, and a .html ending: .*/[bcdfghjklmnpqrstvwxyz]{4}\.html.* However, this won't catch the first URL in the example comment you posted, since that URL contains a "u". (You can also take the Al's suggestion and swap the "bcd..." string with "^AEIOUaeiou" to find any letter that's not a vowel, as opposed to any letter that's a lowercase consonant.) Posted by Zach Heaton at March 31, 2006 01:43 PMThanks, except I don't need to exclude upper case vowels, just lower. These things pretty much invariably come in as lower case domain strings. Also, why the ".html"? I'm going after the domain, not the page. All I really need to exclude is the string of four consonants, regardless of where it appears in the comment or ping. Also, I need at least four, not between one and four (that would exclude most of the worlds in the English language). Shouldn't it be /[^aeiouy]\{4,\} ? Posted by Rand Simberg at March 31, 2006 01:49 PMSorry, that first reply was to Al. Again, I'm not trying to look for the whole domain. Zack's solution seems too complicated. I'm just looking for a string of at least four consonants (in which "y" counts as a vowel). Posted by Rand Simberg at March 31, 2006 01:53 PMHopefully nobody has the word "strength" or "html" or "amcgltd" in their URL. Rather than trying to block all possible offending URLs, why not add a word verification? It works for all those bl0gsp0t blogs. Posted by Ed Minchau at March 31, 2006 02:01 PMMy regexp is rusty-to-nonexistant, but I'm beginning to think: */[bcdfgjklmnpqrstvwxyz][bcdfghjklmnpqrsvwxyz][bcdfghjklnpqrstvwxyz][bcdfghjkmnpqrstvwxyz]/.* might be useful. Posted by Phil Fraering at March 31, 2006 03:29 PMI had a leading '/' and a trailing '\.html' so that it would trigger on the four-consonants in: domain.com/bcdf.html. So it won't trigger on words that happen to have four consonants in random places, only if they're in they're the 'name of the web page', which they are in the example you presented. (Except for the u) As you noticed, [^aeiou] triggers on any consonant. If you only want to scan the piece between http:// and the next slash, part of Phil's line looks best. .*http://[^/]*[bcdfgjklmnpqrstvwxyz]{4}\.com.* This is 'will match anything up to the first "http://", any number of non-slash characters, precisely 4 letters from this list [bcdfgjklmnpqrstvwxyz] followed immediately by ".com" and then optionally more characters.' All of the criteria need to be met for the match - so four consonants that aren't in a domainname that's in a URL won't match. (And add '.ru' as a second domain name just on general principles.) BTW: in both Zach and mine, we _want_ the '{' treated as a special character. The {4} means 'stuff immediately to the left must happen exactly four times', [^aeiou]{1,4} would mean 'match anything with _only_ non-vowels as the only characters between the slash and the .html, as long as there's 1 to 4 of them' I use www.regular-expressionsDOTINFO/reference.html as my regexp reference. I hope this is useful. (Your spam filter wouldn't allow dot info :D) Posted by Al at March 31, 2006 05:19 PMActually, most of these are pretty bad, because you'll hit the non-domain part of the url like "http" or "html" or even "p://" or "rg/d" (as in ".org/default.htm"). What you want is to first isolate the part that is the domain element of the url and then to search for consonants. For example: /^[^\/]+\/\/[^\/]*?[bcdfgjklmnpqrstvwx z]{4}[^\/]*?\//i Basically this looks for a string which follows the form of non-slash characters at the start (e.g. "http:"), then doubled slashes, then any number of non-slash characters bordering 4 consonants (e.g. www.flrgpb.com), then a single slash. I'm not sure how efficient this is but I think it will work better than the others. Also, there are probably easier to follow (read: more maintainable) ways to do this a little more programatically. Meaning, matching the domain segment and extracting it, then clipping off the TLD(s) and performing a match on the "core" domain name. Posted by Robin Goodfellow at March 31, 2006 05:29 PM"strengths" stre[ngths] Posted by Jim C. at April 1, 2006 12:03 AMWhy not ban "url=" or "[/url]" instead? If you're running MT 3.2, look at this. It has a reference to my file of filters, which you might find interesting. Posted by Annoying Old Guy at April 3, 2006 03:07 PMPost a comment |