WikiSpam

What can we do about spam-bot attacks on wiki? c2-wiki: spam-bot

What have email users and blog users been doing to reject spam? Maybe we could adapt those ideas to wiki.

Wikispam is a wiki-wide problem. It won’t be solved but wiki-wide.

Why?

From a so-called Search Engine Optimizer (SEO):

The major search engines link Google and Yahoo are putting an increased emphasis on incoming links when determining your web site ranking. Our Link Building Services will increase the number of incoming links to your web site each month which will affect your Google Page Rank Score and your positioning on the engines that use incoming links as an indicator of how important your web site is. $99.95 per month

They use dead, unmaintained wikis for this: Their links get removed from active wikis (at some price in developer time and community patience), but on the dead wikis their links stay. And increase their target’s pagerank.

Manual Methods

Spam-bots could spam every page on the entire wiki before a human can do much. (This doesn’t mean that spam rejection must be completely automatic. Perhaps it’s sufficient to slow down spambots)

Some wikis may offer WikiFeatures:RollBack? to revert all edits made after a certain point in time.

Automatic Methods

How to avoid the perception of censorship, especially when the automatic methods make a mistake?

How to measure the success of these methods?

Related Pages

More on the Banned Content Network

More on spam:

Other:

References

The wiki on wiki-spam is chongqed-wiki: front page.

See other discussions on:

Measuring Success

In August 2013 I posted some numbers on Emacs Wiki. There, our defenses chiefly consist of three lists of regular expressions that are applied when somebody tries to save a page: BannedHosts matches hostnames and IP numbers, BannedContent matches URLs, BannedRegexps matches page content (and is therefore the most severe).

This is the report for 2013 only. Which rules are effective? Apparently banning some Internet Service Providers is the most effective. This assumes that practically none of them were “ham” (“not spam”) edits. In all the years I´ve been running Emacs Wiki I have received one email by a user in a banned region asking me to review the regular expressions.

Total spam rejected: 76745
Total days surveyed: 214
 (days without a single spam are skipped)
Average spam rejected per day: 358.62

 Matches  Rule
   22377  BannedHosts: broadband\.kyivstar.net
   19836  BannedHosts: ^SOL-FTTB\.[.0-9]+\.sovam\.net\.ua
    9990  BannedRegexps: <a\s+href=["']?http
    6879  BannedHosts: ^static\..*\.clients\.your-server\.de
    1872  BannedContent: \bcheap
    1403  BannedRegexps: (?s)\s+https?:\S+.*\s+https?:\S+.*\s+https?:\S+.*
    1284  BannedHosts: \.dynamic\.163data\.com\.cn
    1250  BannedRegexps: \s(https?:\S+\s+https?:\S+)
    1132  BannedRegexps: \[url=
    1068  BannedHosts: \.amazonaws\.com
     939  BannedHosts: ^178-137-.*-lvv\.broadband\.kyivstar\.net
     770  BannedContent: jp[a-zA-Z]*\.com
     712  BannedHosts: \.SteepHost\.Net
     672  BannedContent: (loans|payday)
     634  BannedRegexps: \s+(https?:\S+)[ .\r\n]*$
     532  BannedHosts: ^112\.111\.1(6[0-9]|7[0-5])\.
     516  BannedHosts: ^unassigned\.psychz.net
     380  BannedContent: michaelkors
     345  BannedContent: (viagra|cialis)
     284  BannedContent: japan[a-zA-Z]*\.com
     260  BannedHosts: ^193\.105\.210\.30
     244  BannedContent: louboutin
     238  BannedRegexps: \s+https?:\S+[ .\r\n]*$
     230  BannedHosts: ^199\.15\.234\.80
     196  BannedContent: louisvuitton
     186  BannedContent: (xanax|tramadol|\bsoma\b)
     184  BannedHosts: ^unknown\.xeex\.net
     164  BannedHosts: ^46\.161\.41\.225
     116  BannedContent: marcjacobs[a-zA-Z]*\.com
     106  BannedHosts: ^198\.2\.208\.205
     104  BannedHosts: ^91\.237\.249\.
      94  BannedHosts: ^5\.39\.219\.26
      94  BannedContent: vietnam
      84  BannedHosts: ^ns4004874\.ip-198-27-65\.net
      70  BannedHosts: ^31\.184\.238\.163
      70  BannedHosts: ^142\.0\.35\.130
      68  BannedHosts: \.broadband\.kyivstar.net
      68  BannedContent: kamagra
      64  BannedHosts: ^46\.119\.119\.145
      64  BannedHosts: ^46\.161\.41\.223
      64  BannedHosts: ^46\.119\.118\.174
      58  BannedContent: (viagra|cialis|valium)
      54  BannedHosts: \.broad\.fz\.fj\.dynamic\.163data\.com\.cn
      52  BannedHosts: ^46\.119\.116\.228
      52  BannedHosts: ^37\.59\.207\.2
      50  BannedHosts: ^192\.74\.229\.1
      40  BannedContent: japan(ese)?\.com
      36  BannedHosts: ^208\.177\.76\.5\.ptr\.us\.xo\.net
      32  BannedHosts: ^SOL-FTTB\.0\.122\.118\.46\.sovam\.net\.ua
      32  BannedContent: erolove

Older Discussion

Minor point- it said “InternetBonding” next to user acct’s and passwords.

InternetBonding means that you are bonded across the Internet, either by having lots of user accounts and transactions online, or even just by being well known, recognized, having a recognizable “voice.” Whereas anyone could get a user account, and an associated password, with no history of activity on the Internet at all.

InternetBonding means, “Bonded by virtue of sustained and consistent presence on the Internet.”

A user account and password are not enough to be “Internet Bonded.”

At least, that’s what I meant by the term.


Sunir showed me a UseModWiki spam-bot attack today (2004-02-12). We’ll have to start thinking about wiki-spam…

17:17 <kensanata> whatever you choose, you'd still have to clean up 50 pages.
17:20 <kensanata> Sunir: the only thing that has worked for me in terms of
    mail spam protection is a naive bayesian statistics filter.
17:20 <kensanata> maybe we need something like that, too.
17:20 <kensanata> which sucks, because then we'll have a dictionary of about
    250-500k to read on every page edit.
17:21 <kensanata> and you'd have to keep the spam-edits somewhere, too.
17:21 <kensanata> hm....
17:21 <kensanata> perhaps we could use the rclog for that...
17:21 <kensanata> we'd just need a flag to tell us "this is spam"
17:22 <kensanata> i will think of something.
17:24 <kensanata> one problem to solve later would be preventing bots from
    poisoning the pool by doing false spam reverts.

Put a limit on the # of transactions (say, 20) per particular IP address?

On the 16th submit, say, “You are being warned: 4 left today.” On the 20th submit, say, “No more.” On the 21st, automatically roll back all submissions. Or, hold them in limbo as StagedCommits?.

..?

Of course, some IP addresses host multiple users…

If a user is trusted, or an IP is special, you can remove or raise the limit for that IP addr.

Some people use public terminals. Embarrassing if all of one terminal user’s good posts were rolled back because someone else came along later and tried to post…

WebLogs are dealing with spam right now, and they say that a naive IP ban is not sufficient, since many spammers just switch to using rotating IPs. We could do worse than look to the ‘blog world to see how they are coping.

As for the public terminals- we’d treat those specially, give them a higher post allowance. (Also, StagedCommits?, so we can sort and find the gems.)

Even if they rotate IPs, we should be fine, right? Just play whack-a-mole a little bit, and retain the results of past whacks. (What problems have the bloggers found with this solution?)

Going to a member-ship only system for wiki basically shuts wiki off.

Unless we go to some sort of standard InternetBonding system. Which (I think) is pretty reasonable, but will require a lot of cooperation.

But the important question here is: What problems have the bloggers had with IP-address whack-a-mole?

Any attacker worth his salt won’t use the same IP twice, so it doesn’t work.

I didn’t know that- do they really use a different IP address for each post to a wiki? (This is a serious question; I’m not being sarcastic.)

How do they get all those IP addresses?

Dunno. Probably just something like Wiki:AnonymizerDotCom.

IP based banning is also problematic for people behind NATs – all request from the crowd behind the NAT will seem to come from the same IP, therefore you might be banning whole networks.

For email, the only thing that has worked reliably (ie. 95% of all spam catched, no false positives I know of) is based on statistics. Which is why I think that we’ll have to do the same thing… Unfortunate as it is.

Statistics require raw data, so we need to train the system, teaching it what we consider to be “ham” and what we consider to be “spam”… For email this is easy, if you just keep all your mail in “ham” and “spam” folders. Whenever you want to re-train the system, just feed it some mail from either category. But we can’t do that for wikis as they are right now because we don’t keep rejected contributions. All we currently keep is the log (RecentChanges). If we could tell the system which contributions were spam, then we could use this information to attempt a guess. It is not much, unless we start to keep a small extract of each contribution as part of the log…

We could also decide that we only want to train the system on “recent” spam. This is what I do for mail. I have several thousand ham and spam mails; I only use the mails I got in the last 90 days for training. We could also extend the keep period for KeptPages to 90 days, and then use the information from the log for the last 90 days to retrieve kept versions, determine what exactly got added to each page when we qualified it as spam or ham, and train the system on that…

Sounds like a lot of work, and I don’t have much data to experiment with. I guess I’d need a wiki with several spam-bot attacks in a significant time period (eg. 90 days), its log file, and its kept pages for the same period.

Grrr. >-{ >-{ >-{

In the mean time, I implemented content-banning, which can be used for LinkBan?.

Tell us more about the spam-bot attack you saw.
DavidCary

MeatBall is currently locked due to another spam attack (must be related to those contests).

So I’ll write what I was going to write there, here.

A lightweight “captcha” that would deny edit access to WikiGateway and probably other similarly simple scripts would be to randomize the HTML form name of the “save” button (not the name that the user sees). Right now in WikiGateway, I just hardcoded the button names for each WikiEngine. I expect others would do the same.

Another solution that I like much better would be to do something like the above to make automated editing hard, but only for non-logged in users. Logged-in users would get an interface that allows automated editing (right now, WikiGateway can login to MoinMoin, I don’t see any difficulties to extending this to UseMod). This way you get to have your cake and eat it too; block automated attacks, but allow real users to use whatever tools they want.

Perhaps make a (server-enforced) rule that new accounts only achieve automated editing capabilities if the accountname corresponds to a real user homepage on which the word “HasAutomatedEditingPrivilages?” has stood for at least a day. Only one login account per accountname. This way, the user community can hold off on granting auto-edit privilages until they are convinced that the new user is for real.

Alternately, in order to prevent an automated agent from simply creating a new account, you could require the user to pass a captcha to create a new account.

The current WikiSpam attackers obviously aren’t that motivated, so a simple captcha would probably work. Like, “what is 5 + 4?” (with the number changed each time and the operation changed between addition, subtraction, and multiplication). Or, better, just simple pictures of words or numbers with “type this in” (“real” captchas use distorted pictures that state-of-the-art machine vision techniques can’t process, but as I said, our attackers aren’t motivated, so any picture would probably do). Of course, all that all this does is require a human to login and create an account once per attack, so I like the HasAutomatedEditingPrivilages? idea better.

[new] sigi
[de]das sind gute ideen. doch vorher könnte man noch folgendes machen: man könnte nach jeder editierung zunächst immer die seite ohne abspeicherung anzeigen (printable view), mit vier neuen buttons: Speichern, ohne recent changes speichern, Korrigieren und Zurücksetzen. will man also die editierung wirklich abspeichern, muss man das ausdrücklich bestätigen. auch das könnte die sicherheit erhöhen. erst recht in kombination mit bayle’s ideen. und die bedienung würde standardisiert.
[en]those are good ideas. but you could also do the following: after each edit session, you could first show the page without storing it (preview), with four new buttons: store, store without recent changes, correct, and revert. if you want to really store the edited version, you have to explicitly confirm it. that might raise the security level. especially in combination with bayle’s ideas. and the operation becomes standartized.

I thought of telling this directly to the WikiTing btw.
What about rejecting all indexing by search engines for wiki. That will eliminate the motive for spammers to go for wiki. I know on a static site you can add a robot.txt to the central folder where index.html is in and google then ignores it. It would be a pitty not to see us googleindexed anymore, sure, but it wouldn’t really harm us. Maybe we can even build a new, a better google, WikiGoogle.

pir wrote on IRC (where I immediately spammed my idea): “It would hack off our head to spite our face. Most people find wikis trough google.” Most do, right, not all though. I first didn’t know what “to spite” means. “Es würde uns den Kopf abhacken, um unser Gesicht zu verärgern (zu kränken).” I do not understand that. [de]”…, um unser Gesicht zu retten.” Das ja, aber verärgern? Mein Gefühl sagt, dass dies die einzige, weil allein saubere Lösung ist (the wiki-way). Es bedeutet zugleich eine Loslösung von Strukturen mit vordefinierter Hierarchie (ich meine google und die ganze Reklamebande, die da hoffnungslos ihre Finger drin hat*), die haben wiki eh nicht verdient. Wir werden gut auch ohne google zurechtkommen.
[en] “…, to save our face.” That yes, but to spite it? My feel says it’s the only solution, as it’s the only clean one (the wiki-way). It likewise means to back off from structures with a predefined hierarcy (I mean google and all ‘em commercial-gangsters who hopelessly have their hands on it*), these are not worth wiki anyway. We’ll get along fine without google.

Sigi ein Knopf “save without history” hiesse, jemand könnte dich aus wiki entfernen. Alles von dir ist ein für allemal weg. Im Archiv liegt wikis Stärke, wiki ist per definitionem “Geschichte”. Geschichte darf man nicht löschen und verändern dürfen. Davon beschlägt der Spiegel, der wiki ist.

drücke korrektur-button: history ist falsch. sollte recent changes heissen.
mit der loslösung von den herkömmlichen strukturen bist du auf dem richtigen weg. leider ist das den meisten hier überhaupt noch nicht klar. vielleicht am ehesten murray. weiss ich nicht so genau.

[en]It’s a wiki and remains one, even with the foto ;-) Sigi corrected his button proposals above. [de]Für dt. man gebraucht man ( you use) im englischen meist you, selten nur one.

Addendum to my 23/5 post here.

  • This was stupid, very stupid. Nobody is a predefined gangster. We should talk openly to everyone who wants to. We should for example to the google-folks about our being threatend by spammers and what could be done about it. We have help to offer, we’re intelligent. Maybe we can help them with something too? Excuse my stupid overreaction please.

I certainly don’t think we should opt out of Google. The wiki community is insular enough as it is. Let’s meet the problem head on, rather than hide from it.

Anyhow, if wikis are no longer viable on the mainstream internet, then they aren’t nearly as useful as we all think. I mean, you can forget about WikiAdvanceWiki if even the people telling you to use wiki feel that it’s so vulnerable that they don’t themselves list on Google! But I think that’s not the case, and that we can protect ourselves through minor, SoftSecurity-esque modifications to the infrastructure such as I propose.

I agree. I’ve also noticed a few wikis shut down over the last month due to WikiSpam. I think CW or Meatball could serve a great community purpose by becoming a clearing house for ideas about dealing with WikiSpam. Some possibilities:

  • Get WikiEngine developers in on the action early, and give them some suggestions for building SurgeProtectors?.
  • Give WikiAdmins? some support, including possibly a VolunteerFireDepartment? to do SoftSecurity manually.
  • Just let everyone know that we can get through this OK.

I agree with Evan. We see the developers of most wiki software pretty much daily. We know what to implement. It just seems like a matter of putting the tech into place.

We have SurgeProtectors?, we have LinkBan?, we have these tools. No problem.

I’ve been using LinkBan? on my TaoRiver wiki. I used to see a lot of spam. Now I see a new spam address once. After I put that in the list, I do not see that spam anywhere else on the set of wiki.

We should be able to share our ban lists among a group. I don’t think we want to have one list for the entire world, because then that one place will probably be the target of concentrated attack. But groups of people who want to set up and share ban lists will be good.

The technology to point your wiki to a ban list should be built in. Ban lists can be updated, say, once per day, but I personally feel an event system would be best. That way, as soon as a spammer starts to work, the other wiki are ready ahead of the spammer.

If the spammer gets a full day to work with- yuck. We’ll still be getting lots of spam.

Now we just need an IntComm:EventSystem?… …which I happen to have right on hand! IntComm:DingDing! Whoah! Who would have thought! What a coincidence! {;D}=

For my part, I’d be happy to just use Oddmuse:wikiget and Oddmuse:wikiput from a cron job to share my banlist. I’m not sure how to motivate users on Emacs Wiki to maintain a master link ban list on some other site, however.

I don’t get it; I thought that Emacs wiki was yours? Don’t you administer it?

Or- do you mean- You don’t know how to motivate other people to use your banlists?

I don’t think you need to, really. It’s the spammers that do the motivating. {:)}=

  1. You put it into OddMuse that you can plug into a banlist and accompanying update event notice.
  2. Someone gets hit by a spammer.
  3. “Help! Help! I’m being spammed! Wiki sucks!” …they say.
  4. Helpful person says, “Never fear, you have an easy way out, and it’s already built into OddMuse. Just participate in our local ban list!
  5. “Really? It’s that easy?”
  6. “It’s that easy.”
  7. “Yay!”
  8. (person receives link to friendly neighbors’ ban list.)
  9. (happy time passes)
  10. “Wait- I got a hit by spam link that wasn’t in the list?!”
  11. “Please deposit that link into your local spam list queue.”
  12. “Ah, right. Gee whiz, it feels good to help out!”

I’m trying to share responsability with other people. These other people might only care about Emacs Wiki. Emacs Wiki has its own, local banlist which we already maintain. I’d have to tell them: “Instead of adding the stuff here, you should now add it there…” It might work, if I add appropriate links to the top of the page listing the banned links, of course.

I still need to implement the “expiration” of banned links…

you surely know about the WikiBlackList. Just for security, know?

I heard about it, and now I read it. :) But it still doesn’t solve the problem:

  1. Do we want a central place to collect spammers?
  2. How do we motivate people to maintain a central collection instead of a local collection? (Maybe we should just do an experiment: Emacs Wiki maintains a local list, now we need to find a wiki that uses the central list…)

I’d use lots of “central” places to collect spammers. That is, lots of communities that share collected spam links. The redundancy is for security and to reduce the burden of mistakes.

And now for something completely different: A Spam wiki! Strangely, I haven’t seen anything about WikiSpam.

Sunir said someting that feels right to me under SpammersDontWantToSpam? on MeatBall:WikiSpam

Hm; I don’t understand PageRank technology.

It seems to me that it’s: “If you don’t get indexed, then there’s no problem.”

Sort of like, “If you don’t have anything valuable, who’s going to try to steal it?”

But if I’ve understood right, that’s not at all interesting to me. I want people who are interested in the things we are talking about to be able to find us.

By turning google’s eyes away, you’re basically saying, “We don’t take part in the world.” I mean, nobody’s going to find you, unless it’s by personal reference.

But, maybe I’ve mis-understood what Sunir means.

Me, in my depressive way, said: “Ok. robot.txt, no index”. Sunir instead says, “let’s all together talk to google”. Which carries the imanent posibility of a common solution, whereas “robot.txt, no indexes” is sheer surrender and therefore is the worse supposal.

We are talking about one DifficultParticipant? who f*s up wiki, like one DP f*s up MetaBaby these days.
This is serious. We don’t want the final restored CW.tar to go into our rl-bookshelf and never be looked at there again, do we? Larp.de had a new spam attack by spammers cleverly changing linknames or something. Just read it on IRC. Didn’t understand as usual. Ciao.

I do like DJeep’s idea on http://openwiki.com/ow.asp?WikiSpam of gradually making wikilife harder to a contributer when enough others declare it should be done so. It’s an interesting approach, some kind of active and non-anonym voting. It is pretty ridiculous too and I like it therefore. Security by making borders is pretty ridiculous (especially on wiki).

Please imagine instead of telling me why it won’t work (and if you really have to do it second place please), as I have a question. Every wiki has a SpamHereOnly page. Changes on it are by default minor and cannot be set major. Spammers spam on it, communities do their stuff on the rest of the pages. Spammers leave the rest of the wikis alone, Wikis leave the SpamHereOnly pages alone (that is they do not delete whatever spam from the SpamHereOnly pages (as long as there is sufficient disk space and no system slowdown because of it). Simply peacefull coexistence.
My question is: Thinking this on a massive scale, how would the net change and the consequences look like for google and other commercial enterprises? About possible answers: Vision stuff is just as well welcome.

btw I forgot to mention: This worked in two cases of spam on CaFoscari:SpamHereOnly. The first spammer had actually created the page from the sandbox where a discussion (now moved to CaFoscari:WikiSpam) had started regarding the usefulness of a SpamHere? page (now renamed to SpamHereOnly)

LionKimbro, who hosts CaFoscari?, has stated that he does very much wants not to become a service provider for spam links. He shall not. This is an experiment. I’d like to see how the reactions are on this concept to get a better insight in the psychology of spammers. We’ll know better in a while.

PhilJones : Over on ThoughtStorms I’ve been changing the name of the text-input field. I seem to get spam once every week or two. What I can’t tell is whether this is from smart-bots or humans. But I wonder if some kind of obfusticated “variable geometry” HTML could keep bot activity down. Of course, this will screw legitimate bots too. And it won’t keep the human spammers out. But I can’t think of any legit bot inputing applications for a human readable wiki. And I suppose human spammers have a lower (controlable?) rate of posting spam.

Phil, there’s lots of applications for bots which write. See the “Potential applications” section at the bottom of InterWikiSoftware:WikiGatewayMotivation for a list. Some highlights:

  • Client-side wiki editing software (perhaps containing advanced features such as graphical refactoring assistance)
  • WikiWindow (for example, a MoinMoin frontend/gateway for MeatballWiki, even thought MeatBall is nomially on UseMod)
  • WikiSync (like cvs; when you’re offline, you could still modify wiki pages, and then commit the changes later on in batch)
  • Copying and moving pages between sites

This is why I would like to see a system where registered users who pass some sort of criteria get “bot access”. For example, the criteria could be “the community sees fit to put the keyword “BotAccessAllowed?” on the user’s homepage, and no one has tried to remove that keyword for 24 hours”. It makes sense to deny bot access to anonymous users, but not to everyone.

:Over on ThoughtStorms I’ve been changing the name of the text-input field. I seem to get spam once every week or two. What I can’t tell is whether this is from smart-bots or humans.

My guess is that some of it (namely, the bit that you still get) is from humans. I would be surprised if anyone would bother to keep up with your changes every week, or if they would bother to write a “smart” algorithm that doesn’t hardcode the text-input field name (when there are so many easier targets around that don’t require that sort of effort).

On the other hand, I wouldn’t be surprised if somewhere in China, human labor is cheap enough to make it worthwhile to employ a few people to spam wikis and message boards. I’ve heard a rumor that as of a few years ago, a phone company needed to type in some phone books into their computer systems, and it was cheaper for them to hire people in China to read and type in phone books than it was to develop an OCR process over here.

Pretty interesting as a new phenomena is www.pörnöpedia.de (without the döts). Allready banned here obviously as I couldn’t save the link. I shouldn’t either. Ok. Will be discussed in times I guess.

MarioSalzer: My recent attempt to get rid of some spam is to add a more extroverted notice if chinese chars / HTML entities are detected on a page. See [[2]] for an example. Manual URL banning and blocking seemed not to last very long, there is an recognition problem with the spammers from China; maybe this time it gets their attention. And after all, if they had to rewrite their contribution for me seperately, I’d no longer be a worthful target.

Thinking about ways and implementing and testing and changing on solutions is ok. On multilingual experiment there is no chinese translation yet. There couldn’t ever be one with an antispam-technique like the one mentioned above. I can’t tell about a recognition problem, you might be right, that the relating chinese spammers read your message for the first time. You might as well be not. For me it feels like “hacking a wikiengine back to the stoneage”. Exclude everybody smaller than 1.55 m. But generally it’s good to give even that a try. ;)

MarioSalzer: Sure, that procedure is impolite as hell. But I think it makes some sense on English-only sites; it’s a better defensal than late cleaning up of garbaged pages. This hack gives my favourite chinese spammer an immediate response to his attempts. No success guarentee, but I like it.

I spent a long time today trying to upgrade some UseMod wikis which I have running at SourceForge to either MoinMoin or OddMuse. Neither one worked, because SourceForge’s web servers have old versions of Perl and Python (5.005 and 1.5something). So, I decided to bite the bullet and start writing the WikiGateway-enabled remote spam cleaning program. This program will enforce things like BannedContent on a target wiki, even if the target wiki’s software doesn’t support it. This will allow me (and you) to deal with spam on older wikis which can’t or won’t upgrade their software for whatever reason.

Perhaps someone as allready come up with that but here are my thoughts anyway:

I’ve been reading a bit about google indexing [3], found that a way to exlude page is to place a META tag <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> and thought about a little quick baked strategy:

  • include this tag by default
  • remove this tag if the page requested is the latest version and if the last modification date is more than say 12 hours ago.

Google will still be able to index pages, with the assumption that a page will reach a stable form at some point and that people will stop to edit it like maniacs, and if the spam is removed in less than 12 hours it won’t be indexed by google. Of course this suppose that spammer has the opportunity to understand that the wiki is working like this and that their spam is useless…but if it does not reduce spam it perhaps can provide the satisfaction that the wiki does not provide any benefit for them. Another thing is to prevent google to index previous revision and pages with a diff (it may allready be the case) in a robot.txt file, otherwise the spam is still profitable even once it has been removed.

For wikis implementing KeptPages, indexing past revisions is no problem, because they will all expire eventually. As long as a significant part of all wikis does not implement an anti-Google strategy, ignorant spammers will continue to spam us. I think most spammers are poorly paid people in China that don’t have the time and energy to investigate whether their spamming will work on a particular wiki. It is fast to just spam a wiki anyway. This is why I think this strategy will not work.

Google, MSN, Yahoo!, and a bunch of blogging software writers have announced a scheme where links will be overlooked by search engines if they have the attribute rel=“nofollow”.

The idea is that software will automatically add rel=“nofollow” attributes to any links in user-contributed content.

I think this will generate the “critical mass” needed for spammers to be forced to learn to avoid sites running software known to do this; once MOST sites have a mechanism like this, spammers will probably restrict themselves to the sites that do not.

Therefore, I think now is the time for WikiEngines (by which I mean, OddMuse) to implement NOFOLLOW attributes (both in page headers and also added to each link; wouldn’t want the spammers to miss it).

(hmm, i thought i posted that already, i guess i forgot)

I thought I’d note that it seems to me that OddMuse’s networked BannedContent list is very effective. In fact, I’d go so far as to say that it seems to solve the spam problem, for now at least.

I believe MoinMoin now has a similar feature called BadContent?.

If you run a wiki and you haven’t done so already, you really should upgrade to OddMuse or the new MoinMoin.

The reasons I don’t like NoFollow:

  1. Wiki spam won’t stop because spam is already removed from actively maintained wikis. Still we get spam. Why? Because spammers will continue to spam as long as they find the occasional unmaintained wiki. Wiki spam will only stop after a very very large majority of all existing wikis, maintained or not, supports NoFollow. Basically this is not a good strategy. We need to find a solution that protects unused wikis from spam.
  2. Even if we “whitelist” some sites using InterLinks we loose a lot – unpopular sites will not rise in pagerank due to wikis, because it will be to much of a bother to create a LocalName or an entry on the InterMap for fringe sites.
  3. This works much better for blogs where articles themselves are “cleared” and only comments are “from the unwashed masses.” On a wiki this is much harder to get right. We could try keeping track of what link where added by registered users and what links by anonymous users were “approved” by registered users because they survived a page-edit by a registered user.

Also note that Oddmuse serves meta headers to control spiders in order to save CPU resources. They affect following. I believe that following and ranking are two distinct issues, however. The Google announcement doesn’t mention following (even though the attribute value used is “nofollow”) – it only mentions ranking. Maybe they implement “not ranking” by “not following” – but they don’t really say it. Oddmuse tells spiders to not follow links. If that also prevents ranking, it was an unintended consequence. :/

I asked Google about that. The Google engineer who replied said that “For the foreseeable future, [rel=“nofollow”] should keep us from [visiting the] link. This is a fine use of the nofollow tag.”

I’ve always had meta nofollow/noindex on my edit/history/diff pages, but I’m now using rel=“nofollow” to keep robots from even visiting those pages in the first place.


Banning all of China is legitimate. It’s a business relationship, not humanism. Their country is not holding up to their responsibilities as part of the internetwork, and so we cease to do business with them until they clean up their act. What should China do? Well, police itself. They have politically decided not to cooperate with the rest of the world’s police. If they are uncooperative, then there is no reason we should cooperate with them. cf. IteratedPrisonersDilemma. – SunirShah

Banning Canada, you included, is legitimate. How about that?

(After the spam-attack of end of march 06)
That was refreshing, wasn’t it? And who knows if it’s already over?

No, let’s be honest. This will happen. We are human, spammers are human. Different ideas but same OS: brain, HumanIntelligence?. No AI. So whhat we maybe need ist o get organized better (apart from great technical solutions like the BannedContent. We need to be faster. Despamming a page takes a minute or so. It takes a few seconds to realize: this contribution is spam! I’d like to be able to immediatately mark it and make it disappear in a folder “to be despammed” and makke it disappear from the recent changes. “5 pages to be despammed” it says on the top. Recent changes clean from spam. Dunnno.

2006-06-15
Introducing a “to be despamed” and a “needs maintainace” mode for pages. maintainace for pages that need translation and such. Pages in “to be depamed” are taken out of the recent changes, you see them on recent changes with spam. A red “! there are pages to be depamed” shos up on recent changes if any page is tagged “to be despamed”. Usually there won’t be any and in case of an attack you just tag the spamed pages and soon have the recent changes work again.

I admit the recent changes are not precious enough at the moment to do that. We have to work on it.

[new] I use special extensions on my wiki http://metin2wiki.ru It is no spam!

I use page-headers like that now:

editable title
Subtitle of the page that gets included in the beginning of the summary, first two lines, next for comments

Apparently somebody is now using my username and copying rollback summaries in order to sneak spam past our eyes. Sorry about that!


CategoryConflict

(CommunityWikiFooter)

Define external redirect: SpammersDontWantToSpam CaFoscari StagedCommits RollBack BadContent SurgeProtectors BotAccessAllowed RejectDuplicate HasAutomatedEditingPrivilages WikiAdmins HumanIntelligence EventSystem DifficultParticipant VolunteerFireDepartment SpamHere LinkBan

EditNearLinks: WikiEngines IteratedPrisonersDilemma EmailAddressProtection UseModWiki MoinMoin OddMuse MeatBall WikiEngine PageRank SourceForge MeatballWiki WikiSync MetaBaby SunirShah SpamHereOnly UseMod WorldWideWiki NoFollow TaoRiver WebLogs EditThrottling

Languages:

The same page elsewhere:
MeatBall:WikiSpamWiki:WikiSpam