May 11, 2019

Analyzing the spam protection of Wikinews


Anticipating the normal Wikipedia admin behavior is an easy task. In general they will delete anything which doesn't contains of academic references from Google scholar and which doesn't contain work of it's own. If these two simple rules are fullfill the user can write as much as he likes. He can add a single paragraph to an existing article or he can write a new article from it's own. If an admin detects that an edit was made, he is looking at first to the reference section at the end. If this section sounds normal he will take a short look into the fulltext and then the edit has passed the incoming control.
If a user has submitted an article and the reference section is empty, or the reference section contains blogpostings but not pdf papers from arxiv, the danger is high that the admin will revert the edit. Some users have with these behavior a problem but it is the best practice method for protecting wikipedia before vandalism. If the author is aware of the rules and it is motivate to act inside the allowed rules every edit of him will pass the incoming filter. The bottleneck is to find enough volunteers who likes to do so.
A very different kind of website is the Wikinews project. Right now I'm trying to figure out which guidelines are working there. What i can say is, that the rules are different from the normal wikipedia. It's unclear which rules are working on Wikinews. We have to seperate between clear defined rules and weak non-clear defined rules. A clear rule is on Wikinews for example the three revert rule. Which means, if the user is pressing the revert button too fast in repetition this is treated as vandalism. Another clear rule is the need for polite talk.
But which additional rules are working at Wikinews? Who does an admin decides between spam and valid? At least for the sources i have found a quasi official rule. It seems, that source which are coming from Google news are valid. If two large mainstream newspaper have reported about an event and if these newspapers are cited in the article then the admin will tolerate the edit.
It is unclear if this rule alone can explain why an article gets rejected or not. Perhaps there are additional rules available which i don't know. The good news is, that WIkinews is telling the beginner very much information because many examples are given in the system. In the left navigation bar there is a button called random article. This allows to swipe through all articles which are seen as valid. What all these articles have in common is:
- in the footnote some tags are defined: date, geographical location, subject
- in the source section around 3 references are given, most of the references have their origin at Google news, that means what is not in Google news, doesn't fit to Wikinews
- the article itself is short, no table of contents is needed, instead the article contains of three paragraphs
- most article have a single picture which illustrate the story
To summarize the knowledge so far. The typical wikinews article looks like an improved Facebook post. It contains of two sources from Google news, contains of three text paragraphs, a single image and category tags at the end.
Let us go back to the behavior of the admins. What is similar to Wikipedia is the blocking technology itself. If the admin has detect an spam-user he can block the user name and also the ip range. The admin is also allowed to block an ip range for preventing distributed attacks. This pirnciple seems very similar to what Wikipedia is using. If the user is blocked he can't do anything. And the user can't bypass the block. That means, technically the admin is superior. If he has decided that a certain user is a spam bot, he is able to prevent the user from posting something.
The open question is, under which condition the admin decides that a user is producing spam and vandalism. Somebody may argue, this is easy to say. In case of the normal Wikipedia i would say the same. In Wikipedia it is possible to say for an edit in under 1 minute if this edit make sense or not. Even if the edit was made in a complicated topic. In most cases the judgment can be made by reading a single sentence and look shortly to the reference section.
In case of Wikinews the decision between spam and valid is more complicated. Roughly spoken the amount of the existing Wikinews article looks similar to what a spam bot would generate. That means, they have all a headline, three paragraph and two sources from Google news, plus the normal tags. And the text in the paragraph is way to short to judge if it was auto-generated, written by an amateur or by an expert. The hypothesis is, that in Wikipedia it is very hard or even impossible to detect spam bots.
Let me give an example. Suppose a random user has created a new article. It contains no source from Google news, and the text contains of a lorem ipsum paragraph. In this case it is easy to judge, that the text is spam. It can be deleted and nobody cares.
In the second case the spam bot was improved slightly. He has copy&paste two sources from Google news and the paragraph was copied from the internet. Decide if this post was spam is more difficult. The admin can copy the text into a search engine to check if the same paragraph is available elsewhere. If he found so, it is a clear violation of the copyright policy and the article get's deleted.
And now comes case three. This time, the bot has inserted two sources from Google news, has written a paragraph which comes a human but contains two spelling mistakes, and was added with the correct tags. Is this post spam or not?
Here is the problem. The described case 3 can be generated on mass scale. Either with an automatic bot, or with an semi-automatic bot in which the paragraph was written by a human on the fly. What will happen if the bot is posting such text frequently to WIkinews? Will this be treated as spam or as a valid contribution to the project?
The good news is, that flooding the normal wikipedia with mass produced article is not possible. Because each topic is different and even articles for places needs a certain quality. The problem is in most cases, that a wikipedia article needs a reference at the end and deciding which one is the right one is a task which can only be done by humans. That means, it takes at least one hour to produce a valid wikipedia edit.
In case of wikinews the situation is different. A mass produced article would pass the incoming filter. The reason is, that articles which are already in the system will look all the same. In contrast to Wikipedia the aim is not to produce content but to insert an URL and add some context. If the URL comes from Google news, the article has passed the incoming filter by 50%. And if some text is inserted in the article, the admin isn't able to stop the submission.
Wikinews has an official deletion request archive. In contrast to the normal Wikipedia project, the page is nearly empty. https://en.wikinews.org/wiki/Wikinews:Deletion_requests In the complete year 2018 only 8 cases are available in the deletion request page. The other deletion were made by the speedy deleition. In contrast to the deletion request a speedy deletion is faster, and doesn't produces a discussion about the reason why. Instead the admin puts a simple template into the article and press the red button. That's it.
In the deleition log all the speedy deletition are listed https://en.wikinews.org/wiki/Special:Log/delete The amount of deleted pages is higher. Around 15 each day. In contrast to the normal wikipedia website the amount is very small.
In a discussion I've found the statement that the total amount of vandalism at Wikinews is small, and deleting is the exception. The reason for this low amount of deletion is simple. The WIkinews project is very small. The total amount of files in the English version 4669. In special languages like Italian the number of edits is very very small. On a single day around 20 edits are made. That means according to the edit counter nobody is trying to make anything at Wikinews. That means, if a spam bot will produce articles on mass scale he will be noticed for sure. The edit count is so low, that if a user made a single edit he can't hide behind anybody.
Spam detection in social networks
In theory it is possible to use an existing spam filter or write a new one for detecting spam messages. The precondition is, that at least a human can tell the difference. Detecting spam inside Wikipedia and in online forums is an easy task. It can be done automatically or semi-automatically The reason is, that both websites are dedicated to content creation.
In contrast, meta-websites and social networks have a different self-understanding. Google+ for example is not a webforum but a news aggregator, the same is true for Wikinews or planet gnome. A famous misconception in the literature is, that it's possible to detect spambots in social networks. But what will happen in the following case. A spambot is posting each day a single URL to a Social network group. Not more and nothing less. Is this behavior typical for a spam bot, for a bored user, for a marketing expert, or for a valid user? This question can't be answered. Creating in Wikipedia an article which contains only a URL is equal to a spam post. But doing the same in Facebook is treated as normal behavior.
The problem is, that meta-websites and social networks are containing as default a low amount of content but a high amount of URLs. This makes it easy to write spam bot and makes it impossible to write spam detection algorithms.
A possible counteraction is to restrict the number of users. If only users are allowed to post something to a social network it can be detected easier if a user is a bot. In case of the planet gnome website the restriction is much harder, because no users at all are allowed to login into the system. The admin of the website is the only person who puts the RSS feed together.
The combination of a social network which is open to everybody is a risky situation. In contrast, a normal forum or an image hoster which is open to everyone can be easily managed. Especially under high traffic load.