March 24, 2007

"A study of the WebmasterWorld states that 75% of all blogs on Google's blogspot are spam." Isn't that exaggerated ? I was more surprised with the spam percentage on top doorway domains:

  • blogspot.com 77%
  • netscape.com 74%
  • hometown.aol.com 84%
  • hometown.aol.de 91%
  • oas.org 78%
  • xoomer.alice.it 77%
  • home.aol.com 95%
  • freewebs.com 52%
  • blogstudio.com 99%
  • maxpages.com 81%
  • usaid.gov 85%
  • blogsharing.com 93%
  • sitegr.com 100%
  • torospace.com 95%
  • blog.hix.com 100%

That's 85% for usaid.gov ?! Welcome to the world ! Anyway I don't agree at all with the technique used to detect splogs :

The researchers scanned 1000 most searched queries: 'phentermine' on blogspot.com and the query 'ringtone' on hometown.aol.com[...]

Who is wasting his time blogging about phentermine or ringtones beside spammers ?! It's somehow obvious to get such results, I think the results should be even much higher.

November 25, 2006

I was working last night on Antisplog and find some new issues which are very interesting. The most important issue is related to splogs which use a very minimal page with 3 or 4 links to other splogs of their own, an adsense banner, a title, and that's all. In two words, if by chance you or someone else visit such splog the most relevant content on the page will be the adsense banner, second thing will be the links to other splogs. And that's how sploggers are making their network.

For Antisplog such pages are not considered as splogs "currently", for many reasons :

  1. It is not blogs-like

  2. Pages without content at all

  3. Links are very poor

Continue reading "The "Splogs Network"" »

November 19, 2006

Since last year I had several issues while working on Antisplog.net project mainly due to lack of hardware and bandwidth resources. Currently the database is indexing about 1,697,355 blogs, and according the last study I have made two months ago the more pages the system index the more the results are accurate. Things that made the project progress delayed until we move to a new server. It will be also another reason to make the algorithm difficult to break, in addition to a new intelligent system which try to learn from current information in database and optimize the algorithm accordingly.

The website have just finished moving to new server and I have tested that everything is working correctly. It might take about 24h until dns updates and then you can see results - The current website have been broken by the server admin and I'm not ready to fix it since it's moved elsewhere. In our current todos :

1- Test the new algorithm
2- Implement an XML-RPC API for single and multi-blogs request
3- Display a page with current splogs detected in real-time, with the possibility to search our database.

Let me know if you have any suggestion.

September 28, 2005

After running Pingoat service, Kailash started a new website to fight splogs : SplogSpot.com. I have contacted Kailash before the announcement of SplogSpot but didn't want to talk about it before it's being officially announced.

SplogSpot is based on splogs detected by Pingoat service and allow within a simple API to access the database or you using a search form. I don't know exactly how many splogs are indexed there but it looks around 5000, where 37% from blogspot only !

Continue reading "SplogSpot.com yet another Anti-Splogs service" »

September 23, 2005

Today I paused the AntiSplog spider at 2,056,958 Blog. Frist I was going to study the behaviour on 1 million then I keep it running for the second million since it didn't took long time. But I needed to stop it for technical reasons, to let me finish studying more cases and run the filter on the indexed blogs.

Continue reading "Antisplog index feeded with 2 milion blogs" »

September 19, 2005

As I have already announce it, the antisplog spider is running now and indexing blogs. I have added a counter to inform about the spider progress, it's updated every 10 to 30 minutes.

Indexed blogs aren't yet filtered, this is the first phase as I have said which consist of collecting blogs, and to study more cases. The database of blogs could reach the milion of entries in about three days, this explain a little the difficulty to run the AntiSplog detection system on it. If 10% of the results are not correct, it will represent 100 000 !!

Continue reading "Blogs spider launched" »

Want more?