AIRWeb 2007 Papers Released

The Third International Workshop on Adversarial Information Retrieval on the Web will be held next month, and there’s an impressive scope and depth to the topics that will be discussed based upon the papers to be presented during the conference, and the members of the committees who have worked to put this workshop together. This […]

Chat with SearchBot

The Third International Workshop on Adversarial Information Retrieval on the Web will be held next month, and there’s an impressive scope and depth to the topics that will be discussed based upon the papers to be presented during the conference, and the members of the committees who have worked to put this workshop together.

This year, AIRWeb 2007 is part of the 16th International World Wide Web Conference in Banff. Alberta, Canada, on May 8th, 2007. The Organizing Committee members include Carlos Castillo of Yahoo Research, Kumar Chellapilla of Microsoft Live Labs, and Brian D. Davison of Lehigh University. The Program Committee members include an impressive group of researchers and search analysts, listed on their call for papers page. The following definition of Adversarial Information Retrieval is also found on that page…

Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is “search engine spamming” or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue.

The workshop pages don’t provide easy access to the papers that have been accepted for the workshop, so I’ve compiled a list with links, authors, and abstracts for the papers.

Full Papers and Abstracts

A Taxonomy of JavaScript Redirection Spam (pdf)
Kumar Chellapilla and Alexey Maykov

Redirection spam presents a web page with false content to a crawler for indexing, but automatically redirects the browser to a different web page. Redirection is usually immediate (on page load) but may also be triggered by a timer or a harmless user event such as a mouse move. JavaScript redirection is the most notorious of redirection techniques and is hard to detect as many of the prevalent crawlers are script-agnostic. In this paper, we study common JavaScript redirection spam techniques on the web. Our findings indicate that obfuscation techniques are very prevalent among JavaScript redirection spam pages. These obfuscation techniques limit the effectiveness of static analysis and static feature based systems. Based on our findings, we recommend a robust counter measure using a light weight JavaScript parser and engine.

New Metrics for Reputation Management in P2P Networks (pdf)
Debora Donato, Mario Paniccia, Maddalena Selis, Carlos Castillo, Giovanni Cortese and Stefano Leonardi

In this work we study the effectiveness of mechanisms for decentralized reputation management in P2P networks. We depart from EigenTrust, an algorithm designed for reputation management in file sharing applications over p2p networks. EigenTrust has been proved very effective against three different natural attacks from malicious coalitions while it performs poorly on particular attack organized by two different kinds of malicious peers. We propose various metrics of reputation based on ideas recently introduced for detecting and demoting Web spam. We combine these metrics with the original EigenTrust approach. Our mechanisms are more effective than EigenTrust alone for detecting malicious peers and reducing the number of inauthentic downloads not only for all the cases previously addressed but also for more sophisticated attacks.

Using Spam Farm to Boost PageRank (pdf)
Ye Du, Yaoyun Shi and Xin Zhao

Nowadays web spamming has emerged to take the economic advantage of high search rankings and threatened the accuracy and fairness of those rankings. Understanding spamming techniques is essential for evaluating the strength and weakness of a ranking algorithm, and for fighting against web spamming. In this paper, we identify the optimal spam farm structure under some realistic assumptions in the single target spam farm model. Our result extends the optimal spam farm claimed by GyÄongyi and Garcia-Molina through dropping the assumption that leakage is constant. We also characterize the optimal spam farms under additional constraints, which the spammer may deploy to disguise the spam farm by deviating from the unconstrained optimal structure.

Combating Spam in Tagging Systems (pdf)
Georgia Koutrika, Frans Effendi, Zoltán Gyöngyi, Paul Heymann and Hector García-Molina

Tagging systems allow users to interactively annotate a pool of shared resources using descriptive tags. As tagging systems are gaining in popularity, they become more susceptible to tag spam: misleading tags that are generated in order to increase the visibility of some resources or simply to confuse users. We introduce a framework for modeling tagging systems and user tagging behavior. We also describe a method for ranking documents matching a tag based on taggers’ reliability. Using our framework, we study the behavior of existing approaches under malicious attacks and the impact of a moderator and our ranking method.

Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics (pdf)
Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog temporal dynamics to detect splogs.

There are three key ideas in our splog detection framework. We first represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts. Second, we show via a novel visualization that the blog temporal characteristics reveal attribute correlation, depending on type of the blog (normal blogs and splogs). Third, we propose the use of temporal structural properties computed from self-similarity matrices across different attributes. In a splog detector, these novel features are combined with content based features. We extract a content based feature vector from different parts of the blog – URLs, post content, etc. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM based splog detector using proposed features on real world datasets, with excellent results (90% accuracy).

Measuring Similarity to Detect Qualified Links (pdf)
Xiaoguang Qi, Lan Nie and Brian Davison

The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show our approach almost doubles the performance of Kleinberg’s HITS and boosts Bharat and Henzinger’s imp algorithm by close to 9% in terms of precision. It also outperforms a previous approach focusing on link farm detection.

Improving Web Spam Classification using Rank-time Features (pdf)
Krysta Svore, Qiang Wu, Chris Burges and Aaswath Raman

In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of dataset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show ranktime features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent ranktime features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.

Extracting Link Spam using Biased Random Walks from Spam Seed Sets (pdf)
Baoning Wu and Kumar Chellapilla

Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities – i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.

Computing Trusted Authority Scores in Peer-to-Peer Web Search Networks (pdf)
Josiane Xavier Parreira, Debora Donato, Carlos Castillo and Gerhard Weikum

Peer-to-peer (P2P) networks have received great attention for sharing and searching information in large user communities. The open and anonymous nature of P2P networks is one of its main strengths, but it also opens doors to manipulation of the information and of the quality ratings.

In our previous work (J. X. Parreira, D. Donato, S. Michel and G. Weikum in VLDB 2006) we presented the JXP algorithm for distributed computing PageRank scores for information units (Web pages, sites, peers, social groups, etc.) within a link- or endorsement-based graph structure. The algorithm builds on local authority computations and bilateral peer meetings with exchanges of small data structures that are relevant for gradually learning about global properties and eventually converging towards global authority rankings.

In the current paper we address the important issue of cheating peers that attempt to distort the global authority values, by providing manipulated data during the peer meetings. Our approach to this problem enhances JXP with statistical techniques for detecting suspicious behavior. Our method, coined TrustJXP, is again completely decentralized, and we demonstrate its viability and robustness in experiments with real Web data.

Transductive Link Spam Detection (pdf)
Dengyong Zhou, Christopher Burges and Tao Tao

Web spam can significantly deteriorate the quality of search engines. Early web spamming techniques mainly manipulate page content. Since linkage information is widely used in web search, link-based spamming has also developed. So far, many techniques have been proposed to detect link spam. Those approaches are basically built on link-based web ranking methods.

In contrast, we cast the link spam detection problem into a machine learning problem of classification on directed graphs. We develop discrete analysis on directed graphs, and construct a discrete analogue of classical regularization theory via discrete analysis. A classification algorithm for directed graphs is then derived from the discrete regularization. We have applied the approach to real-world link spam detection problems, and encouraging results have been obtained.

Short papers

Web Spam Detection via Commercial Intent Analysis
András A. Benczúr, István Bíró, Károly Csalogány and Tamás Sárlós

We propose a number of features for Web spam filtering based on the occurrence of keywords that are eithgr of high advertisement value or highly spammed. Our features include popular words from search engine query logs as well as high cost or volume words according to Google AdWords. We also demonstrate the spam filtering power of the Online Commerical Intention (OCI) value assigned to an URL in a Microsoft adCenter Labs Demonstration and the Yahoo! Mindset classification of Web pages as either commercial or non-commercial as well as metrics based on the occurrence of Google ads on the page. We run our tests on the WEBSPAM-UK2006 dataset recently compiled by Casilllo et al. as a standard means of measuring the performance of Web spam detection algorithms. Our features improve the classification accuraracy of the publicly available WEBSPAM-UK2006 features by 3%.

Improving Web Spam Classifiers Using Link Structure
Qingqing Gan and Torsten Suel

Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any time an anti-spam technique is developed, spammers will design new spamming techniques to confuse search engine ranking methods and spam detection mechanisms. Machine learning-based classiffcation methods can quickly adapt to newly developed spam techniques. We describe a two-stage approach to improve the performance of common classiffers. We first implement a classiffer to catch a large portion of spam in our data. Then we design several heuristics to decide if a node should be relabeled based on the preclassiffed result and knowledge about the neighborhood. Our experimental results show visible improvements with respect to precision and recall.

A Large-Scale Study of Link Spam Detection by Graph Algorithms
Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa and Kazuyuki Aihara

Link spam refers to attempts to promote the ranking of spammers’ web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called “link farm”. In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the JapaneseWeb with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.


Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.


About the author

Bill Slawski
Contributor
Bill Slawski was the Director of Search Marketing for Go Fish Digital and the editor of SEO by the Sea. He started doing SEO and web promotion in the mid-90s, and was a legal and technical administrator in the highest level trial court in Delaware.

Get the newsletter search marketers rely on.