WWW2008: Search Research Paper Roundup

A variety of interesting research papers on search have come out of WWW2008, the 17th International World Wide Web Conference. Some I’ve blogged already. Below is a rundown on those and some other papers that may be of interest. For the attention-challenged, I’ve also included my now patented "Twitter" summary for some of the interesting […]

Chat with SearchBot

A variety of interesting research papers on search have come out of
WWW2008
, the 17th International World Wide Web Conference. Some I’ve blogged
already. Below is a
rundown on those and
some other papers
that may be of interest. For the attention-challenged, I’ve also included my now
patented "Twitter" summary for some of the interesting or more
accessible papers, to tell you the highlights.

PageRank for Image Search
Google

Abstract: In this paper, we cast the image-ranking problem into the task
of identifying "authority" nodes on an inferred visual similarity graph and
propose an algorithm to analyze the visual link structure that can be created
among a group of images. Through an iterative procedure based on the PageRank
computation, a numerical weight is assigned to each image; this measures its
relative importance to the other images being considered. The incorporation of
visual signals in this process differs from the majority of large-scale
commercial-search engines in use today. Commercial search-engines often solely
rely on the text clues of the pages in which images are embedded to rank images,
and often entirely ignore the content of the images themselves as a ranking
signal. To quantify the performance of our approach in a real-world system, we
conducted a series of experiments based on the task of retrieving images for
2000 of the most popular products queries. Our experimental results show
significant improvement, in terms of user satisfaction and relevancy, in
comparison to the most recent Google Image Search results.

Danny’s Twitter
Summary
: Google finds way to rank images better by virtual links of
similarities.

See also the Search Engine Land story:
Google Paper: Better
Image Search Though VisualRank / Image Rank

Spatial Variation in Search
Engine Queries

Yahoo & Cornell University

Abstract: Local aspects of Web search — associating Web content and
queries with geography — is a topic of growing interest. However, the
underlying question of how spatial variation is manifested in search queries is
still not well understood. Here we develop a probabilistic framework for
quantifying such spatial variation; on complete Yahoo! query logs, we find that
our model is able to localize large classes of queries to within a few miles of
their natural centers based only on the distribution of activity for the query.
Our model provides not only an estimate of a query’s geographic center, but also
a measure of its spatial dispersion, indicating whether it has highly local
interest or broader regional or national appeal. We also show how variations on
our model can track geographically shifting topics over time, annotate a map
with each location’s "distinctive queries," and delineate the "spheres of
influence" for competing queries in the same general domain.

Danny’s Twitter
Summary
: Yahoo shows how any query can have a geographic center.

See also the Search Engine Land story:
Yahoo Paper: Finding The
Local "Center" Of Search Queries

Mining the Search Trails of
Surfing Crowds: Identifying Relevant Websites From User Activity

Microsoft Research

Abstract: The paper proposes identifying relevant information sources
from the history of combined searching and browsing behavior of many Web users.
While it has been previously shown that user interactions with search engines
can be employed to improve document ranking, browsing behavior that occurs
beyond search result pages has been largely overlooked in prior work. The paper
demonstrates that users’ post-search browsing activity strongly reflects
implicit endorsement of visited pages, which allows estimating topical relevance
of Web resources by mining large-scale datasets of search trails. We present
heuristic and probabilistic algorithms that rely on such datasets for suggesting
authoritative websites for search queries. Experimental evaluation shows that
exploiting complete post-search browsing trails outperforms alternatives in
isolation (e.g., clickthrough logs), and yields accuracy improvements when
employed as a feature in learning to rank for Web search.

Danny’s Twitter
Summary
: Microsoft studies using surfing patterns after a search to improve
ranking.

See also the Search Engine Land story:
Microsoft Paper:
Improving Search Results By Mining Web Surfing Activity

Genealogical Trees on the Web:
A Search Engine User Perspective

Yahoo Research & Federal University of Minas Gerais

Abstract: This paper presents an extensive study about the evolution of
textual content on the Web, which shows how some new pages are created from
scratch while others are created using already existing content. We show that a
significant fraction of the Web is a byproduct of the latter case. We introduce
the concept of Web genealogical tree, in which every page in a Web snapshot is
classified into a component. We study in detail these components, characterizing
the copies and identifying the relation between a source of content and a search
engine, by comparing page relevance measures, documents returned by real queries
performed in the past, and click-through data. We observe that sources of copies
are more frequently returned by queries and more clicked than other documents.

Danny’s Twitter
Summary
: Yahoo paper on how 1/4 new web docs have content from existing
ones. Insight into scrapers & duplicate contet [sic]? But based on Spanish docs.

Query-Sets:
Using Implicit Feedback and Query Patterns to Organize Web Documents

Yahoo Research & University Pompeu Fabra

Abstract: In this paper we present a new document representation model
based on implicit user feedback obtained from search engine queries. The main
objective of this model is to achieve better results in non-supervised tasks,
such as clustering and labeling, through the incorporation of usage data
obtained from search engine queries. This type of model allows us to discover
the motivations of users when visiting a certain document. The terms used in
queries can provide a better choice of features, from the user’s point of view,
for summarizing the Web pages that were clicked from these queries. In this work
we extend and formalize as "query model" an existing but not very well known
idea of "query view" for document representation. Furthermore, we create a novel
model based on "frequent query patterns" called the "query-set model". Our
evaluation shows that both "query-based" models outperform the vector-space
model when used for clustering and labeling documents in a website. In our
experiments, the query-set model reduces by more than 90% the number of features
needed to represent a set of documents and improves by over 90% the quality of
the results. We believe that this can be explained because our model chooses
better features and provides more accurate labels according to the user’s
expectations.

Danny’s Twitter
Summary
: Yahoo defining documents by queries they might satisfy rather than
as containing individual words.

Using the Wisdom of the Crowds
for Keyword Generation

Microsoft Research

Abstract: In the sponsored search model, search engines are paid by
businesses that are interested in displaying ads for their site alongside the
search results. Businesses bid for keywords, and their ad is displayed when the
keyword is queried to the search engine. An important problem in this process is
emph{keyword generation}: given a business that is interested in launching a
campaign, suggest keywords that are related to that campaign. We address this
problem by making use of the query logs of the search engine. We identify
queries related to a campaign by exploiting the associations between queries and
URLs as they are captured by the user’s clicks. These queries form good keyword
suggestions since they capture the “wisdom of the crowd” as to what is related
to a site. We formulate the problem as a semi-supervised learning problem, and
propose algorithms within the Markov Random Field model. We perform experiments
with real query logs, and we demonstrate that our algorithms scale to large
query logs and produce meaningful results.

Danny’s Twitter
Summary
: Microsoft paper looks at how keyword suggestions for advertisers
can be generated by monitoring click logs.


Performance of Compressed
Inverted List Caching in Search Engines

Microsoft & Polytechnic University

Abstract:
Due to the rapid growth in the size of the web, web search engines are facing
enormous performance challenges. The larger engines in particular have to be
able to process tens of thousands of queries per second on tens of billions of
documents, making query throughput a critical issue. To satisfy this heavy
workload, search engines use a variety of performance optimizations including
index compression, caching, and early termination. We focus on two techniques,
inverted index compression and index caching, which play a crucial rule in web
search engines as well as other high-performance information retrieval systems.
We perform a comparison and evaluation of several inverted list compression
algorithms, including new variants of existing algorithms that have not been
studied before. We then evaluate different inverted list caching policies on
large query traces, and finally study the possible performance benefits of
combining compression and caching. The overall goal of this paper is to provide
an updated discussion and evaluation of these two techniques, and to show how to
select the best set of approaches and settings depending on parameter such as
disk speed and main memory cache size.

Danny’s Twitter
Summary
: Microsoft paper with nice background on how search engines quickly
find and cache results


Improving Relevance Judgment of
Web Search Results with Image Excerpts

Microsoft Research Asia

Abstract: Current web search engines return result pages containing mostly text summary
even though the matched web pages may contain informative pictures. A text
excerpt (i.e. snippet) is generated by selecting keywords around the matched
query terms for each returned page to provide context for user’s relevance
judgment. However, in many scenarios, we found that the pictures in web pages,
if selected properly, could be added into search result pages and provide richer
contextual description because a picture is worth a thousand words. Such new
summary is named as image excerpts. By well designed user study, we demonstrate
image excerpts can help users make much quicker relevance judgment of search
results for a wide range of query types. To implement this idea, we propose a
practicable approach to automatically generate image excerpts in the result
pages by considering the dominance of each picture in each web page and the
relevance of the picture to the query. We also outline an efficient way to
incorporate image excerpts in web search engines. Web search engines can adopt
our approach by slightly modifying their index and inserting a few low cost
operations in their workflow. Our experiments on a large web dataset indicate
the performance of the proposed approach is very promising.

Danny’s Twitter
Summary
: Microsoft paper on finding page’s dominant image using use next to
search listing to improve relevancy of results

Tag-Based Social Interest
Discovery

Yahoo

Abstract: The success and popularity of social network systems, such
as del.icio.us, Facebook, MySpace, and YouTube, have generated many interesting
and challenging problems to the research community. Among others, discovering
social interests shared by groups of users is very important because it helps to
connect people with common interests and encourages people to contribute and
share more contents. The main challenge to solving this problem comes from the
diffi- culty of detecting and representing the interest of the users. The
existing approaches are all based on the online connections of users and so
unable to identify the common interest of users who have no online connections.
In this paper, we propose a novel social interest discovery approach based on
user-generated tags. Our approach is motivated by the key observation that in a
social network, human users tend to use descriptive tags to annotate the
contents that they are interested in. Our analysis on a large amount of
real-world traces reveals that in general, user-generated tags are consistent
with the web content they are attached to, while more concise and closer to the
understanding and judgments of human users about the content. Thus, patterns of
frequent co-occurrences of user tags can be used to characterize and capture
topics of user interests. We have developed an Internet Social Interest
Discovery system, ISID, to discover the common user interests and cluster users
and their saved URLs by different interest topics. Our evaluation shows that
ISID can effectively cluster similar documents by interest topics and discover
user communities with common interests no matter if they have any online
connections.

Danny’s Twitter
Summary
: Yahoo paper on how human tags can be a more relevant way to
determine the main topic of a page than keyword analysis.

Learning to Rank Relational
Objects and Its Application to Web Search

Microsoft Research Asia, Tsinghua University, Peking University

Abstract: Learning to rank is a new statistical learning technology on
creating a ranking model for sorting objects. The technology has been
successfully applied to web search, and is becoming one of the key machineries
for building search engines. Existing approaches to learning to rank, however,
did not consider the cases in which there exists relationship between the
objects to be ranked, despite of the fact that such situations are very common
in practice. For example, in web search, given a query certain relationships
usually exist among the the retrieved documents, e.g., URL hierarchy,
similarity, etc., and sometimes it is necessary to utilize the information in
ranking of the documents. This paper addresses the issue and formulates it as a
novel learning problem, referred to as, ‘learning to rank relational objects’.
In the new learning task, the ranking model is defined as a function of not only
the contents (features) of objects but also the relations between objects. The
paper further focuses on one setting of the learning problem in which the way of
using relation information is predetermined. It formalizes the learning task as
an optimization problem in the setting. The paper then proposes a new method to
perform the optimization task, particularly an implementation based on SVM.
Experimental results show that the proposed method outperforms the baseline
methods for two ranking tasks (Pseudo Relevance Feedback and Topic Distillation)
in web search, indicating that the proposed method can indeed make effective use
of relation information and content information in ranking.

Danny’s Twitter
Summary
: Microsoft paper on "relationship" ranking such as parent child
documents, topic similarities and related relevancy.


Modeling Anchor Text and
Classifying Queries to Enhance Web Document Retrieval

University of Tsukuba

Abstract: Several types of queries are widely used on the World Wide Web and the
expected retrieval method can vary depending on the query type. We propose a
method for classifying queries into informational and navigational types.
Because terms in navigational queries often appear in anchor text for links to
other pages, we analyze the distribution of query terms in anchor texts on the
Web for query classification purposes. While content-based retrieval is
effective for informational queries, anchor-based retrieval is effective for
navigational queries. Our retrieval system combines the results obtained with
the content-based and anchor-based retrieval methods, in which the weight for
each retrieval result is determined automatically depending on the result of the
query classification. We also propose a method for improving anchor-based
retrieval. Our retrieval method, which computes the probability that a document
is retrieved in response to the given query, identifies synonyms of query terms
in the anchor texts on the Web and uses these synonyms for smoothing purposes in
the probability estimation. We use the NTCIR test collections and show the
effectiveness of individual methods and the entire Web retrieval system
experimentally.

Unsupervised Query
Segmentation using Generative Language Models and Wikipedia

Yahoo & University of Illinois at Urbana-Champaign

Abstract: In this paper, we propose a novel unsupervised approach to
query segmentation, an important task in Web search. We use a generative query
model to recover a query’s underlying concepts that compose its original
segmented form. The model’s parameters are estimated using an
expectation-maximization (EM) algorithm, optimizing the minimum description
length objective function on a partial corpus that is specific to the query. To
augment this unsupervised learning, we incorporate evidence from Wikipedia.
Experiments show that our approach dramatically improves performance over the
traditional approach that is based on mutual information, and produces
comparable results with a supervised method. In particular, the basic generative
language model contributes a 7.4% improvement over the mutual information based
method (measured by segment F1 on the Intersection test set). EM optimization
further improves the performance by 14.3%. Additional knowledge from Wikipedia
provides another improvement of 24.3%, adding up to a total of 46% improvement
(from 0.530 to 0.774).


Knowledge Sharing and Yahoo
Answers: Everyone Knows Something

University Of Michigan

Abstract: Yahoo Answers (YA) is a large and diverse question-answer
forum, acting not only as a medium for sharing technical knowledge, but as a
place where one can seek advice, gather opinions, and satisfy one’s curiosity
about a countless number of things. In this paper, we seek to understand YA’s
knowledge sharing and activity. We analyze the forum categories and cluster them
according to content characteristics and patterns of interaction among the
users. While interactions in some categories resemble expertise sharing forums,
others incorporate discussion, everyday advice, and support. With such a
diversity of categories in which one can participate, we find that some users
focus narrowly on specific topics, while others participate across categories.
This not only allows us to map related categories, but to characterize the
entropy of the users’ interests. We find that lower entropy correlates with
receiving higher answer ratings, but only for categories where factual expertise
is primarily sought after. We combine both user attributes and answer
characteristics to predict, within a given category, whether a particular answer
will be chosen as the best answer by the asker.


A Graph-Theoretic Approach to
Webpage Segmentation

Yahoo Research

Abstract: We consider the problem of segmenting a webpage into visually and semantically
cohesive pieces. Our approach is based on formulating an appropriate
optimization problem on weighted graphs, where the weights capture if two nodes
in the DOM tree should be placed together or apart in the segmentation; we
present a learning framework to learn these weights from manually labeled data
in a principled manner. Our work is a significant departure from previous
heuristic and rule-based solutions to the segmentation problem. The results of
our empirical analysis bring out interesting aspects of our framework, including
variants of the optimization problem and the role of learning.

Ranking Refinement
and Its Application to Information Retrieval

Microsoft Research Asia & Michigan State University

Abstract: We consider the problem of ranking refinement, i.e., to
improve the accuracy of an existing ranking function with a small set of labeled
instances. We are, particularly, interested in learning a better ranking
function using two complementary sources of information, ranking information
given by the existing ranking function (i.e., the base ranker) and that obtained
from users’ feedbacks. This problem is very important in information retrieval
where feedbacks are gradually collected. The key challenge in combining the two
sources of information arises from the fact that the ranking information
presented by the base ranker tends to be imperfect and the ranking information
obtained from users’ feedbacks tends to be noisy. We present a novel boosting
algorithm for ranking refinement that can effectively leverage the uses of the
two sources of information. Our empirical study shows that the proposed
algorithm is effective for ranking refinement, and furthermore it significantly
outperforms the baseline algorithms that incorporate the outputs from the base
ranker as an additional feature.

Contextual Advertising by
Combining Relevance with Click Feedback

Yahoo Research

Abstract: Contextual advertising supports much of the Web’s ecosystem
today. User experience and revenue (shared by the site publisher ad the ad
network) depend on the relevance of the displayed ads to the page content. As
with other document retrieval systems, relevance is provided by scoring the
match between individual ads (documents) and the content of the page where the
ads are shown (query). In this paper we show how this match can be improved
significantly by augmenting the ad-page scoring function with extra parameters
from a logistic regression model on the words in the pages and ads. A key
property of the proposed model is that it can be mapped to standard cosine
similarity matching and is suitable for efficient and scalable implementation
over inverted indexes. The model parameter values are learnt from logs
containing ad impressions and clicks, with shrinkage estimators being used to
combat sparsity. To scale our computations to train on an extremely large
training corpus consisting of several gigabytes of data, we parallelize our
fitting algorithm in a Hadoop framework. Experimental evaluation is provided
showing improved click prediction over a holdout set of impression and click
events from a large scale real-world ad placement engine. Our best model
achieves a 25% lift in precision relative to a traditional information retrieval
model which is based on cosine similarity, for recalling 10% of the clicks in
our test data.

Learning to Classify Short and
Sparse Text & Web with Hidden Topics from Large-Scale Data Collections

Tohoku University & Japan Advanced Institute of Science & Technology

Abstract: This paper presents a general framework for building
classifiers that deal with short and sparse text & Web segments by making the
most of hidden topics discovered from large-scale data collections. The main
motivation of this work is that many classification tasks working with short
segments of text & Web, such as search snippets, forum & chat messages, blog &
news feeds, product reviews, and book & movie summaries, fail to achieve high
accuracy due to the data sparseness. We, therefore, come up with an idea of
gaining external knowledge to make the data more related as well as expand the
coverage of classifiers to handle future data better. The underlying idea of the
framework is that for each classification task, we collect a large-scale
external data collection called “universal dataset”, and then build a
classifier on both a (small) set of labeled training data and a rich set of
hidden topics discovered from that data collection. The framework is general
enough to be applied to different data domains and genres ranging from Web
search results to medical text. We did a careful evaluation on several hundred
megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: “Web
search domain disambiguation” and “disease categorization for medical text”,
and achieved significant quality enhancement.

Generating Diverse and
Representative Image Search Results for Landmarks

Yahoo & Columbia University

Abstract: Can we leverage the community-contributed collections of rich
media on the web to automatically generate representative and diverse views of
the world’s landmarks? We use a combination of context- and content-based tools
to generate representative sets of images for location-driven features and
landmarks, a common search task. To do that, we using location and other
metadata, as well as tags associated with images, and the images’ visual
features. We present an approach to extracting tags that represent landmarks. We
show how to use unsupervised methods to extract representative views and images
for each landmark. This approach can potentially scale to provide better search
and representation for landmarks, worldwide. We evaluate the system in the
context of image search using a real-life dataset of 110,000 images from the San
Francisco area.

Flickr Tag Recommendation
based on Collective Knowledge

Yahoo Research

Abstract: Online photo services such as Flickr and Zooomr allow users to
share their photos with family, friends, and the online community at large. An
important facet of these services is that users manually annotate their photos
using so called tags, which describe the contents of the photo or provide
additional contextual and semantical information. In this paper we investigate
how we can assist users in the tagging phase. The contribution of our research
is twofold. We analyse a representative snapshot of Flickr and present the
results by means of a tag characterisation focussing on how users tags photos
and what information is contained in the tagging. Based on this analysis, we
present and evaluate tag recommendation strategies to support the user in the
photo annotation task by recommending a set of tags that can be added to the
photo. The results of the empirical evaluation show that we can effectively
recommend relevant tags for a variety of photos with different levels of
exhaustiveness of original tagging.

Deciphering Mobile Search
Patterns: A Study of Yahoo! Mobile Search Queries

Yahoo

Abstract: In this paper we study the characteristics of search queries
submitted from mobile devices using various Yahoo! mobile oneSearch applications
during a 2 months period in the second half of 2007, and report the query
patterns derived from 20 million English sample queries submitted by users in
US, Canada, Europe, and Asia. We examine the query distribution and topical
categories the queries belong to in order to find new trends. We compare and
contrast the search patterns between US vs international queries, and between
queries from various search interfaces (XHTML/WAP, java widgets, and SMS). We
also compare our results with previous studies wherever possible, either to
confirm previous findings, or to find interesting differences in the query
distribution and pattern.

NOTE: Looks interesting, but there’s no link from the overview page to the
actual research document, at the moment.

IRLbot: Scaling to 6 Billion
Pages and Beyond

Texas A&M University

Abstract: This paper shares our experience in designing a web crawler
that can download billions of pages using a single-server implementation and
models its performance. We show that with the quadratically increasing
complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host
rate-limiting, current crawling algorithms cannot effectively cope with the
sheer volume of URLs generated in large crawls, highly-branching spam,
legitimate multi-million-page blog sites, and infinite loops created by
server-side scripts. We offer a set of techniques for dealing with these issues
and test their performance in an implementation we call IRLbot. In our recent
experiment that lasted $41$ days, IRLbot running on a single server successfully
crawled $6.3$ billion valid HTML pages ($7.6$ billion connection requests) and
sustained an average download rate of $319$ mb/s ($1,789$ pages/s). Unlike our
prior experiments with algorithms proposed in related work, this version of
IRLbot did not experience any bottlenecks and successfully handled content from
over $117$ million hosts, parsed out $394$ billion links, and discovered a
subset of the web graph with $41$ billion unique nodes.

Recrawl Scheduling Based on
Information Longevity

Yahoo Research & Carnegie Mellon University

Abstract: It is crucial for a web crawler to distinguish between
ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is
usually not worth crawling, because by the time it reaches the index it is no
longer representative of the web page from which it was acquired. On the other
hand, content that persists across multiple page updates (e.g., recent blog
postings) may be worth acquiring, because it matches the page’s true content for
a sustained period of time. In this paper we characterize the longevity of
information found on the web, via both empirical measurements and a generative
model that coincides with these measurements. We then develop new recrawl
scheduling policies that take longevity into account. As we show via experiments
over real web data, our policies obtain better freshness at lower cost, compared
with previous approaches.

iRobot: An Intelligent Crawler
for Web Forums

Microsoft Research Asia

Abstract: We study in this paper the Web forum crawling problem, which is
a very fundamental step in many Web applications, such as search engine and Web
data mining. As a typical user-created content (UCC), Web forum has become an
important resource on the Web due to its rich information contributed by
millions of Internet users every day. However, Web forum crawling is not a
trivial problem due to the in-depth link structures, the large amount of
duplicate pages, as well as many invalid pages caused by login failure issues.
In this paper, we propose and build a prototype of an intelligent forum crawler,
iRobot, which has intelligence to understand the content and the structure of a
forum site, and then decide how to choose traversal paths among different kinds
of pages. To do this, we first randomly sample (download) a few pages from the
target forum site, and introduce the page content layout as the characteristics
to group those pre-sampled pages and re-construct the forum’s sitemap. After
that, we select an optimal crawling path which only traverses informative pages
and skips invalid and duplicate ones. The extensive experimental results on
several forums show the performance of our system in the following aspects: 1)
Effectiveness – Compared to a generic crawler, iRobot significantly decreases
the duplicate and invalid pages; 2) Efficiency – With a small cost of
pre-sampling a few pages for learning the necessary knowledge, iRobot saves
substantial network bandwidth and storage as it only fetches informative pages
from a forum site; and 3) Long threads that are divided into multiple pages can
be re-concatenated and archived as a whole thread, which is of great help for
further indexing and data mining.

Analyzing Search Engine
Advertising: Firm Behavior and Cross-Selling in Electronic Markets

New York University

Abstract: The phenomenon of sponsored search advertising is gaining
ground as the largest source of revenues for search engines. Firms across
different industries have are beginning to adopt this as the primary form of
online advertising. This process works on an auction mechanism in which
advertisers bid for different keywords, and final rank for a given keyword is
allocated by the search engine. But how different are firm’s actual bids from
their optimal bids? Moreover, what are other ways in which firms can potentially
benefit from sponsored search advertising? Based on the model and estimates from
prior work [10], we conduct a number of policy simulations in order to
investigate to what extent an advertiser can benefit from bidding optimally for
its keywords. Further, we build a Hierarchical Bayesian modeling framework to
explore the potential for cross-selling or spillovers effects from a given
keyword advertisement across multiple product categories, and estimate the model
using Markov Chain Monte Carlo (MCMC) methods. Our analysis suggests that
advertisers are not bidding optimally with respect to maximizing profits. We
conduct a detailed analysis with product level variables to explore the extent
of cross-selling opportunities across different categories from a given keyword
advertisement. We find that there exists significant potential for cross-selling
through search keyword advertisements in that consumers often end up buying
products from other categories in addition to the product they were searching
for. Latency (the time it takes for consumer to place a purchase order after
clicking on the advertisement) and the presence of a brand name in the keyword
are associated with consumer spending on product categories that are different
from the one they were originally searching for on the Internet.

Online Learning from Click
Data for Sponsored Search

Yahoo Research

Abstract: Sponsored search is one of the enabling technologies for
today’s Web search engines. It corresponds to matching and showing ads related
to the user query on the search engine results page. Users are likely to click
on topically related ads and the advertisers pay only when a user clicks on
their ad. Hence, it is important to be able to predict if an ad is likely to be
clicked, and maximize the number of clicks. We investigate the sponsored search
problem from a machine learning perspective with respect to three main
sub-problems: how to use click data for training and evaluation, which learning
framework is more suitable for the task, and which features are useful for
existing models. We perform a large scale evaluation based on data from a
commercial Web search engine. Results show that it is possible to learn and
evaluate directly and exclusively on click data encoding pairwise preferences
following simple and conservative assumptions. We find that online multilayer
perceptron learning, based on a small set of features representing content
similarity of different kinds, significantly outperforms an information
retrieval baseline and other learning models, providing a suitable framework for
the sponsored search task.

Automatic Online News Issue
Construction in Web Environment

Tsinghua University

Abstract: In many cases, rather than a keyword search, people intend to
see what is going on through the Internet. Then the integrated comprehensive
information on news topics is necessary, which we called news issues, including
the background, history, current progress, different opinions and discussions,
etc. Traditionally, news issues are manually generated by website editors. It is
quite a time-consuming hard work, and hence real-time update is difficult to
perform. In this paper, a three-step automatic online algorithm for news issue
construction is proposed. The first step is a topic detection process, in which
newly appearing stories are clustered into new topic candidates. The second step
is a topic tracking process, where those candidates are compared with previous
topics, either merged into old ones or generating a new one. In the final step,
news issues are constructed by the combination of related topics and updated by
the insertion of new topics. An automatic online news issue construction process
under practical Web circumstances is simulated to perform news issue
construction experiments. F-measure of the best results is either above (topic
detection) or close to (topic detection and tracking) 90%. Four news issue
construction results are successfully generated in different time granularities:
one meets the needs like "what’s new", and the other three will answer questions
like "what’s hot" or "what’s going on". Through the proposed algorithm, news
issues can be effectively and automatically constructed with real-time update,
and lots of human efforts will be released from tedious manual work.

Finding the Right Facts in the
Crowd: Factoid Question Answering over Social Media

Georgia Institute Of Technology & Emory University

Abstract: Community Question Answering has emerged as a popular and
effective paradigm for a wide range of information needs. For example, to find
out an obscure piece of trivia, it is now possible and even very effective to
post a question on a popular community QA site such as Yahoo! Answers, and to
rely on other users to provide answers, often within minutes. The importance of
such community QA sites is magnified as they create archives of millions of
questions and hundreds of millions of answers, many of which are invaluable for
the information needs of other searchers. However, to make this immense body of
knowledge accessible, effective answer retrieval is required. In particular, as
any user can contribute an answer to a question, the majority of the content
reflects personal, often unsubstantiated opinions. A ranking that combines both
relevance and quality is required to make such archives usable for factual
information retrieval. This task is challenging, as the structure and the
contents of community QA archives differ significantly from the web setting. To
address this problem we present a general ranking framework for factual
information retrieval from social media. Results of a large scale evaluation
demonstrate that our method is highly effective at retrieving well-formed,
factual answers to questions, as evaluated on a standard factoid QA benchmark.
We also show that our learning framework can be tuned with the minimum of manual
labeling. Finally, we provide result analysis to gain deeper understanding of
which features are significant for social media search and retrieval. Our system
can be used as a crucial building block for combining results from a variety of
social media content with general web search results, and to better integrate
social media content for effective information access.

Personalized Interactive
Faceted Search

University Of California, Santa Cruz & McGill University

Abstract: Faceted search is becoming a popular method to allow users to
interactively search and navigate complex information spaces. A faceted search
system presents users with key-value metadata that is used for query refinement.
While popular in e-commerce and digital libraries, not much research has been
conducted on which metadata to present to a user in order to improve the search
experience. Nor are there repeatable benchmarks for evaluating a faceted search
engine. This paper proposes the use of collaborative filtering and
personalization to customize the search interface to each user’s behavior. This
paper also proposes a utility based framework to evaluate the faceted interface.
In order to demonstrate these ideas and better understand personalized faceted
search, several faceted search algorithms are proposed and evaluated using the
novel evaluation methodology.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Danny Sullivan
Contributor
Danny Sullivan was a journalist and analyst who covered the digital and search marketing space from 1996 through 2017. He was also a cofounder of Third Door Media, which publishes Search Engine Land and MarTech, and produces the SMX: Search Marketing Expo and MarTech events. He retired from journalism and Third Door Media in June 2017. You can learn more about him on his personal site & blog He can also be found on Facebook and Twitter.

Get the must-read newsletter for search marketers.