Google, altavista have only addressed text and printed documents. Term feedback for information retrieval with language models bin tan, atulya velivelli, hui fang, chengxiang zhai dept. Assessing wikipediabased crosslanguage retrieval models. Those areas are retrieval models, crosslingual retrieval, web search, user modeling, filtering, topic detection and tracking, classification, summarization, question answering, metasearch, distributed retrieval, multimedia retrieval, information. Introduction using language models for information retrieval has been studied extensively recently 1,3,7,8,10. Finally, we conclude our paper and mention some of the future directions.
Retrieval based on probabilistic lm intuition users have a reasonable idea of terms that are likely to occur in documents of interest. Information retrieval models have been studied for decades, leading to a huge body of literature on the topic. The task of ad hoc information retrieval ir consists in finding documents in a corpus that are relevant to an information need specified by a users query. Introduction to information retrieval stanford nlp group. Unigram language model probability distribution over the words in a language generation of text consists of pulling words out. Pdf language modeling approaches to information retrieval. Interestingly, it is similar to the vector space model, except that we use language models, rather than ordinary term vectors to represent a document or a query. In language modeling for information retrieval 2003, vol.
There have been a number of linear, featurebased models proposed by the information retrieval community recently. The term language model refers to a probabilistic model. The kldivergence retrieval model was introduced in6 as a special case of the more general risk min imization retrieval framework. Language models for information retrieval a common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. Statistical language models for information retrieval. Semanticsbased language models for information retrieval. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss. A common approach is to generate a maximumlikelihood model for the entire collection and linearly interpolate the collection model with a maximumlikelihood model for each document to smooth the model. Compared with the traditional models such as the vector space model, these new models have a more sound statistical foundation and can leverage. In this paper, we explore and discuss the theoretical issues of this framework, including a novel look at the parameter space. Semanticsbased language models for information retrieval and text mining a thesis submitted to the faculty of drexel university by xiaohua zhou in partial fulfillment of the requirements for the degree of doctor of philosophy november 2008.
An informationbased crosslanguage information retrieval. A study of smoothing methods for language models applied to information retrieval chengxiang zhai and john lafferty carnegie mellon university. We integrate the linkage of a query as a hidden variable, which expresses the term dependencies within the query as an acyclic, planar, undirected graph. Language models for information retrieval references. In a retrieval model which is an abstraction on the ir process, there are two fundamental aspects. Introduction 2 most of the research work performed under the information retrieval domain is mainly based in the construction 3 of retrieval models. Modelbased feedback in the language modeling approach. Ponte and croft, 1998 a language modeling approach to information retrieval zhai and lafferty, 2001 a study of smoothing methods for language models applied to ad hoc information retrieval.
Collection statistics are integral parts of the language model. However, reported evaluations of the language modeling approach for adhoc search tasks use different query sets and collections. The approach extends the basic language modeling approach based on unigram by relaxing the independence assumption. Interestingly, it is similar to thevector space model, exceptthat we uselanguage models, rather than ordinary term vectors to represent a document or a query. Models of information retrieval systems are characterized by three main 1. Language models for information retrieval stanford nlp. A proximity language model for information retrieval jinglei zhao izenesoft, inc. Document language models, query models, and risk minimization for information retrieval.
Statistical language models for information retrieval a. This empirical success and the overall potential of the approach have also triggered the lemur1 project. They will choose query terms that distinguish these documents from others in the collection. Information retrieval is the name of the process or method whereby a prospective user of information is able to convert his need for information into an actual list of citations to documents in storage containing information useful to him. Our approach to model ing is nonparametric and integrates document indexing and document retrieval into. Language models for information retrieval and web search. Statistical language models for information retrieval university of. They called this approach language modeling approach due to the use of language models in scoring. Second, we want to give the reader a quick overview of the major textual retrieval methods, because the infocrystal can help to visualize the. As a new family of probabilistic retrieval models, language models for ir share the.
Pdf using language models for information retrieval researchgate. Challenges in information retrieval and language modeling. A study of smoothing methods for language models applied. The language modeling approach to ir directly models that idea.
A language modeling approach to information retrieval jay m. This paper presents a new dependence language modeling approach to information retrieval. A language modeling approach to information retrieval. Those areas are retrieval models, crosslingual retrieval, web search, user modeling, filtering, topic detection and tracking, classification, summarization, question answering, metasearch, distributed retrieval, multimedia retrieval, information extraction, as well as testbed requirements for future work. Throughout the years, many models have been proposed to create systems which are accurate. Language modeling approaches to information retrieval. Relevancebased language models very much related to naivebayes classi. First, we want to set the stage for the problems in information retrieval that we try to address in this thesis. With no formal definition, but an approximate model of relevance, most retrieval. Information retrieval language model cornell university. Why language models and inverse document frequency for. Language models were first successfully applied to information retrieval by ponte and croft 1998.
The first statisticallanguage modeler was claude shannon. An information retrieval models taxonomy based on an. A proximity language model for information retrieval. Dependence language model for information retrieval. The use of categorization information in language models. Information retrieval2 300 chapter overview 300 10. In modern day terminology, an information retrieval system is a software program that stores and manages. Natural language processing and information retrieval. The emphasis is on the retrieval of information as opposed to the retrieval of data. Proceedings of the 24th annual international acm sigir conference on research and development in. It states that terms are statistically independent from each other. Linear featurebased models for information retrieval. Relating the new language models of information retrieval.
Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. Although each model is presented differently, they all share a common underlying framework. Introduction the independence assumption is one of the assumptions widely adopted in probabilistic retrieval theory. For advanced models,however,the book only provides a high level discussion,thus readers will still. Two such models, referred to as loglogistic model in short. The kldivergence retrieval model was introduced in 6 as a special case of the more general risk minimization retrieval framework. In exploring the application of his newly founded theory of information to human language, shannon considered language as a statistical source, and measured how weh simple ngram models predicted or, equivalently, compressed natural text. Mutual information gain, entropy, weighting measures, statistical language models, tf. Retrieval models can describe the computational process e.
In information retrieval contexts, unigram language models are often smoothed to avoid instances where pterm 0. The paper firstly introduced the basic information retrieval process, and then listed three types of information retrieval models according to two dimensions and their relationships, and lastly. In the past ten years, a new generation of retrieval models, often referred to as statistical language models, has been successfully applied to solve many different information retrieval problems. Term feedback for information retrieval with language models. Relating the new language models of information retrieval to the traditional retrieval models. Online edition c2009 cambridge up stanford nlp group. Language models for information retrieval citeseerx.
700 281 33 1312 135 1202 577 801 647 953 575 220 320 1175 1436 1505 797 1527 154 293 1082 1014 1142 1386 738 811 44 214 985 826 977 108 215 817 334 1202 620 310 86 968