コンテンツ内容に基づくパラレル文書の検索
Content-Based Parallel Document Search

概要

Search engines have become primary tools used to access information on the Web. However, search engines require users to express their search intent as keyword queries, which can result in an intention gap between search intent and queries. Content-based information retrieval (CBIR) can reduce the burden associated with generating queries. It allows users to search for documents based on an example document that matches the user’s search intent, and is based on an understanding of their content and of their components. Traditional technologies for CBIR are based on similarity relationship. Thus, a search would return documents whose contents were similar to a given document. However, such content similar documents do not always satisfy the user’s search intent. We propose a CBIR method to search for documents that record similar situational context but pertain to different events, referred to as “parallel documents”, e.g., a news article describing the casualties of the Oregon school shooting and an article describing the casualties of the Virginia Tech shooting. To accomplish this, we analyze the entities and actions between documents to predict the coordinate relationship between documents. We conduct experiments using part of the New York Times Annotated Corpus to verify the effectiveness of the proposed method and to demonstrate the importance of the coordinate relationship when searching for parallel documents.

産業界への展開例・適用分野

We believe that content similar documents do not always satisfy the user’s search intent. Suppose we are surveying school shootings and encounter a news article describing the casualties of the Oregon school shooting. In this case, news articles reporting the same event that are published by other news agents may be redundant because they may not provide new or additional information. On the other hand, news articles describing the casualties of other school shooting events, such as an article about the Virginia Tech shooting, would provide more useful information about the topic. Since such documents record different events with similar situational contexts, we refer to them as “parallel documents.” Our objective is to search for parallel documents for a given document to provide more information about the topic (in our example, the “school shooting” topic) from a certain perspective. Currently, our trial focuses on the genre of news articles. However, we also plan to extend to study on the genres of academic papers, advertisements or introductions of restaurants, and so forth. All these trials aim at better understanding a topic from a user-indicated perspective.

研究者

氏名 専攻 研究室 役職/学年
趙 夢 社会情報学専攻 田中研究室 研究員
大島 裕明 社会情報学専攻 田中研究室 特定准教授
田中 克己 社会情報学専攻 田中研究室 教授

PAGE TOP