CIKM2011
, so we only condier the method, not resultsWe address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.
we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner
.
QDMiner
discovers query dimensions by aggregating frequent lists within the top results.
list
formats by websites
Query dimensions are mined by the following four steps:
For each document, we extract a set of lists from the HTML content of d d d based on three different types of patterns
Free text patterns:
pattern: item{, item}*(and|or) {other} item
Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta
further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs
Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.
HTML tag patterns:
OPTION
) to create a listLI
)Repeat region patterns:
First detect repeat regions in webpages based on vision-based DOM trees
Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.
Last, for each group, extract all text from its nodes as a list
Note: we do post-processing for each extracted list
This type of lists are useless for finding dimensions and we should punish them.
we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:
document matching weight: S D O C = ∑ d ∈ R ( s d m ∗ s d r ) S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right) SDOC=∑d∈R(sdm∗sdr)
average invert document frequency(IDF) of items:
ClueWeb09
) is not informative to the query.The importance of a list l l l: S l = S D O C ∗ S I D F S_l = S_{DOC} * S_{IDF} Sl=SDOC∗SIDF
Two lists can be grouped together if they share enough items
Use a modified QT
(assume that all data is equally important)clustering algorithm to group similar lists
We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)
Don’t use individual weighted lists as query dimensions
A good dimension should frequently appear in the top results, a dimension c c c is more important if:
In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.
We only output qualified items by default in QDMiner
.