Authenticating the Query Results of Text Search Engines

H. Pang & K. Mouratidis
VLDB 2008

Motication

  • If a search engine is compromised:
    • Incomplete results: omitting some legitimate documents;
    • Altered ranking: deviating form the correct similarity ranking;
    • Spurious results: includes fake documents

Contribution:

A framework for inverted index authentication that can generate an integrity proof for any search result.

Simularity score computation

image.png
  • Here w(d,t) and w(Q,t) are normalized frequency of term t in document d and query Q, respectively.
  • Execution time proportional to the document size n

Inverted index

Frequency-ordered inverted index.


image.png

PSCAN algorithm

image.png

Modified PSCAN algorithm (TRA)

  • Fagin's algorithm (extended)
    • Alogirhtm repeatedly reads off the next entry from the inverted list entry Li, with the largest term score.
    • The similarity score of document di is immediately computed (by Random Acess) fro mthe document MHT
    • Entry is entered into the result list R ordered in non-incresing order
    • Threshold thres is computed which is an upper bound on the similarity score for any non-encountered document down the inverted list.
    • So when thres <= R.sr then the top r matching documents have been found.

TRA Algorithm

  • To prove that R satisfies the correctness criterion
    • For each result document d, the VO includes the query term frequencies in d so that the user can compute the similarity score.
    • For each non-result document d' that occurs upo to the cut-off threshold the VO includes the query term frequencies in d' so that the user can verify that the score is lower than those of the result documents.
    • The inverted list entries that correspond to the cut-off threshold.

Term MHT

image.png

Document MHT

image.png

Drawbacks

  • Searh engine has to retrieve entire inverted lists in order to regenerate the complimentary digest of the term-MHTs.
  • Leaves of the term-MHTs and document-MHTs are smaller than the upper level digest.

Authentication with Chain MHTs

image.png
  • Digest of each block is included in the digest computation of the block immediately ahead of it
  • Can be used to verify any j leading blocks of the inverted list by supplying the digest of the j+1 block

你可能感兴趣的:(Authenticating the Query Results of Text Search Engines)