IR-chapter5: Index compression

  • the advantage
    decrease the disk space
    increase the use of cache, thus decrease response time
    decrease the transfer time from disk to memory
  • In this chapter, we define a posting as a docID in a postings list.

statistical properties of terms in information retrieval

IR-chapter5: Index compression_第1张图片
The effects of preprocessing for Reuters-RCV1
  • lossy compression, when the "lost" information is unlikely to be used by the search system
  • The compression techniques we discussed in the remainder of this chapter is are lossless, that is, all information is preserved

estimating the number of terms:

  • OED: more than 600,000 words, but ignoring most names of person, locations, products etc.
  • heap's law
    M = k * power(T,b)
    for T>100,000, b=0.49, k=44, the fit is excellent.
    It suggests...

modeling the distribution of terms

  • Zipf's law
    the collection frequency of the ith commonest item = c * power(i,k)
    log cf = logc + k * log i
    it suggests...

dictionary compression

  • why is it necessary?
    To decrease the number of disk seeks to shorten the response time.

dictionary as a string

  • the simplest approach: the dictionary as a fixed-width array.
    Too wasteful!!!
    Can't store term containing more than 20 characters
  • storing the dictionary terms as one long string.
IR-chapter5: Index compression_第2张图片
dictionary as a string

blocked storage

  • grouping terms in the string into blocks of size k, and keeping one term pointer for the first term in the string, adding the length byte for each term in the string.
IR-chapter5: Index compression_第3张图片
blocked storage
  • trade-off between the compression and the speed of term lookup.
IR-chapter5: Index compression_第4张图片
term lookup time
  • front coding:
IR-chapter5: Index compression_第5张图片
front coding
  • hash function:
    unifiable for dynamic environment, since every new term will create collision.
IR-chapter5: Index compression_第6张图片
dictionary compression with different data structure

posting file compression

  • variable ecoding
    docID (rare terms) vs gap(frequent terms)
  • two methods
    bytewise, bitwise (encoding the gaps )


    IR-chapter5: Index compression_第7张图片
    variable encoding

Variable byte encoding

IR-chapter5: Index compression_第8张图片
VB encoding
IR-chapter5: Index compression_第9张图片
pseudodcode
  • larger : less effective compression, less bit manipulation.
  • smaller: more effective compression, more bit manipulation.
  • trade-off between compression ratio and depression time.

γ Codes

IR-chapter5: Index compression_第10张图片
γ Codes
  • universal
IR-chapter5: Index compression_第11张图片
E(L) - the expected length of a code L, H(P)-entropy
  • prefix free, parameter free
  • how much compression does it achieve?

你可能感兴趣的:(IR-chapter5: Index compression)