Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active April 19, 2024 15:03
Show Gist options
  • Save AaradhyaSaxena/e20e7dd1556683523f58fe3d5cc463d2 to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/e20e7dd1556683523f58fe3d5cc463d2 to your computer and use it in GitHub Desktop.
Introduction to Information Retrieval

These are my notes from the book "Introduction to information retrieval"

  1. Intro to IR
  2. Term Vocabulary
  3. Tolerant Retrieval
  4. Index Construction
  5. Index Compression
  6. Scoring
  7. Compute Scoring
  8. Evaluation of Information Retrieval
  9. Relevance feedback and query expansion
  10. XML Retrieval
  11. Probabilistic Information Retrieval
  12. Language Models in IR
  13. Naive Bayes
  14. Vector Space Classification
  15. SVM
  16. Flat Clustering
  17. Hierarchical Clustering
  18. Latent Semantic Indexing
  19. Web
  20. Lucene
  21. solr

# Term Vocabulary and Posting Lists

The major steps in inverted index construction:

  1. Collect the documents to be indexed.
  2. Tokenize the text. - [x]
  3. Do linguistic preprocessing of tokens. - [x]
  4. Index the documents that each term occurs in.

Tokenization is the process of chopping character streams into tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

2.1 Document delineation and character sequence decoding

  • The first step of processing is to convert this byte sequence into a linear sequence of characters. We need to determine the correct encoding, this can be regarded as a machine learning classification problem.

Indexing Granularity

More generally, for very long documents, the issue of indexing granularity arises. For a collection of books, it would usually be a bad idea to index an entire book as a document. A search for Chinese toys might bring up a book that mentions China in the first chapter and toys in the last chapter, but this does not make it relevant to the query.

Instead, we may well wish to index each chapter or paragraph as amini-document. Matches are thenmore likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document.

But why stop there?

We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find. The problems with large document units can be alleviated by use of explicit or implicit proximity search.

2.2 Determining the vocabulary of terms

Tokenization

A token is an instance of a sequence of characters in some particular document that are grouped TYPE together as a useful semantic unit for processing.

For O’Neill, which of the following is the desired tokenization?

  • neill
  • oneill
  • o’neill
  • o’ neill
  • o neill ?

Boolean or free text queries, you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenizer. This guarantees that a sequence of characters in a text will always match the same sequence typed in a query.

  • One possible solution is to omit from indexing tokens such as monetary amounts, numbers, and URLs.
    • Loss of information.
  • WORD SEGMENTATION: Perform word segmentation as prior linguistic processing. Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or conditional random fields, trained over hand-segmented words.
  • Above methods make mistakes sometimes, and so you are never guaranteed a consistent unique tokenization. The other approach is to abandon word-based indexing and to do all indexing via just short subsequences of characters (character k-grams), regardless of whether particular sequences cross word boundaries or not.

Stop Words

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

Using a stop list significantly reduces the number of postings that a system has to store. And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don’t seem very useful. However, this is not true for phrase searches. The phrase query “President of the United States”, which contains two stop words, is more precise than President AND “United States”. The meaning of flights to London is likely to be lost if the word to is stopped out.

The general trend in IR systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.

Normalization

Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens.

These term relationships can be achieved in two ways.

  • The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. A query term is then effectively a disjunction of several postings lists.
  • The alternative is to perform the expansion during index construction. When the document contains automobile, we index it under car as well.

The first method adds a query expansion dictionary and requires more processing at query time, while the second method requires more space for storing postings.

Capitalization/case-folding: A common strategy is to CASE-FOLDING do case-folding by reducing all letters to lower case.

TrueCasing: For English, an alternative tomaking every token lowercase is to just make some tokens lowercase. The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized. These words are usually ordinary words that have been capitalized. Mid-sentence capitalized words are left as capitalized (which is usually correct). This will mostly avoid case-folding in cases where distinctions should be kept apart. The same task can be done more accurately by a machine learning sequence model which uses more features to make the decision of when to case-fold. This is known as truecasing.

Stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally relatedwordswith similarmeanings, such as democracy, democratic, and democratization.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

  • am, are, is ⇒be
  • car, cars, car’s, cars’⇒car

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.

PORTER STEMMER

Porter’s algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix. In the first phase, this convention is used with the following rule group: Rule Example

  • SSES → SS caresses → caress
  • IES → I ponies → poni
  • SS → SS caress → caress
  • S → cats → cat

Many of the later rules use a concept of the measure of a word, which loosely checks the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of a rule as a suffix rather than as part of the stem of a word. For example, the rule: (m > 1) EMENT → would map replacement to replac, but not cement to c.

2.3 Faster postings list intersection via skip pointers

We walk through the two postings lists simultaneously, in time linear in the total number of postings entries. If the list lengths are m and n, the intersection takes O(m + n) operations.

Can we do better than this? One way to do this is to use a skip list by augmenting postings lists with skip pointers (at indexing time).

Where do we place skips? There is a tradeoff. More skips means shorter skip spans, and that we are more likely to skip. But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers. Fewer skip smeans few pointer comparisons, but then long skip spans which means that there will be fewer opportunities to skip. List of length P, use √P evenly-spaced skip pointers.

2.4.1 Phrase Queries

We would like to be able to pose a query such as Stanford University by treating it as a phrase so that a sentence in a document like The inventor Stanford Ovshinsky never went to university. is not a match.

  • Biword indexes: One approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. In this model, we treat each of these biwords as a vocabulary term. Being able to process two-word phrase queries is immediate. Longer phrases can be processed by breaking them down.
  • Phrase Index: The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is generally referred to as a phrase index.

While there is always a chance of false positive matches, the chance of a false positive match on indexed phrases of length 3 or more becomes very small indeed. But on the other hand, storing longer phrases has the potential to greatly expand the vocabulary size. Maintaining exhaustive phrase indexes for phrases of length greater than two is a daunting prospect, and even use of an exhaustive biword dictionary greatly expands the size of the vocabulary.

2.4.2 Positional Index

For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed. Here, for each term in the vocabulary, we store postings of the form docID: (position1, position2, . . . i,)

To process a phrase query, you still need to access the inverted index entries for each distinct term. As before, you would startwith the least frequent term and then work to further restrict the list of possible candidates. In the merge operation, the same general technique is used as before, but rather than simply checking that both terms are in a document, you also need to check that their positions of appearance in the document are compatible with the phrase query being evaluated. This requiresworking out offsets between the words.

Indeed, moving to a positional index also changes the asymptotic complexity of a postings intersection operation, because the number of items to check is now bounded not by the number of documents but by the total number of tokens in the document collection T. That is, the complexity of a Boolean query is Q(T) rather than Q(N). However, most applications have little choice but to accept this, since most users now expect to have the functionality of phrase and proximity searches.

Combination Scheme

The strategies of biword indexes and positional indexes can be fruitfully combined. If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional postings lists. A combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries. Good queries to include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare. Adding Britney Spears as a phrase index entry may only give a speedup factor to that query of about 3, since most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1000. Hence, having the latter is more desirable, even if it is a relatively less common query.

# Dictionaries and tolerant retrieval

Agenda

  • We develop techniques that are robust to typographical errors in the query, as well as alternative spellings.
  • We develop data structures that help the search for terms in the vocabulary in an inverted index.
  • We study the idea of a wildcard query.
  • We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms.
  • We study a method for seeking vocabulary terms that are phonetically close to the query term.

3.1 Search structures for dictionaries

Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary and if so, identify the pointer to the corresponding postings. This vocabulary lookup operation uses a classical data structure called the dictionary and has two broad classes of solutions: hashing, and search trees. The choice of solution (hashing, or search trees) is governed by a number of questions:

  1. How many keys are we likely to have?
  2. Is the number likely to remain static, or change a lot – and in the case of changes, are we likely to only have new keys inserted, or to also have some keys in the dictionary be deleted?
  3. What are the relative frequencies with which various keys will be accessed?

Issues with hashing:

  • There is no easy way to find minor variants of a query term (such as the accented and non-accented versions of a word like resume), since these could be hashed to very different integers.
  • In particular, we cannot seek (for instance) all terms beginning with the prefix automat.
  • Finally, in a setting (such as the Web) where the size of the vocabulary keeps growing, a hash function designed for current needs may not suffice in a few years’ time.
  • Search trees overcome many of these issues.

B-Trees

  • A B-tree may be viewed as “collapsing” multiple levels of the binary tree into one; this is especially advantageous when some of the dictionary is disk-resident, in which case this collapsing serves the function of pre-fetching imminent binary tests.
  • Search trees demand that the characters used in the document collection have a prescribed ordering.

3.2 WildCard queries

Wildcard queries are used in any of the following situations: (1) the user is uncertain of the spelling of a query term (e.g., Sydney vs. Sidney).

First, consider leading wildcard queries, or queries of the form *mon. Consider a reverse B-tree on the dictionary – one in which each root-to-leaf path of the B-tree corresponds to a term in the dictionary written backwards.

In fact, using a regular B-tree together with a reverse B-tree,we can handle an even more general case: wildcard queries in which there is a single * symbol, such as se*mon. To do this, we use the regular B-tree to enumerate the set W of dictionary terms beginning with the prefix se, then the reverse B-tree to enumerate the set R of terms ending with the suffix mon. Next, we take the intersection W ∩ R of these two sets, to arrive at the set of terms that begin with the prefix se and end with the suffix mon. Finally, we use the standard inverted index to retrieve all documents containing any terms in this intersection.

Techniques to handle wildcard queries

  • Permuterm indexes
  • k-gram indexes

k-gram indexes

Example of using k-gram to identify vocabulary terms that are close matches to the query.

Vocabulary Terms: We have a small vocabulary containing the following terms: "board," "border," "boardroom," "bored," "barn," "bat."

Query: Our query is "bord."

K-gram Index: We're using a 2-gram (bigram) index, which means we divide each term in the vocabulary and the query into 2-grams.

K-grams for Vocabulary Terms:

  • "board" is divided into ["bo", "oa", "ar", "rd"].
  • "border" is divided into ["bo", "or", "rd", "de", "er"].
  • "boardroom" is divided into ["bo", "oa", "ar", "rd", "ro", "oo", "om"].
  • "bored" is divided into ["bo", "or", "re", "ed"].
  • "barn" is divided into ["ba", "ar", "rn"].
  • "bat" is divided into ["ba", "at"].

K-grams for the Query: "bord" is divided into ["bo", "or", "rd"].

Jaccard Coefficient Computation: We calculate the Jaccard coefficient for each vocabulary term and the query. The Jaccard coefficient measures the intersection of the sets of k-grams divided by their union. If the coefficient exceeds a preset threshold (for this example, let's use 0.4 as the threshold), we consider the term as a potential match to the query.

Here are the Jaccard coefficients for each term with respect to the query "bord" (rounded to two decimal places):

  • J("board", "bord") = 0.67
  • J("border", "bord") = 0.20
  • J("boardroom", "bord") = 0.22
  • J("bored", "bord") = 0.25
  • J("barn", "bord") = 0.00
  • J("bat", "bord") = 0.00

Results: Based on our threshold of 0.4, we consider "board" as a potential correction for "bord" because it has a Jaccard coefficient of 0.67, indicating significant k-gram overlap. The other terms don't meet the threshold, so they are not considered as potential matches.

In this example, the k-gram index and Jaccard coefficient helped us identify "board" as a potential spelling correction for the query "bord" due to the substantial overlap of their 2-grams. This technique is particularly useful for applications like spell checking and autocorrection in search engines and text processing systems.

3.3 Spelling correction

Two steps to solving this problem: the first based on edit distance and the second based on k-gram overlap.

Implementing spelling correction

  • Of various alternative correct spellings for a mis-spelled query, choose the “nearest” one.
  • When two correctly spelled queries are tied (or nearly tied), select the one that is more common.

Forms of spelling correction

  • Isolated-term correction
  • Context-sensitive correction

Edit distance

Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2.

The notion of edit distance can be generalized to allowing different weights for different kinds of edit operations, for instance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice (for the separate issue of phonetic similarity).

Context sensitive spelling correction

Isolated-term correction would fail to correct typographical errors such as flew form Heathrow, where all three query terms are correctly spelled.

The simplest way to do this is to enumerate corrections of each of the three query terms even though each query termis correctly spelled, then try substitutions of each correction in the phrase. For the example flew form Heathrow, we enumerate such phrases as fled form Heathrow and flew fore Heathrow. For each such substitute phrase, the search engine runs the query and determines the number of matching results.

3.4 Phonetic correction

The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.

Algorithms for such phonetic hashing are commonly collectively known as soundex algorithms.

  1. Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index.
  2. Do the same with query terms.
  3. When the query calls for a soundex match, search this soundex index.

Step by Step Spell correction

  1. Break the query into individual words.
  2. Generate k-grams for each word in the query.
  3. For each k-gram in the query, retrieve similar k-grams from the k-gram index.
  4. Collect a set of candidate words based on the retrieved k-grams.
  5. Calculate the edit distance (e.g., Levenshtein distance) between each candidate word and the query word with the spelling mistake.
  6. Rank the candidate words based on their edit distances or other similarity metrics.

# Index Construction

Agenda

  1. Review of the basics of computer hardware that are relevant for indexing.
  2. We then introduce blocked sort-based indexing.
  3. Describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be distributed over computer clusters with hundreds or thousands of machines.
  4. Collections with frequent changes require dynamic indexing.
  5. How changes in the collection are immediately reflected in the index. Complicating issues that can arise in indexing – such as security and indexes for ranked retrieval.

4.1 Hardware Basics

  • Access to data in memory is much faster than access to data on disk. It takes a few clock cycles (perhaps 5 × 10−9 seconds) to access a byte in memory, but much longer to transfer it from disk (about 2 × 10−8 seconds). Consequently, we want to keep as much data as possible in memory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.
  • When doing a disk read or write, it takes a while for the disk head to move to the part of the disk where the data are located. This time is called the seek time and it averages 5 ms for typical disks. No data are being transferred during the seek. To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk.
  • Operating systems generally read and write entire blocks. Thus, reading a single byte from disk can take as much time as reading the entire block. Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.
  • Data transfers from disk to memory are handled by the system bus, not by the processor. This means that the processor is available to process data during disk I/O. We can exploit this fact to speed up data transfers by storing compressed data on disk. Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data.

4.2 Blocked sort-based indexing

We first make a pass through the collection assembling all term–docID pairs. We then sort the pairs with the term as the dominant key and docID as the secondary key. Finally, we organize the docIDs for each term into a postings list and compute statistics like term and document frequency. To make index construction more efficient, we represent terms as termIDs. For small collections, all this can be done in memory.

With main memory insufficient, we need to use an external sorting algorithm, that is, one that uses disk. For acceptable speed, the central requirement of such an algorithm is that it minimize the number of random disk seeks during sorting – sequential disk reads are far faster than seeks.

One solution is the blocked sort-based indexing algorithm or BSBI:

  1. segments the collection into parts of equal size
  2. sorts the termID–docID pairs of each part in memory
  3. stores intermediate sorted results on disk, and
  4. merges all intermediate results into the final index.
  • The algorithm parses documents into termID–docID pairs and accumulates the pairs inmemory until a block of a fixed size is full. We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID–docID pairs. Next, we collect all termID–docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the block we have just read, is thenwritten to disk.
  • In the final step, the algorithm simultaneously merges the ten blocks into one large merged index. To do the merging, we open all block files simultaneously, and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged index we are writing. In each iteration, we select the lowest termID that has not been processed yet using a priority queue or a similar data structure. All postings lists for this termID are read and merged, and the merged list is written back to disk.

Screenshot 2023-10-11 at 6 47 00 PM

4.3 Single-pass in-memory indexing

Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory.

A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID–docID pairs and then sorting them(aswe did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings.

When memory has been exhausted,we write the index of the block (which consists of the dictionary and the postings lists) to disk (line 12). We have to sort the terms (line 11) before doing this because we want to write postings lists in lexicographic order to facilitate the final merging step. If each block’s postings lists were written in unsorted order, merging blocks could not be accomplished by a simple linear scan through each block.

Both the postings and the dictionary terms can be stored compactly on disk if we employ compression. Compression increases the efficiency of the algorithm further because we can process even larger blocks, and because the individual blocks require less space on disk.

4.4 Distributed indexing

Collections are often so large that we cannot perform index construction efficiently on a single machine. Web search engines, therefore, use distributed indexing algorithms for index construction. In this section, we describe distributed indexing for a term-partitioned index. Most large search engines prefer a document-partitioned index. The distributed index construction method we describe in this section is an application of MapReduce. One requirement for robust distributed indexing is, therefore, that we divide the work up into chunks that we can easily assign and – in case of failure – reassign. A master node directs the process of assigning and reassigning tasks to individual worker nodes. The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time.

First, the input data, in our case a collection of web pages, are split into n splits where the size of the split is chosen to ensure that the work can be distributed evenly (chunks should not be too large) and efficiently (the total number of chunks we need to manage should not be too large); 16 or 64MB are good sizes in distributed indexing. Splits are not preassigned to machines, but are instead assigned by the master node on an ongoing basis: As a machine finishes processing one split, it is assigned the next one. If a machine dies or becomes a laggard due to hardware problems, the split it is working on is simply reassigned to another machine.

In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs. For indexing, a key-value pair has the form (termID,docID). In distributed indexing, the mapping from terms to termIDs is also distributed and therefore more complex than in single-machine indexing. A simple solution is to maintaina (perhaps precomputed) mapping for frequent terms that is copied to all nodes and to use terms directly (instead of termIDs) for infrequent terms. We do not address this problem here and assume that all nodes share a consistent term →termID mapping.

The map phase of MapReduce consists of mapping splits of the input data to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segment files.

For the reduce phase, we want all values for a given key to be stored close together, so that they can be read and processed quickly. This is achieved by partitioning the keys into j term partitions and having the parsers write keyvalue pairs for each term partition into a separate segment file.

Screenshot 2023-10-11 at 6 59 22 PM

Collecting all values (here: docIDs) for a given key (here: termID) into one list is the task of the inverters in the reduce phase.

Example

Map Phase: This is the first phase in MapReduce. In the context of indexing, it involves parsing and processing documents, breaking them down into structured data (usually term-document pairs or tokens), and emitting these pairs as key-value entries. Each document is processed independently on different nodes of the cluster. The Map function is responsible for this step.

Example: Let's say we're indexing a collection of web pages. In the Map phase, each web page is processed separately. The Map function extracts terms from the web page and generates key-value pairs, where the key is the term, and the value is the document ID.

Shuffle and Sort: After the Map phase, data is shuffled and sorted. This process groups all key-value pairs with the same key together. This ensures that in the next phase, the Reduce phase, all pairs with the same key can be processed together. This step is crucial for efficient indexing across multiple documents.

Reduce Phase: In this phase, each unique key (term) is processed by a Reduce function. The Reduce function aggregates the values associated with each key, often creating a postings list for each term. These postings lists contain information about which documents contain the term and their respective document IDs.

Example: Let's consider the term "MapReduce." In the Reduce phase, the postings list for this term is generated by aggregating the document IDs where "MapReduce" is found.

Output: The output of the Reduce phase is the final inverted index. This index contains terms as keys and their associated postings lists as values.

Example: Imagine we're indexing a collection of news articles. In the Map phase, each article is processed individually. The Map function extracts and emits term-document pairs like this:

  • Term: "technology," Document: Article1
  • Term: "big data," Document: Article2
  • Term: "MapReduce," Document: Article3
  • ...

In the Reduce phase, each unique term is processed, and the associated documents are aggregated into postings lists. For example:

  • Term: "technology," Postings List: [Article1, Article4, Article7, ...]
  • Term: "big data," Postings List: [Article2, Article5, Article8, ...]
  • Term: "MapReduce," Postings List: [Article3, Article6, Article9, ...]
  • ...

4.5 Dynamic indexing

Thus far, we have assumed that the document collection is static. This is fine for collections that change infrequently or never (e.g., the Bible or Shakespeare). But most collections are modified frequently with documents being added, deleted, and updated. This means that new terms need to be added to the dictionary, and postings lists need to be updated for existing terms.

If there is a requirement that new documents be included quickly, one solution is to maintain two indexes: a largemain index and a small auxiliary index that stores new documents. The auxiliary index is kept in memory. Searches are run across both indexes and resultsmerged. Deletions are stored in an invalidation bit vector. We can then filter out deleted documents before returning the search result. Documents are updated by deleting and reinserting them.

Each time the auxiliary index becomes too large, wemerge it into the main index. The cost of thismerging operation depends on how we store the index in the file system. If we store each postings list as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index. In this scheme, the reason for keeping the auxiliary index is to reduce the number of disk seeks required over time. Updating each document separately requires up to Mave disk seeks, where Mave is the average size of the vocabulary of documents in the collection. With an auxiliary index, we only put additional load on the disk when we merge auxiliary and main indexes.

Unfortunately, the one-file-per-postings-list scheme is infeasible because most file systems cannot efficiently handle very large numbers of files. The simplest alternative is to store the index as one large file, that is, as a concatenation of all postings lists. In reality, we often choose a compromise between the two extremes.

Having multiple indexes complicates the maintenance of collection-wide statistics. For example, it affects the spelling correction algorithm that selects the corrected alternative with the most hits. With multiple indexes and an invalidation bit vector, the correct number of hits for a term is no longer a simple lookup. In fact, all aspects of an IR system – index maintenance, query processing, distribution, and so on – are more complex in logarithmic merging.

Other types of indexes

In ranked retrieval, postings are often ordered according to weight or impact, with the highest-weighted postings occurring first. With this organization, scanning of long postings lists during query processing can usually be terminated early when weights have become so small that any further documents can be predicted to be of low similarity to the query.

User authorization is often mediated through access control lists or ACLs. ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a “postings list” of documents they can access – the user’s access list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change – we discussed these difficulties in the context of incremental indexing for regular postings lists.

# Index compression

We employ a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems. One benefit of compression is immediately clear. We need less disk space.

As we will see, compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.

There are two more subtle benefits of compression. The first is increased use of caching. Search systems use some parts of the dictionary and the index much more than others. For example, if we cache the postings list of a frequently used query term t, then the computations necessary for responding to the one-term query t can be entirely done in memory. With compression, we can fit a lot more information into main memory. Instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it. There are simple and efficient decompressionmethods, so that the penalty of having to decompress the postings list is small. As a result, we are able to decrease the response time of the IR system substantially. Because memory is a more expensive resource than disk space, increased speed owing to caching – rather than decreased space requirements – is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer of data from disk to memory. Efficient decompression algorithms run so fast on modern hardware that the total time of transferring a compressed chunk of data from disk and then decompressing it is usually less than transferring the same chunk of data in uncompressed form. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.

Agenda

  1. Start with a statistical characterization of the distribution of the entities we want to compress – terms and postings in large collections.
  2. We then look at compression of the dictionary, using the dictionary as-a-string method and blocked storage.
  3. Describe two techniques for compressing the postings file, variable byte encoding and γ encoding.

5.1 Statistical properties of terms in information retrieval

  • The number of terms is the main factor in determining the size of the dictionary.
  • Stemming and case folding reduce the number of (distinct) terms by 17% each and the number of nonpositional postings by 4% and 3%, respectively.
  • The treatment of the most frequent words is also important. The rule of 30 states that the 30 most common words account for 30% of the tokens in written text (31% in the table).
  • Eliminating the 150 most common words from indexing cuts 25%to 30%of the nonpositional postings. But, although a stop list of 150 words reduces the number of postings by a quarter or more, this size reduction does not carry over to the size of the compressed index.
  • The postings lists of frequent words require only a few bits per posting after compression.

Compression

Case folding, stemming, and stop word elimination are forms of lossy compression. Similarly, the vector space model and dimensionality reduction techniques like latent semantic indexing create compact representations from which we cannot fully restore the original collection. Lossy compressionmakes sensewhen the “lost” information is unlikely ever to be used by the search system. There are retrieval scenarios where lossy methods can be used for compression without any reduction in effectiveness.

  • Estimating the number of terms => Heaps’ law
    • M = kTb (where T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.)
  • Modeling the distribution of terms => Zipf’s law
    • A commonly usedmodel of the distribution of terms in ZIPF’S LAW a collection is Zipf’s law. It states that, if t1 is the most common term in the collection, t2 is the next most common, and so on, then the collection frequency cfi of the ith most common term is proportional to 1/i.

5.2 Dictionary compression

The dictionary is small comparedwith the postings file as suggested by Table 5.1. So why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system?

One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in mainmemory, or at least a large portion of it, to support high query throughput.

5.2.1 Dictionary as a string

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries. We allocate 20 bytes for the term itself (because few terms have more than twenty characters in English), 4 bytes for its document frequency, and 4 bytes for the pointer to its postings list.

Using fixed-width entries for terms is clearly wasteful. The average length of a term in English is about eight characters, so on average we are wasting twelve characters in the fixed-width scheme. Also, we have no way of storing terms with more than twenty characters like hydrochlorofluorocarbons and supercalifragilisticexpialidocious. We can overcome these shortcomings by storing the dictionary terms as one long string of characters. The pointer to the next term is also used to demarcate the end of the current term.

5.2.2 Blocked storage

By increasing the block size k, we get better compression. However, there is a tradeoff between compression and the speed of term lookup. One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding.

Screenshot 2023-10-12 at 1 39 19 AM

Even with the best compression scheme, it may not be feasible to store the entire dictionary in main memory for very large text collections and for hardware with limited memory. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree. For processing most queries, the search system has to go to disk anyway to fetch the postings. One additional seek for retrieving the term’s dictionary page from disk is a significant, but tolerable increase in the time it takes to process a query.

5.3 Postings file compression

We will find a document containing computer, then we skip a few documents that do not contain it, then there is again a document with the term and so on. The key idea is that the gaps between postings are short, requiring a lot less space than 20 bits to store. In fact, gaps for the most frequent terms such as the and for are mostly equal to 1. But the gaps for a rare term that occurs only once or twice in a collection have the same order of magnitude as the docIDs and need 20 bits. For an economical representation of this distribution of gaps, we need a variable encoding method that uses fewer bits for short gaps. To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression. As the names suggest, these methods attempt to encode gaps with the minimum number of bytes and bits, respectively.

  1. Variable byte codes (skipped for now, read book)
  2. Gamma codes (skipped for now, read book)

# Scoring, term weighting and the vector space model

It is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand.

Agenda

  1. We introduce parametric and zone indexes, which serve two purposes. First, they allow us to index and retrieve documents by metadata such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query.
  2. Develop the idea of weighting the importance of a term in a document, based on the statistics of occurrence of the term.
  3. Show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring.
  4. Develop several variants of term-weighting for the vector space model.

6.1 Parametric and zone indexes

Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, as well the author and possibly the title of the document.

Consider queries of the form “find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick”. Query processing then consists as usual of postings intersections, except that we may merge postings from standard inverted as well as parametric indexes. There is one parametric index for each field (say, date of creation); it allows us to select only the documents matching a date specified in the query.

The search engine may support querying ranges on such ordered values; to this end, a structure like a B-treemay be used for the field’s dictionary.

Zones: Zones are similar to fields, except the contents of a zone can be arbitrary free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones.

We may build a separate inverted index for each zone of a document, to support queries such as “find documents with merchant in the title and william in the author list and the phrase gentle rain in the body”. Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.

Weighted zone scoring

Many documents, especially in web search and digital libraries, are structured and contain different types of content, such as titles, headings, body text, and metadata. These different content types can have varying degrees of importance for retrieval. For example, a user might be more interested in finding a query term in the title of a web page than in the body text.

Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value. More specifically, consider a set of documents each of which has ℓ zones.

Let g1, . . . , gℓ ∈ [0, 1]

such that åℓ i=1 gi = 1.

For 1 ≤ i ≤ ℓ, let si be the Boolean score denoting a match (or absence thereof) between q and the ith zone. For instance, the Boolean score from a zone could be 1 if all the query term(s) occur in that zone, and zero otherwise.

Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval. A simple approach would be to compute the score for each document in turn, adding in all the contributions from the various zones. We can compute weighted zone scores directly from inverted indexes.

Learning weights

How do we determine the weights gi for weighted zone scoring? These weights could be specified by an expert (or, in principle, the user); but increasingly, these weights are “learned” using training examples that have been judged editorially. This latter methodology falls under a general class of approaches to scoring and ranking in information retrieval, known as machine-learned relevance.

For weighted zone scoring, the process may be viewed as learning a linear function of the Boolean match scores contributed by the various zones. The expensive component of this methodology is the labor-intensive assembly of user-generated relevance judgments from which to learn the weights, especially in a collection that changes frequently (such as the Web). We now detail a simple example that illustrates how we can reduce the problem of learning the weights gi to a simple optimization problem. We now consider a simple case of weighted zone scoring, where each document has a title zone and a body zone. Given a query q and a document d, we use the given Boolean match function to compute Boolean variables sT(d, q) and sB(d, q), depending on whether the title (respectively, body) zone of d matches query q. For instance, the algorithm in Figure 6.4 uses an AND of the query terms for this Boolean function. We will compute a score between 0 and 1 for each (document, query) pair using sT(d, q) and sB(d, q) by using a constant g ∈ [0, 1], as follows:

score(d, q) = g · sT((6.2) d, q) + (1− g)sB(d, q)

The problem of learning the constant g from the given training examples then reduces to picking the value of g that minimizes the total error.

6.2 Term frequency and weighting

A document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score.

Towards this end, we assign to each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. We would like to compute a score between a query term t and a document d, based on the weight of t in d.

In this view of a document, known in the literature as the bag of words model, the exact ordering of the terms in a document is ignored but the number of occurrences of each term is material.

Inverse document frequency

Raw term frequency as above suffers from a critical problem: all terms are considered equally important when it comes to assessing relevancy on a query. In fact certain terms have little or no discriminating power in determining relevance. For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document.

To this end, we introduce a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance determination. An immediate idea is to scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in the collection. The idea would be to reduce the tf weight of a term by a factor that grows with its collection frequency.

The idf of a rare term is high, whereas the idf of a frequent term is likely to be low.

Document Vector: We may view each document as a vectorwith one component corresponding to each termin the dictionary, together with a weight for each component

6.3 The vector space model for scoring

The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging fromscoring documents on a query, document classification and document clustering.

Given a document d (potentially one of the di in the collection), consider searching for the documents in the collectionmost similar to d. Such a search is useful in a system where a user may identify a document and seek others like it – a feature available in the results lists of search engines as a more like this feature. We reduce the problem of finding the document(s) most similar to d to that of finding the di with the highest dot products (sim values)~v(d) ·~v(di).

Viewing a collection of N documents as a collection of vectors leads to a natural viewof a collection as a term-document matrix: this is an M×N matrix whose rows represent the M terms (dimensions) of the N columns, each of which corresponds to a document.

To summarize, by viewing a query as a “bag of words”, we are able to treat it as a very short document. As a consequence, we can use the cosine similarity between the query vector and a document vector as a measure of the score of the document for that query.

Computing the cosine similarities between the query vector and each document vector in the collection, sorting the resulting scores and selecting the top K documents can be expensive — a single similarity computation can entail a dot product in tens of thousands of dimensions, demanding tens of thousands of arithmetic operations. We study how to use an inverted index for this purpose, followed by a series of heuristics for improving on this.

Computing vector scores

In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K. We seek the K documents of the collectionwith the highest vector space scores on the given query. We now initiate the study of determining the K documents with the highest vector space scores for a query.

6.4 Variant tf-idf functions

Sublinear tf scaling

It seems unlikely that twenty occurrences of a termin a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. A common modification is to use instead the logarithm of the term frequency, which assigns a weight given by

wft,d = 1+ log tft,d if tft,d > 0 AND 0 otherwise

Maximum tf normalization

The basic idea is to avoid a large swing in ntft,d frommodest changes in tft,d (say from1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher termfrequencies in longer documents, merely because longer documents tend to repeat the samewords over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document d and create a new document d′ by simply appending a copy of d to itself. While d′ should be no more relevant to any query than d is, would assign it twice as high a score as d. Replacing tf-idft,d in (6.9) by ntf-idft,d eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues:

  1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
  2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
  3. More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.

Pivoted normalized document length

The concept of "pivoted document length normalization" is a technique used to address the impact of document length on relevance ranking in information retrieval. Let's break down this concept using the provided text and an example:

Background:

In information retrieval, documents are represented as vectors in a high-dimensional space, with each dimension corresponding to a unique term (word). When calculating relevance between a query and a document using cosine similarity, document vectors are normalized to unit length by dividing by the Euclidean length of the vector. This normalization allows for consistent ranking based on the angle between the query vector and document vector.

The Issue:

Longer documents tend to have higher term frequencies (tf) because they contain more terms. They may also contain a greater variety of terms. This can lead to a situation where longer documents have higher cosine similarity scores with a query, which may not always be accurate because length alone should not dominate relevance.

Pivoted Document Length Normalization:

The idea behind pivoted document length normalization is to adjust the way document vectors are normalized based on document length. The term "pivot length," denoted as ℓp, is introduced. This length separates shorter documents from longer documents in terms of their treatment during normalization.

Normalization Factor:

For documents with a length less than ℓp, the normalization factor remains similar to the conventional cosine normalization, making them less affected by their length. For longer documents (length greater than ℓp), a modified normalization factor is used to reduce the impact of document length on relevance ranking.

Linear Factor:

In its simplest form, the normalization factor for longer documents is a linear function of the Euclidean length of the document vector, |~V(d)|. The formula for the normalization factor, denoted as a|~V(d)| + (1 - a)piv, ensures that for documents with length less than ℓp, the factor remains close to 1, and for documents with length greater than ℓp, it is less than 1.

Pivoted Normalization Curve:

This normalization factor is responsible for "rotating" the cosine similarity curve (y = x) in the plot counter-clockwise about the pivot point ℓp. This adjustment aims to bring the relevance ranking curve closer to the true relevance vs. document length curve.

# Computing scores in a complete search system

Agenda

  1. Heuristics for speeding up this computation; many of these heuristics achieve their speed at the risk of not finding quite the top K documents matching the query.
  2. Outline a complete search engine, including indexes and structures to support not only cosine scoring but also more general ranking factors such as query term proximity.
  3. Discuss how the vector space model for free text queries interacts with common query operators.

7.1 Efficient scoring and ranking

For the purpose of ranking the documents matching this query, we are really interested in the relative (rather than absolute) scores of the documents in the collection. To this end, it suffices to compute the cosine similarity from each document unit vector ~v(d) to ~V (q) (in which all non-zero components of the query vector are set to 1), rather than to the unit vector ~v(q).

We walk through the postings in the inverted index for the terms in q, accumulating the total score for each document – very much as in processing a Boolean query, except we assign a positive score to each document that appears in any of the postings being traversed. We maintain an idf value for each dictionary term and a tf value for each postings entry. This scheme computes a score for every document in the postings of any of the query terms; the total number of such documents may be considerably smaller than N.

Given these scores, the final step before presenting results to a user is to pick out the K highest-scoring documents. While one could sort the complete set of scores, a better approach is to use a to retrieve only the top K documents in order.

Inexact top K document retrieval

We now consider schemes by which we produce K documents that are likely to be among the K highest scoring documents for a query.

The heuristics have the following two-step scheme:

  1. Find a set A of documents that are contenders, where K < |A| ≪ N. A does not necessarily contain the K top-scoring documents for the query, but is likely to have many documents with scores near those of the top K.
  2. Return the K top-scoring documents in A.

7.11 Index elimination

For a multi-term query q, it is clearwe only consider documents containing at least one of the query terms.

  1. We only consider documents containing terms whose idf exceeds a preset threshold. Thus, in the postings traversal, we only traverse the postings for terms with high idf. This has a fairly significant benefit: the postings lists of low-idf terms are generally long. One way of viewing this heuristic: low-idf terms are treated as stop words and do not contribute to scoring. For instance, on the query catcher in the rye, we only traverse the postings for catcher and rye.
  2. We only consider documents that contain many (and as a special case, all) of the query terms. This can be accomplished during the postings traversal; we only compute scores for documents containing all (or many) of the query terms.

7.12 Champion lists

The idea of champion lists (sometimes also called fancy lists or top docs) is to precompute, for each term t in the dictionary, the set of the r documents with the highest weights for t; the value of r is chosen in advance. For tfidf weighting, these would be the r documents with the highest tf values for term t. We call this set of r documents the champion list for term t.

7.13 Static quality scores and ordering

In many search engines, we have available ameasure of quality g(d) for each document d that is query-independent and thus static. This quality measure may be viewed as a number between zero and one. For instance, in the context of news stories on the web, g(d) may be derived from the number of favorable reviews of the story by web surfers. The net score for a document d is some combination of g(d) together with the query-dependent score induced.

  1. First, consider ordering the documents in the postings list for each termby decreasing value of g(d). This allows us to perform the postings intersection algorithm. The first idea is a direct extension of champion lists: for a well-chosen value r, we maintain for each term t a global champion list of the r documents with the highest values for g(d) + tf-idft,d. The list itself is, like all the postings lists considered so far, sorted by a common order (either by document IDs or by static quality).
  2. We maintain for each term t two postings lists consisting of disjoint sets of documents, each sorted by g(d) values. The first list, which we call high, contains the m documents with the highest tf values for t. The second list, which we call low, contains all other documents containing t. When processing a query, we first scan only the high lists of the query terms, computing net scores for any document on the high lists of all (or more than a certain number of) query terms. If we obtain scores for K documents in the process, we terminate. If not, we continue the scanning into the low lists, scoring documents in these postings lists.

7.14 Impact ordering

The idea is to order the documents d in the postings list of term t by decreasing order of tft,d. Thus, the ordering of documents will vary from one postings list to another, and we cannot compute scores by a concurrent traversal of the postings lists of all query terms. Given postings lists ordered by decreasing order of tft,d, two ideas have been found to significantly lower the number of documents for which we accumulate scores:

  1. When traversing the postings list for a query term t, we stop after considering a prefix of the postings list – either after a fixed number of documents r have been seen, or after the value of tft,d has dropped below a threshold.
  2. When accumulating scores in the outer loop of Figure 6.14, we consider the query terms in decreasing order of idf, so that the query terms likely to contribute the most to the final scores are considered first. This latter idea too can be adaptive at the time of processing a query: as we get to query terms with lower idf, we can determine whether to proceed based on the changes in document scores from processing the previous query term.

7.15 Cluster pruning

In cluster pruning we have a preprocessing step during which we cluster the document vectors. Then at query time, we consider only documents in a small number of clusters as candidates for which we compute cosine scores. Specifically, the preprocessing step is as follows:

  1. Pick √N documents at random from the collection. Call these leaders.
  2. For each document that is not a leader, we compute its nearest leader.

Qquery processing proceeds as follows:

  1. Given a query q, find the leader L that is closest to q. This entails computing cosine similarities from q to each of the √N leaders.
  2. The candidate set A consists of L together with its followers. We compute the cosine scores for all documents in this candidate set. The use of randomly chosen leaders for clustering is fast and likely to reflect the distribution of the document vectors in the vector space: a region of the vector space that is dense in documents is likely to produce multiple leaders and thus a finer partition into sub-regions.

Variations of cluster pruning introduce additional parameters b1 and b2, both of which are positive integers. In the pre-processing step we attach each follower to its b1 closest leaders, rather than a single closest leader. At query time we consider the b2 leaders closest to the query q. Clearly, the basic scheme above corresponds to the case b1 = b2 = 1. Further, increasing b1 or b2 increases the likelihood of finding K documents that are more likely to be in the set of true top-scoring K documents, at the expense of more computation.

7.2 Components of an information retrieval system

Tiered indexes

We may occasionally find ourselves with a set A of contenders that has fewer than K documents.

In this example we set a tf threshold of 20 for tier 1 and 10 for tier 2, meaning that the tier 1 index only has postings entries with tf values exceeding 20, while the tier 2 index only has postings entries with tf values exceeding 10. In this example we have chosen to order the postings entries within a tier by document ID.

Query-term proximity

Consider a query with two or more query terms, t1, t2, . . . , tk. Let ω be the width of the smallest window in a document d that contains all the query terms, measured in the number of words in the window.

For instance, if the document were to simply consist of the sentence The quality of mercy is not strained, the smallest window for the query strained mercy would be 4. Intuitively, the smaller that ω is, the better that d matches the query. In cases where the document does not contain all of the query terms, we can set ω to be some enormous number. We could also consider variants in which only words that are not stop words are considered in computing ω. Such proximity-weighted scoring functions are a departure from pure cosine similarity and closer to the “soft conjunctive” semantics that Google and other web search engines evidently use.

Designing parsing and scoring functions

The query parser may issue a stream of queries:

  1. Run the user-generated query string as a phrase query. Rank them by vector space scoring using as query the vector consisting of the 3 terms rising interest rates.
  2. If fewer than ten documents contain the phrase rising interest rates, run the two 2-term phrase queries rising interest and interest rates; rank these using vector space scoring, as well.
  3. If we still have fewer than ten results, run the vector space query consisting of the three individual query terms.

Each of these steps (if invoked) may yield a list of scored documents, for each of which we compute a score. This score must combine contributions fromvector space scoring, static quality, proximityweighting and potentially other factors – particularly since a document may appear in the lists from multiple steps. This demands an aggregate scoring function that accumulates evidence of a document’s relevance frommultiple sources.

Putting it all together

We have now all the components necessary for a basic search system that supports free text queries as well as Boolean, zone and field queries.

In this figure, documents stream in from the left for parsing and linguistic processing (language and format detection, tokenization and stemming). The resulting stream of tokens feeds into two modules. First, we retain a copy of each parsed document in a document cache. This will enable us to generate results snippets: snippets of text accompanying each document in the results list for a query. This snippet tries to give a succinct explanation to the user of why the document matches the query.

Screenshot 2023-10-14 at 1 32 26 AM

A second copy of the tokens is fed to a bank of indexers that create a bank of indexes including zone and field indexes that store the metadata for each document, (tiered) positional indexes, indexes for spelling correction and other tolerant retrieval, and structures for accelerating inexact top-K retrieval. A free text user query (top center) is sent down to the indexes both directly and through a module for generating spelling-correction candidates. The latter may optionally be invoked only when the original query fails to retrieve enough results. Retrieved documents (dark arrow) are passed to a scoring module that computes scores based on machine-learned ranking (MLR), for scoring and ranking documents. Finally, these ranked documents are rendered as a results page.

7.3 Vector space scoring and query operator interaction

In building a search engine, we may opt to support multiple query operators for an end user. In doing so we need to understandwhat components of the index can be shared for executing various query operators, as well as how to handle user queries that mix various query operators.

Vector space scoring supports so-called free text retrieval, in which a query is specified as a set ofwordswithout any query operators connecting them. It allows documents matching the query to be scored and thus ranked, unlike the Boolean, wildcard and phrase queries studied earlier. Classically, the interpretation of such free text querieswas that at least one of the query terms be present in any retrieved document. However more recently, web search engines such as Google have popularized the notion that a set of terms typed into their query boxes (thus on the face of it, a free text query) carries the semantics of a conjunctive query that only retrieves documents containing all or most query terms.

Wildcard queries

Wildcard and vector space queries require different indexes, except at the basic level that both can be implemented using postings and a dictionary (e.g., a dictionary of trigrams for wildcard queries). The vector space query is then executed as usual, with matching documents being scored and ranked.

Phrase queries

The representation of documents as vectors is fundamentally lossy: the relative order of terms in a document is lost in the encoding of a document as a vector. Even if we were to try and somehow treat every biword as a term (and thus an axis in the vector space), the weights on different axes not independent: for instance the phrase German shepherd gets encoded in the axis german shepherd, but immediately has a non-zero weight on the axes german and shepherd. Further, notions such as idf would have to be extended to such biwords. Thus an index built for vector space retrieval cannot, in general, be used for phrase queries. Moreover, there is no way of demanding a vector space score for a phrase query — we only know the relative weights of each term in a document.

On the query german shepherd, we could use vector space retrieval to identify documents heavy in these two terms, with no way of prescribing that they occur consecutively. Phrase retrieval, on the other hand, tells us of the existence of the phrase german shepherd in a document, without any indication of the relative frequency or weight of this phrase. While these two retrieval paradigms (phrase and vector space) consequently have different implementations in terms of indexes and retrieval algorithms, they can in some cases be combined usefully, as in the three-step example of query parsing.

#Evaluation in information retrieval

How do we know which of these techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting?

Agenda

  1. A discussion of measuring the effectiveness of IR systems and the test collections that are most often used for this purpose.
  2. Discuss formal evaluation methodology that has been developed for evaluating unranked retrieval results. This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text classification and why they are appropriate. We then extend these notions and develop furthermeasures for evaluating ranked retrieval results.

8.1 Information retrieval system evaluation

To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection consisting of three things:

  1. A document collection
  2. A test suite of information needs, expressible as queries
  3. A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.

Relevance is assessed relative to an information need, not a query. For example, an information need might be: Information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. This might be translated into a query such as:

wine AND red AND white AND heart AND attack AND effective.

8.2 Evaluation of unranked retrieval sets

Precision (P) is the fraction of retrieved documents PRECISION that are relevant Precision = (#(relevant items retrieved) / #(retrieved items)) = P(relevant|retrieved)

Recall (R) is the fraction of relevant documents that are retrieved Recall = (#(relevant items retrieved) / #(relevant items)) = P(retrieved|relevant)

Why is accuracy not used?

There is a good reason why accuracy is not an appropriate measure for information retrieval problems. In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents nonrelevant to all queries. Even if the system is quite good, trying to label some documents as relevant will almost always lead to a high rate of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user.

The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. Typical web surfers would like every result on the first page to be relevant (high precision) but have not the slightest interest in knowing let alone looking at every document that is relevant. In contrast, various professional searchers such as paralegals and intelligence analysts are very concerned with trying to get as high recall as possible, andwill tolerate fairly lowprecision results in order to get it. Individuals searching their hard disks are also often interested in high recall searches.

Recall is a non-decreasing function of the number of documents retrieved. On the other hand, in a good system, precision usually decreases as the number of documents retrieved is increased.

A single measure that trades off precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall

Why do we use a harmonic mean rather than the simpler average?

Recall that we can always get 100% recall by just returning all documents, and therefore we can always get a 50% arithmetic mean by the same process. This strongly suggests that the arithmetic mean is an unsuitable measure to use. In contrast, if we assume that 1 document in 10,000 is relevant to the query, the harmonic mean score of this strategy is 0.02%. The harmonic mean is always less than or equal to the arithmetic mean and the geometric mean. When the values of two numbers differ greatly, the harmonic mean is closer to their minimum than to their arithmetic mean.

8.4 Evaluation of ranked retrieval results

In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve. Precision-recall curves have a distinctive saw-tooth shape: if the (k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right.

The interpolated precision pinterp at a certain recall level r is defined as the highest precision found for any recall level r′ ≥ r

The 11-point interpolated average precision. For each information need, the interpolated precision is measured at the 11 recall levels of 0.0, 0.1, 0.2, . . . , 1.0.

Mean Average Precision (MAP), provides a single-figure measure of quality across recall levels. For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averaged over information needs. That is, if the set of relevant documents for an information need qj ∈ Q is {d1, . . . dmj} and Rjk is the set of ranked retrieval results from the top result until you get to document dk

For a single information need, the average precision approximates the area under the uninterpolated precision-recall curve, and so theMAP is roughly the average area under the precision-recall curve for a set of queries.

The above measures factor in precision at all recall levels. For many prominent applications, particularly web search, this may not be germane to users. What matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as Precision at k, for example “Precision at 10”.

An alternative, which alleviates this problem, is R-precision. It requires having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned.

A perfect system could score 1 on this metric for each query, whereas, even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 documents in the collection relevant to an information need.

If there are |Rel| relevant documents for a query, we examine the top |Rel| results of a system, and find that r are relevant, then by definition, not only is the precision (and hence R-precision) r/|Rel|, but the recall of this result set is also r/|Rel|. Thus, R-precision turns out to be identical to the break-even point.

An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1 − specificity). Here, sensitivity is just another term for recall. The false positive rate is given by f p/( f p+ tn). Figure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2. An ROC curve always goes fromthe bottomleft to the top right of the graph. For a good system, the graph climbs steeply on the left side.

A final approach that has seen increasing adoption, especially when employed with machine learning approaches to ranking is measures of cumulative gain, and in particular normalized discounted cumu lative gain (NDCG). NDCG is designed for situations of non-binary notions of relevance (cf. Section 8.5.1). Like precision at k, it is evaluated over some number k of top search results. For a set of queries Q, let R(j, d) be the relevance score assessors gave to document d for query j.

8.5 Assessing relevance

Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs.

For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems.

Kappa statistic

In the social sciences, a common measure for agreement between judges is the KAPPA STATISTIC kappa statistic. It is designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.

kappa = (P(A) − P(E)) / (1− P(E))

where P(A) is the proportion of the times the judges agreed, and P(E) is the proportion of the times they would be expected to agree by chance.

Screenshot 2023-10-16 at 6 08 45 PM

The kappa value will be 1 if two judges always agree, 0 if they agree only at the rate given by chance, and negative if they are worse than random.

The choice can make a considerable absolute difference to reported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a single system which are being compared for effectiveness.

8.5.1 Critiques and justifications of the concept of relevance

Relevance of a document to an information need is treated as an absolute, objective decision. But judgments of relevance are subjective, varying across people, as we discussed above.

MARGINAL RELEVANCE

One clear problem with the relevance-based assessment that we have presented is the distinction between relevance and marginal relevance: whether a document still has distinctive usefulness after the user has looked at certain other documents

The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the World Wide Web – but it can also easily occur when several documents provide a similar precis of an event. In such circumstances, marginal relevance is clearly a bettermeasure of utility to the user. Maximizing marginal relevance requires returning documents that exhibit diversity and novelty. One way to approach measuring this is by using distinct facts or entities as evaluation units.

8.6 A broader perspective: System quality and user utility

8.6.1 System issues

  1. How fast does it index, that is, how many documents per hour does it index for a certain distribution over document lengths?
  2. How fast does it search, that is, what is its latency as a function of index size?
  3. How expressive is its query language? How fast is it on complex queries?
  4. How large is its document collection, in terms of the number of documents or the collection having information distributed across a broad range of topics?

Measuring the rate of return of users is thus an effective metric, which would of course be more effective if you could also measure how much these users used other search engines.

8.6.3 Refining a deployed system

If an IR system has been built and is being used by a large number of users, the system’s builders can evaluate possible changes by deploying variant versions of the system and recording measures that are indicative of user satisfaction with one variant vs. others as they are being used. This method is frequently used by web search engines.

The most common version of this is A/B testing, a term borrowed from the advertising industry.

If we wish to investigate a change to the ranking algorithm, we redirect a random sample of users to a variant system and evaluate measures such as the frequency with which people click on the top result, or any result on the first page. (This particular analysis method is referred to as clickthrough log analysis or clickstream mining.

8.7 Results snippets

The two basic kinds of summaries are static, which are always the same regardless of the query, and dynamic (or query-dependent), which are customized according to the user’s information need as deduced from a query.

Dynamic summaries are generated in real time, often adapting to user preferences, queries, or contexts. These summaries are tailored to the specific needs of the user at the moment and can be adjusted based on the user's interactions or feedback. Dynamic summaries are frequently used in personalized search results, chatbot interactions, or adaptive content curation systems.

Example of a dynamic summary: Consider a news application that dynamically generates personalized summaries of recent news articles based on the user's interests, browsing history, and reading habits. The summary content may change based on the topics the user frequently engages with or the type of articles they prefer to read.

Text Summarization

There has been extensive work within natural language processing (NLP) on better ways to do text summarization. Most such work still aims only to choose sentences from the original document to present and concentrates on how to select good sentences. The models typically combine positional factors, favoring the first and last paragraphs of documents and the first and last sentences of paragraphs, with content factors, emphasizing sentences with key terms, which have low document frequency in the collection as a whole, but high frequency and good distribution across the particular document being returned.

Dynamic summaries display one or more “windows” on the document, aiming to present the pieces that have the most utility to the user in evaluating the document with respect to their information need.

# Relevance feedback and query expansion

In most collections, the same concept may be referred to using different words. This issue, known as synonymy, has an impact on the recall of most information retrieval systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat in appropriate discussions.

The methods for tackling this problem split into two major classes: global methods and local methods.

Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. Global methods include:

  • Query expansion/ reformulation with a thesaurus or WordNet.
  • Query expansion via automatic thesaurus generation.
  • Techniques like spelling correction.

Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:

  • Relevance feedback.
  • Pseudo relevance feedback, also known as Blind relevance feedback.
  • (Global) indirect relevance feedback.

9.1 Relevance feedback and pseudo relevance feedback

The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

  • The user issues a (short, simple) query.
  • The system returns an initial set of retrieval results.
  • The user marks some returned documents as relevant or nonrelevant.
  • The systemcomputes a better representation of the information need based on the user feedback.
  • The system displays a revised set of retrieval results.

Relevance feedback can go through one or more iterations of this sort. The process exploits the idea that it may be difficult to formulate a good query when you don’t know the collection well, but it is easy to judge particular documents.

Image search provides a good example of relevance feedback. Not only is it easy to see the results at work, but this is a domain where a user can easily have difficulty formulating what they want in words, but can easily indicate relevant or nonrelevant images.

  • After the user enters an initial query for bike.
  • The initial results (in this case, images) are returned.
  • The user has selected some of them as relevant. These will be used to refine the query, while other displayed results have no effect on the reformulation.
  • Then the new top-ranked results calculated after this round of relevance feedback.

The Rocchio algorithm for relevance feedback

The underlying theory: We want to find a query vector, denoted as ~q, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents. If Cr is the set of relevant documents and Cnr is the set of nonrelevant documents, then we wish to find ~qopt

~qopt = argmax ~q [sim(~q, Cr) − sim(~q, Cnr)]

where sim is cosine similarity.

That is, the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents. However, this observation is not terribly useful, precisely because the full set of relevant documents is not known: it is what we want to find.

In a real IR query context, we have a user query and partial knowledge of known relevant and nonrelevant documents. The algorithm proposes using the modified query ~qm:

Screenshot 2023-10-17 at 3 58 07 PM

where q0 is the original query vector, Dr and Dnr are the set of known relevant and nonrelevant documents respectively, and α, β, and γ are weights attached to each term.

Screenshot 2023-10-17 at 3 58 49 PM

Starting from q0, the new query moves you some distance toward the centroid of the relevant documents and some distance away from the centroid of the nonrelevant documents. This new query can be used for retrieval in the standard vector space model.

Relevance feedback can improve both recall and precision. But, in practice, it has been shown to be most useful for increasing recall in situations where recall is important. Another alternative is to use only the marked nonrelevant document which received the highest ranking from the IR system as negative feedback.

Probabilistic relevance feedback

Rather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier. One way of doing this is with a Naive Bayes probabilistic model. If R is a Boolean indicator variable expressing the relevance of a document, then we can estimate P(xt = 1|R), the probability of a term t appearing in a document, depending on whether it is relevant or not, as:

ˆP(xt = 1|R = 1) = |VRt|/|VR| 
ˆP(xt = 1|R = 0) = (d ft − |VRt |)/(N − |VR|)

where N is the total number of documents, d ft is the number that contain t, VR is the set of known relevant documents, and VRt is the subset of this set containing t.

When does relevance feedback work?

Relavance feedback cannot handle:

  • Misspellings, Cross-language information retrieval, Mismatch of searcher’s vocabulary versus collection vocabulary.
  • relevance feedback approach requires relevant documents to be similar to each other. That is, they should cluster.
    • This approach does not work as well if the relevant documents are a multimodal class, that is, they consist of several clusters of documents within the vector space. This can happen with: Subsets of the documents using different vocabulary, such as Burma vs. Myanmar.

Relevance feedback is not necessarily popular with users. Users are often reluctant to provide explicit feedback, or in general do not wish to prolong the search interaction. Furthermore, it is often harder to understand why a particular document was retrieved after relevance feedback is applied.

Relevance feedback on the web

Some web search engines offer a similar/related pages feature: the user indicates a document in the results set as exemplary from the standpoint of meeting his information need and requests more documents like it. This can be viewed as a particular simple form of relevance feedback.

Evaluation of relevance feedback strategies

The obvious first strategy is to start with an initial query q0 and to compute a precision-recall graph. Following one round of feedback from the user, we compute the modified query qm and again compute a precision-recall graph. Here, in both rounds we assess performance over all documents in the collection, which makes comparisons straightforward. If we do this, we find spectacular gains fromrelevance feedback: gains on the order of 50% in mean average precision. But unfortunately it is cheating.

The gains are partly due to the fact that known relevant documents (judged by the user) are now ranked higher. The measured gains could give a false impression of the system's performance, as the improved metrics might be due to the known relevant documents being artificially placed higher rather than a genuine enhancement in the system's ability to retrieve relevant information for new, unseen queries. Fairness demands that we should only evaluate with respect to documents not seen by the user.

A second idea is to use documents in the residual collection (the set of documents minus those assessed relevant) for the second round of evaluation. This seems like a more realistic evaluation. Unfortunately, the measured performance can then often be lower than for the original query. This is particularly the case if there are few relevant documents, and so a fair proportion of them have been judged by the user in the first round.

A third method is to have two collections, one which is used for the initial query and relevance judgments, and the second that is then used for comparative evaluation. The performance of both q0 and qm can be validly compared on the second collection.

Pseudo relevance feedback

Pseudo relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

For example, if the query is about copper mines and the top several documents are all about mines in Chile, then there may be query drift in the direction of documents on Chile.

Indirect relevance feedback

We can also use indirect sources of evidence rather than explicit feedback on relevance as the basis for relevance feedback. This is often called implicit (relevance) feedback. Implicit feedback is less reliable than explicit feedback, but is more useful than pseudo relevance feedback, which contains no evidence of user judgments. Moreover,

On the web, DirectHit introduced the idea of ranking more highly documents that users chose to look at more often. In other words, clicks on links were assumed to indicate that the page was likely relevant to the query. This approachmakes various assumptions, such as that the document summaries displayed in results lists (on whose basis users choose which documents to click on) are indicative of the relevance of these documents.

Other uses of relevance feedback include:

  • Following a changing information need (e.g., names of car models of interest change over time)
  • Maintaining an information filter (e.g., for a news feed).
  • Active learning (deciding which examples it is most useful to know the class of to reduce annotation costs).

9.2 Global methods for query reformulation

We briefly discuss three global methods for expanding a query: by simply aiding the user in doing so, by using a manual thesaurus, and through building a thesaurus automatically.

Vocabulary tools for query reformulation

This includes information about words that were omitted from the query because they were on stop lists, what words were stemmed to, the number of hits on each term or phrase, and whether words were dynamically turned into phrases. The IR system might also suggest search terms by means of a thesaurus or a controlled vocabulary. A user can also be allowed to browse lists of the terms that are in the inverted index, and thus find good terms that appear in the collection.

Query expansion

In relevance feedback, users give additional input on documents (by marking documents in the results set as relevant or not), and this input is used to reweight the terms in the query for documents. In query expansion on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms.

Some search engines (especially on the web) suggest related queries in response to a query; the users then opt to use one of these alternative query suggestions.

The central question in this form of query expansion is how to generate alternative or expanded queries for the user. The most common form of query expansion is global analysis, using some form of thesaurus. For each term t in a query, the query can be automatically expanded with synonyms and related words of t from the thesaurus. Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms. Methods for building a thesaurus for query expansion include:

  • Use of a controlled vocabulary that ismaintained by human editors.
    • For example, in Figure 9.7, neoplasms was added to a search for cancer. This Medline query expansion also contrasts with the Yahoo! example. The Yahoo! interface is a case of interactive query expansion, whereas PubMed does automatic query expansion.

Screenshot 2023-10-17 at 4 12 47 PM

  • A manual thesaurus. Here, human editors have built up sets of synonymous names for concepts, without designating a canonical term. The UMLS metathesaurus is one example of a thesaurus.
  • An automatically derived thesaurus. Here, word co-occurrence statistics over a collection of documents in a domain are used to automatically induce a thesaurus.
  • Query reformulations based on query log mining. Here, we exploit the manual query reformulations of other users to make suggestions to a new user. This requires a huge query volume.

Thesaurus-based query expansion has the advantage of not requiring any user input. Use of query expansion generally increases recall and is widely used in many science and engineering fields.

Automatic thesaurus generation

As an alternative to the cost of a manual thesaurus, we could attempt to generate a thesaurus automatically by analyzing a collection of documents. There are two main approaches. One is simply to exploit word co-occurrence. We say that words co-occurring in a document or paragraph are likely to be in some sense similar or related in meaning, and simply count text statistics to find the most similar words. The other approach is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies. For example, we say that entities that are grown, cooked, eaten, and digested, are more likely to be food items. Simply using word co-occurrence is more robust (it cannot be misled by parser errors), but using grammatical relations is more accurate.

The simplest way to compute a co-occurrence thesaurus is based on term-term similarities. We begin with a term-document matrix A, where each cell At,d is a weighted count wt,d for term t and document d, with weighting so A has length-normalized rows. If we then calculate C = AAT, then Cu,v is a similarity score between terms u and v, with a larger number being better.

Term ambiguity easily introduces irrelevant statistically correlated terms. For example, a query for Apple computer may expand to Apple red fruit computer. In general these thesauri suffer from both false positives and false negatives. Moreover, since the terms in the automatic thesaurus are highly correlated in documents anyway (and often the collection used to derive the thesaurus is the same as the one being indexed), this form of query expansion may not retrieve many additional documents.

However, query expansion may also significantly decrease precision, particularly when the query contains ambiguous terms. For example, if the user searches for interest rate, expanding the query to interest rate fascinate evaluate is unlikely to be useful.

XML Retrieval

We delve into the intricacies of information retrieval within XML documents, highlighting the challenges associated with structured retrieval. The chapter underscores the importance of adapting traditional retrieval techniques to address the unique complexities introduced by the hierarchical structure and rich metadata of XML data.

Structured retrieval within XML documents necessitates the consideration of various factors. One such factor is the query's scope, which often extends beyond entire documents to encompass specific XML elements. The chapter introduces the concept of the "Structured Document Retrieval Principle," emphasizing the importance of retrieving the most specific part of a document that fulfills the query criteria. However, implementing this principle is not without difficulties, as determining the appropriate level of granularity for query results can be ambiguous, especially when considering context and relevance.

Moreover, the chapter elucidates the importance of selecting the right indexing units, delineating various strategies for defining the appropriate units for structured retrieval. These strategies include grouping nodes into non-overlapping pseudodocuments, using larger elements as indexing units, and adopting a bottom-up approach for identifying relevant units. However, each approach presents its own set of challenges, such as coherence concerns and limitations in predicting relevance.

The chapter also underscores the challenges posed by nested elements and redundancy in XML retrieval. It discusses strategies to mitigate these issues, such as discarding small or non-relevant elements, or implementing postprocessing methods to collapse nested elements and emphasize query terms. Highlighting the significance of context, the chapter emphasizes the relevance of providing contextual information alongside query results to aid users in interpreting the retrieved content.

Furthermore, the chapter accentuates the significance of schema knowledge in alleviating redundancy and refining search results. However, it acknowledges the limitations users face in understanding complex schemas and formulating structured queries effectively. The text underlines the need for improved user interfaces and retrieval systems that simplify the process for users with limited schema knowledge.

In summary, here we shed light on the intricate challenges associated with structured retrieval within XML documents, emphasizing the need for nuanced strategies that balance granularity, relevance, and redundancy. It underscores the significance of context, coherent indexing units, and efficient postprocessing methods to enhance the effectiveness of XML retrieval systems and improve user experience.

# Probabilistic information retrieval

Agenda

  1. Here, we introduce probabilistic approach to IR, which provides a different formal basis for a retrievalmodel and results in different techniques for setting term weights.
  2. Introduction probability theory and the Probability Ranking Principle.
  3. Then concentrate on the Binary Independence Model.
  4. introduce related but extended methods which use term counts, including the empirically successful BM25 weighting scheme, and Bayesian Network models for IR.

11.1 Review of basic probability theory

Chain rule:

For two events A and B, the joint event of both events occurring is described by the joint probability P(A, B). The conditional probability P(A|B) expresses the probability of event A given that event B occurred. The fundamental relationship between joint and conditional probabilities is given by the chain rule: P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Partition Rule

If an event B can be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases.

P(B) = P(A, B)+ P(A, B)

Bayes rule

Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) / P(B)

This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rule lets us derive a posterior probability P(A|B) after having seen the evidence B, based on the likelihood of B occurring in the two cases that A does or does not hold.

Prior Probability: Prior probability refers to the initial belief about the probability of an event before any new information is taken into account. It represents the probability of an event based on existing knowledge or historical data. For example, suppose we are considering the probability of drawing a red ball from a bag of colored balls. If the bag contains 10 red balls and 20 blue balls, the prior probability of drawing a red ball is 10/30 or 1/3.

Posterior Probability: Posterior probability, on the other hand, is the updated probability of an event after considering new evidence or data. It is calculated using Bayes' theorem, which combines the prior probability with the likelihood of the evidence occurring given the hypothesis. The posterior probability is the revised probability that incorporates the new information.

Example: Let's consider a medical scenario where a patient has symptoms that could indicate a particular disease. The prior probability is the initial probability that the patient has the disease based on historical data. Suppose, based on previous statistics, the prior probability of a patient having the disease is 0.1 (or 10%).

Now, if the patient undergoes diagnostic tests, the results of which provide additional information, the posterior probability can be updated accordingly. Suppose the test result suggests the presence of certain biomarkers associated with the disease with a probability of 0.8 (or 80%). By applying Bayes' theorem, the posterior probability can be calculated as follows:

Posterior Probability = (Prior Probability * Likelihood) / Evidence 

11.2 The Probability Ranking Principle

For a query q and a document d in the collection, let Rd, q be an indicator random variable that says whether d is relevant with respect to a given query q.

Using a probabilistic model, the obvious order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q). This is the basis of the Probability Ranking Principle (PRP).

You lose a point for either returning a nonrelevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss). The goal is to return the best possible results as the top k documents, for any value of k the user chooses to examine. The PRP then says to simply rank all documents in decreasing order of P(R = 1|d, q). If a set of retrieval results is to be returned, rather than an ordering, the Bayes Optimal Decision Rule, the decision which minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant:

d is relevant iff P((11.6) R = 1|d, q) > P(R = 0|d, q)

The PRP with retrieval costs

Let C1 be the cost of not retrieving a relevant document and C0 the cost of retrieval of a nonrelevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved

C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)         (11.7) 

then d is the next document to be retrieved.

11.3 The Binary Independence Model

It introduces some simple assumptions, which make estimating the probability function P(R|d, q) practical. Here, “binary” is equivalent to Boolean: documents and queries are both represented as binary term incidence vectors. That is, a document d is represented by the vector ~x = (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d. With this representation, many possible documents have the same vector representation. “Independence” means that terms are modeled as occurring in documents independently. The model recognizes no association between terms. If a document contains "Narendra", then that says nothing about the presence of the word "Modi".

Under the BIM, we model the probability P(R|d, q) that a document is relevant via the probability in terms of term incidence vectors P(R|~x,~q). Then, using Bayes rule, we have:

Screenshot 2023-10-19 at 12 40 46 PM

Here, P(~x|R = 1,~q) are the probability that if a relevant document is retrieved, then that document’s representation is ~x.

P(R = 1|~q) and P(R = 0|~q) indicate the prior probability of retrieving a relevant or nonrelevant document respectively for a query ~q. Again, if we knew the percentage of relevant documents in the collection.

Deriving a ranking function for query terms

Given a query q, we wish to order returned documents by descending P(R = 1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Rather than estimating this probability directly, because we are interested only in the ranking of documents, we work with some other quantities which are easier to compute and which give the same ordering of documents. In particular, we can rank documents by their odds of relevance.

After solving an equation, making assumptions, we get to ...............

Screenshot 2023-10-19 at 2 18 31 PM

Here, pt = P(xt = 1|R = 1,~q) be the probability of a term appearing in a document relevant to the query, and ut = P(xt = 1|R = 0,~q) be the probability of a term appearing in a nonrelevant document.

The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 − pt)) and the odds of the term appearing if the document is nonrelevant (ut/(1−ut)). The odds ratio is the ratio of two such odds, and then we finally take the log of that quantity. The value will be 0 if a term has equal odds of appearing in relevant and nonrelevant documents, and positive if it is more likely to appear in relevant documents.

Screenshot 2023-10-21 at 4 47 07 PM

  • Refer notes for derivation
  • Refer notes for more on BM25

#Language Models in IR

The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

Instead of overtly modeling the probability P(R = 1|q, d) of relevance of a document d to a query q, as in the traditional probabilistic approach to IR (Chapter 11), the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of themodel generating the query: P(q|Md)

Agenda

  1. Introduce the concept of language models.
  2. Describe the basic and most commonly used language modeling approach to IR, the Query Likelihood Model.
  3. Comparisons between the language modeling approach and other approaches to IR.
  4. Describe various extensions to the language modeling approach.

12.1 Language models

Finite automata and language models

What do we mean by a document model generating a query? A traditional generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings. The full set of strings that can be generated is called the language of the automaton.

Screenshot 2023-10-22 at 3 22 33 PM

If instead each node has a probability distribution over generating different terms, we have a language model. The notion of a language model is inherently probabilistic. A language model is a function that puts a probability measure over strings drawn from some vocabulary. That is, for a language model M over an alphabet S:

Screenshot 2023-10-22 at 3 23 49 PM

12.1.2 Types of language models

How do we build probabilities over sequences of terms? We can always use the chain rule to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events:

P(t1t2t3t4) = P(t1)P(t2|t1)P((12.4) t3|t1t2)P(t4|t1t2t3)

The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

Since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text.

The fundamental problem in designing language models is that we do not know what exactly we should use as the model Md. However, we do generally have a sample of text that is representative of that model. This problem makes a lot of sense in the original, primary uses of language models.

12.2 The query likelihood model

The original and basic method for using language models in IR is the query likelihood model. In it, we construct from each document d in the collection a language model Md. Our goal is to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule we have:

P(d|q) = P(q|d)P(d)/P(q)

P(q) is the same for all documents, and so can be ignored, we return results ranked by simply P(q|d), the probability of the query q under the language model derived from d. The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample fromthe respective document model.

Query generation process

For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to:

  1. Infer a LM for each document.
  2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.
  3. Rank the documents according to these probabilities.

The intuition of the basic model is that the user has a prototype document in mind, and generates a query based on words that appear in this document. Often, users have a reasonable idea of terms that are likely to occur in documents of interest and they will choose query terms that distinguish these documents from others in the collection.

Screenshot 2023-10-22 at 3 29 24 PM

where Md is the language model of document d, tft,d is the (raw) term frequency of term t in document d, and Ld is the number of tokens in document d. That is, we just count up how often each word occurred, and divide through by the total number of words in the document d.

The classic problem with using language models is one of estimationterms appear very sparsely in documents. In particular, some words will not have appeared in the document at all, but are possible words for the information need, which the user may have used in the query. If we estimate ˆP(t|Md) = 0 for a term missing from a document d, then we get a strict conjunctive semantics: documents will only give a query non-zero probability if all of the query terms appear in the document.

But the role of smoothing in this model is not only to avoid zero probabilities. The smoothing of terms actually implements major parts of the term weighting component. It is not just that an unsmoothed model has conjunctive semantics; an unsmoothed model works badly because it lacks parts of the term weighting component.

The general approach is that a non-occurring term should be possible in a query, but its probability should be somewhat close to but no more likely than would be expected by chance from the whole collection. That is, if tft,d = 0 then

ˆP(t|Md) ≤ cft/T

where cft is the raw count of the term in the collection, and T is the raw size (number of tokens) of the entire collection. A simple idea that works well in practice is to use a mixture between a document-specific multinomial distribution and a multinomial distribution estimated from the entire collection:

ˆP(t|d) = λˆPmle(t|Md) + (1 − λ)ˆPmle(t|Mc)

where 0 < λ < 1 and Mc is a language model built from the entire document collection. This mixes the probability from the document with the general collection frequency of the word. Such a model is referred to as a linear interpolation language model.5 Correctly setting λ is important to the good performance of this model.

12.3 Language modeling versus other approaches in IR

The language modeling approach to IR provides a different approach to scoring matches between queries and documents, and the hope is that the probabilistic language modeling foundation improves the weights that are used, and hence the performance of the model. The major issue is estimation of the document model, such as choices of how to smooth it effectively. The model has achieved very good retrieval results. Compared to other probabilistic approaches, the main difference initially appears to be that the LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach). The LM approach assumes that documents and expressions of information needs are objects of the same type.

On the other hand, like all IR models, you can also raise objections to the model. The assumption of equivalence between document and information need representation is unrealistic. Current LM approaches use very simple models of language, usually unigram models. Without an explicit notion of relevance, relevance feedback is difficult to integrate into the model, as are user preferences.

12.4 Extended language modeling approaches

Rather than looking at the probability of a document language model Md generating the query, you can look at the probability of a query language model Mq generating the document. The main reason that doing things in this direction and creating a document likelihood model is less appealing is that there is much less text available to estimate a language model based on the query text, and so the model will be worse estimated.

On the other hand, it is easy to see how to incorporate relevance feedback into such a model: you can expand the query with terms taken from relevant documents in the usual way and hence update the languagemodel Mq.

Rather than directly generating in either direction, we can make a language model from both the document and query, and then ask how different these two language models are from each other. One way to model the risk of returning a document d as relevant to a query q is to use the Kullback-Leibler divergence between their respective language models.

Basic LMs do not address issues of alternate expression, that is, synonymy, or any deviation in use of language between queries and documents. A translation model lets you generate query words not in a document by translation to alternate terms with similar meaning.

# Flat Clustering

Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters.

Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm.

The EM algorithm is a generalization of K-means and can be applied to a large variety of document representations and distributions.

16.1 Clustering in information retrieval

Cluster hypothesis: Documents in the same cluster behave similarly with respect to relevance to information needs.

The hypothesis states that if there is a document from a cluster that is relevant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity hypothesis.

The first application mentioned is search result clustering where by search results we mean the documents that were returned in response to a query. The default presentation of search results in information retrieval is a simple list. Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents.

The second application Scatter-Gather clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered.

As an alternative to the user-mediated iterative clustering in Scatter-Gather, we can also compute a static hierarchical clustering of a collection that is not influenced by user interactions.

Clustering is well suited for access to a collection of news stories since news reading is not really search, but rather a process of selecting a subset of stories about recent events.

The fourth application of clustering exploits the cluster hypothesis directly for improving search results, based on a clustering of the entire collection. We use a standard inverted index to identify an initial set of documents that match the query, but we then add other documents from the same clusters even if they have low similarity to the query. For example, if the query is car and several car documents are taken froma cluster of automobile documents, then we can add documents from this cluster that use terms other than car (automobile, vehicle etc). This can increase recall since a group of documents with high mutual similarity is often relevant as a whole.

Clustering can also speed up search. Search in the vector space model amounts to finding the nearest neighbors to the query. The inverted index supports fast nearest-neighbor search for the standard IR setting.

The cluster hypothesis offers an alternative: Find the clusters that are closest to the query and only consider documents from these clusters. Within this much smaller set, we can compute similarities exhaustively and rank documents in the usual way.

16.2 Problem statement

We can define the goal in hard flat clustering as follows. Given (i) a set of documents D = {d1, . . . , dN}, (ii) a desired number of clusters K, and (iii) an objective function that evaluates the quality of a clustering, we want to compute an assignment γ : D → {1, . . . , K} that minimizes (or, in other cases,maximizes) the objective function.

A difficult issue in clustering is determining the number of clusters or cardinality of a clustering, which we denote by K.

16.3 Evaluation of clustering

This section introduces four external criteria of clustering quality. Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.

# Hierarchical clustering

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC. Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached. See Section 17.6. HAC is more frequently used in IR than top-down clustering and is the main subject of this chapter.

Single-link and complete-link clustering

In single-link clustering or single-linkage clustering, the similarity of two clusters is the similarity of their most similar members (see Figure 17.3, (a))3. This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.

In complete-link clustering or complete-linkage clustering, the similarity of two clusters is the similarity of their most dissimilar members (see Figure 17.3, (b)). This is equivalent to choosing the cluster pair whose merge has the smallest diameter. This complete-linkmerge criterion is non-local; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers.

A measurement based on one pair cannot fully reflect the distribution of documents in a cluster. It is therefore not surprising that both algorithms often produce undesirable clusters. Single-link clustering can produce straggling clusters as shown in Figure 17.6. Since the merge criterion is strictly local, a chain of points can be extended for long distances without regard to the overall shape of the emerging cluster. This effect CHAINING is called chaining.

However, complete-link clustering suffers from a different problem. It pays too much attention to outliers, points that do not fit well into the global structure of the cluster.

17.3 Group-average agglomerative clustering

Group-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluates cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents.

17.4 Centroid clustering

In centroid clustering, the similarity of two clusters is defined as the similarity of their centroids.

Centroid similarity is equivalent to average similarity of all pairs of documents from different clusters. Thus, the difference betweenGAAC and centroid clustering is that GAAC considers all pairs of documents in computing average pairwise similarity (Figure 17.3, (d)) whereas centroid clustering excludes pairs from the same cluster.

In contrast to the other three HAC algorithms, centroid clustering is not monotonic. So-called inversions can occur: Similarity can increase during clustering.

Despite its non-monotonicity, centroid clustering is often used because its similarity measure – the similarity of two centroids – is conceptually simpler than the average of all pairwise similarities in GAAC.

In some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the first occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar fromall previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster.

17.6 Divisive clustering

Top-down clustering is conceptually more complex than bottom-up clustering since we need a second, flat clustering algorithm as a “subroutine”. It has the advantage of being more efficient if we do not generate a complete hierarchy all the way down to individual document leaves.

Bottom-up methods make clustering decisions based on local patterns without initially taking into account the global distribution. These early decisions cannot be undone. Top-down clustering benefits from complete information about the global distribution when making top-level partitioning decisions.

# Latent Semantic Indexing

Agenda

  1. Construct a low-rank approximation to the term-document matrix.
  2. Examine the applicationof such low-rank approximations to indexing and retrieving documents.

The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query document similarity scores in this low-rank representation. This process is known as latent semantic indexing.

The vector space representation suffers, however, from its inability to cope with two classic problems arising in natural languages: synonymy and polysemy.

Synonymy refers to a case where two different words (say car and automobile) have the same meaning. Because the vector space representation fails to capture the relationship between synonymous terms such as car and automobile – according each a separate dimension in the vector space. Consequently the computed similarity ~q · ~d between a query ~q (say, car) and a document ~d containing both car and automobile underestimates the true similarity that a user would perceive.

Polysemy on the other hand refers to the case where a term such as charge has multiple meanings, so that the computed similarity ~q · ~d overestimates the similarity that a user would perceive. Could we use the co-occurrences of terms (whether, for instance, charge occurs in a document containing steed versus in a document containing electron) to capture the latent semantic associations of terms and alleviate these problems?

In LSA, we use the SVD to construct a low-rank approximation Ck to the term-document matrix, for a value of k that is far smaller than the original rank of C. In the experimental work cited later in this section, k is generally chosen to be in the low hundreds. We thus map each row/column (respectively corresponding to a term/document) to a k-dimensional space; this space is defined by the k principal eigenvectors (corresponding to the largest eigenvalues) of CCT and CTC. Note that the matrix Ck is itself still an M× N matrix, irrespective of k.

Screenshot 2023-10-24 at 5 06 31 PM

Next, we use the new k-dimensional LSI representation as we did the original representation – to compute similarities between vectors. A query vector ~q is mapped into its representation in the LSI space by the transformation:

Screenshot 2023-10-24 at 5 06 47 PM

Note especially that Equation (18.21) does not in any way depend on ~q being a query; it is simply a vector in the space of terms. This means that if we have an LSI representation of a collection of documents, a new document not in the collection can be “folded in” to this representation using Equation (18.21). This allows us to incrementally add documents to an LSI representation. Of course, such incremental addition fails to capture the co-occurrences of the newly added documents (and even ignores any new terms they contain). As such, the quality of the LSI representation will degrade as more documents are added and will eventually require a recomputation of the LSI representation.

Problems

  1. The computational cost of the SVD is significant; at the time of this writing, we know of no successful experiment with over one million documents. This has been the biggest obstacle to the widespread adoption to LSI.
  2. LSI shares two basic drawbacks of vector space retrieval: there is no good way of expressing negations (find documents that contain german but not shepherd), and noway of enforcing Boolean conditions.

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

# Web

The web graph

We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge.

There is ample evidence that these links are not randomly distributed; for one thing, the distribution of the number of links into a web page does not follow the Poisson distribution one would expect if every web page were to pick the destinations of its links uniformly at random. Rather, this distribution is widely reported to be a power law, in which the total number of web pages with in-degree i is proportional to 1/iα.

User query needs

There appear to be three broad categories into which common web search queries can be grouped: (i) informational, (ii) navigational and (iii) transactional.

  • Informational queries seek general information on a broad topic, such as leukemia or Provence.
  • Navigational queries seek the website or home page of a single entity that the user has in mind, say Lufthansa airlines.
  • A transactional query is one that is a prelude to the user performing a transaction on the Web – such as purchasing a product, downloading a file or making a reservation. In such cases, the search engine should return results listing services that provide form interfaces for such transactions.

Near-duplicates and shingling

Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads. The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other.

We now describe a solution to the problemof detecting near-duplicate web pages. The answer lies in a technique known as shingling.

Link Analysis

Link analysis forweb search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them.

For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter’s tally of in-links. This phenomenon is referred to as link spam.

Link analysis builds on two intuitions:

  1. The anchor text pointing to page B is a good description of page B.
  2. The hyperlink from A to B represents an endorsement of page B, by the creator of page A.

Anchor text and the web graph

For example, at the time of the writing of this book the home page of the IBM corporation (http://www.ibm.com)did not contain the termcomputer anywhere in its HTML code, despite the fact that IBM is widely viewed as the world’s largest computer maker.

Thus, there is often a gap between the terms in a web page, and how web users would describe that web page. Consequently, web searchers need not use the terms in a page to query for it.

The insight behind anchor text is that such methods can be supplanted by anchor text, thereby tapping the power of the community of web page authors.

The use of anchor text has some interesting side-effects. Searching for big blue on most web search engines returns the home page of the IBM corporation as the top hit; this is consistent with the popular nickname that many people use to refer to IBM. On the other hand, there have been (and continue to be) many instances where derogatory anchor text such as evil empire leads to somewhat unexpected results on querying for these terms on web search engines. This phenomenon has been exploited in orchestrated campaigns against specific sites. Such orchestrated anchor text may be a form of spamming, since a website can create misleading anchor text pointing to itself, to boost its ranking on selected query terms. Detecting and combating such systematic abuse of anchor text is another form of spam detection that web search engines perform.

The window of text surrounding anchor text (sometimes referred to as extended anchor text) is often usable in the same manner as anchor text itself; consider for instance the fragment ofweb text there is good discussion of vedic scripture here.

PageRank

Our first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1, known as its PageRank. The PageRank of a node will depend on the link structure of the web graph. Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity (Section 6.3) and termproximity (Section 7.2.2), togetherwith the PageRank score.

Consider a random surfer who begins at a web page (a node of the web graph) and executes a random walk on the Web as follows. At each time step, the surfer proceeds from his current page A to a randomly chosen web page that A hyperlinks to. Figure 21.1 shows the surfer at a node A, out of which there are three hyperlinks to nodes B, C and D; the surfer proceeds at the next time step to one of these three nodes, with equal probabilities 1/3.

As the surfer proceeds in this random walk from node to node, he visits some nodes more often than others; intuitively, these are nodes with many links coming in from other frequently visited nodes. The idea behind Page- Rank is that pages visited more often in this walk are more important.

What if the current location of the surfer, the node A, has no out-links? To address this we introduce an additional operation for our random surfer: the teleport operation. In the teleport operation the surfer jumps from a node to any other node in the web graph.

The destination of a teleport operation is modeled as being chosen uniformly at random from all web pages. In other words, if N is the total number of nodes in the web graph1, the teleport operation takes the surfer to each node with probability 1/N. The surfer would also teleport to his present position with probability 1/N.

In assigning a PageRank score to each node of the web graph, we use the teleport operation in two ways: (1) When at a node with no out-links, the surfer invokes the teleport operation. (2)At any node that has outgoing links, the surfer invokes the teleport operation with probability 0 < α < 1 and the standard random walk (follow an out-link chosen uniformly at random as in Figure 21.1) with probability 1 − α, where α is a fixed parameter chosen in advance. Typically, α might be 0.1.

In Section 21.2.1, we will use the theory of Markov chains to argue that when the surfer follows this combined process (random walk plus teleport) he visits each node v of the web graph a fixed fraction of the time π(v) that depends on (1) the structure of the web graph and (2) the value of α. We call this value π(v) the PageRank of v.

Markov chains

A Markov chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made. A Markov chain consists of N states. Each web page will correspond to a state in the Markov chain we will formulate.

A Markov chain is characterized by an N × N transition probability matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. The Markov chain can be in one of the N states at any given timestep; then, the entry Pij tells us the probability that the state at the next timestep is j, conditioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i; this is known as the Markov property.

In a Markov chain, the probability distribution of next states for a Markov chain depends only on the current state, and not on how the Markov chain arrived at the current state.

We can view a random surfer on the web graph as a Markov chain, with one state for each web page, and each transition probability representing the probability of moving from one web page to another. The teleport operation contributes to these transition probabilities.

Hubs and Authorities

We now develop a scheme in which, given a query, every web page is assigned two scores. One is called its hub score and the other its authority score. For any query, we compute two ranked lists of results rather than one. The ranking of one list is induced by the hub scores and that of the other by the authority scores.

This approach stems from a particular insight into the creation of web pages, that there are two primary kinds of web pages useful as results for broad-topic searches. By a broad topic search we mean an informational query such as "I wish to learn about leukemia". There are authoritative sources of information on the topic; in this case, the National Cancer Institute’s page on leukemia would be such a page. We will call such pages authorities; in the computation we are about to describe, they are the pages that will emerge with high authority scores.

On the other hand, there aremany pages on theWeb that are hand-compiled lists of links to authoritative web pages on a specific topic. These hub pages are not in themselves authoritative sources of topic-specific information, but rather compilations that someone with an interest in the topic has spent time putting together. The approach we will take, then, is to use these hub pages to discover the authority pages. In the computation we now develop, these hub pages are the pages that will emerge with high hub scores.

A good hub page is one that points to many good authorities; a good authority page is one that is pointed to by many good hub pages. We thus appear to have a circular definition of hubs and authorities; we will turn this into an iterative computation.

Choosing the subset of theWeb

In assembling a subset ofweb pages around a topic such as leukemia, we must cope with the fact that good authority pages may not contain the specific query term leukemia. This is especially true, as we noted in Section 21.1.1, when an authority page uses its web presence to project a certain marketing image. For instance, many pages on the IBM website are authoritative sources of information on computer hardware, even though these pagesmay not contain the term computer or hardware. However, a hub compiling computer hardware resources is likely to use these terms and also link to the relevant pages on the IBM website.

Building on these observations, the following procedure has been suggested for compiling the subset of the Web for which to compute hub and authority scores:

  1. Given a query (say leukemia), use a text index to get all pages containing leukemia. Call this the root set of pages.
  2. Build the base set of pages, to include the root set as well as any page that either links to a page in the root set, or is linked to by a page in the root set.

We then use the base set for computing hub and authority scores. The base set is constructed in this manner for three reasons:

  1. A good authority page may not contain the query text (such as computer hardware).
  2. If the text query manages to capture a good hub page vh in the root set, then the inclusion of all pages linked to by any page in the root set will capture all the good authorities linked to by vh in the base set.
  3. Conversely, if the text query manages to capture a good authority page va in the root set, then the inclusion of pages which point to va will bring other good hubs into the base set. In other words, the “expansion” of the root set into the base set enriches the common pool of good hubs and authorities.

Any algorithm for computing eigenvectors may be used for computing the hub/authority score vector. In fact, we need not compute the exact values of these scores; it suffices to know the relative values of the scores so that we may identify the top hubs and authorities. To this end, it is possible that a small number of iterations of the power iteration method yields the relative ordering of the top hubs and authorities.

# Lucene

References

  1. https://www.baeldung.com/lucene
  2. https://lucenetutorial.com/lucene-in-5-minutes.html
  3. https://github.com/eugenp/tutorials/blob/master/lucene/README.md

Lucene is a high-performance, full-text search library written in Java. It is used for enabling powerful search capabilities in various applications, including search engines, databases, and content management systems. Here's an overview of what Lucene does step by step:

  1. Document Indexing: Lucene begins its process by indexing a collection of documents. During this step, the text from the documents is analyzed and tokenized into individual terms. These terms are then stored in a data structure known as an inverted index.
  2. Inverted Index Creation: The inverted index is a data structure that maps terms to the documents that contain them. It allows for fast full-text searches, as each term is associated with a list of documents where the term appears.
  3. Text Searching: Once the index is created, Lucene enables users to perform various types of searches, including keyword searches, phrase searches, wildcard searches, and more. Users can construct queries using a flexible query language to retrieve relevant documents based on specific criteria.
  4. Scoring Relevance: Lucene scores the relevance of documents to a query based on factors such as term frequency, inverse document frequency, and field length normalization. This scoring mechanism helps rank the retrieved documents in order of their relevance to the search query.
  5. Ranking and Retrieval: Lucene provides ranked search results based on the relevance scores. The top-ranked documents that match the search query are returned, providing users with the most pertinent information based on their search terms.
  6. Boosting and Filtering: Lucene supports boosting, allowing users to assign higher importance to specific terms or documents during the search process, influencing the final ranking of search results. Additionally, Lucene provides filtering capabilities to narrow down search results based on specific criteria, refining the results even further.
  7. Sorting and Pagination: Lucene supports sorting and pagination, enabling users to define custom sorting rules and retrieve search results in a specific order. This feature is crucial for presenting search results in a user-friendly manner, especially when dealing with large datasets.
  8. Data Storage and Retrieval Efficiency: Lucene is designed to handle large-scale datasets efficiently. It uses data structures and algorithms optimized for fast information retrieval operations, making it capable of handling vast amounts of text-based data with high performance and reliability.

Lucene consists of several key components that work together to provide robust and efficient search capabilities. These components include:

  1. Document: The basic unit of information indexed and searched in Lucene. A document is a collection of fields, each containing textual or numerical data.
  2. Field: Represents a named and typed piece of data within a document. Fields can store various types of information, such as text, numbers, dates, and binary data.
  3. Analyzer: Responsible for processing the text of a document into tokens for indexing and searching. Analyzers include tokenization, filtering, and stemming, ensuring that text is indexed in a consistent and searchable format.
  4. IndexWriter: Manages the writing of documents to the index. It accepts documents and writes them to the index, creating or updating the inverted index for efficient searching.
  5. IndexReader: Provides read-only access to the index, allowing for search operations across indexed documents. It supports functions such as retrieving documents, term frequency calculations, and more.
  6. Query: Represents the search request made to the Lucene index. It is constructed using a query parser and can include various types of search criteria, such as keywords, phrases, wildcards, and more.
  7. IndexSearcher: Executes searches against the index based on the provided query. It retrieves relevant documents based on the search query and ranks them based on their relevance scores.
  8. Scorer: Calculates the relevance scores for documents based on the terms matched in the search query. It plays a crucial role in ranking the search results for retrieval.
  9. Similarity: Defines how the relevance scores are calculated for documents. Lucene provides several built-in similarity models, such as the TF-IDF (Term Frequency-Inverse Document Frequency) model, that influence the scoring of search results.Directory: Represents the storage abstraction for the Lucene index. It manages the persistence of the index on disk or in memory. Lucene provides different directory implementations, such as RAMDirectory, FSDirectory, and NIOFSDirectory, for different storage use cases.

# Solr

Solr’s basic unit of information is a document, which is a set of data that describes something. Documents are composed of fields, which are more specific pieces of information. Shoe size could be a field. First name and last name could be fields.

Obviously, the definition of fields is flexible (you could define a shoe size field as a text field rather than a floating point number, for example), but if you define your fields correctly, Solr will be able to interpret them correctly and your users will get better results when they perform a query.

When you add a document, Solr takes the information in the document’s fields and adds that information to an index. When you perform a query, Solr can quickly consult the index and return the matching documents.

Field Analysis

Field analysis tells Solr what to do with incoming data when building an index. Consider, for example, a biography field in a person document. Every word of the biography must be indexed so that you can quickly find people whose lives have had anything to do with ketchup, or dragonflies, or cryptography. However, a biography will likely contains lots of words you don’t care about and don’t want clogging up your index—words like "the", "a", "to", and so forth. Furthermore, suppose the biography contains the word "Ketchup", capitalized at the beginning of a sentence. If a user makes a query for "ketchup", you want Solr to tell you about the person even though the biography contains the capitalized word. The solution to both these problems is field analysis. For the biography field, you can tell Solr how to break apart the biography into words. You can tell Solr that you want to make all the words lower case, and you can tell Solr to remove accents marks.

Solr’s Schema File

Solr stores details about the field types and fields it is expected to understand in a schema file. This file is named either managed-schema.xml or schema.xml. The difference is determined by how you plan to manage Solr’s schema in your installation: either programmatically or by hand-editing.

An important fact about the schema is that it is a Solr concept. The actual data in your index is stored in Lucene, and Lucene does not have the concept of a schema. This means that changes to the schema file will not have any impact on data already stored in the index. In fact, changing the schema without reindexing your data can cause serious problems with the index, to the point where the only solution is to reindex your data entirely.

Solr Indexing

By adding content to an index, we make it searchable by Solr.

Here are the three most common ways of loading data into a Solr index:

  • Indexing with Solr Cell and Apache Tika, built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
  • Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated.
  • Writing a custom Java application to ingest data through Solr’s Java Client API (which is described in more detail in Client APIs).
  • Using the Java API may be the best choice if you’re working with an application, such as a Content Management System (CMS), that offers a Java API.

Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a document containing multiple fields, each with a name and containing content, which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr.

If the field name is defined in the Schema that is associated with the index, then the analysis steps associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition, if one matching the field name exists.

Searching in Solr

When a user runs a search in Solr, the search query is processed by a request handler. A request handler is a Solr plug-in that defines the logic to be used when Solr processes a request.

Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication.

To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query. Different query parsers support different syntax. Solr’s default query parser is known as the Standard Query Parser,or more commonly the "lucene" query parser.

Search parameters may also specify a filter query. As part of a search response, a filter query runs a query against the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the strategic use of filter queries can improve search performance.

Despite their similar names, query filters are not related to analysis filters. Filter queries perform queries at search time against data already in the index, while analysis filters, such as tokenizers, parse content for indexing, following specified rules.

A search query can request that certain terms be highlighted in the search response; that is, the selected terms will be displayed in colored boxes so that they "jump out" on the screen of search results. Highlighting can make it easier to find relevant passages in long documents returned in a search. Solr supports multi-term highlighting. Solr includes a rich set of search parameters for controlling how terms are highlighted.

Search responses can also be configured to include snippets (document excerpts) featuring highlighted text.

To help users zero in on the content they’re looking for, Solr supports two special ways of grouping search results to aid further exploration: faceting and clustering.

Faceting

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.

Faceting makes use of fields defined when the search applications were indexed. In the example above, these fields include categories of information that are useful for describing digital cameras: manufacturer, resolution, and zoom range.

Clustering

Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for.

Screenshot 2023-11-02 at 4 23 10 AM


Tutorials

To launch Solr, run: bin/solr start -e cloud

Because we are starting in SolrCloud mode, and did not define any details about an external ZooKeeper cluster, Solr launches its own ZooKeeper and connects both nodes to it.

A collection must have a configset, which at a minimum includes the two main configuration files for Solr: the schema file (named either managed-schema.xml or schema.xml), and solrconfig.xml.

Solr includes the bin/post tool in order to facilitate indexing various types of documents easily.

The documents we got back include all the fields for each document that were indexed. This is, again, default behavior. If you want to restrict the fields in the response, you can use the fl parameter, which takes a comma-separated list of field names. This is one of the available fields on the query form in the Admin UI.

Field Searches

All Solr queries look for documents using some field. Often you want to query across multiple fields at the same time, and this is what we’ve done so far with the "foundation" query.

Sometimes, though, you want to limit your query to a single field. This can make your queries more efficient and the results more relevant for users. Let’s say we want to find all the "electronics" products in the index. In the Query screen, enter "electronics" .

This search finds all documents that contain the term "electronics" anywhere in the indexed fields However, we can see from the above there is a cat field (for "category"). If we limit our search for only documents with the category "electronics", the results will be more precise for our users.

Phrase Search

To search for a multi-term phrase, enclose it in double quotes: q="multiple terms here". For example, search for "CAS latency" by entering that phrase in quotes to the q box in the Admin UI.

Combining Searches

By default, when you search for multiple terms and/or phrases in a single query, Solr will only require that one of them is present in order for a document to match. Documents containing more terms will be sorted higher in the results list. You can require that a term or phrase is present by prefixing it with a + (plus); conversely, to disallow the presence of a term or phrase, prefix it with a - (minus). To find documents that contain both terms "electronics" and "music", enter +electronics +music in the q box .


Exercise 2: Index Films Data

Create a New Collection

We’re going to use a feature in Solr called "field guessing", where Solr attempts to guess what type of data is in a field while it’s indexing it. It also automatically creates new fields in the schema for new fields that appear in incoming documents. This mode is called "Schemaless".

What is a "schema" and why do I need one?

Solr’s schema is a single file (in XML) that stores the details about the fields and field types Solr is expected to understand. The schema defines not only the field or field type names, but also any modifications that should happen to a field before it is indexed.

For example, if you want to ensure that a user who enters "abc" and a user who enters "ABC" can both find a document containing the term "ABC", you will want to normalize (lower-case, in this case) "ABC" when it is indexed, and normalize the user query to be sure of a match.

These rules are defined in your schema. Earlier in the tutorial we mentioned copy fields, which are fields made up of data that originated from other fields. You can also define dynamic fields, which use wildcards (such as *_t or *_s) to dynamically create fields of a specific field type. These types of rules are also defined in the schema.

Preparing Schemaless for the Films Data

There are two parallel things happening with the schema that comes with the _default configset. First, we are using a "managed schema", which is configured to only be modified by Solr’s Schema API. That means we should not hand-edit it so there isn’t confusion about which edits come from which source. Solr’s Schema API allows us to make changes to fields, field types, and other types of schema rules.

Second, we are using "field guessing", which is configured in the solrconfig.xml file (and includes most of Solr’s various configuration settings). Field guessing is designed to allow us to start using Solr without having to define all the fields we think will be in our documents before trying to index them. This is why we call it "schemaless", because you can start quickly and let Solr create fields for you as it encounters them in documents.

Sounds great! Well, not really, there are limitations. It’s a bit brute force, and if it guesses wrong, you can’t change much about a field after data has been indexed without having to reindex. If we only have a few thousand documents that might not be bad, but if you have millions and millions of documents, or, worse, don’t have access to the original data anymore, this can be a real problem.

It is possible to mix schemaless features with a defined schema. Using the Schema API, you can define a few fields that you know you want to control, and let Solr guess others that are less important or which you are confident (through testing) will be guessed to your satisfaction.

The films data we are going to index has a small number of fields for each movie: an ID, director name(s), film name, release date, and genre(s).

If you look at one of the files in example/films, you’ll see the first film is named .45, released in 2006. As the first document in the dataset, Solr is going to guess the field type based on the data in the record. If we go ahead and index this data, that first film name is going to indicate to Solr that the field type is a "float" numeric field, and will create a "name" field with a type FloatPointField. All data after this record will be expected to be a float.

Well, that’s not going to work. We have titles like A Mighty Wind and Chicken Run, which are strings - decidedly not numeric and not floats. If we let Solr guess the "name" field is a float, what will happen is later titles will cause an error and indexing will fail. That’s not going to get us very far.

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema

This command uses the Schema API to explicitly define a field named "name" that has the field type "text_general" (a text field). It will not be permitted to have multiple values

In the first exercise when we queried the documents we had indexed, we didn’t have to specify a field to search because the configuration we used was set up to copy fields into a text field, and that field was the default when no other field was defined in the query.

The configuration we’re using now doesn’t have that rule. We would need to define a field to search for every query. We can, however, set up a "catchall field" by defining a copy field that will take all data from all fields and index it into a field named text.

Faceting

Field Facets

In addition to providing search results, a Solr query can return the number of documents that contain each unique value in the whole result set.

To see facet counts from all documents (q=:): turn on faceting (facet=true), and specify the field to facet on via the facet.field parameter. If you only want facets, and no document contents, specify rows=0. The curl command below will return facet counts for the genre_str field:

Range Facets

For numerics or dates, it’s often desirable to partition the facet counts into ranges rather than discrete values. A prime example of numeric range faceting, using the example techproducts data from our previous exercise, is price. The films data includes the release date for films, and we could use that to create date range facets, which are another common use for range facets.

Pivot Facets

Another faceting type is pivot facets, also known as "decision trees", allowing two or more fields to be nested for all the various possible combinations. Using the films data, pivot facets can be used to see how many of the films in the "Drama" category (the genre_str field) are directed by a director.

References

  1. https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment