ASS-1

docx

School

Illinois Institute Of Technology *

*We aren’t endorsed by this school

Course

429

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

11

Uploaded by BarristerBraveryFlamingo36

Report
Information Retrieval Assignment-1 A20545920 Satya Jaidev forecast Exercise 1.1 Draw the inverted index that would be built for the following document collection. Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise home 1 2 3 4 forecast s 1 in 2 increas e 3 july 2 3 new 1 rise 1 sales 1 2 3 4 top 1 Exercise 1.2 Consider these documents: Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia
Information Retrieval Assignment-1 A20545920 Satya Jaidev Doc 4 new hopes for schizophrenia patients a. Draw the term-document incidence matrix for this document collection. Term/ Document Doc1 Doc2 Doc3 Doc4 approach 0 0 1 0 breakthrough 1 0 0 0 drug 1 1 0 0 for 1 0 1 1 hopes 0 0 0 1 new 0 1 1 1 of 0 0 1 0 patients 0 0 0 1 schizophrenia 1 1 1 1 treatment 0 0 1 0 b. Draw the inverted index representation for this collection.
Information Retrieval Assignment-1 A20545920 Satya Jaidev Exercise 1.3 For the document collection shown in Exercise 1.2, what are the returned results for these queries: a. schizophrenia AND drug Doc1, Doc 2 b. for AND NOT(drug OR approach) Doc 4 Exercise 1.7 Recommend a query processing order for d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes: Term Postings size eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 N(tangerine) + N(trees) = 363465 N(marmalade) + N(skies) = 379571 N(kaleidoscope) + N(eyes) = 300321 Order is (kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies) Exercise 1.8 If the query is: friends AND romans AND (NOT countrymen)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Retrieval Assignment-1 A20545920 Satya Jaidev how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing. For each of the n terms, get its postings, Process in the order of increasing frequency, start with smallest set and then keep cutting further.If countrymen is more frequent then it can be used to remove documents by where it does not exist. Count for word X in (documents where word X occurs) For Word X the count for !X in ((number of total documents)-(documents where word X occurs)). Exercise 2.1 Are the following statements true or false? a. In a Boolean retrieval system, stemming never lowers precision. False b. In a Boolean retrieval system, stemming never lowers recall. True c. Stemming increases the size of the vocabulary. False d. Stemming should be invoked at indexing time but not while processing a query. False Exercise 2.6 We have a two-word query. For one term the postings list consists of the following 16 entries: [4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the other it is the one entry postings list: [47]. Work out how many
Information Retrieval Assignment-1 A20545920 Satya Jaidev comparisons would be done to intersect the two postings lists with the following two strategies. Briefly justify your answers: a. Using standard postings lists Applying MERGE on the standard postings list, comparisons will be made unless either of the postings list end i.e. till we reach 47 in the upper postings list, after which the lower list ends and no more processing needs to be done. Number of comparisons = 11 b. Using postings lists stored with skip pointers, with a skip length of P, as suggested in Section 2.3. Using skip pointers of length 4 for the longer list and of length 1 for the shorter list, the following comparisons will be made: 1. 4 & 47 2. 14 & 47 3. 22 & 47 4. 120 & 47 5. 81 & 47 6. 47 & 47 Number of comparisons = 6 Exercise 2.7 Consider a postings intersection between this postings list, with skip pointers: 3 5 9 15 24 39 60 68 75 81 84 89 92 96 97 100 115 and the following intermediate result postings list (which hence has no skip pointers): 3 5 89 95 97 99 100 101 Trace through the postings intersection algorithm in Figure 2.10 (page 37). a. How often is a skip pointer followed (i.e., p1 is advanced to skip(p2))? The skip pointer is followed once. (from 24 to 75). b. How many postings comparisons will be made by this algorithm while intersecting the two lists? 19 comparisons are made. (Let (x,y) denote a posting compari- son The comparisons are :(3,3), (5,5), (9,89),(15,89),(24,89),(75,89),(75,89),
Information Retrieval Assignment-1 A20545920 Satya Jaidev (92,89),(81,89),(84,89), (89,89), (92,95),(115,95), (96,95), (96,97), (97,9), (100,99), (100 100), (115,101)) c. How many postings comparisons would be made if the postings lists are intersected without the use of skip pointers? 19 Exercise 2.9 Shown below is a portion of a positional index in the format: term: doc1: <position1, position2, . . .>; doc2: <position1, position2, . . .>; etc. Which document(s) if any meet each of the following queries, where each expression within quotes is a phrase query? a. “fools rush in” All three documents (2, 4, and 7). b. “fools rush in” AND “angels fear to tread” Document 4. Exercise 2.10 Consider the following fragment of a positional index with the format: word: document: <position, position, . . .>; document: >position, . . .> . . . The /k operator, word1 /k word2 finds occurrences of word1 within k words of word2 (on either side), where k is a positive integer argument. Thus k = 1 demands that word1 be adjacent to word2. a. Describe the set of documents that satisfy the query Gates /2 Microsoft.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Retrieval Assignment-1 A20545920 Satya Jaidev document(1,3) b. Describe each set of values for k for which the query Gates /k Microsoft returns a different set of documents as the answer. {k=1} gives {3}; {2,3,4} gives {1,3}; {x: x ≥5) gives {1,2,3}. 2.1 Problem Load the 20newsgroups sample dataset into Python from the scikit-learn library. Using the initial list of document data (Hint: Make sure to set subset=’all’ and shuffle=False in order to retrieve the full dataset without randomized reordering), develop a function to tokenize each document into a list of constituent words (terms). Limit text processing to removal of punctuation and special characters, splitting the text using whitespace as a delimiter. from sklearn.datasets import fetch_20newsgroups import string newsgroups = fetch_20newsgroups(subset='all', shuffle=False) def tokenize_document(document): translator = str.maketrans('', '', string.punctuation) document = document.translate(translator) tokens = document.split() return tokens tokenized_documents = [tokenize_document(document) for document in newsgroups.data] print("Tokens of the first document:") print(tokenized_documents[0])
Information Retrieval Assignment-1 A20545920 Satya Jaidev 2.2 Problem Given the 20newsgroups sample dataset above that has split each document into a list of tokens, develop a function to create an inverted index of the documents/tokens. The resulting index should be a dict with words (terms) as keys and a list of documents (postings) as values (Hint: You can use the list index/array position of the original documents as identifier values) in increasing order value. def create_inverted_index(tokenized_documents): inverted_index = {} for doc_id, tokens in enumerate(tokenized_documents): for term in tokens: if term not in inverted_index: inverted_index[term] = [doc_id] else: inverted_index[term].append(doc_id) for term in inverted_index: inverted_index[term] = sorted(inverted_index[term]) return inverted_index
Information Retrieval Assignment-1 A20545920 Satya Jaidev inverted_index = create_inverted_index(tokenized_documents) print("Inverted Index for 'news':") print(inverted_index.get('news', [])) 2.3 Problem For the 20newsgroups sample dataset index, create two helper functions: a) a function which takes takes the intersect of two list values from the index and returns a list of unique document identifiers (use the O(m + n) algorithm in Figure 1.6 of IIR); and b) a function which performs a sort of a given list of words and a dict index, returning the words as a sorted list based on the number of postings for each word in the index. def intersect(list1, list2): i = 0 j = 0 intersection = [] while i < len(list1) and j < len(list2): if list1[i] == list2[j]: intersection.append(list1[i])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Information Retrieval Assignment-1 A20545920 Satya Jaidev i += 1 j += 1 elif list1[i] < list2[j]: i += 1 else: j += 1 return intersection def sort_words_by_postings(words, index): sorted_words = sorted(words, key=lambda word: len(index[word])) return sorted_words word1 = 'author' word2 = 'subject' intersection_result = intersect(inverted_index.get(word1, []), inverted_index.get(word2, [])) print(f"Intersection of {word1} and {word2}: {intersection_result}") words_to_sort = ['subject', 'from', 'atheist', 'idiot'] sorted_words = sort_words_by_postings(words_to_sort, inverted_index) print(f"Sorted words based on postings: {sorted_words}") 2.4 Problem For the 20newsgroups sample dataset, finally develop a function which performs a search given a query string consisting of whitespace separated words and optional boolean AND terms. The search function should perform the following steps: a) tokenize the query (Problem 1), b) use the inverted index (Problem 2) to sort the query words based on the length of the postings in the index (Problem 3), and c) intersect the postings for each word (in sorted order) from the index (Problem 3) to produce the final query results as a list of document identifiers. Verify your search function using a sample query to ensure proper functionality. def search(query, inverted_index, tokenized_documents):
Information Retrieval Assignment-1 A20545920 Satya Jaidev query_tokens = [term.lower() for term in query.split()] sorted_query_words = sort_words_by_postings(query_tokens, inverted_index) result = inverted_index.get(sorted_query_words[0], []) if sorted_query_words else [] for term in sorted_query_words[1:]: term_postings = inverted_index.get(term, []) result = intersect(result, term_postings) return result query = "article on terrorist attacks" search_result = search(query, inverted_index, tokenized_documents) print(f"Search result for query '{query}': {search_result}")