Code_indexer

docx

School

University of the People *

*We aren’t endorsed by this school

Course

3305

Subject

Computer Science

Date

Nov 24, 2024

Type

docx

Pages

7

Uploaded by benvan77

Report
CS3308 Program Assignment Unit 2 The Python code functions as a program that analyzes a directory containing text documents. Its primary task is to create an inverted index, associating each term in the document collection with a list of documents that contain that term. The resulting inverted index is then stored in two files: index.dat and documents.dat. Here's a breakdown of the code's steps: 1. The code begins by importing necessary modules: sys, os, re, and time. 2. Global variables are established and initialized. These include counters such as 'tokens' (total number of tokens), 'documents' (number of documents), 'terms' (number of unique terms), 'termindex' (index for each unique term), and 'docindex' (index for each document). 3. Two lists, 'alltokens' and 'alldocs,' are created and initialized. 'alltokens' will store all tokens in the corpus, while 'alldocs' will store the names of all documents. 4. The current time is recorded for tracking purposes. 5. The directory containing the documents is specified, and a list of all files in the directory is generated. 6. For each file in the directory: The number of documents is incremented. The file is opened, read into a string with line breaks removed. The string is split into tokens, and each token is added to the 'alltokens' list. The count of tokens is incremented for each token. 7. The 'alldocs' list is sorted and written to the 'documents.dat' file, with each document name followed by its index. 8. The 'alltokens' list is sorted, and unique terms are extracted and stored in a list 'g'. The count of unique terms is recorded. 9. The unique terms in the 'g' list are written to the 'index.dat' file, with each term followed by its index. 10. The start and end times of processing are printed, along with the counts of documents, tokens, and terms in the corpus. import sys import os import re
import sqlite3 import time # The database is a simple dictionary database = {} # Regular expression for: extract words, extract ID from path, check for hexa value chars = re.compile(r'\W+') pattid = re.compile(r'(\d{3})/(\d{3})/(\d{3})') # Global variables for counting tokens = 0 documents = 0 terms = 0 # Class definition for Term class Term(): termid = 0 termfreq = 0 docs = 0 docids = {} # Split on any chars def splitchars(line): return chars.split(line) # Process the tokens of the source code def parsetoken(line): global documents
global tokens global terms line = line.replace('\t', ' ') line = line.strip() l = splitchars(line) for elmt in l: elmt = elmt.replace('\n', '') lowerElmt = elmt.lower().strip() tokens += 1 if not (lowerElmt in database.keys()): terms += 1 database[lowerElmt] = Term() database[lowerElmt].termid = terms database[lowerElmt].docids = dict() database[lowerElmt].docs = 0 if not (documents in database[lowerElmt].docids.keys()): database[lowerElmt].docs += 1 database[lowerElmt].docids[documents] = 0 database[lowerElmt].docids[documents] += 1 database[lowerElmt].termfreq += 1 return l
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Open and read the file line by line, parsing for tokens and processing def process(filename): try: file = open(filename, 'r') except IOError: print("Error in file %s" % filename) return False else: for l in file.readlines(): parsetoken(l) file.close() # Recursive function to walk through directories def walkdir(cur, dirname): global documents all_files = [f for f in os.listdir(dirname) if os.path.isdir(os.path.join(dirname, f)) or os.path.isfile(os.path.join(dirname, f))] for f in all_files: if os.path.isdir(os.path.join(dirname, f)): walkdir(cur, os.path.join(dirname, f)) else: documents += 1 cur.execute("insert into DocumentDictionary values (?, ?)", (os.path.join(dirname, f), documents)) process(os.path.join(dirname, f)) return True
# Function to get a cursor for SQLite def get_cursor(): conn = sqlite3.connect("indexer.db") return conn.cursor() # Function to select and print all records from the DocumentDictionary table def select_all_records_by_author(cursor): sql = "SELECT * FROM DocumentDictionary" cursor.execute(sql) print(cursor.fetchall()) # or use fetchone() print("\nHere is a listing of the rows in the table DocumentDictionary\n") for row in cursor.execute("SELECT rowid, * FROM DocumentDictionary"): print(row) # Main function if __name__ == '__main__': t2 = time.localtime() print("Start Time: %.2d:%.2d" % (t2.tm_hour, t2.tm_min)) # Modify the folder variable to point to the directory where the CACM corpus is located folder = r"C:\Users\benva\Desktop\CS3308\cacm" # PATH TO BE EDITED # Create SQLite database con = sqlite3.connect("indexer_part2.db") con.isolation_level = None cur = con.cursor() # Create tables and indexes cur.execute("drop table if exists DocumentDictionary")
cur.execute("drop index if exists idxDocumentDictionary") cur.execute("create table if not exists DocumentDictionary (DocumentName text, DocId int)") cur.execute("create index if not exists idxDocumentDictionary on DocumentDictionary (DocId)") cur.execute("drop table if exists TermDictionary") cur.execute("drop index if exists idxTermDictionary") cur.execute("create table if not exists TermDictionary (Term text, TermId int)") cur.execute("create index if not exists idxTermDictionary on TermDictionary (TermId)") cur.execute("drop table if exists Posting") cur.execute("drop index if exists idxPosting1") cur.execute("drop index if exists idxPosting2") cur.execute("create table if not exists Posting (TermId int, DocId int, tfidf real, docfreq int, termfreq int)") cur.execute("create index if not exists idxPosting1 on Posting (TermId)") cur.execute("create index if not exists idxPosting2 on Posting (Docid)") # Index the corpus walkdir(cur, folder) t2 = time.localtime() print("Indexing Complete, write to disk: %.2d:%.2d" % (t2.tm_hour, t2.tm_min)) # Print the content of the TermDictionary table print("The content of TermDictionary table is as follows:") cur.execute("select * from TermDictionary") print(cur.fetchall()) # Commit changes to the database and close the connection con.commit()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
con.close() # Print processing statistics print("Documents %i" % documents) print("Terms %i" % terms) print("Tokens %i" % tokens) t2 = time.localtime() print("End Time: %.2d:%.2d" % (t2.tm_hour, t2.tm_min))