Code_indexer

docx

School

University of the People *

*We aren’t endorsed by this school

Course

3305

Subject

Computer Science

Date

Nov 24, 2024

Type

docx

Pages

Uploaded by benvan77

CS3308 Program Assignment Unit 2 The Python code functions as a program that analyzes a directory containing text documents. Its primary task is to create an inverted index, associating each term in the document collection with a list of documents that contain that term. The resulting inverted index is then stored in two files: index.dat and documents.dat. Here's a breakdown of the code's steps: 1. The code begins by importing necessary modules: sys, os, re, and time. 2. Global variables are established and initialized. These include counters such as 'tokens' (total number of tokens), 'documents' (number of documents), 'terms' (number of unique terms), 'termindex' (index for each unique term), and 'docindex' (index for each document). 3. Two lists, 'alltokens' and 'alldocs,' are created and initialized. 'alltokens' will store all tokens in the corpus, while 'alldocs' will store the names of all documents. 4. The current time is recorded for tracking purposes. 5. The directory containing the documents is specified, and a list of all files in the directory is generated. 6. For each file in the directory:  The number of documents is incremented.  The file is opened, read into a string with line breaks removed.  The string is split into tokens, and each token is added to the 'alltokens' list. The count of tokens is incremented for each token. 7. The 'alldocs' list is sorted and written to the 'documents.dat' file, with each document name followed by its index. 8. The 'alltokens' list is sorted, and unique terms are extracted and stored in a list 'g'. The count of unique terms is recorded. 9. The unique terms in the 'g' list are written to the 'index.dat' file, with each term followed by its index. 10. The start and end times of processing are printed, along with the counts of documents, tokens, and terms in the corpus. import sys import os import re

import sqlite3 import time # The database is a simple dictionary database = {} # Regular expression for: extract words, extract ID from path, check for hexa value chars = re.compile(r'\W+') pattid = re.compile(r'(\d{3})/(\d{3})/(\d{3})') # Global variables for counting tokens = 0 documents = 0 terms = 0 # Class definition for Term class Term(): termid = 0 termfreq = 0 docs = 0 docids = {} # Split on any chars def splitchars(line): return chars.split(line) # Process the tokens of the source code def parsetoken(line): global documents

global tokens global terms line = line.replace('\t', ' ') line = line.strip() l = splitchars(line) for elmt in l: elmt = elmt.replace('\n', '') lowerElmt = elmt.lower().strip() tokens += 1 if not (lowerElmt in database.keys()): terms += 1 database[lowerElmt] = Term() database[lowerElmt].termid = terms database[lowerElmt].docids = dict() database[lowerElmt].docs = 0 if not (documents in database[lowerElmt].docids.keys()): database[lowerElmt].docs += 1 database[lowerElmt].docids[documents] = 0 database[lowerElmt].docids[documents] += 1 database[lowerElmt].termfreq += 1 return l

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

# Open and read the file line by line, parsing for tokens and processing def process(filename): try: file = open(filename, 'r') except IOError: print("Error in file %s" % filename) return False else: for l in file.readlines(): parsetoken(l) file.close() # Recursive function to walk through directories def walkdir(cur, dirname): global documents all_files = [f for f in os.listdir(dirname) if os.path.isdir(os.path.join(dirname, f)) or os.path.isfile(os.path.join(dirname, f))] for f in all_files: if os.path.isdir(os.path.join(dirname, f)): walkdir(cur, os.path.join(dirname, f)) else: documents += 1 cur.execute("insert into DocumentDictionary values (?, ?)", (os.path.join(dirname, f), documents)) process(os.path.join(dirname, f)) return True

# Function to get a cursor for SQLite def get_cursor(): conn = sqlite3.connect("indexer.db") return conn.cursor() # Function to select and print all records from the DocumentDictionary table def select_all_records_by_author(cursor): sql = "SELECT * FROM DocumentDictionary" cursor.execute(sql) print(cursor.fetchall()) # or use fetchone() print("\nHere is a listing of the rows in the table DocumentDictionary\n") for row in cursor.execute("SELECT rowid, * FROM DocumentDictionary"): print(row) # Main function if __name__ == '__main__': t2 = time.localtime() print("Start Time: %.2d:%.2d" % (t2.tm_hour, t2.tm_min)) # Modify the folder variable to point to the directory where the CACM corpus is located folder = r"C:\Users\benva\Desktop\CS3308\cacm" # PATH TO BE EDITED # Create SQLite database con = sqlite3.connect("indexer_part2.db") con.isolation_level = None cur = con.cursor() # Create tables and indexes cur.execute("drop table if exists DocumentDictionary")

cur.execute("drop index if exists idxDocumentDictionary") cur.execute("create table if not exists DocumentDictionary (DocumentName text, DocId int)") cur.execute("create index if not exists idxDocumentDictionary on DocumentDictionary (DocId)") cur.execute("drop table if exists TermDictionary") cur.execute("drop index if exists idxTermDictionary") cur.execute("create table if not exists TermDictionary (Term text, TermId int)") cur.execute("create index if not exists idxTermDictionary on TermDictionary (TermId)") cur.execute("drop table if exists Posting") cur.execute("drop index if exists idxPosting1") cur.execute("drop index if exists idxPosting2") cur.execute("create table if not exists Posting (TermId int, DocId int, tfidf real, docfreq int, termfreq int)") cur.execute("create index if not exists idxPosting1 on Posting (TermId)") cur.execute("create index if not exists idxPosting2 on Posting (Docid)") # Index the corpus walkdir(cur, folder) t2 = time.localtime() print("Indexing Complete, write to disk: %.2d:%.2d" % (t2.tm_hour, t2.tm_min)) # Print the content of the TermDictionary table print("The content of TermDictionary table is as follows:") cur.execute("select * from TermDictionary") print(cur.fetchall()) # Commit changes to the database and close the connection con.commit()

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

con.close() # Print processing statistics print("Documents %i" % documents) print("Terms %i" % terms) print("Tokens %i" % tokens) t2 = time.localtime() print("End Time: %.2d:%.2d" % (t2.tm_hour, t2.tm_min))

Related Documents

Exercises 00.02-07.pdf

AWILLISMODULE7journal.docx

CIS256_4.4_GPO_Process_Corey_Adams.docx

L06 Check Your Understanding 2: [23FA] CAS 100A, Sec 02: Eff Speech (WC, Bedell).pdf

Short Quiz 009 - 1 (1).png

tutorial-3_semA2023_student.pdf

Short Quiz 007 - (2).png

05-02_task1.pdf

FINAL ARTICULATE PROJECT.docx

CS4248_Tutorial_6_ws.pdf

KeyF10.pdf

Assessment 5_Unique number 302211 (page 5 of 10).pdf

Recommended textbooks for you

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

Microsoft Visual C#

Computer Science

ISBN:9781337102100

Author:Joyce, Farrell.

Publisher:Cengage Learning,

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

SEE MORE TEXTBOOKS

Recommended textbooks for you

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT