Fall-2023-Midterm.ipynb - Colaboratory

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

5

Uploaded by DeaconBat3708

Report
10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 1/5 Comp 122 - Fall 2023 Midterm: web crawler 1.1 Lists - Functions - Strings - Loops This project aims to turn some of the magic of the search engine into something a bit more understandable. The biggest goal is to learn more about computer science. Computer science is about solving problems, like building a search engine, by breaking them into smaller pieces and then precisely and mechanically describing a sequence of steps that you can use to solve each part. And those steps can be executed by a computer. For the search engine, the three main pieces are: Finding data by crawling web pages Building an index to respond quickly to search queries Ranking pages to get the best result for a given query In this assignment, we will not get into everything that you need to build a search engine as powerful as Google, but we will cover the main ideas and learn a lot about computer science along the way. What you need to know to complete this lab - Lists are used to store multiple items in a single variable. https://docs.python.org/3/tutorial/datastructures.html Lists are one of 4 built-in data types in Python used to store collections of data, the other 3 are Tuple, Set, and Dictionary, all with different qualities and usage. A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.
10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 2/5 - Strings in python are surrounded by either single quotation marks, or double quotation marks. Strings are Arrays Like many other popular programming languages, strings in Python are arrays of bytes representing unicode characters. However, Python does not have a character data type, a single character is simply a string with a length of 1. Square brackets [ ] can be used to access elements of the string. https://docs.python.org/3/tutorial/introduction.html#strings - Python For Loops A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string). This is less like the for keyword in other programming languages, and works more like an iterator method as found in other object-orientated programming languages. With the for loop we can execute a set of statements, once for each item in a list. https://docs.python.org/3/tutorial/control§ow.html?highlight=loop Extracting / printing Links on 360adwords.com requests module import requests # copied from hw 4 def get_html(url): try: response = requests.get(url) response.raise_for_status() # This will raise an exception for HTTP errors return response.text except requests.RequestException as e: print(f"Error fetching the URL: {e}") return None url = 'https://360adwords.com/' 1) Write a function num_of_urls. This function must return number of URLs of the page. def num_of_urls(): text = get_html(url) t 0
10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 3/5 count = 0 start_index = 0 for i in range(len(text)): start_index = text.find('<a', start_index + 1) if start_index == -1: break else: count += 1 return count num_of_urls() 3 2) Write a function print_all_url . This function must keep going until there are no more url to print. Think about looping forever (set while loop condition) until there are no more links (i.e. else:). def print_all_url(): text = get_html(url) start_index = 0 while(text.find('<a href', start_index) != -1): start_index = text.find('<a href', start_index) + 9 end_index = text.find('">', start_index) print(text[start_index: end_index]) print_all_url() http://360adwords.com/crawl.html http://360adwords.com/walk.html http://360adwords.com/fly.html 3) Write a function get_all_url. This function must store all the URLs from the page to a list. Your function should return a list of all the URLs. def get_all_url(): text = get_html(url) start_index = 0 url_list = []
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 4/5 while(text.find('<a href', start_index) != -1): start_index = text.find('<a href', start_index) + 9 end_index = text.find('">', start_index) url_list.append(text[start_index: end_index]) return url_list get_all_url() [' http://360adwords.com/crawl.html ', ' http://360adwords.com/walk.html ', ' http://360adwords.com/fly.html '] 4) Write a Python function that determines the number of unique words in the page, def get_all_unique_words(): text = get_html(url) start_index = text.find('<body>') word_set = set() while(text.find('>', start_index) != -1): start_index = text.find('>', start_index) + 1 end_index = text.find('<', start_index) sentence = text[start_index: end_index].strip() if len(sentence) != 0: word_set.update(sentence.split(' ')) return len(word_set) get_all_unique_words() 23 Extra credit:
10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 5/5 Compose a Python script to determine the number of pages within a given URL, the count of URLs on each page, and the count of distinct words on each page.