Fall-2023-Midterm.ipynb - Colaboratory

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

MISC

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by DeaconBat3708

10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 1/5 Comp 122 - Fall 2023 Midterm: web crawler 1.1 Lists - Functions - Strings - Loops This project aims to turn some of the magic of the search engine into something a bit more understandable. The biggest goal is to learn more about computer science. Computer science is about solving problems, like building a search engine, by breaking them into smaller pieces and then precisely and mechanically describing a sequence of steps that you can use to solve each part. And those steps can be executed by a computer. For the search engine, the three main pieces are: Finding data by crawling web pages Building an index to respond quickly to search queries Ranking pages to get the best result for a given query In this assignment, we will not get into everything that you need to build a search engine as powerful as Google, but we will cover the main ideas and learn a lot about computer science along the way. What you need to know to complete this lab - Lists are used to store multiple items in a single variable. https://docs.python.org/3/tutorial/datastructures.html Lists are one of 4 built-in data types in Python used to store collections of data, the other 3 are Tuple, Set, and Dictionary, all with different qualities and usage. A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 2/5 - Strings in python are surrounded by either single quotation marks, or double quotation marks. Strings are Arrays Like many other popular programming languages, strings in Python are arrays of bytes representing unicode characters. However, Python does not have a character data type, a single character is simply a string with a length of 1. Square brackets [ ] can be used to access elements of the string. https://docs.python.org/3/tutorial/introduction.html#strings - Python For Loops A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string). This is less like the for keyword in other programming languages, and works more like an iterator method as found in other object-orientated programming languages. With the for loop we can execute a set of statements, once for each item in a list. https://docs.python.org/3/tutorial/control§ow.html?highlight=loop Extracting / printing Links on 360adwords.com requests module import requests # copied from hw 4 def get_html(url): try: response = requests.get(url) response.raise_for_status() # This will raise an exception for HTTP errors return response.text except requests.RequestException as e: print(f"Error fetching the URL: {e}") return None url = 'https://360adwords.com/' 1) Write a function num_of_urls. This function must return number of URLs of the page. def num_of_urls(): text = get_html(url) t 0

10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 3/5 count = 0 start_index = 0 for i in range(len(text)): start_index = text.find('<a', start_index + 1) if start_index == -1: break else: count += 1 return count num_of_urls() 3 2) Write a function print_all_url . This function must keep going until there are no more url to print. Think about looping forever (set while loop condition) until there are no more links (i.e. else:). def print_all_url(): text = get_html(url) start_index = 0 while(text.find('<a href', start_index) != -1): start_index = text.find('<a href', start_index) + 9 end_index = text.find('">', start_index) print(text[start_index: end_index]) print_all_url() http://360adwords.com/crawl.html http://360adwords.com/walk.html http://360adwords.com/fly.html 3) Write a function get_all_url. This function must store all the URLs from the page to a list. Your function should return a list of all the URLs. def get_all_url(): text = get_html(url) start_index = 0 url_list = []

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 4/5 while(text.find('<a href', start_index) != -1): start_index = text.find('<a href', start_index) + 9 end_index = text.find('">', start_index) url_list.append(text[start_index: end_index]) return url_list get_all_url() [' http://360adwords.com/crawl.html ', ' http://360adwords.com/walk.html ', ' http://360adwords.com/fly.html '] 4) Write a Python function that determines the number of unique words in the page, def get_all_unique_words(): text = get_html(url) start_index = text.find('<body>') word_set = set() while(text.find('>', start_index) != -1): start_index = text.find('>', start_index) + 1 end_index = text.find('<', start_index) sentence = text[start_index: end_index].strip() if len(sentence) != 0: word_set.update(sentence.split(' ')) return len(word_set) get_all_unique_words() 23 Extra credit:

10/23/23, 9:32 PM Fall-2023-Midterm.ipynb - Colaboratory https://colab.research.google.com/drive/181MUXb14gIk-63VRe1ayfyYvurN25a79?authuser=1#scrollTo=KrA7psDcusVo&printMode=true 5/5 Compose a Python script to determine the number of pages within a given URL, the count of URLs on each page, and the count of distinct words on each page.

Related Documents

CYB 250 Stepping Stone One.docx

CYB 240 Module Three Lab Worksheet.docx

First OYSTER Run.docx

1.1.12 Practice Questions.docx

Lab Assignment 1.docx

Hw8_Fall_2023.ipynb - Colaboratory.pdf

230_Exam3_Sp_2023.pdf

6.2.3 Formating floating point outputs using print().odt

lab7-Philipmartin.docx

Lab1.docx

6.3.1 input output string streams.odt

Hands-on Exercise 9 (Network Mgmt Chp 4 - Part III).docx

Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

A+ Guide To It Technical Support

Computer Science

ISBN:9780357108291

Author:ANDREWS, Jean.

Publisher:Cengage,

Microsoft Windows 10 Comprehensive 2019

Computer Science

ISBN:9780357392607

Author:FREUND

Publisher:Cengage

New Perspectives on HTML5, CSS3, and JavaScript

Computer Science

ISBN:9781305503922

Author:Patrick M. Carey

Publisher:Cengage Learning

Programming with Microsoft Visual Basic 2017

Computer Science

ISBN:9781337102124

Author:Diane Zak

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
A+ Guide To It Technical Support
Computer Science
ISBN:9780357108291
Author:ANDREWS, Jean.
Publisher:Cengage,
Microsoft Windows 10 Comprehensive 2019
Computer Science
ISBN:9780357392607
Author:FREUND
Publisher:Cengage
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning