Crawler is a pilot project designed to crawl BD sites only. First, Initiate crawler module provides current status of URL queue to the Schedule policy module and receives policy map and crawling threshold number. Until a threshold reached flag is received from the Fetch URL module, Crawler receives a link from Fetch URL module. It then retrieves a new page by providing page link to the Fetch site page module. This page is then sent to the Extract URL module and it returns a set of newly extracted links to the Crawler. In order to do that, Extract URL module first generates raw links using Parse raw links module and sends those to the Filter Valid links module. Filter valid links module returns only the URLs that Crawler should crawl in the future. After receiving the filtered links, Extract URL module formats the links using a library module called Format link and finally sends these filtered-formatted links to the Crawler. Based on the policy map, Crawler can either add the links to the Update URL queue module which is an off-page connector or dispatch to the Reschedule policy module which is an on-page connector.
Crawler is a pilot project designed to crawl BD sites only. First, Initiate crawler module provides current status of URL queue to the Schedule policy module and receives policy map and crawling threshold number. Until a threshold reached flag is received from the Fetch URL module, Crawler receives a link from Fetch URL module. It then retrieves a new page by providing page link to the Fetch site page module. This page is then sent to the Extract URL module and it returns a set of newly extracted links to the Crawler. In order to do that, Extract URL module first generates raw links using Parse raw links module and sends those to the Filter Valid links module. Filter valid links module returns only the URLs that Crawler should crawl in the future. After receiving the filtered links, Extract URL module formats the links using a library module called Format link and finally sends these filtered-formatted links to the Crawler. Based on the policy map, Crawler can either add the links to the Update URL queue module which is an off-page connector or dispatch to the Reschedule policy module which is an on-page connector.
Design a structure chart based on the above information.
Step by step
Solved in 2 steps with 1 images