What is Web Crawler?
A Web Crawler is a Automated script which browses the web in automated manner. we can call it as web crawler or web spider or bot.
Why Web Crawler?
The Web crawler visits the web sites and fetch the pages and other information to create the entries for the search index. web crawling is specially used to discover publically available web pages.
google bot is the most well know web crawler. google bot will read line by line in your web page and follow the links on those pages. it will reads each and every links and fetch in to the google servers.
Crawling the Web:
The crawler begins to crawl the web site based on the past crawl and also based on the site map provided by the store/site owner. store owners have many choices to handle the crawler for their sites. for google bot, we can able to handle the crawling by the help of google web masters and robots.txt file.
What is Robots.txt File?
Store/site owner can have an option to restrict the crawling from the robots.txt file. robots.txt is just a simple txt file exists in web root folder. web robots reads the robots.txt file and view the content of each web page and links. with the help of robots.txt file, we can control what are the files we need to crawl so it used to prevent the directives being crawled by the crawler. so it helps to avoid indexing the duplicated page for the purpose of SEO. we can configure the delay of each crawling in robots.txt file. it helps to avoid keep on indexing the page and also it reduces the bandwidth of the web server. in robots.txt file we can allow or disallow the specific bots. so it helps to avoid crawling the site by the bad/spam crawler.
in the next post we will see how to generate the robots.txt file and how to handle the robots.txt file to restrict specific crawlers.