Website Crawling

Search engines use software known as “web crawlers” (also called spiders) to find publicly available webpages. Googlebot is perhaps the most well-known web crawler. They are designed to examine the structure of webpages and to follow the various links on those pages. Crawlers follow the links and return with the latest data about those webpages.

Most crawlers start with a list of web addresses from prior crawls, and then double check their list against the sitemaps provided by website owners. The crawler software is designed to identify new sites, modifications to existing sites and dead links.

Reasons for Website Crawling

You can think of the web as a giant library with billions of books (websites), but no centralized filing system. Search engines identify as many pages as possible during the crawl process and then create an index so individual webpages can easily be located.

A web index includes information about words and their location on webpages. When you type in words for a web search, the search engine algorithms look up your search terms in the index to find the matching (most relevant) pages.

The above description is, however, a gross simplification. The search process is really much more complex. When you search for “calico cats”, for example, you probably want pictures, videos or other related information about calico cats, not just a page with the words appearing multiple times.

Search engine indexing systems pay attention to several different aspects of webpages, including when they were published, meta tags, the words on the page, whether they contain pictures/videos, the “ranking” of the page and a great deal more. All of these factors are weighed by the search engine algorithm in the process of deciding what pages “match” your search.

Website Owner Site Crawl Options

Website owners can choose to give site crawlers unfettered access to the website or set up specific restrictions for crawling, indexing or serving. No restrictions means you get maximum exposure without any extra work. However, internal design or administrative pages shouldn’t be seen by the general public. Site owners have many choices about how crawlers indexes their sites through Webmaster Tools and a file called robots.txt.

Using the robots.txt file, website owners can choose not to be crawled, or they can provide more specific instructions about how their pages are to be crawled (such as adjusting the crawl rate). For example, you might choose to reduce the crawl rate to save bandwidth if your website is much more busy than usual.

Site owners/webmasters can also choose how content is indexed on each individual page. For example, they can choose to have their pages appear with or without a snippet (a summary of the page).

Many search engines offer custom search functions within a specific website – in effect, a mini-search engine for your own website. Moreover, within-site search engines can be customized. With Google’s custom search engine, for example, you can choose to earn some income from the ads related to the searches or pay as little as $100 a year to delete the advertising and Google branding and have the search function appear as just another feature of your site. Read more about advertisement in our paid section.

Crawl Optimization

Crawl optimization is about making your website quick and easy for Googlebot to crawl. That’s because a larger “crawl budget” leads to higher page ranks and better SEO (crawl budget is the time or number of pages Google spends in crawling a site). Recency of Googlebot crawling does impact search page rank, so it boils down to you want Google to be able as crawl as many of your website’s page as possible in as little time as possible.

Strongpages has years of experience with crawl optimization and can help you make the best decisions for your website. We provide a detailed initial report regarding the internal structure of your website, how to use the information to clean up the website, and solve duplicate content, server redirect, or any number of crawl errors. We then continuously monitor your website through crawls to make sure that it stays clean and optimized as your business and website grows.