Guide to crawling, the basis of Google and SEO

Reading time : 14 minutes

It all starts here: without crawling, there would be no search engines as we know them, and therefore no ranking nor, clearly, SEO. In short, it is on crawling and the visits made by bots that the functioning of the Web (and our work to gain online visibility) is based, and this alone should make us realize the importance of knowing at least superficially about this topic, as we try to do with this guide to crawling for SEO.

What is crawling for search engines

Essentially, crawling is the discovery process during which search engines send a team of robots, called crawlers or spiders, into the Web to find new and updated content, which will then be added to the various search engine indexes.

The type of content is broad and can vary-a Web page, an image, a video, a PDF, and so on-but regardless of the format, content is discovered through links, whether they are on already known pages or through sitemaps that a site provides directly.

In English, the activity is called crawling: in technical terms it precisely identifies the entire process of accessing a website and retrieving data obtained through a computer program or software. That is, through the work of bots (usually known as crawlers or even spiders because, like spiders, they follow the path traced by link threads to create the Web) that automatically search or update web pages on behalf of the search engine.

As we said, this step is essential for every single Web site: if our content is not crawled, we have no chance of gaining real visibility on search engines, starting with Google.

Crawling: what it is and how it works for Google

Dwelling precisely on how crawling works for Google, crawling represents the search engine’s way of figuring out what pages exist on the Web: there is no central registry of all Web pages, so Google must constantly search for new and updated pages in order to add them to its list of known pages.

Gli spider e il crawling - da Moz

The crawling process begins with a list of URLs from previous crawls and sitemaps provided by site owners: Google uses web crawlers and specifically Googlebot (the name by which its program is known to perform the retrieval operation through the work of a huge amount of computers scanning billions of pages on the web) to visit these addresses, read the information they contain, and follow the links on those pages.

The crawlers will revisit the pages already in the list to see if they have been changed and will also scan the newly detected pages. During this process, crawlers have to make important decisions, such as prioritizing when and what to crawl, making sure that the website can handle the server requests made by Google.

More specifically, in the crawling phase, Googlebot retrieves some publicly accessible Web pages, then follows the links there to find new URLs; by jumping along this path of links, the crawler is able to find new content and add it to the Index, which we know is a huge database of discovered URLs, from which (but here we are already at the later stages of Search) they are later retrieved when a user searches for information to which the content of that URL provides a relevant answer.

Scanning is also called “URL Discovery,” indicating precisely how Google discovers new information to add to its catalog. Usually, the way Google finds a new Web site is by following links from one Web site to another, as mentioned: just as we users do when we explore content on the Web, crawlers go from page to page and store information about what they find on those pages and other publicly accessible content, which ends up in the Google Search index.

Some pages are known because Google has already visited them, other pages are discovered when Googlebot follows a link back to them (e.g., a hub page, such as a category page, links to a new blog post), and still others are discovered when we send Google a Sitemap for crawling.

Either way, when Googlebot finds the URL of a page it may visit or “crawl” the page to discover its contents. It is important to understand, in fact, that Googlebot does not crawl all the pages it has detected, partly because some pages may not be authorized for crawling by the site owner, while others may not be accessible without being logged into the site.

During the crawl, Google displays the page and executes any JavaScript code detected using a recent version of Chrome, similar to what a common browser does in displaying the page we visit. Rendering is important because websites often rely on JavaScript to display content on the page, and without rendering, Google may not see this content, the official guide to this tells us.

Crawling for Google: frequency, speed and budget

Googlebot uses an algorithmic process to determine which sites to crawl, how often to do so, and how many pages to retrieve from each site. Google’s crawlers are also programmed to try not to crawl the site too quickly to avoid overloading it. This mechanism is based on site responses (e.g., HTTP 500 errors mean “slowdown”) and settings in Search Console.

Successfully crawled pages are processed and forwarded to Google indexing to prepare the content for publication in the search results; the search engine’s systems view the page content as the browser would and take note of key signals, from keywords to website updates, storing all this information in the search index.

Because the Web and other content is constantly changing, Google’s crawling processes are constantly running to keep up, learning how often content that has already been examined is being changed and scanning it as necessary, and also discovering new content as new links to those pages or information are displayed.

As the reference guide always makes clear, Google never accepts payment for scanning a site more frequently, true to its promise to provide the same tools to all websites to ensure the best possible results for users.

In addition, Google is very careful not to overload its servers, so the frequency of scans depends on three factors:

  • Crawl rate: maximum number of simultaneous connections a crawler can use to crawl a site.
  • Crawl demand: how much content is desired by Google.
  • Crawl budget: number of URLs that Google can and wants to crawl.

There are also three common problems with Googlebots accessing sites, which can prevent or block Google bots from crawling:

  • Problems with the server running the site
  • Network problems
  • Rules in the robots.txt file that prevent page access by Googlebot

As we will see in more detail, the set of tools in the Search Console can serve “content authors to help us crawl their content better,” the official documentation suggests, adding to established standards such as Sitemaps or the robots.txt file to specify how often Googlebot should visit their content or whether it should not be included in the search index.

The importance of crawling for Google and for sites

To better understand the weight this activity has for Google, and thus for SEO, we can think of the analogy proposed by Lizzy Harvey on crawling is “like reading all the books in a library.” Before search engines can serve up any search results, they must get as much information from the web as possible, and so they use the crawler, a program that travels from site to site and acts like a browser.

This check includes the HTML and any content mentioned in the HTML, such as images, videos, or JavaScript. Crawlers also extract links from HTML documents so that the crawler can also visit linked URLs, again with the goal of finding new pages on the Web.

Technically speaking, crawlers do not actively click on links or buttons, but send URLs to a queue to be crawled at a later time. When a new URL is accessed, no cookies, service workers, or local storage (such as IndexedDB) are available.

The crawlers attempt to retrieve each URL to determine the status of the document: if a book or document is missing or damaged, the bot cannot read it, just as if a document returns an error status code, the bots cannot use any of its contents, but they could retry the URL at a later time. This ensures that only publicly accessible documents enter the index. Again, if the crawlers discover a 301 or 302 redirect status code, for example, they follow the redirect to a new URL and continue there: when they get a positive response, and therefore have found a user-accessible document, they check whether it is allowed to crawl and then download the content.

Returning then to the previous definitions, crawl rate or crawl rate represents the maximum number of simultaneous connections a crawler can use to crawl a site. Crawl demand, on the other hand, depends on “how much content is desired by Google” and is “influenced by URLs that have not been crawled by Google before, and Google’s estimate of how often content changes on non-URLs.”

Google calculates a site’s crawl rate periodically, based on the site’s responsiveness or, in other words, the share of crawling traffic it can actually handle: if the site is fast and consistent in responding to crawlers, the rate goes up if there is demand for indexing; if, on the other hand, the site slows down or responds with server errors, the rate goes down and Google crawls less.

When Googlebot can crawl a site efficiently, it enables a site to quickly get new content indexed in search results and helps Google discover changes to existing content.

How to handle Google scans on a site

Talking about crawling also means addressing a topic that is becoming increasingly popular in recent years and that often plagues SEOs and those who work on sites, namely the crawl budget, which we have already defined as the amount of time (expressed as the amount of URLs) that Googlebot can and will devote to crawling a site-in other words, the sum of crawl rate and crawl demand.

To guide us through the analysis of how Google’s crawling mechanism works, we can refer to an appointment with the Google Search Console Training series entrusted, as on previous occasions, to Search Advocate Daniel Waisberg, who gives a quick but comprehensive overview of how Google crawls pages, and then dwells on the Search Console’s Crawl Statistics report, which first allows us to check Googlebot’s ability to crawl a given site and provides data on crawl requests, average response time, and more.

As a disclaimer, the Googler explains that such topics are most relevant for those working on a large website, while those with a project with a few thousand pages need not worry too much about them (although, he says, “it never hurts to learn something new, and who knows, your site might become the next big thing”).

How to reduce Googlebot crawl speed the right way

In the rare cases when Google’s crawlers overload servers, you can set a limit on crawl speed using settings in Search Console or other on-site interventions.

As a recent official Google page makes clear, to reduce Googlebot crawl speed we can essentially:

  • Use Search Console to temporarily reduce the crawl speed.
  • Return an HTTP status code 500, 503 or 429 to Googlebot when it crawls too fast.

A code like 4xx identifies client errors: servers return a signal indicating that the client request was wrong in some sense and for some reason; in most cases, errors in this category are rather benign, Google says, such as “not found,” “forbidden,” “I am a teapot” (one of Google’s most famous Easter Eggs), because they do not suggest that something wrong is happening with the server itself.

The only exception is 429, which stands for “too many requests”: this error is a clear signal to any well-behaved robot, including Googlebot, that it must slow down because it is overloading the server.

However, and again with the exception of code 429, all 4xx errors are not good for Googlebot’s rate limiting, precisely because they do not suggest that there is an error with the server: not that it is overloaded, not that it has encountered a critical error and is unable to respond to the request. They simply mean that the client request was bad or wrong in some way. There is no sensible way to associate, for example, a 404 error with server overload (and it couldn’t be otherwise, because an influx of 404s could result from a user accidentally linking to the wrong pages on the site and cannot, in turn, affect Googlebot’s slowdown in scanning), and the same is true for states 403, 410, 418.

Then there is another aspect to consider: all 4xx HTTP status codes (again, except 429) will cause content to be removed from Google Search; even worse, publishing a robots.txt file with a 4xx HTTP status code makes it practically useless, because it will be treated as if it did not exist – and thus all the rules set, including directives on areas forbidden to be crawled, are practically accessible to everyone, with disadvantages for everyone.

Ultimately, then, Google strongly urges us not to use 404 and other 4xx client errors to reduce Googlebot’s crawling frequency, which albeit seems to be a trending strategy among website owners and some content delivery networks (CDNs).

What it is and how the Google Crawl Stats report can be used

In this regard, far more effective is to learn how to use the special tool in Google Search Console, the Crawl Stats report, which allows us to find out how often Google crawls the site and what the responses were, but also to view statistics on Google’s crawling behavior and to support understanding and optimizing the crawling process.

The most recent version of this tool was released in late 2020 (as also announced in Google Search News in November 2020) and allows for data that answers questions such as:

  • What is the general availability of the site?
  • What is the average page response for a crawl request?
  • How many requests have been made by Google to the site in the last 90 days?

The Scan Stats report adds to the old webmaster tools and is only available for properties at the root directory level: site owners can find it by accessing the Search Console and going to the “Settings” page.

When the report is opened, a summary page appears, which includes a graph of the scanning trends, details of the host’s status, and a detailed analysis of the scan request.

The graph on scan trends

In particular, the graph of scanning trends shows information on three metrics:

  • Total scan requests for site URLs (successful or not). Requests for resources hosted outside the site are not counted, so if images are served on another domain (such as a CDN network) they will not appear here.
  • Total download size from the site while scanning. Page resources used by multiple pages that Google has cached are required only the first time (at the time of storage).
  • Average page response time for a search request for indexing to recover page content. This metric does not include page resource recovery such as scripts, images, and other linked or embedded content, and does not take into account page render time.

When analyzing these data, Waisberg recommends looking for “higher peaks, drops and trends over time”: for example, if you notice a significant drop in total scan requests, it is good to make sure that no one has added a new robots.txt file to the site; if the site responds slowly to Googlebot it could be a sign that the server fails to handle all the requests, as well as a constant increase in the average response time is another “indicator that the servers might not handle all the load”although it may not immediately affect the scanning speed but rather the user experience.

Host status analysis

Host status data allows you to check the general availability of a site in the last 90 days. The errors in this section indicate that Google cannot scan the site for technical reasons.

Again there are 3 categories that provide details of host status:

  • Robots.txt fetch: the percentage of errors while scanning the robots.txt file. It is not mandatory to have a robots.txt file, but must return the answer 200 or 404 (valid, compiled or empty file, or non-existent file); if Googlebot has a connection problem, such as a 503, it will stop scanning the site.
  • DNS Resolution: indicates when the DNS server has not recognized the host name or has not responded during the scan. In case of errors, it is suggested to contact the registrar to verify that the site is properly configured and that the server is connected to the Internet.
  • Server connectivity: shows when the server is not responding or has not provided the full answer for the URL during a scan. If you notice significant peaks or connectivity problems, it is suggested to talk to the provider to increase capacity or resolve availability problems.

A substantial error in any of the categories may result in a reduction in availability. There are three values of the host state that appear in the report: if Google has found at least one of these errors on the site in the last week, a red icon-shaped alert appears with exclamation mark; if the error is older than a week and goes back to the last 90 days, a white icon appears with a green check that indicates that there have been problems in the past (temporary or resolved in the meantime), which can occur through server logs or with a developer; finally, if there have been no substantial problems of availability in the last 90 days everything is in place and a green icon with a white check appears.

Googlebot’s scan requests

The scan request cards show several broken down data that help figure out what Google crawlers found on the site. In this case, there are four breakdowns:

  • Scan response: The responses received by Google while scanning the site, grouped by type, as a percentage of all the responses to the scans. Common response types are 200, 301, 404 or server errors.
  • Scanned file types: shows the file types returned by the request (whose percentage value refers to the responses received for that type and not to the recovered bytes); the most common are HTML, images, video or Javascript.
  • Purpose of the scan: shows the reason for scanning the site, such as the discovery of a new URL for Google or refresh for a re-crawl of a note page
  • Type of Googlebot: indicates the type of user agent used to make the scan request, such as smartphone, desktop, image and others.

Tell search engines how to crawl your site

To recap, to understand and optimize Google crawling we can use the Search Console’s Crawl Statistics report, starting with the page summary graph to analyze crawl volume and trends, continuing with host status details to check overall site availability, and finally, checking the breakdown of crawl requests to understand what Googlebot finds when it crawls the site.

These are the basics of using the crawl status report to ensure that Googlebot can crawl the site efficiently for Search, to be followed up with the necessary crawl budget optimization and general interventions to ensure that our site can actually enter the Google Index and then begin the climb to visibility positions.

With the understanding that the crawl budget – that is, the number of URLs Google can and will crawl on websites each day, repetita iuvant – is a parameter “relevant for large websites, because Google needs to prioritize what to crawl first, how much to crawl, and how frequently to crawl again,” it is still useful to know how to guide the process of search engine crawlers crawling our site.

Le basi del crawling - da Moz

In that sense, as Moz‘s work (from which we have drawn some of the images on the page) well summarizes for us, there are some optimizations we can have implemented to better direct Googlebot on how we want it to crawl our content published on the Web, and personally telling the search engines how to crawl our pages can give us more and better control over what ends up in the Index.

Site interventions to optimize crawler’s crawling

Before we get into the details of what needs to be done, however, let’s digress one last time. Usually, we focus on the work needed to ensure that Google can find our important pages, and that is certainly a good thing. However, we should not forget that there are probably pages that we do not want Googlebot to find, such as old URLs with thin content, duplicate URLs (such as sorting parameters and e-commerce filters), special promo code pages, staging or test pages, and so on.

This is also what crawling management is for, allowing us to steer crawlers away from certain pages and sections of the site. And these are the common and most effective methods.

  • Robots.txt

We have mentioned it several times: robots.txt files are located in the root directory of Web sites and suggest which parts of the site search engines should and should not crawl, as well as the speed at which they crawl the site , via specific directives.

  • Sitemap

Sitemaps can also be useful: they are, as the name makes clear, a list of URLs on the site that crawlers can use to discover and index content. One of the easiest ways to make sure Google finds your pages with the highest priority is to create a file that meets Google’s standards and submit it through Google Search Console. Although submitting a sitemap does not replace the need for good site navigation, it can certainly help crawlers follow a path to all important pages.

Sometimes, navigation errors can prevent crawlers from seeing the entire site: this is the case of mobile navigation showing different results than desktop navigation, JavaScript-enabled (and not HTML-enabled) menu items, customization or display of navigation unique to a specific type of visitor over others (which could appear as cloaking to crawlers), failure to link to a primary page of the site in the navigation, hidden text within non-text content, content hidden behind login forms, and so on.

According to experts, it is essential for the website to have clear navigation and useful URL folder structures.

At the same time, a clean information architecture should be set up, following the practice of organizing and labeling content in a way that improves efficiency and findability for users, on the premise that the best information architecture is intuitive, that is, it allows users not to think much about scrolling through the site or finding something.

  • Optimizing the crawl budget

Finally, there are the technical interventions to optimize the crawl budget, which is the average number of URLs Googlebot scans on the site before exiting, and thus serves to prevent Googlebot from wasting time scanning unimportant pages and risking ignoring important ones. The crawl budget is very important on very large sites with tens of thousands of URLs, but it is never a bad idea to prevent crawlers from accessing content that we are definitely not interested in. What we need to make sure is not to block a crawler’s access to pages on which we have added other directives, such as canonical or noindex tags: if Googlebot is blocked from a page, it will not be able to see the instructions there.