What is Crawling and Indexing?

Crawling and indexing are something that search engine “bots” to discover and catalogue content on websites and blogs in their index so they can be ranked and served for relevant keyphrases. Let’s rewind a bit and talk about the epic job that Google’s search engine has: to catalogue whatever exists on the entire internet!

According to Siteefy, “as of January 2021, a total of 1,167,715,133 exists, with 198,988,100 of them being active sites”. Just think of the mindblowing number of new data being added to these sites and the massive job that Google has of cataloging, ranking, and serving all these sites to its search engine users.

The rate of growth is staggering and grew even more significantly during the pandemic as people setup online business and blogs. If you are starting a new blog or have an existing blog, you need to make sure that Google is crawling your pages and posts and also indexing them.

If they’re not getting indexed, you have a problem: your blog or niche site will not be served to people because it’s not even Google’s “library” to begin with!

How Do Search Engines like Google Work?

Search engines have three main functions:

Crawling: the process of a search engine’s bot or crawler finding new pages, reading/crawling them to understand them.

Indexing: once a page has been crawled, it can now be stored and organized in the search engine’s library to be served to a user later.

Ranking: only after a page has been indexed can be it be ranked; the search engine determines rankings pertaining to particular keyphrases and serves them to search engine users when those keywords are entered into the search engine.

What is Crawling?

As described earlier, crawling is the first step in your blog or site to be considered being stored and served in a search engine’s database. But in order to understand crawling, you must first understand what crawlers or bots are.

What Are Web Crawlers?

A web crawler is a bot that crawls a web page to read it and will follow the links on a page to crawl the other pages as well.

Web crawlers go by many names: robots, bots, spiders, search engine bots, etc.

Think about web crawling bots like spiders; they’re literally crawling your site and the internet to collect information about your pages and on your pages to understand them. Google’s own bot is called the Googlebot.

These Googlebots find your sites when they find a link to your site on another site they’re crawling or if you have submitted a sitemap via the Google Search Console tool.

Either way, crawling is a “neverending” task. To have a higher chance of being crawled, make sure to update your content frequently. Crawlers prioritize sites that are popular, high quality, and update frequently. For this reason, you should take great care of the content on your site and update it when needed. Google has to use a lot of resources to crawl your blog’s pages and posts and they have to “prioritize” certain sites over others which leads us to “crawl budget”.

So, Exactly What is Crawl Budget?

It’s exactly what it sounds like: your site has a “budget” assigned to it by the Googlebot to visit it within a timeframe.

Crawl budget is the number of pages a bot or crawler will visit within a given timeframe. For that reason, you want to use your crawl budget wisely. For instance, I typically advise you to not index tag pages as those are considered “low value” pages. If you are wasting crawling resources (basically, wasting Google’s resources) to crawl pages on your blog or niche site that are low quality, your crawl budget becomes smaller.

To avoid this, make sure to setup your blog properly and to publish high quality content.

What is Indexing?

After the crawling stage happens, then indexing magic happens. Indexing is how search engines organize and store your data in their “library” to rank and retrieve for its users.

If your pages are not getting indexed, they will not show up in the search engines!

The easiest way to understand this is to think of your local library. They first receive a a new book and try to understand data regarding it like the author, what category it falls under, etc. so they can put it into their “system”.

Then, the library actually assigns it an index number so patrons of the library can go to the physical location and retrieve it. They have a system in place. Same thing with search engines indexing your content.

It’s important to understand that indexing requires a lot of resources by Google and other resources as they have to “store” your content. Remember, millions of blogs and sites exist with new ones started daily and it is Google’s job to discover them, crawl them, index them, and rank them.

What Does the Term Ranking Mean?

After your content has been crawled and hopefully indexed, Google then determines how it will rank for keyphrases relevant to your content. So, in loose terms, “my blog post is ranking #3 for a keyphrase in Google” means that when you search that specific keyphrase using Google’s search engine, the blog post will be in position 3 from the top.

The Importance of Crawling and Indexing

If you don’t setup your blog properly, Google and other search engines won’t be able to crawl your site. If they aren’t able to crawl it, your site can’t appear in its index and be given any rankings.

You may as well not even have a site.

The first step is to go through this WordPress SEO Guide so you enable proper indexing settings for the search engines. Next, you will want to make sure that you setup Google Search Console for your blog so you can view if there are any issues keeping Google bots from indexing your content and so much more.

Even if you’re the master of SEO, none of it helps if you don’t have your blog properly setup to be crawled and indexed.

How to Check if Your Content is Getting Indexed

There are a couple ways to check if your content is getting indexed. The first way to do it is to log into Google Search Console and click on “Pages” under Indexing. The graph will show you how many indexed versus non-indexed pages you have and why they are not indexed.

The two main reasons for non-indexed pages are “Discovered – currently not indexed” and “Crawled – currently not indexed”. If you click on them, they will list which pages were discovered or crawled but not indexed. Some pages may be ones you truly didn’t want indexed (like tag pages!) but there might be legitimate content you do want indexed. Google gives us a lot of insight into its process:

crawled discovered but not indexed explanation

The second method is to type this into the Google search browser:

site:affiliatephoenix.com

Replace the “affiliatephoenix.com” with your own blog name or domain name and see what the search results show you. It will only list the indexed pages or posts.

If you see a difference in the number of indexed pages in Google Search Console and when you manually check the search engine, that is something a lot of site owners see. In the last 3-4 years, the process for crawling and indexing has changed a lot.

Now, Google has a mobile first primary index as well as a supplementary index and it is common to see a slight discrepancy when you check for indexing in the search engine versus GSC.

In Summary

It’s important to understand the overall process of how Google and other search engines discover and index your content, how it affects your blog, and as a result, set up your blog properly. When you have these technical pieces synchronized, Google will discover, index, and rank your content in a beautiful rhythm!