Generating sitemaps hosted on S3

rails

seo

Feb, 2021

Generating sitemaps hosted on S3

If you're trying to improve your website's SEO, generating a sitemap for it is a quick win.

A sitemap tells the search engines how your site is structured and which pages you want to be crawled and indexed. It's not that the search engines won't crawl your website if you don't have a sitemap but it will certainly help them do it more intelligently. In this post I'll share how I generated a sitemap for my blog, hosting it on S3.

Sitemap Generator

I don't want to create and edit the actual sitemap file manually so I've delegated that to a gem called sitemap_generator. If you use this gem, the only thing you'll need is to edit a configuration file with the routes that you want to include on your sitemap and set up a host that will be accessed by the search engines.

What's also great about this gem is that it includes a rake to ping the search engines when you have updates on your pages. Hence, there's no need to manually add and manage the sitemap on Google Search Console or Bing Webmaster Tools. These two search engines are included by default but you can also add others (e.g. Yandex, Baidu). If you're also aiming for DuckDuckGo and Ecosia you should be fine with the defaults. Ecosia uses Bing's web crawler and whilst you can't ping DuckDuckGo crawler (DuckDuckBot) it is mentioned in their help page that their search engines sources from multiple partners, though most commonly from Bing (and none from Google).

Why hosting a sitemap on S3

If you're hosting your website on a server that does not allow persisting files in the local filesystem (Heroku, for instance), it is a good idea to host your sitemap in a remote service like Amazon's Simple Storage Service (S3). You can then point the search engines and robots to those remote files, using a redirect route from your website.

Steps to generate the sitemap and upload it to S3:

The gem's documentation is very comprehensive but here are the steps for this use-case:

1. Add gem sitemap_generator to your Gemfile

# Gemfile
gem sitemap_generator

2. Run rake sitemap:install

bundle exec rake sitemap:install

This will create a config/sitemap.rb file where you can configure the links that want the sitemap to consider.

3. Add a default host

This is the hostname that is used to build the links added to the sitemap (and all links in a sitemap must belong to the same host). In my case that will be:

# config/sitemap.rb 
SitemapGenerator::Sitemap.default_host = "https://ananunesdasilva.com"

4. Add the routes that you want to include in your sitemap

The root path '/' and sitemap index file are added automatically. So other than those, I'd like to add the about page, the posts#index page, and all the posts#show pages:

# config/sitemap.rb
SitemapGenerator::Sitemap.create do 
  add '/about', changefreq: 'monthly', lastmod: 1.month.ago 
  add '/posts', changefreq: 'weekly', lastmod: Post.order('updated_at DESC').first.updated_at

  Post.find_each do |post|
    add post_path(post), lastmod: post.updated_at, changefreq: 'monthly'
  end
end

The gem's documentation advises that if you're adding large ActiveRecord collections (thousands of records) you should iterate through them in batches to avoid loading all records into memory at once. For now, my posts' collection is not that big but I'm using the find_each method anyway since there's no harm to small collections and hopefully, I'll have thousands of posts soon 😅.

Notice that I'm only adding the paths since the URL will be built using the default host that was defined in the previous step.

You can pass options to the add method or stay with the defaults:

# Defaults: :priority => 0.5, :changefreq => 'weekly', # :lastmod => Time.now, :host => default_host

I'm giving a higher priority to each post entry (vs. the posts#index and about pages) since I want the crawlers to consider those as the most relevant links. Regarding changefreq, I'm signaling that both the about page and each post entry page will change monthly. I'm not expecting to change them frequently except for occasional improvements. On the other hand, the posts#index is going to change every time I add a new post entry (I'm hoping to add at least two posts per month, for now).

5. Add a public_path

Heroku does not allow writing to the local filesystem but it still grants access to a temporary directory that we will use to store the sitemap before sending it to the S3 bucket. On Heroku, this is tmp/ within your application directory.

# config/sitemap.rb
SitemapGenerator::Sitemap.public_path = 'tmp/sitemap' 

6. Create an S3 bucket on AWS

I'd advise you to create a separate bucket just for the sitemap(s). You'll need to open that bucket to the public so that the sitemap(s) can be read by the search engines.

Go to your bucket's permissions and allow public access

7. Add the bucket to config/sitemap.rb using the gem's AwsSdkAdapter

You must require 'aws-sdk-s3' in your sitemap config before using this adapter, or require another library that defines Aws::S3::Resource and Aws::Credentials.

# config/sitemap.rb
SitemapGenerator::Sitemap.adapter = SitemapGenerator::AwsSdkAdapter.new(Rails.application.credentials.dig(:aws, :prod, :sitemap_bucket), aws_access_key_id: Rails.application.credentials.dig(:aws, :access_key_id), aws_secret_access_key: Rails.application.credentials.dig(:aws, :secret_access_key), aws_region: Rails.application.credentials.dig(:aws, :region))

8. add your remote host

# The remote host where your sitemaps will be hosted
SitemapGenerator::Sitemap.sitemaps_host = "https://#{Rails.application.credentials.dig(:aws, :prod, :sitemap_bucket)}.s3.#{Rails.application.credentials.dig(:aws, :region)}.amazonaws.com"

9. Add an internal route to download the sitemap

# config/routes.rb
get '/sitemap.xml.gz', to: redirect("https://#{Rails.application.credentials.dig(:aws, :prod, :sitemap_bucket)}.s3.#{Rails.application.credentials.dig(:aws, :region)}.amazonaws.com/sitemap.xml.gz")end

Instead of using the direct URL to the sitemap hosted in the S3 bucket, you should use an URL with your own domain redirecting to the S3 URL:

http://ananunesdasilva.com/sitemap.xml.gz

10. Add the sitemap to robots.txt

You should add the URL of the sitemap index file to public/robots.txt to help search engines find your sitemaps. The URL should be the complete URL to the sitemap index. So, in my case:

# public/robots.txt
# See https://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file Sitemap: 

http://ananunesdasilva.com/sitemap.xml.gz

11. Run the rake that will generate the sitemap and ping the search engines

rake sitemap:refresh

That's it! Check your buck to make sure that the sitemap was uploaded. And don't forget to run this command every time that are content changes to your website.

Final remarks

You can check on Github how my final sitemap configurations and robot file look like.

Though I mentioned earlier that you do not need to manually add your sitemap to Google Search Console it might be a good idea if you want to be sure that all is set up correctly. You can also use ahrefs webmaster tools. Some features are paid but you have a nice on-page SEO analyzer that will let you know if there are any issues with your sitemap. I had some warnings I would not know about if it wasn't for this analyzer, I quickly fixed it and improved my on-page SEO ranking (also provided by ahrefs).