Bulk update: Active Record Vs. Native SQL

rails

activerecord

sql

performance

Jan, 2021

Bulk update: Active Record Vs. Native SQL

Comparing the performance of three different approaches to a dynamic bulk update ran on ~150.000 records.

The following example is a replication of a bulk action problem solved in the Active Record way and the native SQL way. I'm going to go through three different solutions and compare their performance. For each solution, I'll share the code, run the code on my machine (development environment) and register how long each one takes to successfully update ~150.000 records.

Let's then consider the following db schema and models:

Alt Text

class PlayerGameStatistic < ApplicationRecord
  belongs_to :player
  belongs_to :team
  belongs_to :game
end
class Player < ApplicationRecord
  belongs_to :team
  has_many :player_game_statistics

  enum status: %i[active inactive retired]
end
class Team < ApplicationRecord
  belongs_to :team
  has_many :player_game_statistics
  has_many :games

  enum status: %i[inactive active]
end

We have 3 tables: players, teams and player_game_statistics. I'm purposefully excluding the games table here (to which the player_game_statistics belongs too), since it is not relevant for this exercise.

They are all long-existing tables; they all have thousands of rows but player_game_statistics is, by its nature, the one that carries more rows.

Now, let's say that I have added just added a new column player_name to player_game_statistics. For the existing records that have this column empty (defaults to null), I want to populate it with data that is available on the name column of the players table.

To reduce the number of rows a bit, I'm actually only interested in updating the statistics of active and injured players (not including retired players) and active teams (these are values of the status attribute that both players and teams have).

Version 1: The Active Record way

A first approach could be to iterate each player_game_statistic, get the associated player, and update the statistics' player_name column based on the player.name value.

This will result in the classic n + 1 query. We will load the player_game_statistics_to_update and then do n additional queries to grab each player's name.

This is where :includes usually comes in. So let's add that.

class UpdatePlayerGameStatisticsV1
  def execute!
    player_game_statistics_to_update.each do |stat|
      stat.update!(player_name: stat.player.name)
    end
  end

  private

  def user_statistics_to_update
    PlayerGameStatistic
      .includes(:player)
      .joins(:team)
      .references(:player)
      .where('team.status = 1')
      .where('player.status IN (0, 1)')
  end
end

:includes will either use :preload or :eager_load depending on whether or not you reference the association being loaded on a subquery (such as a where clause). In our case, we want to do exactly that so that we can filter results by player status. This will result in a single query on the player_game_statistics with a left outer join on players and an inner join on teams. :eager_load is slower than :preload but it's not like we have an option here, and it's still better than not using :includes at all. For more details on how :includes work, I recommend this great article by Julianna Roen on the gusto blog.

Result: It took about 00:04:33 to update ~150.000 rows.

Version 2: Active Record with a memoized dictionary

A second approach I was interested in measuring is a cached dictionary approach where you query all the players, pluck the player_id and the player_name, transform that result into a hash that will then be memoized.

We still iterate over each statistic but instead of getting the player through the statistic we query the players once, and the following iterations will just consult the dictionary. This approach results in an extra query compared to Version 1 but I'm assuming it should not be that heavy since I'm using :pluck.

class UpdatePlayerGameStatisticsV2
  def execute!
    player_game_statistics_to_update.each do |stat|
      stat.update!(player_name: player_name_idx[stat.player_id])
    end
  end

  private

  def player_game_statistics_to_update
    PlayerGameStatistic.where(player_id: player_name_idx.keys)
  end

  def player_name_idx
    @player_name_idx ||=
      Player
        .joins(:team)
        .where('teams.status = 1')
        .where("players.status IN (0,1)")
        .pluck('players.id, players.name')
        .to_h
  end
end

Result: It took about 00:04:58 to update ~150.000 rows, not a relevant difference from Version 1.

Version 3: raw SQL

A final approach is to use raw SQL. We will use the Postgres UPDATE statement to update the statistics in a single query.

class UpdatePlayerGameStatisticsV3
  def execute!
    ActiveRecord::Base.connection.execute(Arel.sql(update_sql))
  end

  private

  def update_sql
    <<-SQL
        UPDATE player_game_statistics AS stats
          SET player_name = players.name,
              updated_at = LOCALTIMESTAMP
        FROM (#{players_sql}) AS players
        WHERE stats.player_id = players.id
    SQL
  end

  def players_sql
    <<-SQL
        SELECT players.id, players.name
        FROM players
        LEFT OUTER JOIN teams ON teams.id = players.team_id
        WHERE teams.status = 1 AND players.status IN (0,1)
    SQL
  end
end

Result: It took about 00:00:11 to update ~150.000 rows.

Benchmarking all 3 versions:

Again looking at the real-time, v1 seems to do a bit better than v2, although the difference might not be that relevant. But v3 is a clear winner here as it's about 28 times faster than the others.

[#<Benchmark::Tms:0x00007fe48268deb8
  @cstime=0.0,
  @cutime=0.0,
  @label="v1",
  @real=309.4179189999704,
  @stime=15.808668,
  @total=241.84425000000002,
  @utime=226.035582>,
 #<Benchmark::Tms:0x00007fe47ee8eda0
  @cstime=0.0,
  @cutime=0.0,
  @label="v2",
  @real=341.4523780000163,
  @stime=14.207616000000002,
  @total=231.90784299999999,
  @utime=217.70022699999998>,
 #<Benchmark::Tms:0x00007fe48073ef08
  @cstime=0.0,
  @cutime=0.0,
  @label="v3",
  @real=12.004358999955002,
  @stime=0.001827999999996166,
  @total=0.003549999999968634,
  @utime=0.0017219999999724678>]

Conclusion:

I could have probably explored other Active Record solutions and it would be really interesting to see how other developers are solving these problems the Rails way. But in my experience, so far, Active Record shows some performance bottlenecks in complex and data-heavy bulk updates/inserts. That is where SQL comes in handy.

Though the memoized dictionary (v2) did not do better than the full Active Record way (v1), it does not mean that in other cases it won't help improve performance. I often experience cases where it does help a lot.

Finally, I find it important to run this kind of analysis on different problems/queries. Each case is a case. Again, from my experience, SQL ends up winning for high amounts of data and/or complex updates.

Subscribe below to get future blog posts