Inserting Bulk Data into Postgres Using Ruby

Inserting Bulk Data into Postgres Using Ruby Performance

June 01, 2015

I had this task where it was required to generate a lot of data into a postgreSQL database and perform some queries on the generated data. So this is going to be some kind of a crash course on Ruby and postgreSQL (and some optimization).

Requirements

The database contained only two tables. The first one was:

cup_matches(mid, round, year, num_ratings, rating)
mid is the primary key for matches
round is the cup round/level (e.g. 32nd, 16th, quarter_final,SemiFinal, Final)
year is the year of the cup.
num_ratings is how many people viewed and rated that match
rating is the average rating of all viewers

the second one:

played_in(mid, name, year, position) this is a table that stores which player played where:
mid is the match id (from cup_matches table)
name is the player name
mid+name is the primary key
year is the year of the match
position is the position number of the player in the match (e.g. 1 for goal keeper, etc…)

and the actual task:

Write a script that populates the database with 2,680 matches and 58,960 players
You must insert players with name that contain the word “pele” 118 times

Setup

I created a Ruby script and used pg gem to connect to the database:

  require 'pg'

  conn = PG::Connection.open(dbname: 'mydb',
                            host: 'localhost',
                            user: 'testuser')

Tables

Now I create the two tables:

  # create the first table (cup_matches)
  conn.exec("CREATE TABLE cup_matches (
              mid serial PRIMARY KEY,
              round varchar(20),
              year integer,
              num_ratings integer,
              rating real);")

  # create the second table (played_in)
  conn.exec("create table played_in(
            mid integer references cup_matches(mid),
            name varchar(80),
            year integer,
            position integer,
            PRIMARY KEY(mid,name));")

Generating Data

Here I generate data into the first table. This is pretty straightforward:

  # generate data for cup_matches

  round_array = ['32th', '16th', '8th', 'quarter_final',
                  'semi_final', 'final']

  for i in (1..2680) do
    round_i = rand(0..5)
    year_value = rand(2000..2015)
    num_ratings_value = rand(1..205)
    rating_value = rand(1.0..5.0)

    # insert this row
    conn.prepare("ins#{i}",
                "INSERT INTO cup_matches(round, year, num_ratings, rating)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("ins#{i}",
                          [round_array[round_i], year_value,
                          num_ratings_value, rating_value])
  end

Up until this point everything works fine but when I wanted to insert data into the second table, things started to get really slow. My original script contained this part:

  # generate data for played_in

  # array of random names
  name_array = ['Shinichi', 'Nitrogen', 'Yan', 'Itachi', 'Kurosan',
                'Tamashii', 'Ben', 'Owen', 'Ror', 'Bo', 'Yama', 'Kaminari',
                'Kintama', 'Sasuke', 'Ali', 'Leila', 'Xou', 'Xubuntu',
                'Gnomy', 'Lolly', 'Gintoki', 'Kogami', 'Kurotora', 'Matz',
                'Yuki', 'Yumi', 'Kakashi', 'Naruto', 'K.O.', 'Gohan',
                'Kagura', 'Kuroon', 'Ran', 'Aurora', 'Conan', 'Kudo',
                'Lol', 'Yami', 'Sabasa', 'Namik', 'Goku', 'Tintin',
                'Gara', 'Aiomi']

  # this array holds hashes of {:name => mid} that have already been inserted
  current_data = []

  for i in (1..58960) do
    # generate random data
    year_value = rand(2000..2015)
    position_value = rand(1..11)
    name_i = rand(0..43)
    mid_value = rand(1..2680)

    while true
      # create a hash
      hash = {:name_array[name_i] => mid_value}

      # check if hash is not in the current_date array
      if current_data.detect { |h| h == hash }
        name_i = rand(0..43)
        mid_value = rand(1..2680)
      else
        break
      end
    end

    conn.prepare("inse#{i}",
                  "INSERT INTO played_in(mid, name, year, position)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("inse#{i}",
                      [mid_value, name_array[name_i],
                      year_value, position_value])

    # append the unqiue (:name => mid) hash into the current_data array
    current_data[i] = {:name_array[name_i] => mid_value}
  end

The Issue

The second table has a pair of columns as its the primary key (played_in_pk); one is the name column and the other is mid as a foreign key. Thus, every row must have a unique (mid, name) pair.

My approach was to save all the unique pairs in an array, generate a random pair, then loop until we find a unique pair that was not used before. This took a lot of time when current_data array got bigger as it had to traverse the whole array especially that I used some fancy methods like Enumerable#detect and Enumerable#include? which are both O(n).

It took about 40 minutes to generate 21440 rows and for some reason it got stuck there.

Solution

In my second attempt, I used a postgreSQL SELECT query to search the table instead. So this was my second solution:

  for i in (1..58960) do
    while true
      name_i = rand(0..43)
      year_value = rand(2000..2015)
      position_value = rand(1..11)
      mid_value = rand(1..2680)

      # SELECT query that searches for the pair (mid_value, name_array[name_i])
      result = conn.exec( "SELECT * FROM played_in WHERE mid=#{mid_value}
                          AND name='#{name_array[name_i]}' " )

      # loop until we have result array empty
      # i.e., (mid_value, name_array[name_i]) is not used
      break unless result.any?
    end

    conn.prepare("inse#{i}", "INSERT INTO
                  played_in(mid, name, year, position)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("inse#{i}",
                      [mid_value, name_array[name_i],
                      year_value, position_value])
  end

This made the script much faster. It generated 58960 rows in about 20 minutes which was a great success besides it is much simpler now.

This is the last part where I changed exactly 118 rows to have the name pele as required in the task:

  # this is where we make 118 rows have the name 'pele'
  for i in (1..118) do

    while true
      mid_value = rand(1..2680)

      # make sure that we don't have a (mid_value, 'pele') pair in the database
      result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'
                          AND name='pele' " )

      # break if mid_value is valid to use
      break unless result.any?
    end

    # retrieve all rows that have mid_value
    result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'" )

    # result is an array of rows
    # retrieve the name (row[1]) from the first row of the result (result[0])
    target_name = result.getvalue(0, 1)

    # update this row to have name='pele'
    conn.exec_params("UPDATE played_in SET name = $1 WHERE mid=#{mid_value}
                      AND name='#{target_name}'",  ['pele'])
  end

Here’s the full script on Github.

Resources

PostgreSQL tutorials

PG gem documentation

PGresult class documentation

Import CSV file into PostgreSQL table