Inserting Bulk Data into Postgres Using Ruby


I had this task where it was required to generate a lot of data into a postgreSQL database and perform some queries on the generated data. So this is going to be some kind of a crash course on Ruby and postgreSQL (and some optimization).

Requirements

The database contained only two tables. The first one was:

  • cup_matches(mid, round, year, num_ratings, rating)
  • mid is the primary key for matches
  • round is the cup round/level (e.g. 32nd, 16th, quarter_final,SemiFinal, Final)
  • year is the year of the cup.
  • num_ratings is how many people viewed and rated that match
  • rating is the average rating of all viewers

the second one:

  • played_in(mid, name, year, position) this is a table that stores which player played where:
  • mid is the match id (from cup_matches table)
  • name is the player name
  • mid+name is the primary key
  • year is the year of the match
  • position is the position number of the player in the match (e.g. 1 for goal keeper, etc…)

and the actual task:

  • Write a script that populates the database with 2,680 matches and 58,960 players
  • You must insert players with name that contain the word “pele” 118 times

Setup

I created a Ruby script and used pg gem to connect to the database:

  require 'pg'

  conn = PG::Connection.open(dbname: 'mydb',
                            host: 'localhost',
                            user: 'testuser')

Tables

Now I create the two tables:

  # create the first table (cup_matches)
  conn.exec("CREATE TABLE cup_matches (
              mid serial PRIMARY KEY,
              round varchar(20),
              year integer,
              num_ratings integer,
              rating real);")

  # create the second table (played_in)
  conn.exec("create table played_in(
            mid integer references cup_matches(mid),
            name varchar(80),
            year integer,
            position integer,
            PRIMARY KEY(mid,name));")

Generating Data

Here I generate data into the first table. This is pretty straightforward:

  # generate data for cup_matches

  round_array = ['32th', '16th', '8th', 'quarter_final',
                  'semi_final', 'final']

  for i in (1..2680) do
    round_i = rand(0..5)
    year_value = rand(2000..2015)
    num_ratings_value = rand(1..205)
    rating_value = rand(1.0..5.0)

    # insert this row
    conn.prepare("ins#{i}",
                "INSERT INTO cup_matches(round, year, num_ratings, rating)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("ins#{i}",
                          [round_array[round_i], year_value,
                          num_ratings_value, rating_value])
  end

Up until this point everything works fine but when I wanted to insert data into the second table, things started to get really slow. My original script contained this part:

  # generate data for played_in

  # array of random names
  name_array = ['Shinichi', 'Nitrogen', 'Yan', 'Itachi', 'Kurosan',
                'Tamashii', 'Ben', 'Owen', 'Ror', 'Bo', 'Yama', 'Kaminari',
                'Kintama', 'Sasuke', 'Ali', 'Leila', 'Xou', 'Xubuntu',
                'Gnomy', 'Lolly', 'Gintoki', 'Kogami', 'Kurotora', 'Matz',
                'Yuki', 'Yumi', 'Kakashi', 'Naruto', 'K.O.', 'Gohan',
                'Kagura', 'Kuroon', 'Ran', 'Aurora', 'Conan', 'Kudo',
                'Lol', 'Yami', 'Sabasa', 'Namik', 'Goku', 'Tintin',
                'Gara', 'Aiomi']

  # this array holds hashes of {:name => mid} that have already been inserted
  current_data = []

  for i in (1..58960) do
    # generate random data
    year_value = rand(2000..2015)
    position_value = rand(1..11)
    name_i = rand(0..43)
    mid_value = rand(1..2680)

    while true
      # create a hash
      hash = {:name_array[name_i] => mid_value}

      # check if hash is not in the current_date array
      if current_data.detect { |h| h == hash }
        name_i = rand(0..43)
        mid_value = rand(1..2680)
      else
        break
      end
    end

    conn.prepare("inse#{i}",
                  "INSERT INTO played_in(mid, name, year, position)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("inse#{i}",
                      [mid_value, name_array[name_i],
                      year_value, position_value])

    # append the unqiue (:name => mid) hash into the current_data array
    current_data[i] = {:name_array[name_i] => mid_value}
  end

The Issue

The second table has a pair of columns as its the primary key (played_in_pk); one is the name column and the other is mid as a foreign key. Thus, every row must have a unique (mid, name) pair.

My approach was to save all the unique pairs in an array, generate a random pair, then loop until we find a unique pair that was not used before. This took a lot of time when current_data array got bigger as it had to traverse the whole array especially that I used some fancy methods like Enumerable#detect and Enumerable#include? which are both O(n).

It took about 40 minutes to generate 21440 rows and for some reason it got stuck there.

Solution

In my second attempt, I used a postgreSQL SELECT query to search the table instead. So this was my second solution:

  for i in (1..58960) do
    while true
      name_i = rand(0..43)
      year_value = rand(2000..2015)
      position_value = rand(1..11)
      mid_value = rand(1..2680)

      # SELECT query that searches for the pair (mid_value, name_array[name_i])
      result = conn.exec( "SELECT * FROM played_in WHERE mid=#{mid_value}
                          AND name='#{name_array[name_i]}' " )

      # loop until we have result array empty
      # i.e., (mid_value, name_array[name_i]) is not used
      break unless result.any?
    end

    conn.prepare("inse#{i}", "INSERT INTO
                  played_in(mid, name, year, position)
                  VALUES ($1, $2, $3, $4)")

    conn.exec_prepared("inse#{i}",
                      [mid_value, name_array[name_i],
                      year_value, position_value])
  end

This made the script much faster. It generated 58960 rows in about 20 minutes which was a great success besides it is much simpler now.

This is the last part where I changed exactly 118 rows to have the name pele as required in the task:

  # this is where we make 118 rows have the name 'pele'
  for i in (1..118) do

    while true
      mid_value = rand(1..2680)

      # make sure that we don't have a (mid_value, 'pele') pair in the database
      result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'
                          AND name='pele' " )

      # break if mid_value is valid to use
      break unless result.any?
    end

    # retrieve all rows that have mid_value
    result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'" )

    # result is an array of rows
    # retrieve the name (row[1]) from the first row of the result (result[0])
    target_name = result.getvalue(0, 1)

    # update this row to have name='pele'
    conn.exec_params("UPDATE played_in SET name = $1 WHERE mid=#{mid_value}
                      AND name='#{target_name}'",  ['pele'])
  end

Here’s the full script on Github.

Other Solutions?

There might be other good solutions. One solution is to generate data into a CSV file then copy it to the real tables. If you have any good solutions, please let me know.

Resources

PostgreSQL tutorials

PG gem documentation

PGresult class documentation

Import CSV file into PostgreSQL table