Inserting Bulk Data into Postgres Using Ruby Performance
June 01, 2015I had this task where it was required to generate a lot of data into a postgreSQL database and perform some queries on the generated data. So this is going to be some kind of a crash course on Ruby and postgreSQL (and some optimization).
Requirements
The database contained only two tables. The first one was:
- cup_matches(mid, round, year, num_ratings, rating)
- mid is the primary key for matches
- round is the cup round/level (e.g. 32nd, 16th, quarter_final,SemiFinal, Final)
- year is the year of the cup.
- num_ratings is how many people viewed and rated that match
- rating is the average rating of all viewers
the second one:
- played_in(mid, name, year, position) this is a table that stores which player played where:
- mid is the match id (from cup_matches table)
- name is the player name
- mid+name is the primary key
- year is the year of the match
- position is the position number of the player in the match (e.g. 1 for goal keeper, etc…)
and the actual task:
- Write a script that populates the database with 2,680 matches and 58,960 players
- You must insert players with name that contain the word “pele” 118 times
Setup
I created a Ruby script and used pg
gem to connect to the database:
require 'pg'
conn = PG::Connection.open(dbname: 'mydb',
host: 'localhost',
user: 'testuser')
Tables
Now I create the two tables:
# create the first table (cup_matches)
conn.exec("CREATE TABLE cup_matches (
mid serial PRIMARY KEY,
round varchar(20),
year integer,
num_ratings integer,
rating real);")
# create the second table (played_in)
conn.exec("create table played_in(
mid integer references cup_matches(mid),
name varchar(80),
year integer,
position integer,
PRIMARY KEY(mid,name));")
Generating Data
Here I generate data into the first table. This is pretty straightforward:
# generate data for cup_matches
round_array = ['32th', '16th', '8th', 'quarter_final',
'semi_final', 'final']
for i in (1..2680) do
round_i = rand(0..5)
year_value = rand(2000..2015)
num_ratings_value = rand(1..205)
rating_value = rand(1.0..5.0)
# insert this row
conn.prepare("ins#{i}",
"INSERT INTO cup_matches(round, year, num_ratings, rating)
VALUES ($1, $2, $3, $4)")
conn.exec_prepared("ins#{i}",
[round_array[round_i], year_value,
num_ratings_value, rating_value])
end
Up until this point everything works fine but when I wanted to insert data into the second table, things started to get really slow. My original script contained this part:
# generate data for played_in
# array of random names
name_array = ['Shinichi', 'Nitrogen', 'Yan', 'Itachi', 'Kurosan',
'Tamashii', 'Ben', 'Owen', 'Ror', 'Bo', 'Yama', 'Kaminari',
'Kintama', 'Sasuke', 'Ali', 'Leila', 'Xou', 'Xubuntu',
'Gnomy', 'Lolly', 'Gintoki', 'Kogami', 'Kurotora', 'Matz',
'Yuki', 'Yumi', 'Kakashi', 'Naruto', 'K.O.', 'Gohan',
'Kagura', 'Kuroon', 'Ran', 'Aurora', 'Conan', 'Kudo',
'Lol', 'Yami', 'Sabasa', 'Namik', 'Goku', 'Tintin',
'Gara', 'Aiomi']
# this array holds hashes of {:name => mid} that have already been inserted
current_data = []
for i in (1..58960) do
# generate random data
year_value = rand(2000..2015)
position_value = rand(1..11)
name_i = rand(0..43)
mid_value = rand(1..2680)
while true
# create a hash
hash = {:name_array[name_i] => mid_value}
# check if hash is not in the current_date array
if current_data.detect { |h| h == hash }
name_i = rand(0..43)
mid_value = rand(1..2680)
else
break
end
end
conn.prepare("inse#{i}",
"INSERT INTO played_in(mid, name, year, position)
VALUES ($1, $2, $3, $4)")
conn.exec_prepared("inse#{i}",
[mid_value, name_array[name_i],
year_value, position_value])
# append the unqiue (:name => mid) hash into the current_data array
current_data[i] = {:name_array[name_i] => mid_value}
end
The Issue
The second table has a pair of columns as its the primary
key (played_in_pk)
; one is the name
column and the other is mid
as a
foreign key. Thus, every row must have a unique (mid, name) pair.
My approach was to save all the unique pairs in an array,
generate a random pair, then loop until we find a unique pair that was not used before.
This took a lot of time when current_data array got bigger as it had to
traverse the whole array especially that I used some fancy methods like Enumerable#detect
and
Enumerable#include?
which are both O(n).
It took about 40 minutes to generate 21440 rows and for some reason it got stuck there.
Solution
In my second attempt, I used a postgreSQL SELECT
query to search the table instead. So this was my second solution:
for i in (1..58960) do
while true
name_i = rand(0..43)
year_value = rand(2000..2015)
position_value = rand(1..11)
mid_value = rand(1..2680)
# SELECT query that searches for the pair (mid_value, name_array[name_i])
result = conn.exec( "SELECT * FROM played_in WHERE mid=#{mid_value}
AND name='#{name_array[name_i]}' " )
# loop until we have result array empty
# i.e., (mid_value, name_array[name_i]) is not used
break unless result.any?
end
conn.prepare("inse#{i}", "INSERT INTO
played_in(mid, name, year, position)
VALUES ($1, $2, $3, $4)")
conn.exec_prepared("inse#{i}",
[mid_value, name_array[name_i],
year_value, position_value])
end
This made the script much faster. It generated 58960 rows in about 20 minutes which was a great success besides it is much simpler now.
This is the last part where I changed exactly 118 rows to have the name pele
as required in the task:
# this is where we make 118 rows have the name 'pele'
for i in (1..118) do
while true
mid_value = rand(1..2680)
# make sure that we don't have a (mid_value, 'pele') pair in the database
result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'
AND name='pele' " )
# break if mid_value is valid to use
break unless result.any?
end
# retrieve all rows that have mid_value
result = conn.exec( "SELECT * FROM played_in WHERE mid='#{mid_value}'" )
# result is an array of rows
# retrieve the name (row[1]) from the first row of the result (result[0])
target_name = result.getvalue(0, 1)
# update this row to have name='pele'
conn.exec_params("UPDATE played_in SET name = $1 WHERE mid=#{mid_value}
AND name='#{target_name}'", ['pele'])
end
Here’s the full script on Github.
Other Solutions?
There might be other good solutions. One solution is to generate data into a CSV file then copy it to the real tables. If you have any good solutions, please let me know.