2016-04-20

Bulk insert (multi-row vs. batch) using JDBC and MyBatis

I have seen some developers are using multi-row insert in MyBatis to perform bulk insert.
If performance is your concern, though, you should consider using batch insert instead especially when the number of rows to insert is large.

In this post, I will show you the advantage of batch insert over multi-row insert.
The project that I used for profiling is available on GitHub.

Multi-row insert

Multi-row insert is a form of INSERT statement which allows you to insert multiple rows in a single statement.
According to Wikipedia, it is added in SQL-92 specification and most popular RDBMS support this feature in some way.

Here is an example of basic SQL multi-row insert statement.

insert into person
(id, name, address)
values
(1, 'John', 'Kyoto'),
(2, 'Mike', 'Tokyo'),
(3, 'Paul', 'Sapporo');

Write multi-row insert in JDBC

In terms of JDBC API, multi-row insert is nothing different than the normal single row insert statement.
It just has (number of columns * number of rows) placeholders in a single insert statement.

// Build SQL statement as a string
StringBuilder sql
  = new StringBuilder("insert into person")
    .append(" (id, name, address) values ");
for (int i = 0; i < persons.size(); i++) {
  if (i > 0)
    sql.append(",");
  sql.append("(?, ?, ?)"
}
// Prepare statement
PreparedStatement ps = connection.prepareStatement(sql);
// Set parameters
int k = 0;
for (Person p : persons) {
  ps.setInt(++k, p.getId());
  ps.setString(++k, p.getName());
  ps.setString(++k, p.getAddress());
}
// Execute
ps.executeUpdate();

As it sends a single big statement, there usually is a database specific limitation that you should be aware of. For example…

  • MySQL : the packet size must be within the value of max_alloweed_packet.
  • PostgreSQL : the number of placeholders must be lower than 32768. This is

When the number of rows to insert is large, the resulting statement string gets bigger and it sometimes causes performance issue even without MyBatis.

Write multi-row insert in MyBatis

In MyBatis, you can write multi-row insert by using <foreach /> element.

<insert id="insertPersons">
  insert into person
  (id, name, address)
  values
  <foreach item="p" items="list" separator=",">
    (#{p.id}, #{p.name}, #{p.address})
  </foreach>
</insert>

The mapper method takes a list of person as its parameter.

int insertPersons(List<Person> persons);
SqlSession sqlSession = sqlSessionFactory.openSession();
Mapper mapper = sqlSession.getMapper(Mapper.class);
try {
  mapper.insert(persons.get(i));
} finally {
  session.close();
}

Internally, it still generates the same single insert statement with many placeholders as the JDBC code above.

MyBatis has an ability to cache PreparedStatement, but this statement cannot be cached because it contains <foreach /> element and the statement varies depending on the parameters.
As a result, MyBatis has to 1) evaluate the foreach part and 2) parse the statement string to build parameter mapping [1] on every execution of this statement.
And these steps are relatively costly process when the statement string is big and contains many placeholders.

[1] simply put, it is a mapping between placeholders and the parameters.

Batch insert

Since version 2.0, JDBC API provides ‘batch update’ facility that specifically designed to perform bulk operations efficiently.

Write batch insert in JDBC

Here is a Java code performing batch insert using JDBC API.

int batchSize = 100;
// Build SQL statement as a string
String sql = "insert into person"
  + " (id, name, address) values "
  + " (?, ?, ?);"
// Turn off auto-commit
connection.setAutoCommit(false);
// Prepare statement
PreparedStatement ps = connection.prepareStatement(sql);
for (int i = 0; i < persons.size();) {
  Person p = persons.get(i);
  ps.setInt(1, p.getId());
  ps.setString(2, p.getName());
  ps.setString(3, p.getAddress());
  ps.addBatch();
  i++;
  if (i % batchSize == 0 || i == persons.size()) {
    ps.executeBatch();
  }
}

When implementing batch operation, there are two important things.

  • Turn off auto-commit.
  • Set appropriate batch size.

In the example above, batch size is 100 which means that the program sends the data to the database every 100 rows.

Write batch insert in MyBatis

To perform batch update in MyBatis, you need to specify ExecutorType.BATCH when retrieving SqlSession.

int batchSize = 100;
SqlSession sqlSession
  = sqlSessionFactory.openSession(ExecutorType.BATCH);
Mapper mapper = sqlSession.getMapper(Mapper.class);
try {
  int size = persons.size();
  for (int i = 0; i < size;) {
    mapper.insert(persons.get(i));
    i++;
    if (i % batchSize == 0 || i == size) {
      sqlSession.flushStatements();
      sqlSession.clearCache();
    }
  }
  sqlSession.commit();
} finally {
  sqlSession.close();
}

The mapper implementation is pretty straight forward.

int insert(Person person);
<insert id="insert">
  insert into person (id, name, address)
  values (#{id}, #{name}, #{address})
</insert>

Testing

Here is the project that I used to compare multi-row and batch insert.
Note that performance depends on various factors and the results below may not be applicable to your situation.

Environment

  • Mac OS X 10.11.4
  • MySQL 5.6.29 (InnoDB, Connector/J 5.1.38, rewriteBatchedStatements=true)
  • Eclipse 4.6.0M6

MySQL runs on the local environment.

Elapsed time : inserting 10k rows

Let’s see the execution time, first.
Each test method inserts 10k rows (10 columns per row) using different method:

  • insert : executes insert statement 10k times without batch.
  • insertBatch : executes insert 10k times using batch (batch size = 100)
  • insertMultiRow : executes multi-row insert

Fig. 1 : Insert 10k rows using JDBC

Fig. 2 : Insert 10k rows using MyBatis

Elapsed time : insert 50k rows

The same test, but with 50k rows to insert.

Fig. 3 : Insert 50k rows using JDBC

Fig. 4 : Insert 50k rows using MyBatis

Even though batch insert seems to scale better, multi-row does not look too bad.
How about the memory usage?

Memory usage

To compare the memory usage, I used Eclipse launcher for VisualVM.
This time, I executed each method separately.

Fig. 5 : 10k rows / batch / JDBC (i.e. insertBatch in Fig. 1)

Fig. 6 : 10k rows / multi-row / JDBC (i.e. insertMultiRow in Fig. 1)

As the bytes allocated for Person objects are the same in both cases, we can estimate the difference by comparing their percentages.
In this particular case, multi-row insert uses twice as much memory as batch insert.

The difference gets bigger with 50k rows insertion.

Fig. 7 : 50k rows / batch / JDBC (i.e. insertBatch in Fig. 3)

Fig. 8 : 50k rows / multi-row / JDBC (i.e. insertMultiRow in Fig. 3)

Involving MyBatis does not change the result very much.

Fig. 9 : 50k rows / batch / MyBatis (i.e. insertBatch in Fig. 4)

Fig. 10 : 50k rows / multi-row / MyBatis (i.e. insertMultiRow in Fig. 4)

Conclusion

To implement bulk insert via JDBC (MyBatis also uses JDBC internally), I would recommend batch insert over multi-row insert because it has smaller memory footprint and hence scales better usually.
Of course, you should not trust me and do the tests yourself.

2 comments:

  1. Hi,Although it is had been a long time, but i want to say sometime benchmark will lie to you, did you notice your network i/o latency is 0? batch insert does not optimized network i/o, when the network has been degraded, it took more duration that bulk insert.

    ReplyDelete
    Replies
    1. It all depends and if the foreach works better for you, that's great. But the problem caused by the single big SQL is really difficult (if possible) to overcome once it happens.

      Delete