Lesson Learned #169: Bulk Insert using Python in Azure SQL

Posted by

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.

 

Background:

 

  • We have 5 CSV files with  111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns). 
  • The Azure SQL Database is Standard 9 
  • Virtual Machine:
    • All files are placed on temporary drive (D:).
    • The accelerated networking is disabled.
    • vCore 4 and 16 Gb of RAM
  • The goal is to import the data as soon as possible.  

 

Lessons Learned:

 

  • I developed two different python script, both will read every CSVs files but using two different methods to import the data:
    • 1) Using executemany method:
      • As I saw that a single thread will take too much time, I configure to run 100 threads at the same time. 
      • The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.
        • This thread creates a temporal table in TEMPDB
        • Once it finished runs an INSERT INTO to the main table. 
        • This main table is a head table without any non-clustered indexes.
      • The time spent to insert all the data was at database level:
        • 40 mins of execution.
        • CPU DB usage 70%-80%
        • LOG IO usage 60%-70%
        • Every minute 3,4 M rows are inserted.
    • 2) Calling BCP.exe 
      • Using the main thread to read every CSV. Batch size is 10000. 
      • The time spent to insert all the data was at database level:
        • 20 mins.
        • CPU DB usage 5%-10%
        • LOG IO usage 80%-90%
        • Every minute 7M rows are inserted

 

It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET - SqlBulkCopy that the performance is much better.

 

Enjoy!

This articles are republished, there may be more discussion at the original link. But if you found this helpful, you're more than welcome to let us know!

This site uses Akismet to reduce spam. Learn how your comment data is processed.