Lesson Learned #169: Bulk Insert using Python in Azure SQL

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

Today, I worked in a very interesting case where our customer wants to insert millions of rows using Python. We reviewed two alternatives to import the data as soon as possible: Using BCP command line and using executemany command. Following I would like to share my lessons learned about it.

 

Background:

 

  • We have 5 CSV files with  111.100.000 and around 22 columns (20 varchar(6) and 2 int data type columns). 
  • The Azure SQL Database is Standard 9 
  • Virtual Machine:
    • All files are placed on temporary drive (D:).
    • The accelerated networking is disabled.
    • vCore 4 and 16 Gb of RAM
  • The goal is to import the data as soon as possible.  

 

Lessons Learned:

 

  • I developed two different python script, both will read every CSVs files but using two different methods to import the data:
    • 1) Using executemany method:
      • As I saw that a single thread will take too much time, I configure to run 100 threads at the same time. 
      • The main thread reads the CSV file and when it reached 10000 rows, it sends this array of data to a single thread that execute the executemany in background. I managed the status of 100 threads.
        • This thread creates a temporal table in TEMPDB
        • Once it finished runs an INSERT INTO to the main table. 
        • This main table is a head table without any non-clustered indexes.
      • The time spent to insert all the data was at database level:
        • 40 mins of execution.
        • CPU DB usage 70%-80%
        • LOG IO usage 60%-70%
        • Every minute 3,4 M rows are inserted.
    • 2) Calling BCP.exe 
      • Using the main thread to read every CSV. Batch size is 10000. 
      • The time spent to insert all the data was at database level:
        • 20 mins.
        • CPU DB usage 5%-10%
        • LOG IO usage 80%-90%
        • Every minute 7M rows are inserted

 

It is clear that BCP is better to even when executemany is using a prepared TSQL command to improve the performance. I will check if we have the save option as we have in .NET - SqlBulkCopy that the performance is much better.

 

Enjoy!

REMEMBER: these articles are REPUBLISHED. Your best bet to get a reply is to follow the link at the top of the post to the ORIGINAL post! BUT you're more than welcome to start discussions here:

This site uses Akismet to reduce spam. Learn how your comment data is processed.