Site icon TheWindowsUpdate.com

Writing data using Azure Synapse Dedicated SQL Pool Connector for Apache Spark

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Summary 

 

A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark. The Azure Synapse Dedicated SQL Pool Connector for Apache Spark is the way to read and write a large volume of data efficiently between Apache Spark to Dedicated SQL Pool in Synapse Analytics. The connector supports Scala and Python language on Synapse Notebooks to perform this operations.

 

The intention of this Guide is not explain all the Connector features if you require a deeper understanding of how this connector works start here.

 

Scenario

 

Usually, customers do this kind of operation using Synapse Apache Spark to load data to Dedicated Pool within Azure Synapse Workspace, but today, I would like to reproduce a different scenario that I was working on one of my support cases.  Consider a scenario where you are trying to load data from Synapse Spark to Dedicated pool (formerly SQL DW) using Synapse Pipelines, and additionally you are using Synapse Workspace deployed with Managed Virtual Network.

 

The intention of this guide is to help you with which configuration will be required if you need to load data from Azure Synapse Apache Spark to Dedicated SQL Pool (formerly SQL DW). If you prefer take advantage of the new feature-rich capabilities now available via the Synapse workspace and Studio and load data directly from Azure Apache Spark to Dedicated Pool in Azure Synapse Workspace is recommended that you enable Synapse workspace features on an existing dedicated SQL pool (formerly SQL DW).

 

Before we start, here is some initial considerations

 

To write data to internal tables, the connector now uses COPY statement instead of CETAS/CTAS approach. It also requires storage permissions.

 

Environment

 

There are 3 deployments needed to make this configuration possible

 

  1. Create an Azure Synapse Workspace Deployment with Managed Virtual Network enabled. Start here.
  2. Create an Azure Apache Spark pool using Synapse Studio. Start here.
  3. Create a Dedicated SQL pool (formerly SQL DW). Start here.

 

Fig 1 - Resources

 

Implementation

 

Step 1 - Let's create a Synapse Notebook that will perform read and write operation to be executed on the Synapse Pipeline.

 

Fig 2 - Creating an Azure Synapse Notebook

 

Here is Spark script used within my sample notebook to generate data:

 

 

 

%%sql CREATE DATABASE IF NOT EXISTS SampleDB%%sql USE SampleDB%%sql CREATE TABLE IF NOT EXISTS SampleDB.SampleTable (id int, name string, age int) USING PARQUET%%sql INSERT INTO SampleDB.SampleTable VALUES (1, 'Your Name', 18)%%pyspark df = spark.sql("SELECT * FROM `SampleDB`.`SampleTable`") display(df)

 

 

 

Below is the sample to write using AAD Authentication (Internal Table) using python connector:

 

 

# Write using AAD Auth to internal table # Add required imports import com.microsoft.spark.sqlanalytics from com.microsoft.spark.sqlanalytics.Constants import Constants # Configure and submit the request to write to Synapse Dedicated SQL Pool # Sample below is using AAD-based authentication approach; See further examples to leverage SQL Basic auth. (df.write # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point. .option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net") # Like-wise, if `Constants.TEMP_FOLDER` is not provided, the connector will use the runtime staging directory config (see section on Configuration Options for details). .option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>") # Choose a save mode that is apt for your use case. # Options for save modes are "error" or "errorifexists" (default), "overwrite", "append", "ignore". # refer to https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes .mode("overwrite") # Required parameter - Three-part table name to which data will be written .synapsesql("sqlpool01.dbo.SampleTable"))

 

 

 

Here is the sample of Synapse Notebook with the code:

Fig 3 - Synapse Notebook with sample code

 

After creating synapse notebook, click Publish all to save your modifications.

 

Fig 4 - Publish all

 

Step 2 - Granting Storage Permissions 

 

Go to Storage account access control -> Click +Add -> add role assigned -> select Storage Blob Data Contributor role -> Assign access to -> Select your Synapse Workspace Managed Identity and also your logged account -> Review + assign.

 

Fig 5 - Storage account permissions

 

Step 3 - Granting SQL Permissions to logged in account

 

If you are trying to read or write on dedicated pool using synapse notebook and you are not SQL Active Directory Admin, following SQL authorization is necessary on Dedicated SQL Pool for user logged in account:

 

Read:

 

 

EXEC sp_addrolemember 'db_exporter', [<your_domain_user>@<your_domain_name>.com];

 

 

 

Write:

 

 

--Make sure your user has the permissions to CREATE tables in the [dbo] schema GRANT CREATE TABLE TO [<your_domain_user>@<your_domain_name>.com]; --Make sure your user has ALTER permissions GRANT ALTER ON SCHEMA::<target_database_schema_name> TO [<your_domain_user>@<your_domain_name>.com]; --Make sure your user has ADMINISTER DATABASE BULK OPERATIONS permissions GRANT ADMINISTER DATABASE BULK OPERATIONS TO [<your_domain_user>@<your_domain_name>.com]; --Make sure your user has Select permissions on the target table GRANT SELECT ON <your_table> TO[<your_domain_user>@<your_domain_name>.com] --Make sure your user has INSERT permissions on the target table GRANT INSERT ON <your_table> TO [<your_domain_user>@<your_domain_name>.com] --Make sure your user has DELETE permissions on the target table GRANT DELETE ON <your_table> TO[<your_domain_user>@<your_domain_name>.com]

 

 

 

Now, let's Run the Synapse Notebook to test if logged in user account will have right permissions to perform all actions.

 

When the notebook was executed, the error below was throw:

Fig 6 - Error when notebook try to access SQL Server

 

This error is pretty clear: Synapse Pipeline IP is not allowed to access the server.

We need a allow Azure services and resources to access SQL Server.

 

Step 4 - Allowing Azure services and resources on SQL Server

 

To allow this configuration: Go to your SQL Server -> Security -> Networking - > Exceptions -> Allow Azure services and resources to access this server -> Save

 

Fig 7 - Allowing Azure services and resources to access SQL Server

 

Let's re-run the code that loads data into Dedicated SQL Pool (formerly SQL DW).

 

Fig 8 - Error related to Storage permissions

 

Now we can see a different error throwed:

ERROR: 'Not able to validate external location because The remote server returned an error: (403) Forbidden.

 

This generic error means that now we can get on SQL Server, but System assigned managed identity that will be used to perform this operation may not be enabled.

 

Step 5 - Allowing System assigned managed identity on SQL Server

 

To allow this configuration: Go to your SQL Server Security -> Identity- > Turn Status On -> Save

 

Fig 9 - Allowing System assigned managed identity from SQL Server

 

Now when we run the write cell again, it will work because logged in user account have Storage Blob Data Contributor role to perform all action.

 

Fig 10 - Write on Dedicated SQL Pool succeeded

 

Step 6 - Adding Synapse Notebook activity within Synapse Pipeline to perform data load into Dedicated SQL Pool (formerly SQL DW).

 

To allow this configuration: Go to your Synapse Notebook -> On the upper right side Click on  Add pipeline -> New pipeline.

 

Fig 11 - Adding Synapse Notebook to Pipeline

 

Next screen:

1 - On properties, choose a Name for you pipeline.

2 - Publish your pipeline.

 

Fig 12 - Naming and publishing.

 

Now let's execute our pipeline to ensure that all configuration is done.

 

Go to Integrate -> Select the pipeline that was created on the last step -> Click on +Add trigger -> Trigger now -> Ok.

 

Fig 13 - Triggering Pipeline

 

Go to Monitor-> integration -> Pipeline runs.

 

As we can see on the screenshot below, now we have an error when executing from Synapse Pipeline.

 

Fig 14 - Monitoring Pipeline

 

Click on Error to see the complete error:

: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: Internal error - Attempt to run SQL Transaction failed with {ErrorNumber - 105215, ErrorMessage - Not able to validate external location because The remote server returned an error: (403) Forbidden.}

 

This error is saying that System assigned managed identity doesn't have permission to perform this operation.

It means that the System assigned managed identity should be assigned Storage Blob Data Contributor on the Azure Storage.

 

Usually, the name of the System assigned managed identity is the same of the SQL Server name, as shown on the screenshot below.

 

Fig 15 - Checking system assigned managed identity name

 

Go back to Step 2 and grant Storage permissions for System assigned managed identity.

 

Fig 16 - Checking system assigned managed identity Storage permissions

 

Again, Let's trigger the pipeline.

Now it works. 

 

Fig 17 - Pipeline succeeded.

 

Step 7 - Checking the data loaded into Dedicated SQL Pool (formerly SQL DW).

 

Fig 18 - Checking data on Dedicated SQL Pool table.

 

Solution Overview

 

The new Azure Synapse Dedicated SQL Pool Connector for Apache Spark is designed to efficiently transfer data between Spark pools and dedicated SQL Pools in Azure Synapse Analytics.

The solution allows Synapse notebooks read and write data from or to Dedicated SQL Pool.

The Azure Synapse Apache Spark to SQL Connector works on a dedicated SQL pools only.

The Synapse Workspace Managed Identity allows you to perform more secure operations on the storage and dedicated pool and no longer requires user details or storage keys to be shared.

The System assigned managed identity needs to be enabled from Logical Server.

The Storage account security is streamlined, and we now grant RBAC permissions to the Managed Service Identity for the Logical Server and Synapse Workspace. 

To write data to internal tables, the connector now uses COPY statement instead of CETAS/CTAS approach.

 

Conclusion

 

When using The Azure Synapse Dedicated SQL Pool Connector for Apache Spark, users can take advantage of read and write a large volume of data efficiently between Apache Spark to Dedicated SQL Pool in Synapse Analytics. The connector supports Scala and Python language on Synapse Notebooks to perform these operations.

 

If you are unable to complete the steps in the article, please do reach out to us by logging a Support Request.

 

SIDNEY CIRQUEIRA

Synapse Analytics Support Engineer

 

Exit mobile version