Getting started with Azure Cosmos Database (A Deep Dive)

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

kevin_comba_1-1716149104688.jpeg

 

 What is Azure Cosmos Database

 

It’s a fully managed, distributed NOSQL & relational database.  

 

Topics to be covered  

 

  1. NoSQL Vs Relational Databases 
  2. What is Azure Cosmos DB 
  3. Azure cosmos DB Architecture 
    • Azure Cosmos Db Components 
    • Multi-model (APIs) 
    • Partitions 
    • Request Units 
  4. Azure cosmos DB Access Methods 
  5. Scenario: Event Database 
  6. Creating Resources using Azure cosmos DB for NoSQL 

 

What is a NoSQL Database? 

 

NoSQL, standing for “Not only SQL,”. It’s a highly scalable storage mechanism for structured, semi-structured, and unstructured data.  

 

 

 

kevin_comba_2-1716149104692.png

 

 

What is a Relational Database?  

 

A relational database is a way of storing and organizing data that emphasizes precision and interconnection. It uses a structured table with predefined schemas to store data. 

 

Structural Difference

 

 

 

kevin_comba_3-1716149104699.png

 

 

kevin_comba_4-1716149104704.png

 

 

 

 

Relational Vs NoSQL 

 

 

 

kevin_comba_5-1716149104710.png

 

 

What Is Azure Cosmos DB? 

 

Simply it's Microsoft’s premium NoSQL database service. 

Key Benefits 

 

  1. Fully-managed Service – Focus on your app, and let Microsoft handle the rest. 
  2. No Schema - NoSQL, no schema, no problem. 
  3. No Index Management - All data is automatically indexed. 
  4. Multi-Model - It helps you cover a variety of databases by providing APIs to interact with.
    1. Cosmos DB for NoSQL API - It’s the default API which provides support for querying items in an SQL style. It also supports ACID transactions, stored procedures and triggers. 
    2. Table API - stores simple key-value data. This is geared towards users of Azure Table storage to use this API as a premium feature. 
    3. Apache Gremlin API - It's for working with graph databases. 
    4. Apache Cassandra API - it’s a wide-column store database, well known for distributing petabytes of data with high reliability and performance. 
    5. MongoDB API – Document database built on MongoDB.  
  5. Global Distribution – Azure is in more than 60+ Regions and 140+ Countries, in all these regions and countries azure cosmos DB is available. This is not the case for all other services offered in Azure. 
  6. Guaranteed Performance and Availability – Azure cosmos DB provides 99.99 % Service Level Agreement (SLA) for throughput, consistency, availability and latency.  
  7. Elastically scalable – You can achieve this by 
    1. Provisioned – you specify what you service will scale up to. 
    2. Autoscale – The service will scale automatically according to the workloads. 

Azure Cosmos DB Architecture 

 

What Are Azure Cosmos Components 

 

  1. Database Account – Top-level resource that determines the public name and API. 
  2. Database – Namespace for your containers. Manage users & permissions 
  3. Container – A collection of items (similar to a table). Your Api choice will manifest various forms of our container, ie table, collection, graph, etc. 
  4. Item – Atomic data structure of a container ie document, row, node, edge 

 

 

kevin_comba_7-1716149104724.png

 

 

How is multi-model possible? 

 

The database engine of Azure Cosmos is cable of efficiently translating and projection the data models onto the Atomic-Record-sequence based data (ARS) model. 

By utilizing the ARS abstraction layer Cosmos Db can offer various popular data models and translate them back to ARS. This all happens under the same database engine efficiently at a global scale. 

Available APIS 

 

 

 

 

kevin_comba_8-1716149104727.jpeg

 

 

Partitions 

These are the chunks in which your data is stored. These are the fundamental units of scalability and distribution. 

  • Logical – This is dividing each partition based on a partition key of your choice. 

 

kevin_comba_9-1716149104730.jpeg

 

 

  • Physical – Physical storage of your data, with one or more logical partitions mapped to it. Azure will map the logical partitions to the physical partitions for you. As you increase physical throughput, azure will automatically create new physical partitions and remap the logical ones as it needs in order to satisfy those requests. 

 

kevin_comba_10-1716149104734.jpeg

 

 

Partitions: Tips to keep in mind 

 

  1. Partition key will affect the database performance and ability to scale. 
  2. Avoid hot partitions - (partitions which are not evenly distributed) by choosing keys with high cardinality and distinctness over time. In our example above, using a serial number of the phone is unique hence creating an even distributed partition. Model is not a great key to use in the end because it will cause all items to be in one partition instead of being evenly distributed. 
  3. Hot partitions result in rate limiting and inefficient use of the throughput that you’ve provisioned as well as potentially higher costs. 
  4. Microsoft Transparently handles physical distribution – your work is to choose a good partition key that is good for your application and data also the throughput and storage associated with it. 

Request Units  

 

Request units normalize database operation costs and become a uniform currency for Azure Cosmos DB throughput. Query operation requires more RU/s than the rest because we are using more system resources to perform a query operation. 

 

kevin_comba_11-1716149104739.jpeg

 

 

Flavors for Provisioning RUs 

 

  1. Provisioned – in this case, you know what you want and just provision it. You will get dependable billing because you know how many Rus you’re going to be billed for. Main fall back is hitting rate limits 
  2. Auto scale – It lets you set certain parameters and the system will scale up and down the RUs needed as necessary should you hit a higher peak of work. 
  3. Serverless – Pay only for what you consume. This option frees you from the need to pick specific parameters like auto scale or be locked into a specific one via provisioned option. 

Planning Your Request Units 

 

  • Two granularities - Provisioning your throughput at the database or container level or both. 
    1. Database level - The throughput you choose will be shared among all containers under that database. 
    2. Container Level – you have a specific throughput to a certain container. 
  • Billing Hourly – No matter which option method you use, you’ll be billed for the highest RU of the hour. 

Azure Cosmos DB Access Methods 

 

  1. Data Explorer – A graphical data utility built straight into the Azure Portal 
  2. SDK – use your favorite language to consume Azure Cosmos DB 
  3. Rest APIs –manage data using HTTPS request 

Creating Resources in the Azure Portal 

 

 

 

kevin_comba_12-1716149104742.jpeg

 

 

 

 

Let’s Create Account 

 

  • Search for Azure Cosmos DB 

 

kevin_comba_13-1716149104745.png

 

 

  • Create Azure Cosmos DB Account 

 

kevin_comba_14-1716149104748.png

 

 

  • Choose the API according to your use case. I’ll go with NoSQL option for this demo. 

 

kevin_comba_15-1716149104756.png

 

 

  • Under create Azure cosmos DB Account page 
  • Choose your subscription. 
  • Choose or create a resource group. 
  • Create the account name (make it unique). 
  • Choose the availability zone if you want to improve your apps availability and resilient. 
  • Choose the location of your DB according to the available data centers. 
  • Capacity Mode enables you to define the throughput. The Provisioned option also comes with a free tier option. 

 

 

kevin_comba_16-1716149104761.png

 

 

  • Select Geo-Redundancy will enable your database to be available to the paired region ie East US and West Us or South Africa North and South Africa West. For this demo 'South Africa West' is not included in my subscription 
  • Multi-region writes capability allows you to take advantage of the provisioned throughput for your databases and containers across the globe. 

 

 

kevin_comba_17-1716149104767.png

 

 

  • Under networking, your Azure Cosmos DB account either publicly, via public IP addresses or service endpoints, or privately, using a private endpoint. Choose according to your use case. 
  • Connection Security Settings – I will go with TLS 1.2 

 

 

kevin_comba_18-1716149104772.png

 

 

  • Backup policy defines the way your backup will occur.  
    1. Periodic lets you define the interval (minutes or hours) and backup retention (How long you would like your backups to be saved) – in hours or days and Backup storage redundancy in Geo, Zone or Local. 
    2. Continuous (7 days) - Provides backup window of 7 days / 168 hours and you can restore to any point of time within the window. This mode is available for free. 
    3. Continuous (30 days) - Provides a backup window of 30 days / 720 hours and you can restore to any point of time within the window. This mode has cost impact. 

 

kevin_comba_19-1716149104777.png

 

 

  • Data Encryption – I will let Microsoft encrypt my account using service-managed keys. Feel free to use your customer-managed key if you have any. 

kevin_comba_20-1716149104782.png

 

 

  • I don’t need to create a tag for now, just review and create. 

 

Let’s Create Event Database Using the Scenario below 

 

For our scenario, we need to store data from sports events (e.g., marathon, triathlon, cycling, etc.). Users should be able to select an event and view a leaderboard. The amount of data that will be stored is estimated at 100 GB. 

The schema of the data is different for various events and likely to change. As a result, this requires the database to be schema-agnostic and therefore we decided to use Azure Cosmos DB  as our database. 

 

Identify access patterns 

 

To design an efficient data model it is important to understand how the client application will interact with Azure Cosmos DB. The most important questions are: 

  • Is the access pattern more read-heavy or write-heavy? 
  • What are the main queries? 
  • What is the expected document size? 

If the access pattern is read-heavy you want to choose a partition key that appears frequently as a filter in your queries. Queries can be efficiently routed to only the relevant physical partitions by including the partition key in the filter predicate. 

When the access pattern is write-heavy you might want to choose item ID as the partition key. Item ID does a great job with evenly balancing partitioned throughput (RUs) and data storage since it’s a unique value. For more information, see Partitioning and horizontal scaling in Azure Cosmos DB | Microsoft Docs 

Finally, we need to understand the document size. 1 kb documents are very efficient in Azure Cosmos DB. To understand the impact of large documents on RU utilization see the capacity calculator and change the item size to a larger value. As a starting point you should start with only one container and embed all values of an entity in a single JSON document. This provides the best reading performance. However, if your document size is unpredictable and can grow to hundreds of kilobytes you might want to split these in different documents within the same container. For more information, see Modeling data in Azure Cosmos DB – Azure Cosmos DB | Microsoft Docs. 

 

Sample document structure 

 

 

 

 

 

{ "eventId": "unique_event_id", "eventName": "Marathon", "eventDate": "2024-05-20", "participants": [ { "participantId": "participant1", "name": "Alice", "score": 1200 }, { "participantId": "participant2", "name": "Bob", "score": 1100 } // ... more participants ] }

 

 

 

 

 

The eventId serves as the unique identifier for each event. 

Create Container 

 

  • Create a new container 
  • Give it a unique database id 
  • Select autoscale - for automatic throughput else select manual which can be useful for use for a single container with a predictable throughput for a container. The importance of autoscale is that it doesn’t cause any downtime. For more information, see How to choose between manual and autoscale on Azure Cosmos DB. 
  • At the container level the partition key is specified, which in our case is /eventId 

 

 

 

kevin_comba_21-1716149438562.png

 

 

 

Add a document 

 

  • Click data explorer 
  • Click on Events 
  • Expand Events2024 then items 
  • Click New item 
  • Let’s replace the default Json object with our data 

 

kevin_comba_22-1716149438568.png

 

 

 

Save a single document 

 

  • Add the document 
  • save 

 

 

kevin_comba_23-1716149438573.png

 

 

 

Save a many document 

 

Let’s say you have you have your data saved on a JSON file, like the one below, follow below steps to insert that data. 

 

 

 

 

 

[ { "eventId": "event_1", "eventName": "Coding Competition", "eventDate": "2024-05-21", "participants": [ { "participantId": "p1", "name": "John", "score": 980 }, { "participantId": "p2", "name": "Jane", "score": 890 }, { "participantId": "p3", "name": "Mike", "score": 1020 } ] }, { "eventId": "event_2", "eventName": "CodeFest", "eventDate": "2024-06-15", "participants": [ { "participantId": "p4", "name": "Lily", "score": 950 }, { "participantId": "p5", "name": "Alex", "score": 1120 } ] }, { "eventId": "event_3", "eventName": "Hackathon Challenge", "eventDate": "2024-07-10", "participants": [ { "participantId": "p6", "name": "Sarah", "score": 1180 }, { "participantId": "p7", "name": "Kevin", "score": 1035 } ] }, { "eventId": "event_4", "eventName": "Byte Battle", "eventDate": "2024-08-05", "participants": [ { "participantId": "p8", "name": "Olivia", "score": 1005 }, { "participantId": "p9", "name": "Ethan", "score": 1150 } ] }, { "eventId": "event_5", "eventName": "Code Warriors Championship", "eventDate": "2024-09-20", "participants": [ { "participantId": "p10", "name": "Ava", "score": 1085 }, { "participantId": "p11", "name": "Noah", "score": 1070 } ] } ]

 

 

 

 

 

 

kevin_comba_25-1716149438585.png

 

 

  • Click Upload item 
  • Locate the file you want to upload from the file explorer then click upload. 

 

 

kevin_comba_26-1716149438589.png

 

 

 

  • A successful upload will show you the number of records uploaded. 

Let’s Query our Database 

 

  • Click on New SQL Query 
  • Write your SQL query 
  • Run the query 
  • View our results – as you can see our object has some meta data appended to it. 

 

 

kevin_comba_27-1716149438596.png

 

 

 

 

More Queries 

 

  • Query 1: View Top Ranked Participants for a Selected Event: 

 

kevin_comba_28-1716149438600.png

 

 

  • Query 2: View All Events for a Selected Year a Person Has Participated In: 

 

kevin_comba_29-1716149438604.png

 

 

  • Query 3: View All Registered Participants per Event: 

 

kevin_comba_30-1716149438609.png

 

 

  • Query 4: View Total Score for a Single Participant per Event: 

 

kevin_comba_31-1716149438614.png

 

 

  • You can also check the cost of the query operation consumed 2.9 RUs 

 

 

kevin_comba_32-1716149438619.png

 

 

Read More 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.