Getting started with Azure Cosmos Database (A Deep Dive)

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

What is Azure Cosmos Database

It’s a fully managed, distributed NOSQL & relational database.

Topics to be covered

NoSQL Vs Relational Databases
What is Azure Cosmos DB
Azure cosmos DB Architecture
- Azure Cosmos Db Components
- Multi-model (APIs)
- Partitions
- Request Units
Azure cosmos DB Access Methods
Scenario: Event Database
Creating Resources using Azure cosmos DB for NoSQL

What is a NoSQL Database?

NoSQL, standing for “Not only SQL,”. It’s a highly scalable storage mechanism for structured, semi-structured, and unstructured data.

What is a Relational Database?

A relational database is a way of storing and organizing data that emphasizes precision and interconnection. It uses a structured table with predefined schemas to store data.

Structural Difference

Relational Vs NoSQL

What Is Azure Cosmos DB?

Simply it's Microsoft’s premium NoSQL database service.

Key Benefits

Fully-managed Service – Focus on your app, and let Microsoft handle the rest.
No Schema - NoSQL, no schema, no problem.
No Index Management - All data is automatically indexed.
Multi-Model - It helps you cover a variety of databases by providing APIs to interact with.
1. Cosmos DB for NoSQL API - It’s the default API which provides support for querying items in an SQL style. It also supports ACID transactions, stored procedures and triggers.
2. Table API - stores simple key-value data. This is geared towards users of Azure Table storage to use this API as a premium feature.
3. Apache Gremlin API - It's for working with graph databases.
4. Apache Cassandra API - it’s a wide-column store database, well known for distributing petabytes of data with high reliability and performance.
5. MongoDB API – Document database built on MongoDB.
Global Distribution – Azure is in more than 60+ Regions and 140+ Countries, in all these regions and countries azure cosmos DB is available. This is not the case for all other services offered in Azure.
Guaranteed Performance and Availability – Azure cosmos DB provides 99.99 % Service Level Agreement (SLA) for throughput, consistency, availability and latency.
Elastically scalable – You can achieve this by
1. Provisioned – you specify what you service will scale up to.
2. Autoscale – The service will scale automatically according to the workloads.

Azure Cosmos DB Architecture

What Are Azure Cosmos Components

Database Account – Top-level resource that determines the public name and API.
Database – Namespace for your containers. Manage users & permissions
Container – A collection of items (similar to a table). Your Api choice will manifest various forms of our container, ie table, collection, graph, etc.
Item – Atomic data structure of a container ie document, row, node, edge

How is multi-model possible?

The database engine of Azure Cosmos is cable of efficiently translating and projection the data models onto the Atomic-Record-sequence based data (ARS) model.

By utilizing the ARS abstraction layer Cosmos Db can offer various popular data models and translate them back to ARS. This all happens under the same database engine efficiently at a global scale.

Available APIS

Partitions

These are the chunks in which your data is stored. These are the fundamental units of scalability and distribution.

Logical – This is dividing each partition based on a partition key of your choice.

Physical – Physical storage of your data, with one or more logical partitions mapped to it. Azure will map the logical partitions to the physical partitions for you. As you increase physical throughput, azure will automatically create new physical partitions and remap the logical ones as it needs in order to satisfy those requests.

Partitions: Tips to keep in mind

Partition key will affect the database performance and ability to scale.
Avoid hot partitions - (partitions which are not evenly distributed) by choosing keys with high cardinality and distinctness over time. In our example above, using a serial number of the phone is unique hence creating an even distributed partition. Model is not a great key to use in the end because it will cause all items to be in one partition instead of being evenly distributed.
Hot partitions result in rate limiting and inefficient use of the throughput that you’ve provisioned as well as potentially higher costs.
Microsoft Transparently handles physical distribution – your work is to choose a good partition key that is good for your application and data also the throughput and storage associated with it.

Request Units

Request units normalize database operation costs and become a uniform currency for Azure Cosmos DB throughput. Query operation requires more RU/s than the rest because we are using more system resources to perform a query operation.

Flavors for Provisioning RUs

Provisioned – in this case, you know what you want and just provision it. You will get dependable billing because you know how many Rus you’re going to be billed for. Main fall back is hitting rate limits
Auto scale – It lets you set certain parameters and the system will scale up and down the RUs needed as necessary should you hit a higher peak of work.
Serverless – Pay only for what you consume. This option frees you from the need to pick specific parameters like auto scale or be locked into a specific one via provisioned option.

Planning Your Request Units

Two granularities - Provisioning your throughput at the database or container level or both.
1. Database level - The throughput you choose will be shared among all containers under that database.
2. Container Level – you have a specific throughput to a certain container.

Billing Hourly – No matter which option method you use, you’ll be billed for the highest RU of the hour.

Azure Cosmos DB Access Methods

Data Explorer – A graphical data utility built straight into the Azure Portal
SDK – use your favorite language to consume Azure Cosmos DB
- .Net
- Java
- Spring data
- Node.js
- Python
- Go
Rest APIs –manage data using HTTPS request

Creating Resources in the Azure Portal

Let’s Create Account

Search for Azure Cosmos DB

Create Azure Cosmos DB Account

Choose the API according to your use case. I’ll go with NoSQL option for this demo.

Under create Azure cosmos DB Account page

Choose your subscription.
Choose or create a resource group.
Create the account name (make it unique).
Choose the availability zone if you want to improve your apps availability and resilient.
Choose the location of your DB according to the available data centers.
Capacity Mode enables you to define the throughput. The Provisioned option also comes with a free tier option.

Select Geo-Redundancy will enable your database to be available to the paired region ie East US and West Us or South Africa North and South Africa West. For this demo 'South Africa West' is not included in my subscription
Multi-region writes capability allows you to take advantage of the provisioned throughput for your databases and containers across the globe.

Under networking, your Azure Cosmos DB account either publicly, via public IP addresses or service endpoints, or privately, using a private endpoint. Choose according to your use case.
Connection Security Settings – I will go with TLS 1.2

Backup policy defines the way your backup will occur.
1. Periodic lets you define the interval (minutes or hours) and backup retention (How long you would like your backups to be saved) – in hours or days and Backup storage redundancy in Geo, Zone or Local.
2. Continuous (7 days) - Provides backup window of 7 days / 168 hours and you can restore to any point of time within the window. This mode is available for free.
3. Continuous (30 days) - Provides a backup window of 30 days / 720 hours and you can restore to any point of time within the window. This mode has cost impact.

Data Encryption – I will let Microsoft encrypt my account using service-managed keys. Feel free to use your customer-managed key if you have any.

I don’t need to create a tag for now, just review and create.

Let’s Create Event Database Using the Scenario below

For our scenario, we need to store data from sports events (e.g., marathon, triathlon, cycling, etc.). Users should be able to select an event and view a leaderboard. The amount of data that will be stored is estimated at 100 GB.

The schema of the data is different for various events and likely to change. As a result, this requires the database to be schema-agnostic and therefore we decided to use Azure Cosmos DB as our database.

Identify access patterns

To design an efficient data model it is important to understand how the client application will interact with Azure Cosmos DB. The most important questions are:

Is the access pattern more read-heavy or write-heavy?
What are the main queries?
What is the expected document size?

If the access pattern is read-heavy you want to choose a partition key that appears frequently as a filter in your queries. Queries can be efficiently routed to only the relevant physical partitions by including the partition key in the filter predicate.

When the access pattern is write-heavy you might want to choose item ID as the partition key. Item ID does a great job with evenly balancing partitioned throughput (RUs) and data storage since it’s a unique value. For more information, see Partitioning and horizontal scaling in Azure Cosmos DB | Microsoft Docs

Finally, we need to understand the document size. 1 kb documents are very efficient in Azure Cosmos DB. To understand the impact of large documents on RU utilization see the capacity calculator and change the item size to a larger value. As a starting point you should start with only one container and embed all values of an entity in a single JSON document. This provides the best reading performance. However, if your document size is unpredictable and can grow to hundreds of kilobytes you might want to split these in different documents within the same container. For more information, see Modeling data in Azure Cosmos DB – Azure Cosmos DB | Microsoft Docs.

Sample document structure

{ "eventId": "unique_event_id", "eventName": "Marathon", "eventDate": "2024-05-20", "participants": [ { "participantId": "participant1", "name": "Alice", "score": 1200 }, { "participantId": "participant2", "name": "Bob", "score": 1100 } // ... more participants ] }

The eventId serves as the unique identifier for each event.

Create Container

Create a new container
Give it a unique database id
Select autoscale - for automatic throughput else select manual which can be useful for use for a single container with a predictable throughput for a container. The importance of autoscale is that it doesn’t cause any downtime. For more information, see How to choose between manual and autoscale on Azure Cosmos DB.
At the container level the partition key is specified, which in our case is /eventId

Add a document

Click data explorer
Click on Events
Expand Events2024 then items
Click New item
Let’s replace the default Json object with our data

Save a single document

Add the document
save

Save a many document

Let’s say you have you have your data saved on a JSON file, like the one below, follow below steps to insert that data.

[ { "eventId": "event_1", "eventName": "Coding Competition", "eventDate": "2024-05-21", "participants": [ { "participantId": "p1", "name": "John", "score": 980 }, { "participantId": "p2", "name": "Jane", "score": 890 }, { "participantId": "p3", "name": "Mike", "score": 1020 } ] }, { "eventId": "event_2", "eventName": "CodeFest", "eventDate": "2024-06-15", "participants": [ { "participantId": "p4", "name": "Lily", "score": 950 }, { "participantId": "p5", "name": "Alex", "score": 1120 } ] }, { "eventId": "event_3", "eventName": "Hackathon Challenge", "eventDate": "2024-07-10", "participants": [ { "participantId": "p6", "name": "Sarah", "score": 1180 }, { "participantId": "p7", "name": "Kevin", "score": 1035 } ] }, { "eventId": "event_4", "eventName": "Byte Battle", "eventDate": "2024-08-05", "participants": [ { "participantId": "p8", "name": "Olivia", "score": 1005 }, { "participantId": "p9", "name": "Ethan", "score": 1150 } ] }, { "eventId": "event_5", "eventName": "Code Warriors Championship", "eventDate": "2024-09-20", "participants": [ { "participantId": "p10", "name": "Ava", "score": 1085 }, { "participantId": "p11", "name": "Noah", "score": 1070 } ] } ]

Click Upload item
Locate the file you want to upload from the file explorer then click upload.

A successful upload will show you the number of records uploaded.

Let’s Query our Database

Click on New SQL Query
Write your SQL query
Run the query
View our results – as you can see our object has some meta data appended to it.

More Queries

Query 1: View Top Ranked Participants for a Selected Event:

Query 2: View All Events for a Selected Year a Person Has Participated In:

Query 3: View All Registered Participants per Event:

Query 4: View Total Score for a Single Participant per Event:

You can also check the cost of the query operation consumed 2.9 RUs