Decoding the Dynamics: Dapr vs. Service Meshes

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Dapr and Service Meshes are more and more usual suspects in Cloud native architectures. However, I noticed that there is still some confusion about their purpose, especially because of some overlapping features. People sometimes wonder how to choose between Dapr and a Service Mesh or even if both should be enabled at the same time.

The purpose of this post is to highlight the differences, especially on the way they handle mTLS, as well as the impact on the application code itself. You can already find a summary about how Dapr and Service Meshes differ on the Dapr web site but the explanations are not deep enough to really understand the differences. This blog post is an attempt to dive deeper and give you a real clue on what's going on behind the scenes. Let me first start with what Dapr and Service Meshes have in common.

Things that Dapr and Service Meshes have in common

Secure service-to-service communication with mTLS encryption
Service-to-service metric collection
Service-to-service distributed tracing
Resiliency through retries

Yes, this is the exact same list as the one documented on the Dapr web site! However, I will later focus on the mTLS bits because you might think that these are equivalent, overlapping features but the way Dapr and Service Meshes enforce mTLS is not the same. I'll show some concrete examples with Dapr and the Linkerd Service Mesh to illustrate the use cases.

On top of the above list, I'd add:

They both leverage the sidecar pattern, although the Istio Service Mesh is exploring the Ambient Mesh, which is sidecar free, but the sidecar approach is still mainstream today. Here again, the role of the sidecars and what happens during the injection is completely different between Dapr and Service Meshes.
They both allow you to define fine-grained authorization policies
They both help deal with distributed architectures

Before diving into the meat of it, let us see how they totally differ.

Differences between Dapr and Service Meshes

Applications are Mesh-agnostic, while they must explicitly be Dapr-aware to leverage the Dapr capabilities. Dapr infuses the application code. Being Dapr-aware does not mean that you must use a specific SDK. Every programming language that has an HTTP client and/or gRPC client can benefit from the great Dapr features. However, the application must comply to some Dapr pre-requisites, as it must expose an API to initialize Dapr's app channel.
Meshes can deal with both layer-4 (TCP) and layer-7 traffic, while Dapr is focused on layer-7 only protocols such as HTTP, gRPC, AMQP, etc.
Meshes serve infrastructure purposes while Dapr serves application purposes
Meshes typically have smart load balancing algorithms
Meshes typically let you define dynamic routes across multiple versions of a given web site/API
Some meshes ship with extra OAuth validation features
Some meshes let you stress your applications through Chaos Engineering techniques, by injecting faults, artificial latency, etc.
Meshes typically incur a steep learning curve while Dapr is much smoother to learn. On the contrary, Dapr even eases the development of distributed architectures.
Dapr provides true service discovery, not meshes
Dapr is designed from the ground up to deal with distributed and microservice architectures, while meshes can help with any architecture style, but prove to be a good ally for microservices.

Demo material

I will reuse one demo app that I developed 4 years ago (time flies), which is a Linkerd Calculator. The below figure illustrates it:

Some services talking together. MathFanBoy, a console app randomly talking to the arithmetic operations, while the percentage operation also calls multiplication and division. The goal of this app was to generate traffic and show how Linkerd helped us see in near real time what's going on. I also purposely introduced exceptions by performing divisions by zero...to also demo how Linkerd (or any other mesh) helps spot errors. Feel free to clone the repo and try it out on your end if you want to test what is later described in this post. I have now created the exact same app, using Dapr, which is made available here. Let us now dive into the technical details.

Diving into the technical differences

Invisible to the application code vs code awareness

As stated earlier, an application is agnostic to the fact that it is injected or not by a Service Mesh. If you look at the application code of the Linkerd Calculator, you won't find anything related to Linkerd. The magic happens at deployment time where we annotate our K8s deployment to make sure the application gets injected by the Mesh. On the other hand, the application code of the Dapr calculator is directly impacted in multiple ways:

- While I could use a mere .NET Console App for the Linkerd Calculator, I had to turn MathFanBoy into a web host, to comply with the Dapr app initialization channel. However, because MathFanBoy generates activity by calling random operations, I could not just turn it as an API, so I had to run different tasks in parallel. Here are the most important bits:

class Program
    {
        static string[] endpoints = null;
        static string[] apis = new string[5] { "addition", "division", "multiplication", "substraction", "percentage" };
        static string[] operations = new string[5] { "addition/add", "division/divide", "multiplication/multiply", "substraction/substract", "percentage/percentage" };
        
        static async Task Main(string[] args)
        {
            var host = CreateHostBuilder(args).Build();

            var runHostTask = host.RunAsync();

            var loopTask = Task.Run(async () =>
            {
                while (true)
                {
                    var pos = new Random().Next(0, 5);
                    using var client = new DaprClientBuilder().Build();
                    var operation = new Operation { op1 = 10, op2 = 2 };
                    try
                    {
                         var response = await client.InvokeMethodAsync<object, object>(
                         apis[pos], // The name of the Dapr application
                         operations[pos], // The method to invoke
                         operation); // The request payload                        
                        
                        Console.WriteLine(response);
                    }
                    catch(Exception ex) { 
                        Console.WriteLine(ex.ToString());
                    }
                    
                    await Task.Delay(5000);
                }
            });

            await Task.WhenAll(runHostTask, loopTask);

        }

        public static IHostBuilder CreateHostBuilder(string[] args) =>
        Host.CreateDefaultBuilder(args)
            .ConfigureWebHostDefaults(webBuilder =>
            {
                webBuilder.UseStartup<Startup>();
            });
    }

Lines 9 and 10 create the web host. Between lines 13 and 35, I generate random calls to the operations, but here again we have another difference as the application is using the Dapr Client's InvokeMethodAsync to perform the calls. As you might have noticed, the application does not need to know the URL of these services. Dapr will discover where the services are located, thanks to its Service Discovery feature. The only thing we need to provide is the App ID and the operation that we want to call. With the Linkerd calculator, I had to know the endpoints of the target services, so they were injected through environment variables during the deployment. The same principles apply to the percentage operation, which is a true API. I had to inject the Dapr client through Dependency Injection:

public void ConfigureServices(IServiceCollection services)
        {
            services.AddControllers().AddDapr();
        }

In order to to get an instance through the controller's constructor:

public PercentageController(ILogger<PercentageController> logger, DaprClient dapr)
        {
            _logger = logger;
            _dapr = dapr;
        }

and use that instance to call the division and multiplication operations from within another controller operation, using again the Invoke method as for MathFanBoy. As you can see, the application code is explicitly using Dapr and must comply to some Dapr requirements. Dapr has many features other than Service Discovery but I'll stick to that since the point is made that a Dapr-injected Application must be Dapr-aware while it is completely agnostic of a Service Mesh.

mTLS

Now things will get a bit more complicated. While both Service Meshes and Dapr implement mTLS as well as fine-grained authorization policies based on the client certificate presented by the caller to the callee, the level of protection of Dapr-injected services is not quite the same as the one from Mesh-injected services.

Roughly, you might think that you end up with something like this:

A very comparable way of working between Dapr and Linkerd. This is correct but only to some extents. If we take the happy path, meaning every pod is injected by Linkerd or Dapr, we should end up in the above situation. However, in a K8s cluster, not every pod is injected by Dapr nor Linkerd. The typical reason why you enable mTLS is to make sure injected services are protected from the outside world. By outside world, I mean anything that is not either Dapr-injected, either Mesh-injected. However, with Dapr, nothing prevents the following situation:

The blue path is taking the Dapr route and is both encrypted and authenticated using mTLS. However, the green paths from both a Dapr-injected pod and a non-Dapr pod still goes through in plain text and anonymously. How come is that possible?

For the blue path, the application is going through the Dapr route ==> http://localhost:3500/ this is the port that the Daprd sidecar listens to. In that case, the sidecar will find out the location of the target and will talk to the target service's sidecar. However, because Dapr does not intercept network calls, nothing prevents you from taking a direct route, from both a Dapr-injected pod and a non-Dapr one (green paths). So, you might end up in a situation where you enforce a strict authorization policy as shown below:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: multiplication
  namespace: dapr-calculator
spec:
  accessControl:
    defaultAction: deny        
    trustDomain: "public"
    policies:
    - appId: mathfanboy
      defaultAction: allow
      trustDomain: 'public'
      namespace: "dapr-calculator"
    - appId: percentage
      defaultAction: allow
      trustDomain: 'public'
      namespace: "dapr-calculator"

where you only allow MathFanBoy and Percentage to call the multiplication operation, and yet have other pods bypass the Dapr sidecar, which ultimately defeats the policy itself. Make no mistake, the reason why we define such policies is to enforce a certain behavior and I don't have peace of mind if I know that other routes are still possible.

So, in summary, Dapr's mTLS and policies are only effective if you take the Dapr route but nothing prevents you from taking another route.

Let us see how this works with Linkerd. As stated on their web site, Linkerd also does not enforce mTLS by default and has added this to their backlog. However, with Linkerd (same and even easier with Istio), we can make sure that only authorized services can talk to meshed ones. So, with Linkerd, we would not end up in the same situation:

First thing to notice, we simply use the service name to contact our target because there is no such Dapr route in this case nor any service discovery feature. However, because Linkerd leverages the Ambassador pattern, which intercepts all network calls coming in and going outside of a pod. Therefore, when the application container of a Linkerd-injected pod tries to connect to another service, Linkerd's sidecar performs the call to the target, which lands onto the other sidecar (if the target is well a Linkerd-injected service of course). In this case no issue. Of course, as for Dapr, nothing prevents us from directly calling the pod IP of the target. Yet, from an injected pod, the Linkerd sidecar will intercept that call. From a non-injected pod, there is no such outbound sidecar, but our target's sidecar will still tackle inbound calls, so you can't bypass it. By default, because Linkerd does not enforce mTLS, it will let it go, unless you define fine-grained authorizations as shown below:

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  namespace: rest-calculator
  name: multiplication
spec:
  podSelector:
    matchLabels:
      app: multiplication
  port: 80
  proxyProtocol: HTTP/1

---
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  namespace: rest-calculator
  name: multiplication-from-mathfanboy
spec:
  server:
    name: multiplication
  client:
    meshTLS:
      identities:         
        - mathfanboy
        - percentage

In this case, only MathFanBoy and and Percentage will be allowed to call the multiplication operation. In other words, Linkerd allows us to enforce mTLS, whatever route is taken. With Istio, it's even easier since you can simply enforce mTLS through the global mesh config. You do not even need to specify explicit authorization policies (although it is a best practice). Just to illustrate the above diagrams, here are some screenshots showing these routes in action:

I'm first calling the multiplication operation from the addition pod, while we told Dapr that only MathFanboy and Percentage could call multiplication. As you can see, the Dapr policy kicks in and forbids the call as expected.

but while this policy is defined, I can still call the multiplication using a direct route (pod IP):

and the same applies to non-injected pods of course.

While, with the Linkerd policy in place, there will be no way to call multiplication other than from MathFanBoy and Percentage. For sake of brevity, I won't show you the screenshots but trust me, you will be blocked if you try.

Let us now focus on the injection process which will clarify what is going on behind the scenes.

Injection process Dapr vs Service Mesh

Both Dapr and Service Meshes will inject application pods according to annotations. They both have controllers in charge of injecting their victims. However, when looking at the lifecycle of a Dapr-injected pod as well as a Linkerd-injected pod, we can see noticeable differences.

When injecting Linkerd to an application, in plain Kubenet (not using the CNI plugin), we notice that Linkerd injects not only the sidecar but also an Init Container:

When looking more closely at the init container, we can see that it requires a few capabilities such as NET_ADMIN and NET_RAW, and that is because the init container will rewrite IP tables to make sure network traffic entering and leaving the pod is captured by Linkerd's sidecar. When using Linkerd together with a CNI, the same principle applies but route tables are not rewritten by the init container. No matter how you use Linkerd, all traffic is redirected to its sidecar. This means that the sidecar cannot be bypassed.

When injecting Dapr, we see that there is no Init Container and only the daprd container (sidecar) is injected:

There is no rewrite of any IP table, meaning that the sidecar can be bypassed without any problem, thus bypass Dapr routes and Dapr policies. In other words, we can easily escape the Dapr world.

Wrapping up

As stated initially, I mostly focused on the impact of Dapr or a Service Mesh on the application itself and how the overall protection given by mTLS varies according to whether you use Dapr or a Service Mesh. I hope it is clear by now that Dapr is definitely an application framework that infuses the application code, while a Service Mesh is completely transparent for the application. Note that the latter is only true when using a decent Service Mesh. By decent, I mean something stable, performant and reliable. I have been recently confronted to a Mesh that I will not name here, but this was a true nightmare for the application and it kept breaking it.

Although Dapr & Service Meshes seem to have overlapping features, they are not equally covering the workloads. With regards to the initial question about when to use Dapr or a Service Mesh, I would take the following elements into account:

- For distributed architectures that are also heavily event-driven, Dapr is a no brainer because Dapr brings many features on the table to interact with message and event brokers, as well as state stores. Yet, Service Meshes could still help measure performance, spot issues and load balance traffic by understanding protocols such as HTTP/2, gRPC, etc. Meshes would also help in the release process of the different services, splitting traffic across versions, etc.

- For heterogeneous workloads, with a mix of APIs, self-hosted databases, self-hosted message brokers (such as Rabbit MQ), etc., I would go for Service Meshes.

- If the trigger of choosing a solution is more security-centric, I would go for a Service Mesh

- If you need to satisfy all of the above, I would combine Dapr and a Service Mesh for microservices, while using Service Mesh only for the other types of workloads. However, when combining, you must consider the following aspects:

- Disable Dapr's mTLS and let the Service Mesh manage this, including fine-grained authorization policies. Beware that doing so, you would loose some Dapr functionality such as defining ACLs on the components

- Evaluate the impact on the overall performance as you would have two sidecars instead of one. From that perspective, I would not mix Istio & Dapr together, unless Istio's performance dramatically improves over time.

- Evaluate the impact on the running costs because each sidecar will consume a certain amount of CPU and memory, which you will have to pay for.

- Assess whether your Mesh goes well with Dapr. While an application is agnostic to a mesh, Dapr is not, because Dapr also manipulates K8s objects such as K8s services, ports, etc. There might be conflicts between what the mesh is doing and what Dapr is doing. I have seen Dapr and Linkerd be used together without any issues, but I've also seen some Istio features being broken because of Dapr naming its ports dapr-http instead of http. I reported this problem to the Dapr team 2 years ago but they didn't change anything.