Developing secure coding in Azure Sphere

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

"How do you develop secure code?" I've been asked this a lot recently and it is time for a blog post as the public and various parts of Microsoft have gotten a glimpse of how much Azure Sphere goes through to hold to our security promises. This will not be a short blog post as I have a lot to cover.

I joined Microsoft in February 2019 with the goal of improving the security posture of Azure Sphere and also desired to give back to the public community when possible. My talk at the Platform Security Summit talks about the hardware however the talk did not provide detailed answers on how to improve security for software development and testing. Azure Sphere's focus is IoT however we still give back to the open source community and want to see the software industry improve too, this blog post is to help reveal the efforts we go through and hopefully help others understand what it means to sign up saying you are secure.

First, you must realize that there is no magic bullet or single solution to keep a system secure, there are a wide range of potential issues and problems that require solutions. You must also accept that all code must be questioned regardless of where it is from, so how does Azure Sphere handle this? We have a range of guidelines and rules that we refuse to bend on. These become our foundation when questions come up about allowing an exception or if a new feature has the right design by checking if it violates a rule we refuse to bend on. The 7 properties is a good starting point and provides a number of overarching requirements, this is further extended with internal requirements that we do not want to bend on, an example being no unsigned code can ever be allowed to execute.

What is your foundation? What lines are you willing to draw in the sand that can not be compromised, bent, or ignored? Our foundation of what we deem secure is always being improved and expanded with the following being a few examples. These are meant to help drive thoughts on what works on your own software and is not meant to be a complete list. By having these lines along with the 7 properties allows for better security decisions as new features are added.

No unsigned code allowed on the platform
Will delay or stop feature development that risks the security of the platform
Willing to do an out-of-band release if a security compromise or CVE is impactful enough
NetworkD is the only daemon allowed to have CAP_NET_ADMIN and CAP_NET_RAW
Assume the application on a real-time core is compromised

Now that a base foundation is laid we can build on top of it to better protect the platform. Although we require all code to be signed that does not mean it is bug free, humans make mistakes and it is easy to overlook a bug. We can bring tools to the table to help remove both low hanging fruit and easily abusable bugs, in our case we use clang-tidy and Coverity Static Analysis.

clang-tidy has over 400 checks that can be done as part of static code analysis, we run with over 300 of the checks enabled. Although clang-tidy can detect a lot of simpler bugs it does not do a good job of deeper introspection at a static level, this is where Coverity comes into play. Coverity allows us to pick up where clang-tidy leaves off and provide deeper introspection of data usage between functions along with more stringent validations for potential failing code paths. A good question at this point is "how often should such tools be ran?" We want to eliminate coding flaws quickly, so both tools get ran on every pull request (PR) to our internal repos and the tools must pass before the PR is allowed to complete. This requirement does impact build times but not as much as may be expected. Our base code running under Linux is approaching 250k lines of total code in the repository across 2.4k files, running both tools adds roughly 8 minutes to the build and is not a bad trade off for the amount of validation being done.

clang-tidy and Coverity is not enough for us, it is still possible to mess up string parsing and manipulation, array handling, integer math and various other bugs. To help reduce this risk we heavily use C++ internally for our code and rely on objects for string and array handling. Our normal world, secure world, Pluton, and even our rom code is almost all C++ and we do what we can to limit assembly usage in low level parts of the system. C++ does introduce new risks though, object confusion through inheritance as an example. Part of our coding practices is that very few classes are allowed to inherit and this is further validated during the PR process which requires sign-off by someone besides the creator of the PR.

Larger projects require more contributors which increases the risk of insecure or buggy code being checked in as no single person is able to understand the whole code base. Allowing anyone to submit a PR then approve it themselves runs the risk of introducing bugs but having every lead engineer setup to sign off causes complications and bottlenecks when leads are only responsible for specific parts of the code base. Our PR pipeline has a number of automated scripts, one of them looks at a maintainers.txt file which contains directory or file paths along with an Azure DevOps (ADO) group name per line allowing an ADO group to own a specific directory and anything under it. The script auto-adds the required groups as reviewers to the PR forcing group sign off based on what files were modified and avoids complex git repository setups where a group is responsible per repository. With the growth of Azure Sphere's internal teams, the maintainers.txt file is re-evaluated every 6 months by the team leads to make sure that the ownership that is specified for each directory in our repositories still makes sense.

What about when new features are developed and making sure they don't introduce vulnerabilities into the system from information disclosure to policies that may impact the system? New features have a Request For Comment (RFC) process where a number of requirements must be signed off to begin coding, one of the requirements is a security related section where potential risks are listed along with any tests and fuzzing efforts that will be done. Potential risks can be anything including additional privileges that are required, file permission changes, manifest additions, or a new data parser. The security team evaluates the design itself along with the identified risks and efforts to mitigate before signing off on the design. Security sign-off is required to implement the new feature however the validations do not stop once the RFC is approved. Before the feature is exposed to customers it goes through a second check by the security team to make sure that what was designed is what was implemented for security relevant areas, if the tests for both good and bad input were sufficient, and if fuzzing was done to a satisfied level. At this point the security team signs off and the feature is available to customers in a new version of the software.

I've mentioned fuzzing but what does that actually mean and entail? At it's core, fuzzing is simply any method of providing unexpected input into a piece of code and determine if the code does anything that is not expected. Depending on how smart your fuzzer is and what you are watching quickly determines what "not expected" means as this could be a range of things including crashing the software, information leakage, invalid data returns, excessive cpu usage, or causing the software to become unresponsive and hang, fuzzing for Azure Sphere currently means anything causing a crash or a hang of the system. Our platform is very cpu constrained and does not do well for fuzzing on the actual hardware however with our C++ design we are able to recompile our internal applications to x86 allowing us to use AFL based fuzzers on different components of the system and are in the process of using OneFuzz. At times our fuzzing efforts reveal crashes in open source components which are promptly reported to the appropriate parties, some times with a patch included, allowing us to not only improve our security and stability but also benefit the open source ecosystem.

Fuzzing each component and validating it runs properly is not enough, we must fuzz and test the communication paths between applications and between chips. Fuzzing at this level requires a complete system for testing end to end interactions which is possible with QEMU and allows us to extend the types of fuzzing we can do. We not only have the ability to fuzz various parts of the system in end to end tests with AFL, but are also able to test the stability and repercussions of network related areas including dropped network packets and noisy network environments by altering what QEMU does with the emulated network traffic.

Our ability to recompile the code to run on our desktop computers allows us to use other useful tools like Address Sanitizer (ASan) and Valgrind. We use the gtest framework for testing individual parts of the system including stress testing, it is during these tests that ASan and Valgrind are deployed. ASan allows for detection of improper memory usage in dynamic environments and use-after-free bugs while Valgrind adds additional detection for uninitialized memory usage, reading and writing memory after it is freed, and a range of other memory related validations. It could be considered overkill to be using these many tools on the code, however every tool has both strengths and weaknesses and no single tool is capable of detecting everything. The more bugs we can eliminate the harder it is to attack the core system as it evolves.

We use a lot of external software as part of our build system and a number of open source projects on the actual hardware, all of them are susceptible to CVEs which have to be monitored and quickly handled. We use Yocto for our builds and it comes with a useful tool, cve-check, that queries the Mitre CVE database, our daily builds are setup to run this tool each night and alert us to any new CVEs that impact not only our system but also impact our build infrastructure for tools we rely on. We need to protect both our end product and our build system, a weak build system gives an attacker a different target that can be far more damaging across the product if the builds were manipulated.

We bring to bear static code analysis, fuzzing, and extra validations to help with security of the system and we still bring even more to the table. The security team I run is responsible for coordinating red teams to look at our software and hardware. We rely on both internal and external red teams along with interacting and helping with the public bug bounties for Azure Sphere. As a defender we have to defend from everything while an attacker only needs a single flaw, when you are focused on the day to day things it is easy to overlook a detail so bringing in teams from outside of the immediate organization with the latest tools and techniques of attack brings a fresh set of eyes to help discover what was overlooked.

The security team for Azure Sphere looks a lot like a red team, we are constantly trying to figure out an answer to "If I control X what can I do now?" Once we have an answer we then evaluate if there is a better solution to harden the platform while also determining what can be strengthened or improved while being invisible to customers. We want a secure platform for customers to develop on without additional complexities hence the effort to be as invisible as possible for security relevant validations. This has resulted in a range of internal changes from having a common IPC server code path with simple validations before handing IPC data off to other parts of the platform to Linux Kernel modifications. I have the personal view that teaching a development team to think about security is hard while teaching a red team to create solutions and become developers is easier. This view is applied to the security team I run resulting in a more red-team centric group that is always looking for ways to break the system then helping create solutions. This type of thought process is what drove the memory protection changes in the kernel, the ptrace work that was being worked on when the bug bounty event identified it publicly, and even drives what compilation flags are used.

Another area that people don't always recognize is that overly complex code can result in normally unused but still accessible code paths. Failure to remove code that should be unused not only wastes space, it is an area for bugs to lurk and can also be useful areas for an attacker, remember they only need to find a single opening. Although our platform space is small and we have a reason to limit our code size due to the limited memory footprint, we actively try to keep unused code from sitting around even in the open source libraries we use. We don't need every feature in the Linux Kernel to be turned on so we turn off as much as possible to remove risks, less code compiled in means less areas for bugs to lurk. We apply this stripping mentality to everything we can which is why we don't have a local shell, why we custom compile wolfSSL to be stripped down and smaller, and why we limit what is on the system. We want to remove targets of interest which also allows us to focus on what targets are left.

Creating and maintaining security for a platform is not easy and claiming you are secure often draws a bullseye on your product as hackers enjoy proving someone wrong about how secure they are. The effort required to not only design and create but maintain a secure product is not easy, is not simple, and is not cheap. Putting security first and not compromising is a very hard and difficult choice as it has direct impacts to how fast you can change and adapt for customers. This blog post only covered what we do internally to help write secure code and does not touch on a range of topics including the build acceptance testing (BVT) done in emulation and on physical hardware, code signing and validation, certificate validation and revocation for cloud or resource access, secure boot, or any of the physical hardware security designs that were mentioned during my talk.

I along with Azure Sphere want to see a more secure future, security requires dedication and the willingness to not compromise, it must come first and not be an after-thought, it must be maintained, and it can't be delayed till later due to deadlines. This blog post is one of many ways of trying to give back to the public, hopefully it has helped people understand the effort it takes to keep a system secure and can encourage both ideas and conversations about security during product development outside of Azure Sphere.

Jewell Seay
Azure Sphere Operating System Platform Security Lead
If you do not push the limits then you do not know what limits truly exist

Leave a Reply Cancel reply