This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
"2020 is the start of a new decade – or does it start in 2021?" That was the debate on social media as we crossed over into the new year. There was also a lot of talk about remembering Y2K twenty years later, which inevitably led to speculation about how computers will behave in Y2038.
However, what's more interesting to me is just how little we discuss software bugs related to leap years, which occur much more frequently than any of these grand events. In an attempt to remedy this, I'd like to tell you a little about leap year bugs, how you can spot them in your code, and what we have been doing here at Microsoft to prevent them from impacting our products and services.
What are leap years, and why do we have them?
A leap year is a year which contains an extra day which we observe on February 29th and call a leap day. Because of the extra day, there are 366 days in the year instead of the usual 365.
Many think leap years occur every four years, but the exact algorithm is slightly more complicated:
- Is the year evenly divisible by 4? Then it is a leap year. (Examples: 2012, 2016, 2020, 2024)
- … Unless it is also evenly divisible by 100. Those are not leap years. (Examples: 1800, 1900, 2100)
- … Except that year that are evenly divisible by 400 are leap years. (Examples: 1600, 2000, 2400)
Leap years are an important part of our calendar system, as they keep seasons and astronomical events from drifting from one year to the next. This is because the mean time it takes for the Earth to actually go around the sun is slightly more than 365 days, but not quite 366 days. The leap year algorithm approximates this as 365.2425 days – though it's important to note than no given year has a fractional number of days in it. Rather, we decide when to add an extra whole day and when not to.
Note that this information refers to the proleptic Gregorian calendar, which is the primary calendar system used in business and computing today. Other calendar systems (such as the Buddhist calendar, Hebrew calendar, Hijri calendar, and others) have different rules for observing leap days or months.
What about leap seconds, time zones, daylight saving time, time on Mars, relativity, etc.?
While all fascinating topics, none are related to leap years. Even leap seconds – despite having the word "leap" in them, are related to a completely different phenomenon, and thus I won't dive into them in this post.
Ok then, What is a leap year bug?
A leap year bug is what happens in software when programs that work with dates do not correctly take leap years into account. They might simply misapply the leap year algorithm, or they might ignore the difference between leap years and common years when manipulating dates.
I can demonstrate this without any code at all. Let me ask you a simple question: What date will it be a year from today? Let's say that "today" is January 1st, 2020. Got the answer? Now think about how you figured that out. Likely you did something like this:
- 2020 plus 1 year is 2021
- Today is January 1st
- The result is January 1st, 2021
If you did that, congratulations! You have just created a leap year bug! Don't feel bad, even the very best software engineers sometimes do this.
Not getting it? Ok – let's try that exact same process again, but this time let's say that "today" is leap day – February 29th, 2020:
- 2020 plus 1 year is 2021
- Today is February 29th
- The result is February 29th, 2021
See the problem? The resulting date does not exist. 2021 is not a leap year, it is a common year and thus February only has 28 days in that year.
As a human being, if you were looking for February 29th on the calendar and it wasn't there, likely you would just pick February 28th and move on to better things. Computers, however, only do exactly what we tell them to do, and in many programing languages invalid input is expected to return an error.
What kind of impact can leap year bugs have?
Leap year bugs typically fall into two impact categories:
- Category 1: Those that lead to error conditions, such as exceptions, error return codes, uninitialized variables, or endless loops
- Category 2: Those that lead to incorrect data, such as off-by-one problems in range queries or aggregation
It's generally Category 1 bugs, such as the one described above, that are the most concerning. These are the type that are responsible for cloud service outages, bricked personal media players, airport baggage handling mishaps, and catastrophic industrial equipment failures.
Category 2 bugs do not typically lead to outages, but that doesn't mean they aren't important. They occur when "365 days" is substituted in place of a year – akin to calling a month "30 days". Most of us know that not every month has 30 days. It may suffice as an approximation, but is no good when you need a precise answer. Likewise, adding 365 days is fine if you need "about a year", perhaps for an expiration date. However, it won't lead to correct results if you're calculating financial reports or figuring out how many years old someone is.
What is Microsoft doing to protect me from leap year bugs?
Leap year bugs can be incredibly challenging to find, especially in large projects. It's even harder if you need to sift through billions of lines of source code across many different divisions of a company like Microsoft. However daunting the task, we've been taking leap year readiness quite seriously in order to reduce the risk of impact to our customers on Feb 29th. Over the past 9 months or so, I've been working with a small team of engineers in Azure to prepare the company for leap year. We've been searching through source code looking for potential leap year bugs, through both manual and automated approaches. We started within Azure, building upon some of the efforts from 2016's leap year, then expanded our scope to include a much larger portion of Microsoft. Tens of thousands of source code repositories have now been scanned, and are continually being scanned as new code is written by product teams.
A large part of the challenge is positive identification. Many of the items we find may match patterns associated with leap year bugs, but turn out to be benign. For example, we might detect that a year of a date is being incremented but miss that the day is always the 1st of a month – in which case there is no problem. We also tend to find more issues in test code than in code that makes its way into a product or service. Tests can still be important, as many teams depend on passing tests in order to build and release updates. In order to address these challenges, we've enlisted the aid of thousands of Microsoft engineers on each of the product or service teams. They examine our findings, triage them, and take action where necessary. Through this tedious process, we have identified and repaired many leap year bugs well before they can become an issue for customers.
Azure Engineering has fully audited the Azure services and supporting services, libraries, and operating systems for leap year issues. While the code has been reviewed extensively, we also recognize that this is not a foolproof process and one approach is not enough. As we want to protect our customers in every way possible, we have prepared the company through multiple avenues including:
- Code scanning, testing, and manual review where necessary
- Documentation of the leap year risk, the code patterns involved, testing approaches, and past occurrences
- Training sessions and presentations, both to existing engineers and to new hires
- Participation in internal events to raise awareness among engineers and product managers
- Broad internal communications of our efforts through both email and physical posters hung around campus
- External communications, such as the blog post you're reading now
- An action plan for the leap day weekend, including heightened awareness, monitoring, and escalation channels
I'm a developer. What does a leap year bug look like in code?
Here is an example of .NET code containing a leap year bug, written in C#, that uses the DateTime structure. It is trying to add a year to today, but it is doing it in a way that doesn't account for February 29th.
There are variations of this of course. Perhaps the data type is a DateTimeOffset structure. Perhaps the date doesn't originate with Now or Today but comes from a stored value. Perhaps the number of years being added or subtracted is variable. In all variations, the problem is the same – when the result is a date that doesn't exist, an exception is thrown. (Specifically, an ArgumentOutOfRangeException.)
For .NET, the solution is simple. Just use the AddYears method:
The AddYears method is built in, and works by doing what I said a human might do. It extends the logic to ensure the resulting value is valid. If not, it adjusts to the last day of the month (February 28th).
Now let's take a look at some C++ code that exhibits the same problem, using the Windows SYSTEMTIME structure.
At this point, the st variable could contain an invalid date, such as February 29, 2021.
Unlike .NET's DateTime structure, the Windows SYSTEMTIME structure doesn't care if it represents a valid date or not. It is just a plain data type containing separate integers for year, month, day, and so on. It is not until the structure is used by a function that these fields are recomposed into a logical date, and it's up to that function on how to deal with invalid dates.
- Most functions that accept a SYSTEMTIME value will return an error result when passed an invalid date. A good example is SystemTimeToFileTime.
- It returns a boolean result code, which is TRUE as long as the input date was valid. (Because this is usually the case, developers sometimes don't bother to check this result!)
- When it returns FALSE, that indicates the conversion failed. The pointer to the FILETIME structure that was intended to receive the output will be left as is. And there's the problem – perhaps that value is 0, representing January 1st, 1601. That certainly won't have the desired effect!
- By contrast, the SystemTimeToVariantTime function doesn't fail on dates like February 29, 2021. Instead, it is treated as if it were March 1st, 2021. Perhaps this is still a bug for your code, as you may have expected February 28th, but it is the Category 2 type described earlier.
So what is the solution for this C++ code? Well, unlike .NET, there is no built-in AddYears method to call. Instead, one has to ensure the structure has a valid date.
Consider the following:
That may look like a lot, but what it's doing makes sense: First it tests if the year is a leap year. If it's not, and it's February 29th, it moves it back to February 28th.
Alternatively, one might consider converting to a FILETIME, adding 365 days, then converting back to a SYSTEMTIME. If doing so, ask yourself if it's ok for the result to sometimes be off by a day.
What are common locations where I might find a leap year bug?
One area you might find a Category 1 leap year bug is when preparing valid from/to dates for certificates. Certificates must have valid dates on them, so passing invalid ones will likely fail certificate generation processes. One such API in Windows is CertCreateSelfSignCertificate, which creates a "self-signed" certificate, and is a very common case for a leap year bug. If you have code that generates certificates (via any mechanism), you should examine the logic used for determining their validity dates carefully.
Another area where leap year bugs are common is when dealing with anniversary dates, such as birth dates. Did you know that a person born on a leap day is called a "leapling"? I've certainly heard stories from leaplings describing how bothersome it is when some computer system won't accept their birth date, but the risk goes beyond that.
Imagine you have a user's date of birth, or date of employment, or date of first login to your website stored in a database. Maybe every year you want to send them a "happy birthday" or "happy anniversary" email, or perhaps an invoice. A common way to figure out when to do that might be to take the month and day from the user and apply it to the current year. If you do just that, then you have a leap year bug that will occur when its not a leap year!
Let's see that in C# code:
One approach to remedy the problem would be to figure out the number of days in the month and adjust, like this:
Another approach, which is arguably simpler, is to just add the difference in years like this:
Really, a leap year bug might be anywhere that dates are being manipulated by some logic, not just these.
Are there other types of leap year bugs?
Here are a few to watch out for:
- Assuming February only has 28 days, without considering the year.
- Assuming that one can decide whether to add 365 days or 366 days by checking the starting year or ending year.
- Testing if a year is a leap year, thinking they always occur every four years. (year % 4 is only part of the formula)
- Declaring a fixed array of 365 elements – one for each day of the year. (Which will fail on Dec 31 in a leap year)
- Accepting all combination of year, month, and day from user input without checking for validity
- Using a leap year test to branch your code significantly (especially if the leap year branch goes untested)
- Parsing a month+day string to a data type that requires a year component
What about other programming languages?
Leap year bugs can be introduced in any programming language. However some languages such are not as prone to Category 1 impact bugs as others.
This code won't fail on leap day, but the Date object will advance to March 1st.
Other languages like Python are prone to Category 1 impact bugs. For example:
That will raise a ValueError when run on a leap day.
With either of these two languages, consider the best option may be to use an open source library for date manipulation.
Do leap year bugs only happen on February 29th?
No. Another date that can cause problems is December 31st, as it is the 366th day of a leap year. Some applications may be hard-wired for 365 days.
Additionally, not all date manipulate happens around the current date. It's reasonable that a leap day bug could be encountered any day of the year. It's just that a lot of code works with "today" as a basis, so leap year bugs are more likely to become visible on the leap day.
How can I test my code for leap year bugs?
One excellent way to ensure your code is free of leap year bugs is unit testing, using the "Virtual Clock" pattern (also known as, "Mock the Clock"). The general idea is to treat the system clock as a service, rather than as a simple property or method call. You can then test this service in the same way that you might test any other service, such as a service that makes a network call. The advantage being that one can prove deterministically that their code is resilient to a variety of dates, before or after those dates actually come about.
Note that this pattern takes several different forms and can vary slightly per language. Also, it already exists in several open source libraries, such as Noda Time for .NET. It can also be implemented manually.
Here is an example implementation in C#
With these defined, you can now depend on the IClock interface in your application components:
At runtime, one might wire up through a Dependency Injection container, like so:
But in unit tests, we can use the FakeClock:
Why not just set the clock forward and see what happens?
One might think that the easiest thing to do would be to create an environment where we turn the clock forward to February 29th and see what fails. While this might work for small individual programs, it's usually not a viable option for distributed systems because time is so interwoven to every dependency. For example, does your application send telemetry to a logging service? If so, how will that service handle events with timestamps from the future? Perhaps it will discard them, which might make your system appear as if it is offline. As a more concrete example, consider that most web applications require SSL certificates that are signed by a trusted certificate authority. The validity start and end dates are timestamped. What will happen if your certificate is expired when your tests run? Lastly, consider that many authentication protocols, such as Kerberos and OpenID Connect, include timestamping and validation of timestamps as a security mechanism. Will your system behave correctly if authentication fails? Even if none of these things matter to you, are you sure that you have the ability to exercise all of your code in a way that would highlight any problems? What if the result is simply a category 2 bug like described earlier – how would you catch that?
We actually have spent considerable research into this capability, and what we have learned is that time-forward testing is only a viable approach for testing a single machine running in isolation. We don't recommend it for modern cloud infrastructure and applications.
If you do decide to try it anyway, be sure to first disable all time synchronization services (NTP, w32tm, Hyper-V's VMIC Time Provider, etc.) – otherwise your clock might simply correct itself during your tests.
Leap year bugs aren't always disastrous, but they certainly can be, and leap day is just around the corner. Please take a few minutes now to go look at any code you might have in your application that manipulates dates. Think about what will happen when one of those dates is February 29th. If possible, go implement Virtual Clock in your tests.
Also, talk to your colleagues about leap day bugs. This isn't specialized or difficult knowledge, and leap year bugs certainly have happened many times before, but somehow we keep forgetting. Hopefully this time we can all work together to make leap day a fantastic non-event.