Google operates one of the largest cloud-based computing systems in the world, and takes great pride in its reliability, investing significant amounts of both money and human resources to make sure that all of its services are always-on and accurate. Here’s the story of how, to eliminate what many would consider an almost insignificant issue, Google conducted a “smear” campaign and created software that “lied” to its own servers—all to improve performance and eliminate potential errors that most of us wouldn’t even notice.
Inside Google’s Time Warp
As searchers, we want fresh results, which Google usually provides. But Google also offers many other services, such as Google Docs, Gmail, and so on, that rely on much more accurate time stamping. Like most other online services, Google uses a service called the “Network Time Protocol” (NTP), which periodically checks a computer’s time against a more accurate server, such as an atomic clock. NTP also takes into account variable factors like how long the NTP server takes to reply, or the speed of the network between you and the server when setting a to-the-second or better time on the computer you’re using. So most of the time (so to speak) you can rely on Google to be spot-on when it comes to time-stamping everything you do.
Problem: Leap years. Of even more concern: Leap seconds. As Christopher Pascoe, Google Site Reliability Engineer writes on the Google blog, “It turns out that being on a revolving imperfect sphere floating in space, being reshaped by earthquakes and volcanic eruptions, and being dragged around by gravitational forces makes your rotation somewhat irregular. These fluctuations in Earth’s rotational speed mean that even very accurate clocks, like the atomic clocks used by global timekeeping services, occasionally have to be adjusted slightly to bring them in line with ‘solar time.’”
For most of us, that second of flux is something that (if we even notice it) is irrelevant. But for Google, which may process thousands or even millions of events during that transitional second, this can lead to major problems.
According to Pascoe, “Our systems are engineered for data integrity, and some will refuse to work if their time is sufficiently “wrong.” We saw some of our clustered systems stop accepting work on a small scale during the leap second in 2005, and while it didn’t affect the site or any of our data, we wanted to fix such issues once and for all.”
Google’s solution? Adding what they call a “leap smear,”—injecting code that would effectively “lie” to its own servers during the day that a leap second was taking place. Pascoe again: “We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day.”
Lest you think this was a trivial patch, Google actually developed some serious math to solve the problem, and performed two “smears” (one going back in time, the other pushing into the future) and tested them using about 10,000 servers, comparing “standard atomic time,” their own servers and a variety of public NTP clients.
The result? Google has figured out how halt the ravages of time (at least in this case). For more of the science and math behind the fix, check out the official Google blog post.