Linked by David Adams on Thu 1st Mar 2012 22:53 UTC, submitted by judgen
Microsoft The outage on Microsoft's Windows Azure cloud computing platform that caused the government's G-Cloud service to go offline was the result of a calculation error caused by the extra day in February due to the leap year. Writing on the Azure blog the firm's corporate vice president for service and cloud, Bill Laing, said while the firm had still to fully determine the cause of the issue, the extra date in the month appeared the most likely cause.
Order by: Score:
Doesn't seem so bad...
by zima on Thu 1st Mar 2012 23:26 UTC
zima
Member since:
2005-07-06

Only every 4 years, right? (even less on average - not in 2100 for example)

Edited 2012-03-01 23:28 UTC

Reply Score: 2

LOL
by WorknMan on Thu 1st Mar 2012 23:31 UTC
WorknMan
Member since:
2005-11-13

Reason #48341830923 that you shouldn't keep your data in the cloud.

Reply Score: 4

RE: LOL
by cdude on Fri 2nd Mar 2012 14:17 UTC in reply to "LOL"
cdude Member since:
2008-09-21

I found a typo in your post and fixed it for you:

Reason #48341830923 that you shouldn't keep your data in the *Microsoft* cloud.

There is NO OTHER cloud-system in the whole galaxy which has such kinds of problems again and again. They have problems without doubt but NOT such problems.

Let's face it. It may not be a problem to reboot your Windows desktop if there is a problem but that does NOT WORK for servers.

Anyone remember the London stock-exchange disaster last year (a whole day down thanks to Windows server technology)?

Reply Score: 1

RE[2]: LOL
by moondevil on Sun 4th Mar 2012 16:35 UTC in reply to "RE: LOL"
moondevil Member since:
2005-07-08

Anyone remember the London stock-exchange disaster last year (a whole day down thanks to Windows server technology)?


Yes, but knowing how many consulting companies work from the inside, I am sure that developers were to blame and not the technology.

Reply Score: 2

If true, is this a vindication of Y2K?
by bannor99 on Fri 2nd Mar 2012 01:42 UTC
bannor99
Member since:
2005-09-15

If the world's (arguably) premiere software company, with all the lessons learned and experience gained during decades of development could have had a disastrous outage caused by an extra day,
then all those who bitch that all the money and effort spent on the Y2K fixes were a waste and that we were hoodwinked by a bunch of grouchy, grimy COBOL programmers looking for a last big payout can just shut the FUCK up.

Reply Score: 3

Bill Shooter of Bul Member since:
2006-07-14

Depends on how you define Y2K.

If Y2K was about how a single flaw in a single system somewhere would kill all civilization. Then no, this is not that.

If Y2K was about how some developers fail to think a few years into the future with localized bad results, then yes. This is that.

Reply Score: 2

bannor99 Member since:
2005-09-15

Depends on how you define Y2K.

If Y2K was about how a single flaw in a single system somewhere would kill all civilization. Then no, this is not that.

If Y2K was about how some developers fail to think a few years into the future with localized bad results, then yes. This is that.


It went way beyond "some developers" although most of the blame can be laid at the feet of the decision-makers.
Bob Bemer started petitioning everyone from programmers to politicians starting in the early 60s about the problems with 2-digit dates - they didn't listen and he wasted several decades trying to convince them.

Reply Score: 3

Alfman Member since:
2011-01-28

Everyone assumed that "Y2K" trouble was only about the year 2000, but most computers actually had different date boundaries.

The year 2000 was only a problem for those who stored dates in ascii/ebsdic form. Mainframes seem to be unusual in their use of BCD and nine's complement within their vsam files, which is why they were especially susceptible to the two digit overflow.

Binary time representations such as those in *nix have different limits, but they're also approaching.

http://en.wikipedia.org/wiki/Year_2038_problem

Reply Score: 4

zima Member since:
2005-07-06

Depends on how you define Y2K.

If Y2K was about how a single flaw in a single system somewhere would kill all civilization. Then no, this is not that.

But weren't ICBMs among the concerns (admittedly, one of the most silly ones) thrown around? ;)

Reply Score: 2

Wtf? Really?
by Soulbender on Fri 2nd Mar 2012 03:28 UTC
Soulbender
Member since:
2005-08-18

The reason is that they didn't think about leap years? In 2012, this is the error they made? It's not like it's some unexpected even we didn't see coming.
You know, I would have found this acceptable in someone's pet OSS project but not in a global service from MS that you probably pay an arm and a leg for.
If I was the guy who was responsible for this in "the government" I would have been having a serious talk with my account rep already and it would not have been easy for them convince me to continue using their product.

Reply Score: 4

RE: Wtf? Really?
by Laurence on Fri 2nd Mar 2012 09:14 UTC in reply to "Wtf? Really?"
Laurence Member since:
2007-03-26

The reason is that they didn't think about leap years? In 2012, this is the error they made? It's not like it's some unexpected even we didn't see coming.
You know, I would have found this acceptable in someone's pet OSS project but not in a global service from MS that you probably pay an arm and a leg for.
If I was the guy who was responsible for this in "the government" I would have been having a serious talk with my account rep already and it would not have been easy for them convince me to continue using their product.


agreed, but sadly British government like expensive and often vastly over-priced contracts with Microsoft, IBM and Oracle is simply because it takes liability away from the government.

If MS fsck up and take a government service offline, then IT managers within the government just say "not our fault, it's one of our service providers". For the government, contracts like this are just another form of outsourcing and thus it would take something monumental and hugely publicly embarrassing before any government body would even consider switching providers - let along bring the services back in house where they really belong.

This is just my experiences when I worked for the British government. Things might be different for the rest of the EU or western world (for their sake, I hope so).

Edited 2012-03-02 09:15 UTC

Reply Score: 3

RE[2]: Wtf? Really?
by lucas_maximus on Fri 2nd Mar 2012 11:22 UTC in reply to "RE: Wtf? Really?"
lucas_maximus Member since:
2009-08-18

Not only Governments but also quite a lot of organisations (I worked in a large charity for 15 months and this was rampant). The higher up you get the more you gotta watch your own backside.

Reply Score: 2

RE[2]: Wtf? Really?
by zima on Thu 8th Mar 2012 23:53 UTC in reply to "RE: Wtf? Really?"
zima Member since:
2005-07-06

sadly British government like expensive and often vastly over-priced contracts with Microsoft, IBM and Oracle is simply because it takes liability away from the government.
If MS fsck up and take a government service offline, then IT managers within the government just say "not our fault, it's one of our service providers"

Everybody likes to outsource responsibility. Certainly in some Central European places one can see a strong "nobody got fired for using Microsoft or Oracle" of sorts...

...and even when the projects, waaaaay down the line, largely prove to be practical failures - those initially pushing and implementing them moved on, several times already, each time adding another "success" to their CV - and the more expensive, the more lucrative such "successes" are, the better they look on the CV, it seems.

Reply Score: 2

RE: Wtf? Really?
by B. Janssen on Fri 2nd Mar 2012 09:58 UTC in reply to "Wtf? Really?"
B. Janssen Member since:
2006-10-11

The reason is that they didn't think about leap years? In 2012, this is the error they made? It's not like it's some unexpected even we didn't see coming.
You know, I would have found this acceptable in someone's pet OSS project but not in a global service from MS that you probably pay an arm and a leg for.

Agreed, that's just embarrassing. But...

If I was the guy who was responsible for this in "the government" I would have been having a serious talk with my account rep already and it would not have been easy for them convince me to continue using their product.

...you would only complain and try to get some monetary recognition out of it, but you wouldn't quit using the service. And you know why. This is not just picking up your ball and going, it's picking up the goal posts, the fences, the benches, the lawn and the parking lot, too. I don't claim to know how large the gov's data is on Azure, but I'm sure it is somewhere in the region where you don't move on a whim.

And on top of that 1 day in 366 is probably well within agreed outage levels (I'd guess they have 99.9%, so they would be covered.)

Reply Score: 2

RE[2]: Wtf? Really?
by Soulbender on Fri 2nd Mar 2012 10:29 UTC in reply to "RE: Wtf? Really?"
Soulbender Member since:
2005-08-18

This is not just picking up your ball and going, it's picking up the goal posts, the fences, the benches, the lawn and the parking lot, too.


In the short run you're probably right but the contract will be renegotiated at some point and I would make damn sure there's was a viable alternative at that point. Of course, I would probably not have bought into Azure in the first place so it's a bit moot.

And on top of that 1 day in 366 is probably well within agreed outage levels


Could be but on the other hand, isn't the cloud all about NOT having these kind of problems? You know, scalability, redundancy and all that jazz that the sales rep probably fed the gov't.

Reply Score: 2

RE[3]: Wtf? Really?
by Lennie on Fri 2nd Mar 2012 11:21 UTC in reply to "RE[2]: Wtf? Really?"
Lennie Member since:
2007-09-22

It just means someone else, which is dedicated to the task, is doing that kind of work. That doesn't mean you get less problems.

It might mean you get more problems, because doing things at a large scale isn't easier.

Reply Score: 2

RE[4]: Wtf? Really?
by Soulbender on Fri 2nd Mar 2012 13:15 UTC in reply to "RE[3]: Wtf? Really?"
Soulbender Member since:
2005-08-18

Right, but I'm sure the MS sales rep told them that if they used Azure they'll never have downtime and it would all be redundant and scalable and blah blah blah. If I had been told that and then the whole thing (and I mean the whole thing, not just a few of my VM's) goes down because they forgot about leap years I'd be mighty pissed.

Reply Score: 3

RE[3]: Wtf? Really?
by B. Janssen on Fri 2nd Mar 2012 16:31 UTC in reply to "RE[2]: Wtf? Really?"
B. Janssen Member since:
2006-10-11

I think, I wasn't clear. The FU is reprimandable, no doubt.

My line of thinking was that at some point in deployment you pass a point of no return where you are effectively locked-in into the cloud of someone else, because moving becomes very expensive, even more expensive than putting up with a FU.

I guess, what I'm really trying to say is that cloud services lock-in your data and you will suffer the consequences and like it. Beware of the cloud, seriously.

Reply Score: 2

RE[2]: Wtf? Really?
by cdude on Fri 2nd Mar 2012 14:28 UTC in reply to "RE: Wtf? Really?"
cdude Member since:
2008-09-21

"And on top of that 1 day in 366 is probably well within agreed outage levels (I'd guess they have 99.9%, so they would be covered.)"

Let me show you some magic:
100-1/366*100 => 99.73%
99.73>=99.9 => false

Reply Score: 2

RE[3]: Wtf? Really?
by B. Janssen on Fri 2nd Mar 2012 16:23 UTC in reply to "RE[2]: Wtf? Really?"
B. Janssen Member since:
2006-10-11

By Jove! Please, civilized man, teach me your mathmagics!


I guess, what I'm trying to say is: what were you thinking when you decided to snark instead of simply correcting my mistake?

Reply Score: 2

RE[4]: Wtf? Really?
by avgalen on Fri 2nd Mar 2012 21:02 UTC in reply to "RE[3]: Wtf? Really?"
avgalen Member since:
2010-09-23

let me show you some marketingmath:
100-(1/(365+365+365+366))*100 = 99,93
99.93 > 99.9, so no problem for the uptime guarantee ;)

(and even if the SLA were for 99.99 and this month will only be 96.55 that probably only means you will get a refund of 99.99-96.55=3.44% of what you pay per month)

Reply Score: 1

Comment by Luminair
by Luminair on Fri 2nd Mar 2012 04:45 UTC
Luminair
Member since:
2007-03-30

I trust microsoft cloud services the same now. because I already didnt trust microsoft cloud services

Reply Score: 5

Leap Year ??
by Digihooman on Fri 2nd Mar 2012 07:37 UTC
Digihooman
Member since:
2010-05-01

Even a top flight software design company can be caught out by one of these random insertions of extra days by those damn wizards, soothsayers or stargazers. Who would have thought that someone would "chuck in" a spare day out of the blue like that?

Reply Score: 4

RE: Leap Year ??
by daedalus on Fri 2nd Mar 2012 08:22 UTC in reply to "Leap Year ??"
daedalus Member since:
2011-01-14

And at such short notice too!

Reply Score: 2

This reminds me of something..
by nej_simon on Fri 2nd Mar 2012 10:10 UTC
nej_simon
Member since:
2011-02-11

Remember when a lot of zune players died the last leap year?

http://www.computerworld.com/s/article/9124638/Zune_chokes_on_leap_...

Microsoft says it will issue a bug fix for the device so that this problem won't occur again in 2012, the next leap year.

I guess they should have shared that knowledge with the azure department.

Reply Score: 2

RE: This reminds me of something..
by phoenix on Fri 2nd Mar 2012 22:25 UTC in reply to "This reminds me of something.."
phoenix Member since:
2005-07-11

Microsoft doesn't work that way. Each department is a fiefdom unto itself, and must hoard its knowledge and bug fixes to give itself a leg up on its departmental enemies.

Now the Zune devs can sit back and laugh, gloat, and toast to the pain and suffering of their evil Azure dev-enemies.

:D

Reply Score: 3

I don't get it!
by AnythingButVista on Fri 2nd Mar 2012 14:20 UTC
AnythingButVista
Member since:
2008-08-27

We've been having leap years long before computers were invented. We have one every four years. None of my Android devices had problems on February 29 or March 1st. Even Windows didn't have problems with the extra day. How can Microsoft's Azure division drop the ball so miserably with something so simple, for which there's plenty of source code sample on how to handle?!

Reply Score: 2

RE: I don't get it!
by ggiunta on Sat 3rd Mar 2012 10:45 UTC in reply to "I don't get it!"
ggiunta Member since:
2006-01-13

I'm running windows 7 and I definitely had problems yesterday: the time on my PC was off by one hour, but the timezone was set correctly.
I checked the configuration, and it said that it had synced in the morning from time.microsoft.com. That service seemed not to be responding very well (maybe it's hosted on azure now?), I'd put the blame on it rather than the OS itself, but stil...

Reply Score: 1