May 31, 2008

Short Outage For Some On Demand Customers This Morning

At approximately 5:10 AM ET, one of our load balanced web servers in one of our data centers began throwing 500's on certain requests. Our load balancer did not catch this in time to remove the failing server from the pool, but our monitoring systems did. This means that a small portion of On Demand customers that were active at this time were seeing these errors.

Service was restored by 5:16 AM ET. I will examine the symptoms of this outage and find a way for our load balancer to detect this.

April 26, 2008

Weekend Problems processing Orders

UPDATE: Order processing is back online.

Due to an outage with our payment processer (Linkpoint/Yourpay) coupled with the fact that they outsource their tech support during weekends and that there is no one there until Monday that can help us, any orders placed with us will not complete processing until they fix problems with their gateway probably sometime on Monday.  This affects signups for Fog Creek Copilot, FogBugz On Demand, purchasing FogBugz for your server, etc.  Basically any time you enter a credit card on our system, or change an existing one, the order will be placed on hold until the gateway comes back up.  I apologize and have written another letter to First Data to complain about the problem.  It appears it may be time to switch our payment processor to Paypal or some other competitor.

April 09, 2008

Outage for Some On Demand Customers

update: things are back online.  It turns out that our colo had a bad patch cable in place that came lose and caused this outage.  These things happen, so I'm going to look for reasonable ways to make that path redundant.

We have just detected an outage with our LA datacenter that is hosting half of our On Demand customers. 

I am working with our colo support staff to troubleshoot, and will post updates.

April 02, 2008

Scheduled Maintenance for On Demand: Wednesday, April 9th, 2008, 03:01 - 07:00 EST

Our ISP will be performing an IOS upgrade on their core routers on Wednesday, April 9th, 2008 from 03:01 - 07:00 EST.  During this time, some FogBugz On Demand customers may experience brief periods of latency or connectivity loss with our services. 

March 26, 2008

Intermittant Blip Seen By Some FogBugz On Demand Customers

At approximately 4:25 AM, one of our load balanced On Demand web servers went into an error state.   Our monitoring system caught it, and it was corrected by 4:30 AM. 

We will work to understand the cause of this error.

Knowing that these types of things will happen in the future, as they do in the world of web apps, we are working to implement a more robust load balancer that can detect this sort of problem and take the 'damaged' server out of the pool automatically.

March 24, 2008

Weekend Problems Accepting Payments

Due to some intermittent service problems with our 3rd party payment processor, we were unable to accept credit card payments at various times this weekend.  To further complicate matters, our own monitoring system was not notifying us of the problems as it should have been. 

I have corrected the bug in the monitoring system, and we are in touch with our payment processor to further understand the problem.

March 19, 2008

Copilot and FogBugz On Demand Sign-Up Outage

Both Copilot and FogBugz On Demand services were unable to accept credit card orders between 3:30 PM and 4:30 PM EST (approximately).   While our processing service was being monitored, it was not being watched at a deep enough level to detect this specific problem.  I will be installing a more scrutinizing monitor shortly.

March 18, 2008

Non-Service Impacting: Data Center Power Generator Test on March 19th, 2008 @ 07:00 AM EST

Just a note to everyone: our NYC colo will be performing a power generator test @ 7:00 AM EST on Wednesday, March 19th, 2008.  This is a routine test that ensures that the power system is functioning properly, and we are not expecting any trouble.

March 17, 2008

Payment Processor Response

My paper letter to the CEO of First Data resulted in two phone calls from them, and a very good phone call from the director of their online payment processor.  He shared with me the reasons for the 3 outages we experienced (upgrade to new version of Oracle db, out of disk space on a disk which wasn't being monitored, and redundant dns server failures).  I shared my thoughts about how the reasons for the failure were important (when they are down, we do not make any money), but that the communication was even more important.  Knowing what was going on lessens the frustration from a customers point of view, which is exactly why we started this blog.  Sharing our outages and the reasons for our mistakes, and how we're going to fix them in the future, lets our customers know that we try to be perfect, but we'll admit our mistakes when we aren't.

Hopefully the message got through.  Thanks for the phone call First Data!

March 05, 2008

Payment Processor Outage

UPDATE: It's back up.  I've sent a letter to First Data's CEO.  We'll see if I get a response ;)

Our payment processor's domain name is not resolving so we cannot reach it right now. (Neither can other machines out on the internet).  Their status message says the gateway is down.