Main | February 2008 »

January 2008

January 27, 2008

On Demand Maintenance Completed

The planned maintenance went off without a hitch and all accounts should now be accessible.  Please let us know if you have any problems by either emailing us at customer-service@fogcreek.com or calling at 1-866-FOGCREEK. 

January 26, 2008

FogBugz On Demand Maintenance Reminder

Just a reminder that we will be performing maintenance for FogBugz On Demand.  The work will begin in approximately 45 minutes.

January 24, 2008

Fog Creek's Payment Gateway is UP

UPDATE: In the time it took me to write this post, it appears they have fixed their problems and we should be 100% again.

Currently our payment gateway provider is down, which means we cannot process any credit card related orders.  They have a web page where they post their status, but it says their gateway is still functioning perfectly :)

I spoke with a supervisor who said they were looking into it ASAP.

  • Copilot Day Pass users - contact us via phone at 866.364.2733 and we can give you a magic credit card to use
  • Other customers - I expect this to be resolved by 1PM EST.  If you need your software, or job posting, or any other help, please contact us via phone and we'll make sure you can get started using our products ASAP and we'll take care of the payment later.

We're going to investigate cloning our payment process to use another provider during outages like this (and it should also give us some flexibility to save some money by switching to a different payment processor).

January 23, 2008

Partial Copilot Outage

Wednesday, January 23, at approximately 4:00pm EST Copilot users began having trouble connecting to the Fog Creek servers.

  1. Why? One Reflector stopped responding to user requests, which went unnoticed by Fog Creek staff for longer than it should have.
  2. Why? There was no automated notification until we heard about it from customers.
  3. Why? The Reflectors were down and therefore not reporting bugs, and so it was impossible to be notified.
  4. Why? Because there is no external monitoring system for the Reflectors.
  5. Why? All prior bugs have been reported from within the Reflector by BugzScout, and nothing was set up to monitor it externally.

So how are we going to fix this? Fortunately, the solution is pretty simple. We are going to create a monitor that will regularly log in to Copilot from both sides and ensure that data is flowing between the clients.  If this process fails for any reason, we will be notified of it and can take action before it affects users.

January 22, 2008

FogBugz On Demand Maintenance on January 27th, 2008, 00:01-05:00 EST

We will be taking down the second half of our FogBugz On Demand customers this weekend in order to migrate them to our shiny new database servers.  This will improve reliability and performance for a number of our customers, as well as improve our ability to handle larger loads.  An email was sent out to all customers who will be impacted, but to reiterate:

The outage will occur at approximately 00:01 EST on January 27th, 2008, and will end at 05:00 EST.  During this time, your FogBugz On Demand account will not be available for use.   

If you are interested in receiving updates on this and other outages, please subscribe to our RSS feed

January 20, 2008

On Demand Maintenance Completed

We have completed all of our maintenance to the On Demand service (previously mentioned here).  We have now upgraded all of our customers in our New York Data Center to our brand new database servers.   

All of our tests are passing, but should you have any problems accessing your FogBugz On Demand account, please contact our customer service team at 866-FOGCREEK and we'll tackle it right away.

January 19, 2008

On Demand Maintenance Reminder

Just a friendly reminder that FogBugz On Demand maintenance will begin in 15 minutes and end at 05:00 EST.  Please see this earlier notice for more details. I will post an update upon completion.

January 18, 2008

FogBugz On Demand Maintenance on January 20th, 2008, 00:01-05:00 EST

We will be taking down half of our FogBugz On Demand customers this weekend in order to migrate them to our shiny new database servers.  This will improve reliability and performance for a number of our customers, as well as improve our ability to handle larger loads.  An email was sent out to all customers who will be impacted, but to reiterate:

The outage will occur at approximately 00:01 EST on January 20th, 2008, and will end at 05:00 EST.  During this time, your FogBugz On Demand account will not be available for use.   

If you are interested in receiving updates on this and other outages, please subscribe to our RSS feed

January 10, 2008

Analysis of this morning's outage

Post mortem of this morning's outage:

  1. Why? – Our link to Peer1 NY went down
  2. Why? – Our switch appears to have put the port in a failed state
  3. Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
  4. Why? – The switch interface was set to auto-negotiate instead of being manually configured
  5. Why? – Our network administrator was fully aware of problems like this, and has been for many years.  But - we do not have a written standard and verification process for production switch configurations.

Assuming that the third ‘why’ is correct, and it certainly is probable, then we have our root cause. Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred. Or, it would occur once, and the standard would get updated as appropriate. Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team. It should be clear that it is much more than that.

There is irony in the fact that our system administrator spent the early part of this week drafting a small set of policies and standards for our environment. He now has one more to add to the list.

Now, we could surely take the 5 Whys even further and discover that we would be better off with an HA router / switch configuration, etc.  While I see that as fair, the above examination exposes a fundamental flaw in our approach to maintaining this environment which needs to be remedied before adding complexity.

New York data center outage

There was a full outage in our New York data center this morning. Things began flapping around 3:30 AM, and then settled down after 10 minutes and we saw no need to panic. At approx 5:00 AM, it happened again. We contacted Peer1, and they felt it must be connectivity and started investigating. Things came up around 5:30 AM, and Peer1 did not find anything. At approx 6:15 it happened again, but this time it was a full outage. Again, Peer1 could not detect anything wrong with the connection. Michael went down to the data center, verified that our router could not talk to the outside world, and then moved the Peer1 network connection from our switch directly to our router. This cleared everything up.

Reason suggests that there is either a configuration error on the switch as a whole or an issue with just that port. We see no reason to think that the problem is on Peer1’s end. We are still investigating.

Mitigating factors: This outage did not affect FogBugz customers using the Los Angeles data center. Because the outage occurred during the North American night, most North American customers would not have been affected.