podrobnější informace o výpadku [24.2.2012]

Redakce25. 2. 2012

247 6 minutes read

Problém byl, jak jsem již psal, ve výpadku DNS serverů, které se starají o naši doménu PLAY.CZ. Doména na několik minut (cca. 20) přestala existovat díky chybě na DNS server CloudFlare a menší DNS cache servery tento stav ochotně přebraly a naše služby odstřihly od uživatelů.
CloudFlare problém vyřešil během oněch cca. 20ti minut a doména se opět vrátila k životu. Bohužel některé české DNS server (a nejen české) jaksi ignorují 5ti minutový TTL pro naši doménu a prostě se k opraveným záznamům dostaly až se zpožděním… Navíc, tím, Že doména chvilku neexistovala (resp. neměla DNS záznamy) ztratily DNS cache servery i informaci o tom 5ti minutovém TTL 🙁
Naštěstí se DNS ustálilo poměrně rychle a až na pár výjimek během 30ti minut. Problémy s DNS následně postihly i náš script pro sčítání počtu posluchačů… Server prostě nedokázal přeložit DNS názvy některých našich media serverů a posluchače z nich nezapočítával. Tento problém se naštěstí vyřešil sám během oběda.

Proč používáme CloudFlare?
Problém by nenastal, kdyby naše služby neběžely přes CloudFlare a nepoužívaly DNS server CloudFlare… Jenže… Stejně tak mohl odejít i jiný DNS server. CloudFlare má 14 data center po celém světě, tj. mnohem menší pravděpodobnost, že kompletně vypadne… tedy až na včerejší problém.
CloudFlare funguje zároveň jako CDN proxy a odlehčuje našim serverům, čímž se zlepšila odezva našich serverů.
Největší výhodou a důvodem, proč používáme CloudFlare je ochrana proti DDoS útokům. CloudFlare denně odráží několik větších a spousty malých DDoS útoků a jde mu to velmi dobře! Mimo jiné také odráží různé automatizované BOTy, kteří se snaží najít chyby v kódu a vlámat se do databáze, případně odstavit celý web. Podobných útoků na naše servery míří poměrně velké množství a díky CloudFlare nás většina z nich nemusí trapit. Za posledních 7 dní CloudFlare odrazil 1 181 281 dotazů a celkem identifikoval 9 337 unikátních zdrojů útoku. Za celou dobu používání CloudFlare se ukázalo, že cca. 21% veškerého datového trafficu na náš web pochází právě od různých robotů hledajících emailové adresy nebo snažících se prolomit naše scripty… Nechceme dopadnout jako stránky Intergram.cz 😉

Pro zájemce ještě přikládám text z blogu CloudFlare:

Last night was not our finest hour. Around 07:30 GMT, we were finishing up a push of a new DNS infrastructure. The core of what this new update was built to do is make DNS updates even faster. Before it took about a minute for a change to your DNS settings to propagate to all our infrastructure, with the new DNS update it is almost instant. That is important to understand in order to understand what went wrong.

Making an update to the DNS requires changing underlying code deep in our system and taking servers offline while we do so. We scheduled the update for the quietest time on our network, which is around 07:00 GMT (around 11:00pm in San Francisco). The code had been running smoothly in our test environment and one data center for the last week so we were feeling pretty good. And, in fact, the push of the DNS update went smoothly and was ahead of schedule.

The Ugly

When the update was complete in 10 of our 14 data centers we got word of a minor issue that was affecting some data getting pushed from the master DNS database. In the process of diagnosing the minor issue, the master DNS database was deleted. The new DNS system did its job and rapidly propagated across the 10 datacenters where the update was live. The result was that if recursive DNS looked up a domain and hit one of those 10 datacenters, around 07:30 GMT they would receive an invalid result. That meant those sites went offline and it was entirely our fault.

The Bad

The DNS database is regularly backed up, but it took us about 5 minutes to recognize the issue, retrieve the backup, and push it to production. Our new DNS infrastructure pushed the update out to most of the datacenters immediately, but because it was such a large update it took a few minutes to rebuild. In most places, new DNS requests were correctly answered with less than a 10 minute window of bad results.

Unfortunately, DNS is a series of interconnected caches, many of which are not in our control. If you accessed a page during the issue, your ISP’s recursive DNS likely cached the result. Since most DNS providers don’t make it easy to flush their cache (compared with a recursive provider likeOpenDNS, which does) it extended the outage for people who were already seeing an issue. Generally, within 30 minutes, recursive DNS had flushed and by 8:00 GMT sites were back online.

Two datacenters did not take all the corrected DNS file updates correctly. We are still investigating why, but our speculation is that because the update affected a large number of records the systems choked on the initial attempt at the updates. Requests that hit those data centers returned bad results for some sites until about 8:10 GMT. Some visitors in Europe and Asia would have seen a longer period of downtime on some sites as a result. Our system has multiple layers of redundancy, including at the datacenter level, so we removed the two data centers from rotation as soon as we recognized the issue and affected visitors once again saw correct DNS results.

Two last problems exacerbated things. First, as is normal operations for us, we were dealing with two mid-sized DDoS attacks directed at some of our customers at the time. Nothing abnormal about that, but having two fewer data centers in rotation made us less effective at stopping them and caused a small handful of 500 errors. The impact of those, however, was minimal (less than 0.001% of traffic for around a 12 minute period). Second, there were some DNS entries in our system for TLDs like co.nz that shouldn’t have been there. While it wasn’t a validated DNS zone record, the way that the DNS update was pushed caused a handful of records that fell under these TLDs to also see an extended outage. When we got reports of this we identified the issue and removed the problematic entries.

The Good

There’s not a ton of good in this incident itself. While the system status is green now, we will memorialize the incident on our system status page. I, along with the rest of the team, apologize for the problem and anyone who experienced it. We’ve built a system that is resilient to most attacks, but a mistake on our part can still cause a significant issue. This is the second significant period of downtime we’ve had network wide. The first was more than a year ago and also occurred due to an error we made ourselves. Any period of downtime is unacceptable to us and, again, we sincerely apologize.

Going forward, we’ve already added several layers of safeguards to prevent this, or a similar incident, from occurring. CloudFlare’s technical systems are designed to learn over time, that same ethos is in our team itself. While this incident was ugly, I was proud to see almost the entire engineering, ops, and support teams online into the wee hours helping customers sort out issues and building the safeguards to prevent an issue like this in the future.

What I was planning on writing a blog post about this morning is our new DNS infrastructure, so I will end with a bit more detail on that. As described above, one of the main benefits is that DNS updates are even faster than before. In the past, DNS files were replicated every minute or so. Now changes are pushed instantly to our entire net
work. While that wasn’t a great thing last night, in general we believe it is a big benefit to our publishers and makes us the fastest updating global authoritative DNS in the world.

The update to the DNS systems also includes hardening against some of the new breed of DNS-directed DDoS attacks we’ve begun to see. Going forward, this will help us provide even better protection against larger and larger attacks. Our goal is to stay ahead of the bad guys and ensure that everyone on CloudFlare has state-of-the-art protection against attacks.

I apologize again for those of you who experienced downtime as a result of our mistake. We will learn from it and continue to build redundancy and resiliency into CloudFlare in order to earn your trust.

link: http://blog.cloudflare.com/post-mortem-the-ugly-the-bad-the-good

Redakce25. 2. 2012

247 6 minutes read

Související články

PLAY.CZ má novou aplikaci pro Android

Aplikace pro Android – veřejná beta verze

Nový hudební pořad PLAY.CZ Music Channel

Podcasting na PLAY.CZ