TranslateProject/sources/tech/20171112 Love Your Bugs.md
2017-12-06 20:18:29 -05:00

26 KiB
Raw Blame History

yixunx translating

Love Your Bugs

In early October I gave a keynote at Python Brasil in Belo Horizonte. Here is an aspirational and lightly edited transcript of the talk. There is also a video available here.

I love bugs

Im currently a senior engineer at Pilot.com, working on automating bookkeeping for startups. Before that, I worked for Dropbox on the desktop client team, and Ill have a few stories about my work there. Earlier, I was a facilitator at the Recurse Center, a writers retreat for programmers in NYC. I studied astrophysics in college and worked in finance for a few years before becoming an engineer.

But none of that is really important to remember the only thing you need to know about me is that I love bugs. I love bugs because theyre entertaining. Theyre dramatic. The investigation of a great bug can be full of twists and turns. A great bug is like a good joke or a riddle youre expecting one outcome, but the result veers off in another direction.

Over the course of this talk Im going to tell you about some bugs that I have loved, explain why I love bugs so much, and then convince you that you should love bugs too.

Bug #1

Ok, straight into bug #1. This is a bug that I encountered while working at Dropbox. As you may know, Dropbox is a utility that syncs your files from one computer to the cloud and to your other computers.

        +--------------+     +---------------+
        |              |     |               |
        |  METASERVER  |     |  BLOCKSERVER  |
        |              |     |               |
        +-+--+---------+     +---------+-----+
          ^  |                         ^
          |  |                         |
          |  |     +----------+        |
          |  +---> |          |        |
          |        |  CLIENT  +--------+
          +--------+          |
                   +----------+

Heres a vastly simplified diagram of Dropboxs architecture. The desktop client runs on your local computer listening for changes in the file system. When it notices a changed file, it reads the file, then hashes the contents in 4MB blocks. These blocks are stored in the backend in a giant key-value store that we call blockserver. The key is the digest of the hashed contents, and the values are the contents themselves.

Of course, we want to avoid uploading the same block multiple times. You can imagine that if youre writing a document, youre probably mostly changing the end we dont want to upload the beginning over and over. So before uploading a block to the blockserver the client talks to a different server thats responsible for managing metadata and permissions, among other things. The client asks metaserver whether it needs the block or has seen it before. The “metaserver” responds with whether or not each block needs to be uploaded.

So the request and response look roughly like this: The client says, “I have a changed file made up of blocks with hashes 'abcd,deef,efgh'”. The server responds, “I have those first two, but upload the third.” Then the client sends the block up to the blockserver.

                +--------------+     +---------------+
                |              |     |               |
                |  METASERVER  |     |  BLOCKSERVER  |
                |              |     |               |
                +-+--+---------+     +---------+-----+
                  ^  |                         ^
                  |  | 'ok, ok, need'          |
'abcd,deef,efgh'  |  |     +----------+        | efgh: [contents]
                  |  +---> |          |        |
                  |        |  CLIENT  +--------+
                  +--------+          |
                           +----------+

Thats the setup. So heres the bug.

                +--------------+
                |              |
                |  METASERVER  |
                |              |
                +-+--+---------+
                  ^  |
                  |  |   '???'
'abcdldeef,efgh'  |  |     +----------+
     ^            |  +---> |          |
     ^            |        |  CLIENT  +
                  +--------+          |
                           +----------+

Sometimes the client would make a weird request: each hash value should have been sixteen characters long, but instead it was thirty-three characters long twice as many plus one. The server wouldnt know what to do with this and would throw an exception. Wed see this exception get reported, and wed go look at the log files from the desktop client, and really weird stuff would be going on the clients local database had gotten corrupted, or python would be throwing MemoryErrors, and none of it would make sense.

If youve never seen this problem before, its totally mystifying. But once youd seen it once, you can recognize it every time thereafter. Heres a hint: the middle character of each 33-character string that wed often see instead of a comma was l. These are the other characters wed see in the middle position:

l \x0c < $ ( . -

The ordinal value for an ascii comma – , – is 44. The ordinal value for l is 108. In binary, heres how those two are represented:

bin(ord(',')): 0101100  
bin(ord('l')): 1101100  

Youll notice that an l is exactly one bit away from a comma. And herein lies your problem: a bitflip. One bit of memory that the desktop client is using has gotten corrupted, and now the desktop client is sending a request to the server that is garbage.

And here are the other characters wed frequently see instead of the comma when a different bit had been flipped.

,    : 0101100
l    : 1101100
\x0c : 0001100
<    : 0111100
$    : 0100100
(    : 0101000
.    : 0101110
-    : 0101101

Bitflips are real!

I love this bug because it shows that bitflips are a real thing that can happen, not just a theoretical concern. In fact, there are some domains where theyre more common than others. One such domain is if youre getting requests from users with low-end or old hardware, which is true for a lot of laptops running Dropbox. Another domain with lots of bitflips is outer space theres no atmosphere in space to protect your memory from energetic particles and radiation, so bitflips are pretty common.

You probably really care about correctness in space your code might be keeping astronauts alive on the ISS, for example, but even if its not mission-critical, its hard to do software updates to space. If you really need your application to defend against bitflips, there are a variety of hardware & software approaches you can take, and theres a very interesting talk by Katie Betchold about this.

Dropbox in this context doesnt really need to protect against bitflips. The machine that is corrupting memory is a users machine, so we can detect if the bitflip happens to fall in the comma but if its in a different character we dont necessarily know it, and if the bitflip is in the actual file data read off of disk, then we have no idea. Theres a pretty limited set of places where we could address this, and instead we decide to basically silence the exception and move on. Often this kind of bug resolves after the client restarts.

Unlikely bugs arent impossible

This is one of my favorite bugs for a couple of reasons. The first is that its a reminder of the difference between unlikely and impossible. At sufficient scale, unlikely events start to happen at a noticable rate.

Social bugs

My second favorite thing about this bug is that its a tremendously social one. This bug can crop up anywhere that the desktop client talks to the server, which is a lot of different endpoints and components in the system. This meant that a lot of different engineers at Dropbox would see versions of the bug. The first time you see it, you can  really  scratch your head, but after that its easy to diagnose, and the investigation is really quick: you look at the middle character and see if its an l.

Cultural differences

One interesting side-effect of this bug was that it exposed a cultural difference between the server and client teams. Occasionally this bug would be spotted by a member of the server team and investigated from there. If one of your  servers  is flipping bits, thats probably not random chance its probably memory corruption, and you need to find the affected machine and get it out of the pool as fast as possible or you risk corrupting a lot of user data. Thats an incident, and you need to respond quickly. But if the users machine is corrupting data, theres not a lot you can do.

Share your bugs

So if youre investigating a confusing bug, especially one in a big system, dont forget to talk to people about it. Maybe your colleagues have seen a bug shaped like this one before. If they have, you might save a lot of time. And if they havent, dont forget to tell people about the solution once youve figured it out write it up or tell the story in your team meeting. Then the next time your teams hits something similar, youll all be more prepared.

How bugs can help you learn

Recurse Center

Before I joined Dropbox, I worked for the Recurse Center. The idea behind RC is that its a community of self-directed learners spending time together getting better as programmers. That is the full extent of the structure of RC: theres no curriculum or assignments or deadlines. The only scoping is a shared goal of getting better as a programmer. Wed see people come to participate in the program who had gotten CS degrees but didnt feel like they had a solid handle on practical programming, or people who had been writing Java for ten years and wanted to learn Clojure or Haskell, and many other profiles as well.

My job there was as a facilitator, helping people make the most of the lack of structure and providing guidance based on what wed learned from earlier participants. So my colleagues and I were very interested in the best techniques for learning for self-motivated adults.

Deliberate Practice

Theres a lot of different research in this space, and one of the ones I think is most interesting is the idea of deliberate practice. Deliberate practice is an attempt to explain the difference in performance between experts & amateurs. And the guiding principle here is that if you look just at innate characteristics genetic or otherwise they dont go very far towards explaining the difference in performance. So the researchers, originally Ericsson, Krampe, and Tesch-Romer, set out to discover what did explain the difference. And what they settled on was time spent in deliberate practice.

Deliberate practice is pretty narrow in their definition: its not work for pay, and its not playing for fun. You have to be operating on the edge of your ability, doing a project appropriate for your skill level (not so easy that you dont learn anything and not so hard that you dont make any progress). You also have to get immediate feedback on whether or not youve done the thing correctly.

This is really exciting, because its a framework for how to build expertise. But the challenge is that as programmers this is really hard advice to apply. Its hard to know whether youre operating at the edge of your ability. Immediate corrective feedback is very rare in some cases youre lucky to get feedback ever, and in other cases maybe it takes months. You can get quick feedback on small things in the REPL and so on, but if youre making a design decision or picking a technology, youre not going to get feedback on those things for quite a long time.

But one category of programming where deliberate practice is a useful model is debugging. If you wrote code, then you had a mental model of how it worked when you wrote it. But your code has a bug, so your mental model isnt quite right. By definition youre on the boundary of your understanding so, great! Youre about to learn something new. And if you can reproduce the bug, thats a rare case where you can get immediate feedback on whether or not your fix is correct.

A bug like this might teach you something small about your program, or you might learn something larger about the system your code is running in. Now Ive got a story for you about a bug like that.

Bug #2

This bug also one that I encountered at Dropbox. At the time, I was investigating why some desktop client werent sending logs as consistently as we expected. Id started digging into the client logging system and discovered a bunch of interesting bugs. Ill tell you only the subset of those bugs that is relevant to this story.

Again heres a very simplified architecture of the system.

                                   +--------------+
                                   |              |
               +---+  +----------> |  LOG SERVER  |
               |log|  |            |              |
               +---+  |            +------+-------+
                      |                   |
                +-----+----+              |  200 ok
                |          |              |
                |  CLIENT  |  <-----------+
                |          |
                +-----+----+
                      ^
                      +--------+--------+--------+
                      |        ^        ^        |
                   +--+--+  +--+--+  +--+--+  +--+--+
                   | log |  | log |  | log |  | log |
                   |     |  |     |  |     |  |     |
                   |     |  |     |  |     |  |     |
                   +-----+  +-----+  +-----+  +-----+

The desktop client would generate logs. Those logs were compress, encrypted, and written to disk. Then every so often the client would send them up to the server. The client would read a log off of disk and send it to the log server. The server would decrypt it and store it, then respond with a 200.

If the client couldnt reach the log server, it wouldnt let the log directory grow unbounded. After a certain point it would start deleting logs to keep the directory under a maximum size.

The first two bugs were not a big deal on their own. The first one was that the desktop client sent logs up to the server starting with the oldest one instead of starting with the newest. This isnt really what you want for example, the server would tell the client to send logs if the client reported an exception, so probably you care about the logs that just happened and not the oldest logs that happen to be on disk.

The second bug was similar to the first: if the log directory hit its maximum size, the client would delete the logs starting with the newest instead of starting with the oldest. Again, you lose log files either way, but you probably care less about the older ones.

The third bug had to do with the encryption. Sometimes, the server would be unable to decrypt a log file. (We generally didnt figure out why maybe it was a bitflip.) We werent handling this error correctly on the backend, so the server would reply with a 500. The client would behave reasonably in the face of a 500: it would assume that the server was down. So it would stop sending log files and not try to send up any of the others.

Returning a 500 on a corrupted log file is clearly not the right behavior. You could consider returning a 400, since its a problem with the client request. But the client also cant fix the problem if the log file cant be decrypted now, well never be able to decrypt it in the future. What you really want the client to do is just delete the log and move on. In fact, thats the default behavior when the client gets a 200 back from the server for a log file that was successfully stored. So we said, ok if the log file cant be decrypted, just return a 200.

All of these bugs were straightforward to fix. The first two bugs were on the client, so wed fixed them on the alpha build but they hadnt gone out to the majority of clients. The third bug we fixed on the server and deployed.

📈

Suddenly traffic to the log cluster spikes. The serving team reaches out to us to ask if we know whats going on. It takes me a minute to put all the pieces together.

Before these fixes, there were four things going on:

  1. Log files were sent up starting with the oldest

  2. Log files were deleted starting with the newest

  3. If the server couldnt decrypt a log file it would 500

  4. If the client got a 500 it would stop sending logs

A client with a corrupted log file would try to send it, the server would 500, the client would give up sending logs. On its next run, it would try to send the same file again, fail again, and give up again. Eventually the log directory would get full, at which point the client would start deleting its newest files, leaving the corrupted one on disk.

The upshot of these three bugs: if a client ever had a corrupted log file, we would never see logs from that client again.

The problem is that there were a lot more clients in this state than we thought. Any client with a single corrupted file had been dammed up from sending logs to the server. Now that dam was cleared, and all of them were sending up the rest of the contents of their log directories.

Our options

Ok, theres a huge flood of traffic coming from machines around the world. What can we do? (This is a fun thing about working at a company with Dropboxs scale, and particularly Dropboxs scale of desktop clients: you can trigger a self-DDOS very easily.)

The first option when you do a deploy and things start going sideways is to rollback. Totally reasonable choice, but in this case, it wouldnt have helped us. The state that wed transformed wasnt the state on the server but the state on the client wed deleted those files. Rolling back the server would prevent additional clients from entering this state but it wouldnt solve the problem.

What about increasing the size of the logging cluster? We did that and started getting even more requests, now that wed increased our capacity. We increased it again, but you cant do that forever. Why not? This cluster isnt isolated. Its making requests into another cluster, in this case to handle exceptions. If you have a DDOS pointed at one cluster, and you keep scaling that cluster, youre going to knock over its depedencies too, and now you have two problems.

Another option we considered was shedding load you dont need every single log file, so can we just drop requests. One of the challenges here was that we didnt have an easy way to tell good traffic from bad. We couldnt quickly differentiate which log files were old and which were new.

The solution we hit on is one thats been used at Dropbox on a number of different occassions: we have a custom header, chillout, which every client in the world respects. If the client gets a response with this header, then it doesnt make any requests for the provided number of seconds. Someone very wise added this to the Dropbox client very early on, and its come in handy more than once over the years. The logging server didnt have the ability to set that header, but thats an easy problem to solve. So two of my colleagues, Isaac Goldberg and John Lai, implemented support for it. We set the logging cluster chillout to two minutes initially and then managed it down as the deluge subsided over the next couple of days.

Know your system

The first lesson from this bug is to know your system. I had a good mental model of the interaction between the client and the server, but I wasnt thinking about what would happen when the server was interacting with all the clients at once. There was a level of complexity that I hadnt thought all the way through.

Know your tools

The second lesson is to know your tools. If things go sideways, what options do you have? Can you reverse your migration? How will you know if things are going sideways and how can you discover more? All of those things are great to know before a crisis but if you dont, youll learn them during a crisis and then never forget.

Feature flags & server-side gating

The third lesson is for you if youre writing a mobile or a desktop application:  You need server-side feature gating and server-side flags.  When you discover a problem and you dont have server-side controls, the resolution might take days or weeks as you push out a new release or submit a new version to the app store. Thats a bad situation to be in. The Dropbox desktop client isnt going through an app store review process, but just pushing out a build to tens of millions of clients takes time. Compare that to hitting a problem in your feature and flipping a switch on the server: ten minutes later your problem is resolved.

This strategy is not without its costs. Having a bunch of feature flags in your code adds to the complexity dramatically. You get a combinatoric problem with your testing: what if feature A is enabled and feature B, or just one, or neither multiplied across N features. Its extremely difficult to get engineers to clean up their feature flags after the fact (and I was also guilty of this). Then for the desktop client theres multiple versions in the wild at the same time, so it gets pretty hard to reason about.

But the benefit man, when you need it, you really need it.

How to love bugs

Ive talked about some bugs that I love and Ive talked about why to love bugs. Now I want to tell you how to love bugs. If you dont love bugs yet, I know of exactly one way to learn, and thats to have a growth mindset.

The sociologist Carol Dweck has done a ton of interesting research about how people think about intelligence. Shes found that there are two different frameworks for thinking about intelligence. The first, which she calls the fixed mindset, holds that intelligence is a fixed trait, and people cant change how much of it they have. The other mindset is a growth mindset. Under a growth mindset, people believe that intelligence is malleable and can increase with effort.

Dweck found that a persons theory of intelligence whether they hold a fixed or growth mindset can significantly influence the way they select tasks to work on, the way they respond to challenges, their cognitive performance, and even their honesty.

[I also talked about a growth mindset in my Kiwi PyCon keynote, so here are just a few excerpts. You can read the full transcript here.]

Findings about honesty:

After this, they had the students write letters to pen pals about the study, saying “We did this study at school, and heres the score that I got.” They found that  almost half of the students praised for intelligence lied about their scores , and almost no one who was praised for working hard was dishonest.

On effort:

Several studies found that people with a fixed mindset can be reluctant to really exert effort, because they believe it means theyre not good at the thing theyre working hard on. Dweck notes, “It would be hard to maintain confidence in your ability if every time a task requires effort, your intelligence is called into question.”

On responding to confusion:

They found that students with a growth mindset mastered the material about 70% of the time, regardless of whether there was a confusing passage in it. Among students with a fixed mindset, if they read the booklet without the confusing passage, again about 70% of them mastered the material. But the fixed-mindset students who encountered the confusing passage saw their mastery drop to 30%. Students with a fixed mindset were pretty bad at recovering from being confused.

These findings show that a growth mindset is critical while debugging. We have to recover from confusion, be candid about the limitations of our understanding, and at times really struggle on the way to finding solutions all of which is easier and less painful with a growth mindset.

Love your bugs

I learned to love bugs by explicitly celebrating challenges while working at the Recurse Center. A participant would sit down next to me and say, “[sigh] I think Ive got a weird Python bug,” and Id say, “Awesome, I  love  weird Python bugs!” First of all, this is definitely true, but more importantly, it emphasized to the participant that finding something where they struggled an accomplishment, and it was a good thing for them to have done that day.

As I mentioned, at the Recurse Center there are no deadlines and no assignments, so this attitude is pretty much free. Id say, “You get to spend a day chasing down this weird bug in Flask, how exciting!” At Dropbox and later at Pilot, where we have a product to ship, deadlines, and users, Im not always uniformly delighted about spending a day on a weird bug. So Im sympathetic to the reality of the world where there are deadlines. However, if I have a bug to fix, I have to fix it, and being grumbly about the existence of the bug isnt going to help me fix it faster. I think that even in a world where deadlines loom, you can still apply this attitude.

If you love your bugs, you can have more fun while youre working on a tough problem. You can be less worried and more focused, and end up learning more from them. Finally, you can share a bug with your friends and colleagues, which helps you and your teammates.

Obrigada!

My thanks to folks who gave me feedback on this talk and otherwise contributed to my being there:

  • Sasha Laundy

  • Amy Hanlon

  • Julia Evans

  • Julian Cooper

  • Raphael Passini Diniz and the rest of the Python Brasil organizing team


via: http://akaptur.com/blog/2017/11/12/love-your-bugs/

作者:Allison Kaptur 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出