2
0
mirror of https://github.com/LCTT/TranslateProject.git synced 2025-01-13 22:30:37 +08:00
TranslateProject/sources/talk/20190630 Data Still Dominates.md
darksun 9569c73a48 选题: 20190630 Data Still Dominates
sources/talk/20190630 Data Still Dominates.md
2019-07-09 09:57:50 +08:00

9.8 KiB
Raw Blame History

Data Still Dominates

Heres a quote from Linus Torvalds in 2006:

Im a huge proponent of designing your code around the data, rather than the other way around, and I think its one of the reasons git has been fairly successful… I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

Which sounds a lot like Eric Raymonds “Rule of Representation” from 2003:

Fold knowledge into data, so program logic can be stupid and robust.

Which was just his summary of ideas like this one from Rob Pike in 1989:

Data dominates. If youve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Which cites Fred Brooks from 1975:

Representation is the Essence of Programming

Beyond craftmanship lies invention, and it is here that lean, spare, fast programs are born. Almost always these are the result of strategic breakthrough rather than tactical cleverness. Sometimes the strategic breakthrough will be a new algorithm, such as the Cooley-Tukey Fast Fourier Transform or the substitution of an n log n sort for an n2 set of comparisons.

Much more often, strategic breakthrough will come from redoing the representation of the data or tables. This is where the heart of your program lies. Show me your flowcharts and conceal your tables, and I shall be continued to be mystified. Show me your tables, and I wont usually need your flowcharts; theyll be obvious.

So, smart people have been saying this again and again for nearly half a century: focus on the data first. But sometimes it feels like the most famous piece of smart programming advice that everyone forgets.

Let me give some real examples.

The Highly Scalable System that Couldnt

This system was designed from the start to handle CPU-intensive loads with incredible scalability. Nothing was synchronous. Everything was done with callbacks, task queues and worker pools.

But there were two problems: The first was that the “CPU-intensive load” turned out not to be that CPU-intensive after all — a single task took a few milliseconds at worst. So most of the architecture was doing more harm than good. The second problem was that although it sounded like a highly scalable distributed system, it wasnt one — it only ran on one machine. Why? Because all communication between asynchronous components was done using files on the local filesystem, which was now the bottleneck for any scaling. The original design didnt say much about data at all, except to advocate local files in the name of “simplicity”. Most of the document was about all the extra architecture that was “obviously” needed to handle the “CPU-intensiveness” of the load.

The Service-Oriented Architecture that was Still Data-Oriented

This system followed a microservices design, made up of single-purpose apps with REST-style APIs. One component was a database that stored documents (basically responses to standard forms, and other electronic paperwork). Naturally it exposed an API for basic storage and retrieval, but pretty quickly there was a need for more complex search functionality. The designers felt that adding this search functionality to the existing document API would have gone against the principles of microservices design. They could talk about “search” as being a different kind of service from “get/put”, so their architecture shouldnt couple them together. Besides, the tool they were planning to use for search indexing was separate from the database itself, so creating a new service made sense for implementation, too.

In the end, a search API was created containing a search index that was essentially a duplicate of the data in the main database. This data was being updated dynamically, so any component that mutated document data through the main database API had to also update the search API. Its impossible to do this with REST APIs without race conditions, so the two sets of data kept going out of sync every now and then, anyway.

Despite what the architecture diagram promised, the two APIs were tightly coupled through their data dependencies. Later on it was recognised that the search index should be an implementation detail of a unified document service, and this made the system much more maintainable. “Do one thing” works at the data level, not the verb level.

The Fantastically Modular and Configurable Ball of Mud

This system was a kind of automated deployment pipeline. The original designers wanted to make a tool that was flexible enough to solve deployment problems across the company. It was written as a set of pluggable components, with a configuration file system that not only configured the components, but acted as a DSL for programming how the components fitted into the pipeline.

Fast forward a few years and its turned into “that program”. There was a long list of known bugs that no one was ever fixing. No one wanted to touch the code out of fear of breaking things. No one used any of the flexibility of the DSL. Everyone who used the program copy-pasted the same known-working configuration that everyone else used.

What had gone wrong? Although the original design document used words like “modular”, “decoupled”, “extensible” and “configurable” a lot, it never said anything about data. So, data dependencies between components ended up being handled in an ad-hoc way using a globally shared blob of JSON. Over time, components made more and more undocumented assumptions about what was in or not in the JSON blob. Sure, the DSL allowed rearranging components into any order, but most configurations didnt work.

Lessons

I chose these three examples because theyre easy to explain, not to pick on others. I once tried to build a website, and failed trying to instead build some cringe-worthy XML database that didnt even solve the data problems I had. Then theres the project that turned into a broken mockery of half the functionality of make, again because I didnt think about what I really needed. I wrote a post before based on a time I wrote a castle-in-the-sky OOP class hierarchy that should have been encoded in data instead.

Update:

Apparently many people still thought I wrote this to make fun of others. People whove actually worked with me will know Im much more interested in the things Im fixing than in blaming the people who did most of the work building them, but, okay, heres what I think of the engineers involved.

Honestly, the first example obviously happened because the designer was more interested in bringing a science project to work than in solving the problem at hand. Most of us have done that (mea culpa), but its really annoying to our colleagues wholl probably have to help maintain them when were bored of them. If this sounds like you, please dont get offended; please just stop. (Id still rather work on the single-node distributed system than anything built around my “XML database”.)

Theres nothing personal in the second example. Sometimes it feels like everyone is talking about how wonderful it is to split up services, but no one is talking about exactly when not to. People are learning the hard way all the time.

The third example was actually from some of the smartest people Ive ever had the chance to work with.

(End update.)

“Does this talk about the problems created by data?” turns out to be a pretty useful litmus test for good systems design. Its also pretty handy for detecting false expert advice. The hard, messy systems design problems are data problems, so false experts love to ignore them. Theyll show you a wonderfully beautiful architecture, but without talking about what kind of data its appropriate for, and (crucially) what kind of data it isnt.

For example, a false expert might tell you that you should use a pub/sub system because pub/sub systems are loosely coupled, and loosely coupled components are more maintainable. That sounds nice and results in pretty diagrams, but its backwards thinking. Pub/sub doesnt make your components loosely coupled; pub/sub is loosely coupled, which may or may not match your data needs.

On the flip side, a well-designed data-oriented architecture goes a long way. Functional programming, service meshes, RPCs, design patterns, event loops, whatever, all have their merits, but personally Ive seen tools like boring old databases be responsible for a lot more successfully shipped software.


via: https://theartofmachinery.com/2019/06/30/data_still_dominates.html

作者:Simon Arneaud 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出