diff --git a/sources/tech/20200629 Scaling a GraphQL Website.md b/sources/tech/20200629 Scaling a GraphQL Website.md deleted file mode 100644 index 5434cc1032..0000000000 --- a/sources/tech/20200629 Scaling a GraphQL Website.md +++ /dev/null @@ -1,336 +0,0 @@ -[#]: collector: (lujun9972) -[#]: translator: (MjSeven) -[#]: reviewer: ( ) -[#]: publisher: ( ) -[#]: url: ( ) -[#]: subject: (Scaling a GraphQL Website) -[#]: via: (https://theartofmachinery.com/2020/06/29/scaling_a_graphql_site.html) -[#]: author: (Simon Arneaud https://theartofmachinery.com) - -Scaling a GraphQL Website -====== - -For obvious reasons, I normally write abstractly about work I’ve done for other people, but I’ve been given permission to write about a website, [Vocal][1], that I did some SRE work on last year. I actually gave [a presentation at GraphQL Sydney back in February][2], but for various reasons it’s taken me this long to get it into a blog post. - -Vocal is a GraphQL-based website that got traction and hit scaling problems that I got called in to fix. Here’s what I did. Obviously, you’ll find this post useful if you’re scaling another GraphQL website, but most of it’s representative of what you have to deal with when a site gets enough traffic to cause technical problems. If website scalability is a key interest of yours, you might want to read [my recent post about scalability][3] first. - -### Vocal - -![][4] - -Vocal is a blogging platform publishing everything from diaries to movie reviews to opinion pieces to recipes to professional and amateur photography to beauty and lifestyle tips and poetry. Of course, there’s no shortage of proud pet owners with cute cat and dog pictures. - -![][5] - -One thing that’s a bit different about Vocal is that it lets everyday people get paid for producing works that viewers find interesting. Authors get a small amount of money per page view, and can also receive donations from other users. There are professionals using the platform to show off their work, but for most users it’s just a fun hobby that happens to make some extra pocket money as a bonus. - -Vocal is the product of [Jerrick Media][6], a New Jersey startup. Development started in 2015 in collaboration with [Thinkmill][7], a medium-sized Sydney software development consultancy that specialises in all things JavaScript, React and GraphQL. - -### Some spoilers for the rest of this post - -I was told that unfortunately I can’t give hard traffic numbers for legal reasons, but publicly available information can give an idea. Alexa ranks all websites it knows of by traffic level. Here’s a plot of Alexa rank I showed in my talk, showing growth from November 2019 up to getting ranked number 5,567 in the world by February. - -![Vocal global Alexa rank rising from#9,574 in November 2019 to #5,567 in February 2020.][8] - -It’s normal for the curve to slow down because it requires more and more traffic to win each position. Vocal is now at around #4,900. Obviously there’s a long way to go, but that’s not shabby at all for a startup. Most startups would gladly swap their Alexa rank with Vocal. - -Shortly after the site was upgraded, Jerrick Media ran a marketing campaign that doubled traffic. All we had to do on the technical side was watch numbers go up in the dashboards. In the past 9 months since launch, there have only been two platform issues needing staff intervention: [the once-in-five-years AWS RDS certificate rotation that landed in March][9], and an app rollout hitting a Terraform bug. I’ve been very happy with how little platform busywork is needed to keep Vocal running. - -Here’s an overview of the technical stuff I’ll talk about in this post: - - * Technical and historical background - * Database migration from MongoDB to Postgres - * Deployment infrastructure revamp - * Making the app compatible with scaling - * Making HTTP caching work - * Miscellaneous performances tweaks - - - -### Some background - -Thinkmill built a website using [Next.js][10] (a React-based web framework), talking to a GraphQL API provided by [Keystone][11] in front of MongoDB. Keystone is a GraphQL-based headless CMS library: you define a schema in JavaScript, hook it up to some data storage, and get an automatically generated GraphQL API for data access. It’s a free and open-source software project that’s commercially backed by Thinkmill. - -#### Vocal V2 - -The version 1 of Vocal got traction. It found a userbase that liked the product, and it grew, and eventually Jerrick Media asked Thinkmill to help develop a version 2, which was successfully launched in September last year. The Jerrick Media folk avoided the [second system effect][12] by generally basing changes on user feedback, so they were [mostly UI and feature changes that I won’t go into][13]. Instead, I’ll talk about the stuff I was brought in for: making the new site more robust and scalable. - -For the record, I’m thankful that I got to work with Jerrick Media and Thinkmill on Vocal, and that they let me present this story, but [I’m still an independent consultant][14]. I wasn’t paid or even asked to write this post, and this is still my own personal blog. - -### The database migration - -Thinkmill suffered several scalability problems with using MongoDB for Vocal, and decided to upgrade Keystone to version 5 to take advantage of its new Postgres support. - -If you’ve been in tech long enough to remember the “NoSQL” marketing from the end of the 00s, that might surprise you. The message was that relational (SQL) databases like Postgres aren’t as scalable as “webscale” NoSQL databases like MongoDB. It’s technically true, but the scalability of NoSQL databases comes from compromises in the variety of queries that can be efficiently handled. Simple, non-relational databases (like document and key-value databases) have their places, but when used as a general-purpose backend for an app, the app often outgrows the querying limitations of the database before it outgrows the theoretical scaling limit a relational database would have. Most of Vocal’s DB queries worked just fine with MongoDB, but over time more and more queries needed hacks to work at all. - -In terms of technical requirements, Vocal is very similar to Wikipedia, one of the biggest sites in the world. Wikipedia runs on MySQL (or rather, its fork, MariaDB). Sure, some significant engineering is needed to make that work, but I don’t see relational databases being a serious threat to Vocal’s scaling in the foreseeable future. - -At one point I checked, the managed AWS RDS Postgres instances cost less than a fifth of the old MongoDB instances, yet CPU usage of the Postgres instances was still under 10%, despite serving more traffic than the old site. That’s mostly because of a few important queries that just never were efficient under the document database architecture. - -The migration could be a blog post of its own, but basically a Thinkmill dev built an [ETL pipeline][15] using [MoSQL][16] to do the heavy lifting. Thanks to Keystone being a FOSS project, I was also able to contribute some performance improvements to its GraphQL to SQL mapping. For that kind of stuff, I always recommend Markus Winand’s SQL blogs: [Use the Index Luke][17] and [Modern SQL][18]. His writing is friendly and accessible to non-experts, yet has most of the theory you need for writing fast and effective SQL. A good, DB-specific book on performance gives you the rest. - -### The platform - -#### The architecture - -V1 was a couple of Node.js apps running on a single virtual private server (VPS) behind Cloudflare as a CDN. I’m a fan of avoiding overengineering as a high priority, so that gets a thumbs up from me. However, by the time V2 development started, it was obvious that Vocal had outgrown that simple architecture. It didn’t give Thinkmillers many options when handling big traffic spikes, and it made updates hard to deploy safely and without downtime. - -Here’s the new architecture for V2: - -![Architecture of Vocal V2. Requests come through a CDN to a load balancer in AWS. The load balancer distributes traffic to two apps, "Platform" and "Website". "Platform" is a Keystone app storing data in Redis and Postgres.][19] - -Basically, the two Node.js apps have been replicated and put behind a load balancer. Yes, that’s it. In my SRE work, I often meet engineers who expect a scalable architecture to be more complicated than that, but I’ve worked on sites that are orders of magnitude bigger than Vocal but are still just replicated services behind load balancers, with DB backends. If you think about it, if the platform architecture needs to keep getting significantly more complicated as the site grows, it’s not really very scalable. Website scalability is mostly about fixing the many little implementation details that prevent scaling. - -Vocal’s architecture might need a few additions if traffic grows enough, but the main reason it would get more complicated is new features. For example, if (for some reason) Vocal needed to handle real-time geospatial data in future, that would be a very different technical beast from blog posts, so I’d expect architectural changes for it. Most of the complexity in big site architecture is because of feature complexity. - -If you don’t know how to make your architecture scalable, I always recommend keeping it as simple as you can. Fixing an architecture that’s too simple is easier and cheaper than fixing an architecture that’s too complex. Also, an unnecessarily complex architecture is more likely to have mistakes, and those mistakes will be harder to debug. - -By the way, Vocal happened to be split into two apps, but that’s not important. A common scaling mistake is to prematurely split an app into smaller services in the name of scalability, but split the app in the wrong place and cause more scalability problems overall. Vocal could have scaled okay as a monolithic app, but the split is also in a good place. - -#### The infrastructure - -Thinkmill has a few people who have experience working with AWS, but it’s primarily a dev shop and needed something more “hands off” than the old Vocal deployment. I ended up deploying the new Vocal on [AWS Fargate][20], which is a relatively new backend to Elastic Container Service (ECS). In the old days, many people wanted ECS to be a simple “run my Docker container as a managed service” product, and were disappointed that they still had to build and manage their own server cluster. With ECS Fargate, AWS manages the cluster. It supports running Docker containers with the basic nice things like replication, health checking, rolling updates, autoscaling and simple alerting. - -A good alternative would have been a managed Platform-as-a-Service (PaaS) like App Engine or Heroku. Thinkmill was already using them for simple projects, but often needed more flexibility with other projects. There are much bigger sites running on PaaSes, but Vocal is at a scale where a custom cloud deployment can make sense economically. - -Another obvious alternative would have been Kubernetes. Kubernetes has a lot more features than ECS Fargate, but it’s a lot more expensive — both in resource overhead, and the staffing needed for maintenance (such as regular node upgrades). As a rule, I don’t recommend Kubernetes to any place that doesn’t have dedicated DevOps staff. Fargate has the features Vocal needs, and has let Thinkmill and Jerrick Media focus on website improvements, not infrastructure busywork. - -Yet another option was “Serverless” function products like AWS Lambda or Google Cloud Functions. They’re great for handling services with very low or highly irregular traffic, but (as I’ll explain) ECS Fargate’s autoscaling is enough for Vocal’s backend. Another plus of these products is that they allow developers to deploy things in cloud environments without needing to learn a lot about cloud environments. The tradeoff is that the Serverless product becomes tightly coupled to the development process, and to the testing and debugging processes. Thinkmill already had enough AWS expertise in-house to manage a Fargate deployment, and any dev who knows how to make a Node.js Express Hello World app can work on Vocal without learning anything about either Serverless functions or Fargate. - -An obvious downside of ECS Fargate is vendor lock-in. However, avoiding vendor lock-in is a tradeoff like avoiding downtime. If you’re worried about migrating, it doesn’t make sense to spend more on platform independence than you would on a migration. The total amount of Fargate-specific code in Vocal is <500 lines of [Terraform][21]. The most important thing is that the Vocal app code itself is platform agnostic. It can run on normal developer machines, and then be packaged up into a Docker container that can run practically anywhere a Docker container can, including ECS Fargate. - -Another downside of Fargate is that it’s not trivial to set up. Like most things in AWS, it’s in a world of VPCs, subnets, IAM policies, etc. Fortunately, that kind of stuff is quite static (unlike a server cluster that requires maintenance). - -### Making a scaling-ready app - -There’s a bunch of stuff to get right if you want to run an app painlessly at scale. You’re doing well if you follow [the Twelve-Factor App design][22], so I won’t repeat it here. - -There’s no point building a “scalable” system if staff can’t operate it at scale — that’s like putting a jet engine on a unicycle. An important part of making Vocal scalable was setting up stuff like CI/CD and [infrastructure as code][23]. Similarly, some deployment ideas aren’t worth it because they make production too different from the development environment (see also [point #10 of the Twelve-Factor App][24]). Every difference between production and development slows app development and can be expected to lead to a bug eventually. - -### Caching - -Caching is a really big topic — I once gave [a presentation on just HTTP caching][25], and that still wasn’t enough. I’ll stick to the essentials for GraphQL here. - -First, an important warning: Whenever you have performance problems, you might wonder, “Can I make this faster by putting this value into a cache for future reuse?” **Microbenchmarks will practically _always_ tell you the answer is “yes”.** However, putting caches everywhere will tend to make your overall system **slower**, thanks to problems like cache coherency. Here’s my mental checklist for caching: - - 1. Ask if the performance problem needs to be solved with caching - 2. Really ask (non-caching performance wins tend to be more robust) - 3. Ask if the problem can be solved by improving existing caches - 4. If all else fails, maybe add a new cache - - - -One cache system you’ll always have is the HTTP caching system, so a corollary is that it’s a good idea to use HTTP caching effectively before trying to add extra caches. I’ll focus on that in this post. - -Another very common trap is using a hash map or something inside the app for caching. [It works great in local development but performs badly when scaled.][26] The best thing is to use an explicit caching library that supports pluggable backends like Redis or Memcached. - -#### The basics - -There are two types of caches in the HTTP spec: private and public. Private caches are caches that don’t share data with multiple users — in practice, the user’s browser cache. Public caches are all the rest. They include ones under your control (such as CDNs or servers like Varnish or Nginx) and ones that aren’t (proxies). Proxy caches are rarer in today’s HTTPS world, but some corporate networks have them. - -![][27] - -Caching lookup keys are normally based on URLs, so caching is less painful if you stick to a “same content, same URL; different content, different URL” rule. I.e., give each page a canonical URL, and avoid “clever” tricks returning varying content from one URL. Obviously, this has implications for GraphQL API endpoints (that I’ll discuss later). - -Your servers can take custom configuration, but the primary way to configure HTTP caching is through HTTP headers you set on web responses. The most important header is `cache-control`. The following says that all caches down the line may cache the page for up to 3600 seconds (one hour): - -``` -cache-control: max-age=3600, public -``` - -For user-specific pages (such as user settings pages), it’s important to use `private` instead of `public` to tell public caches not to store the response and serve it to other users. - -Another common header is `vary`. This tells caches that the response varies based on some things other than the URL. (Effectively it adds HTTP headers to the the cache key, alongside the URL.) It’s a very blunt tool, which is why I recommend using a good URL structure instead if possible, but an important use case is telling browsers that the response depends on the login cookie, so that they update pages on login/logout. - -``` -vary: cookie -``` - -If a page can vary based on login status, you need `cache-control: private` (and `vary: cookie`) even on the public, logged out version, to make sure responses don’t get mixed up. - -Other useful headers include `etag` and `last-modified`, but I won’t cover them here. You might still see some old headers like `expires` and `pragma: cache`. They were made obsolete by HTTP/1.1 back in 1997, so I only use them if I want to disable caching and I’m feeling paranoid. - -#### Clientside headers - -Less well known is that the HTTP spec allows `cache-control` headers to be used in client requests to reduce the cache time and get a fresher response. Unfortunately `max-age` greater than 0 doesn’t seem to be widely supported by browsers, but `no-cache` can be useful if you sometimes need a fresh response after an update. - -#### HTTP caching and GraphQL - -As above, the normal cache key is the URL. But GraphQL APIs often use just one endpoint (let’s call it `/api/`). If you want a GraphQL query to be cachable, you need the query and its parameters to appear in the URL path, like `/api/?query={user{id}}&variables={"x":99}` (ignoring URL escaping). The trick is to configure your GraphQL client to use HTTP GET requests for queries (e.g., [set `useGETForQueries` for `apollo-link-http`][28]). - -Mutations mustn’t be cached, so they still need to use HTTP POST requests. With POST requests, caches will only see `/api/` as the URL path, but caches will refuse to cache POST requests outright. Remember: GET for non-mutating queries, POST for mutations. There’s a case where you might want to avoid GET for a query: if the query variables contain sensitive information. URLs have a habit of appearing in log files, browser history and chat channels, so sensitive information in URLs is usually a bad idea. Things like authentication should be done as non-cachable mutations, anyway, so this is a rare case, but one worth remembering. - -Unfortunately, there’s a problem: GraphQL queries tend to be much larger than REST API URLs. If you simply switch on GET-based queries, you’ll get some pretty big URLs, easily bigger than the ~2000 byte limit before some popular browsers and servers just won’t accept them. A solution is to send some kind of query ID, instead of sending the whole query. (I.e., something like `/api/?queryId=42&variables={"x":99}`.) Apollo GraphQL server supports two ways of doing this. - -One way is to [extract all the GraphQL queries from the code and build a lookup table that’s shared serverside and clientside][29]. One downside is that it makes the build process more complicated. Another downside is that it couples the client project to the server project, which goes against a selling point of GraphQL. Yet another downside is that version X of your code might recognise a different set of queries from version Y of your code. This is a problem because 1) your replicated app will serve multiple versions during an update rollout, or rollback, and 2) clients might use cached JavaScript, even as you upgrade or downgrade the server. - -Another way is what Apollo GraphQL calls [Automatic Persisted Queries (APQs)][30]. With APQs, the query ID is a hash of the query. The client optimistically makes a request to the server, referring to the query by hash. If the server doesn’t recognise the query, the client sends the full query in a POST request. The server stores that query by hash so that it can be recognised in future. - -![][31] - -#### HTTP caching and Keystone 5 - -As above, Vocal uses Keystone 5 for generating its GraphQL API, and Keystone 5 works with Apollo GraphQL server. How do we actually set the caching headers? - -Apollo supports cache hints on GraphQL schemas. The neat thing is that Apollo gathers all the hints for everything that’s touched by a query, and then it automatically calculates the appropriate overall cache header values. For example, take this query: - -``` -query userAvatarUrl { - authenticatedUser { - name - avatar_url - } -} -``` - -If `name` has a max age of one day, and the `avatar_url` has a max age of one hour, the overall cache max age would be the minimum, one hour. `authenticatedUser` depends on the login cookie, so it needs a `private` hint, which overrides the `public` on the other fields, so the resulting header would be `cache-control: max-age=3600, private`. - -I added [cache hint support to Keystone lists and fields][32]. Here’s a simple example of adding a cache hint to a field in the to-do list demo from the docs: - -``` -const keystone = new Keystone({ - name: 'Keystone To-Do List', - adapter: new MongooseAdapter(), -}); - -keystone.createList('Todo', { - schemaDoc: 'A list of things which need to be done', - fields: { - name: { - type: Text, - schemaDoc: 'This is the thing you need to do', - isRequired: true, - cacheHint: { - scope: 'PUBLIC', - maxAge: 3600, - }, - }, - }, -}); -``` - -#### One more problem: CORS - -Cross-Origin Resource Sharing (CORS) rules create a frustrating conflict with caching in an API-based website. - -Before getting stuck into the problem details, let me jump to the easiest solution: putting the main site and API onto one domain. If your site and API are served from one domain, you won’t have to worry about CORS rules (but you might want to consider [restricting cookies][33]). If your API is specifically for the website, this is the cleanest solution, and you can happily skip this section. - -In Vocal V1, the Website (Next.js) and Platform (Keystone GraphQL) apps were on different domains (`vocal.media` and `api.vocal.media`). To protect users from malicious websites, modern browsers don’t just let one website interact with another. So, before allowing `vocal.media` to make requests to `api.vocal.media`, the browser would make a “pre-flight” check to `api.vocal.media`. This is an HTTP request using the `OPTIONS` method that essentially asks if the cross-origin sharing of resources is okay. After getting the okay from the pre-flight check, the browser makes the normal request that was originally intended. - -The frustrating thing about pre-flight checks is that they are per-URL. The browser makes a new `OPTIONS` request for each URL, and the server response applies to that URL. [The server can’t say that `vocal.media` is a trusted origin for all `api.vocal.media` requests][34]. This wasn’t a serious problem when everything was a POST request to the one api endpoint, but after giving every query its own GET-able URL, every query got delayed by a pre-flight check. For extra frustration, the HTTP spec says `OPTIONS` requests can’t be cached, so you can find that all your GraphQL data is beautifully cached in a CDN right next to the user, but browsers still have to make pre-flight requests all the way to the origin server every time they use it. - -There are a few solutions (if you can’t just use a shared domain). - -If your API is simple enough, you might be able to exploit the [exceptions to the CORS rules][35]. - -Some cache servers can be configured to ignore the HTTP spec and cache `OPTIONS` requests anyway (e.g., Varnish-based caches and AWS CloudFront). This isn’t as efficient as avoiding the pre-flight requests completely, but it’s better than the default. - -Another (really hacky) option is [JSONP][36]. Beware: you can create security bugs if you don’t get this right. - -#### Making Vocal more cachable - -After making HTTP caching work at the low level, I needed to make the app take better advantage of it. - -A limitation of HTTP caching is that it’s all-or-nothing at the response level. Most of a response can be cachable, but if a single byte isn’t, all bets are off. As a blogging platform, most Vocal data is highly cachable, but in the old site almost no _pages_ were cachable at all because of a menu bar in the top right corner. For an anonymous user, the menu bar would show links inviting the user to log in or create an account. That bar would change to a user avatar and profile menu for signed-in users. Because the page varied based on user login status, it wasn’t possible to cache any of it in CDNs. - -![A typical page from Vocal. Most of the page is highly cachable content, but in the old site none of it was actually cachable because of a little menu in the top right corner.][37] - -These pages are generated by Server-Side Rendering (SSR) of React components. The fix was to take all the React components that depended on the login cookie, and force them to be [lazily rendered clientside only][38]. Now the server returns completely generic pages with placeholders for personalised components like the login menu bar. When a page loads in the user’s browser, these placeholders are filled in clientside by making calls to the GraphQL API. The generic pages can be safely cached in CDNs. - -Not only does this trick improve cache hit ratios, it helps improve perceived page load time thanks to human psychology. Blank screens and even spinner animations make us impatient, but once the first content appears, it distracts us for several hundred milliseconds. If people click a Vocal post link from social media and the main content appears immediately from a CDN, very few will ever notice that some components aren’t fully interactive until a few hundred milliseconds later. - -By the way, another trick for getting the first content in front of the user faster is to [stream render the SSR response as it’s generated][39], instead of waiting for the whole page to be rendered before sending it. Unfortunately, [Next.js doesn’t support that yet][40]. - -The idea of splitting responses for improved cachability also applies to GraphQL. The ability to query multiple pieces of data with one request is normally an advantage of GraphQL, but if the different parts of the response have very different cachability, it can be better overall to split them. As a simple example, Vocal’s pagination component needs to know the number of pages plus the content for the current page. Originally the component fetched both in one query, but because the total number of pages is a constant across all pages, I made it a separate query so it can be cached. - -#### Benefits of caching - -The obvious benefit of caching is that it reduces the load on Vocal’s backend servers. That’s good, but it’s dangerous to rely on caching for capacity, though, because you still need a backup plan for when you inevitably drop the cache one day. - -The improved responsiveness is a better reason for caching. - -A couple of other benefits might be less obvious. Traffic spikes tend to be highly localised. If someone with a lot of social media followers shares a link to a page, Vocal will get a big surge of traffic, but mostly to that one page and its assets. That’s why caches are good at absorbing the worst traffic spikes, making the backend traffic patterns relatively smoother and easier for autoscaling to handle. - -Another benefit is graceful degradation. Even if the backends are in serious trouble for some reason, the most popular parts of the site will still be served from the CDN cache. - -### Other performance tweaks - -As I always say, the secret to scaling isn’t making things complicated. It’s making things no more complicated than needed, and then thoroughly fixing all the things that prevent scaling. Scaling Vocal involved a lot of little things that won’t fit in this post. - -Here’s one tip: for the difficult debugging problems in distributed systems, the hardest part is usually getting the right data to see what’s going on. I can think of plenty of times that I’ve got stuck and tried to just “wing it” by guessing instead of figuring out how to find the right data. Sometimes that works, but not for the hard problems. - -A related tip is that you can learn a lot by getting real-time data (even just log files under [`tail -F`][41]) on each component in a system, displaying it in various windows in one monitor, and just clicking around the site in another. I’m talking about things like, “Hey, why does toggling this one checkbox generate dozens of DB queries in the backend?” - -Here’s an example of one fix. Some pages were taking more than a couple of seconds to render, but only in the deployment environment, and only with SSR. The monitoring dashboards didn’t show any CPU usage spikes, and the apps weren’t using disk, so it suggested that maybe the app was waiting on network requests, probably to a backend. In a dev environment I could watch how the app worked using [the sysstat tools][42] to record CPU/RAM/disk usage, along with Postgres statement logging and the usual app logs. [Node.js supports probes for tracing HTTP requests][43] using something like [bpftrace][44], but boring reasons meant they didn’t work in the dev environment, so instead I found the probes in the source code and made a custom Node.js build with request logging. I used [tcpdump][45] to record network data. That let me find the problem: for every API request made by Website, a new network connection was being created to Platform. (If that hadn’t worked, I guess I would have added request tracing to the apps.) - -Network connections are fast on a local machine, but take non-negligible time on a real network. Setting up an encrypted connection (like in the production environment) takes even longer. If you’re making lots of requests to one server (like an API), it’s important to keep the connection open and reuse it. Browsers do that automatically, but Node.js doesn’t by default because it can’t know if you’re making more requests. That’s why the problem only appeared with SSR. Like many long debugging sessions, the fix was very simple: just configure SSR to [keep connections alive][46]. The rendering time of the slower pages dropped dramatically. - -If you want to know more about this kind of stuff, I highly recommend reading [the High Performance Browser Networking book][47] (free to read online) and following up with [guides Brendan Gregg has published][48]. - -### What about your site? - -There’s actually a lot more stuff we could have done to improve Vocal, but we didn’t do it all. That’s a big difference between doing SRE work for a startup and doing it for a big company as a permanent employee. We had goals, a budget and a launch date, and now Vocal V2 has been running for 9 months with a healthy growth rate. - -Similarly, your site will have its own requirements, and is likely quite different from Vocal. However, I hope this post and its links give you at least some useful ideas to make something better for users. - --------------------------------------------------------------------------------- - -via: https://theartofmachinery.com/2020/06/29/scaling_a_graphql_site.html - -作者:[Simon Arneaud][a] -选题:[lujun9972][b] -译者:[译者ID](https://github.com/译者ID) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]: https://theartofmachinery.com -[b]: https://github.com/lujun9972 -[1]: https://vocal.media -[2]: https://www.meetup.com/en-AU/GraphQL-Sydney/events/267681845/ -[3]: https://theartofmachinery.com/2020/04/21/what_is_high_traffic.html -[4]: https://theartofmachinery.com/images/scaling_a_graphql_site/vocal1.png -[5]: https://theartofmachinery.com/images/scaling_a_graphql_site/vocal2.png -[6]: https://jerrick.media -[7]: https://www.thinkmill.com.au/ -[8]: https://theartofmachinery.com/images/scaling_a_graphql_site/alexa.png -[9]: https://aws.amazon.com/blogs/database/amazon-rds-customers-update-your-ssl-tls-certificates-by-february-5-2020/ -[10]: https://github.com/vercel/next.js -[11]: https://www.keystonejs.com/ -[12]: https://wiki.c2.com/?SecondSystemEffect -[13]: https://vocal.media/resources/vocal-2-0 -[14]: https://theartofmachinery.com/about.html -[15]: https://en.wikipedia.org/wiki/Extract,_transform,_load -[16]: https://github.com/stripe/mosql -[17]: https://use-the-index-luke.com/ -[18]: https://modern-sql.com/ -[19]: https://theartofmachinery.com/images/scaling_a_graphql_site/architecture.svg -[20]: https://aws.amazon.com/fargate/ -[21]: https://www.terraform.io/docs/providers/aws/r/ecs_task_definition.html -[22]: https://12factor.net/ -[23]: https://theartofmachinery.com/2019/02/16/talks.html -[24]: https://12factor.net/dev-prod-parity -[25]: https://www.meetup.com/en-AU/Port80-Sydney/events/lwcdjlyvjblb/ -[26]: https://theartofmachinery.com/2016/07/30/server_caching_architectures.html -[27]: https://theartofmachinery.com/images/scaling_a_graphql_site/http_caches.svg -[28]: https://www.apollographql.com/docs/link/links/http/#options -[29]: https://www.apollographql.com/blog/persisted-graphql-queries-with-apollo-client-119fd7e6bba5 -[30]: https://www.apollographql.com/blog/improve-graphql-performance-with-automatic-persisted-queries-c31d27b8e6ea -[31]: https://theartofmachinery.com/images/scaling_a_graphql_site/apq.png -[32]: https://www.keystonejs.com/api/create-list/#cachehint -[33]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies#Define_where_cookies_are_sent -[34]: https://lists.w3.org/Archives/Public/public-webapps/2012AprJun/0236.html -[35]: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS#Simple_requests -[36]: https://en.wikipedia.org/wiki/JSONP -[37]: https://theartofmachinery.com/images/scaling_a_graphql_site/cachablepage.png -[38]: https://nextjs.org/docs/advanced-features/dynamic-import#with-no-ssr -[39]: https://medium.com/the-thinkmill/progressive-rendering-the-key-to-faster-web-ebfbbece41a4 -[40]: https://github.com/vercel/next.js/issues/1209 -[41]: https://linux.die.net/man/1/tail -[42]: https://github.com/sysstat/sysstat/ -[43]: http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html -[44]: https://theartofmachinery.com/2019/04/26/bpftrace_d_gc.html -[45]: https://danielmiessler.com/study/tcpdump/ -[46]: https://www.npmjs.com/package/agentkeepalive -[47]: https://hpbn.co/ -[48]: http://www.brendangregg.com/ diff --git a/translated/tech/20200629 Scaling a GraphQL Website.md b/translated/tech/20200629 Scaling a GraphQL Website.md new file mode 100644 index 0000000000..661e72abba --- /dev/null +++ b/translated/tech/20200629 Scaling a GraphQL Website.md @@ -0,0 +1,332 @@ +[#]: collector: "lujun9972" +[#]: translator: "MjSeven" +[#]: reviewer: " " +[#]: publisher: " " +[#]: url: " " +[#]: subject: "Scaling a GraphQL Website" +[#]: via "https://theartofmachinery.com/2020/06/29/scaling_a_graphql_site.html" +[#]: author: "Simon Arneaud https://theartofmachinery.com" + +扩展一个 GraphQL 网站 +====== + +我通常会抽象总结我为他人所做的工作(出于明显的原因),但是直到现在,我才可以谈论一个网站:[Vocal][1] 。我去年为它做了一些 SRE 工作。实际上早在 2 月份,我就在 [GraphQL Sydney 做过一次演讲][2],但这篇博客推迟了一点才发表。 + +Vocal 是一个基于 GraphQL 的站点,它获得了关注,然后遇到了可扩展性问题,而我是来解决这个问题的。这篇文章会讲述我的做法。显然,如果你正在扩展一个 GraphQL 站点,你会发现这篇文章很有用,但其中大部分内容都是当一个站点获得了足够的流量而出现的技术问题,这些问题必须解决。如果你对站点可扩展性有兴趣,你可能想先阅读[最近我发表的一系列关于可扩展性的文章][3]。 + +### Vocal + +![][4] + +Vocal 是一个博客内容平台,内容包括日记、电影评论、文章评论、食谱、专业或业余摄影、美容和生活小贴士以及诗歌,当然,还有可爱的猫和狗狗照片。 + +![][5] + +Vocal 的不同之处在于,它允许人们制作观众感兴趣的作品而获得报酬。作者的页面每次被浏览都可以获得少量资金,也可以获得其他用户的捐赠。有很多专业人士在这个平台上炫耀他们的工作,但对于大多数普通用户来说,他们只是把 Vocal 当作一个有趣的爱好,碰巧还能赚些零花钱作为奖励。 + +Vocal 是新泽西初创公司 [Jerrick Media][6] 的产品,更新:Jerrick Media 已经更名为 Creatd,在纳斯达克上市。2015 年,他们与 [Thinkmill][7] 合作一起开发,Thinkmill 是一家悉尼中型软件开发咨询公司,擅长 JavaScript、React 和 GraphQL。 + +### 剧透 + +不幸的是,有人告诉我,由于法律原因,我不能提供具体的流量数字,但公开的信息可以说明。Alexa 对所有网站按照流量进行排名。这是我演讲中展示的 Alexa 排名图,从 2019 年 11 月到今年 2 月,Vocal 流量增长到全球排名第 5567 位。 + +![去年 11 月到今年 2 月 Vocal 的全球排名从 9574 名增长到 5567 名][8] + +曲线增长变慢是正常的,因为它需要越来越多的流量来赢得每个位置。Vocal 现在排名 4900 名左右,显然还有很长的路要走,但对于一家初创公司来说,这一点也不寒酸。大多数初创公司很乐意与 Vocal 互换排名。 + +在网站升级后不久,Creatd 开展了一项营销活动,使流量翻了一番。在技术方面,我们要做的就是观察仪表盘上的上升的数字。自发布以来的 9 个月里,只有两个平台问题需要员工干预:[3 月份每五年一次的 AWS RDS 证书轮换][9],以及一款应用推出时遇到的 Terraform 错误。作为一名 SRE,我很高兴看到 Vocal 不需要太多的平台工作来保持运行。更新:该系统也处理了 2020 年的美国大选,没有任何意外。 + +以下是本文技术内容的概述: + + * 技术和历史背景 + * 从 MongoDB 迁移到 Postgres + * 部署基础设置改造 + * 应用程序兼容扩展措施 + * HTTP 缓存 + * 其他一些性能调整 + +### 背景 + +Thinkmill 使用 [Next.js][10](一个基于 React 的 Web 框架)构建了一个网站,和 [Keystone][11] 提供的 GraphQL API 进行交互。Keystone 是一个基于 GraphQL 的无头 CMS 库:在 JavaScripy 中定义一个模式,将它与一些数据存储挂钩,并获得一个自动生成的 GraphQL API 用于数据访问。这是一个免费的开源软件项目,由 Thinkmill 提供商业支持。 + +#### Vocal V2 + +Vocal 的第一版就受到了关注,它找到了一个喜欢它的用户群,并不断壮大,最终 Creatd 请求 Thinkmill 帮助开发 V2,并于去年 9 月成功推出。Creatd 员工通常基于用户反馈来避免[第二个系统效应][12],他们[主要是 UI 和功能更改,我就不赘述了][13]。相反,我将讨论下我的工作内容:使新站点更加健壮和可扩展。 + +声明:我很感谢能与 Creatd 以及 Thinkmill 在 Vocal 中的合作,并且他们允许我发表这个故事,但[我仍然是一名独立顾问][14],我没有报酬,甚至没有被要求写这篇文章,这仍然是我自己的个人博客。 + +### 迁移数据库 + +Thinkmill 在使用 MongoDB 时遇到了几个可扩展性问题,因此决定升级到 Keystone5 以利用其新的 Postgres 支持。 + +如果你从事技术工作的时间足够长,那你可能还记得 00 年代末的 “NOSQL” 营销,这可能听起来很有趣。NoSQL 的一个重要特点是,像 Postgres 这样的关系数据库(SQL)不像 MongoDB 这样网络规模的 NoSQL 数据库那样具有可扩展性。从技术上将,这种说法是正确的,但 NoSQL 数据库的可扩展性来自它可以有效处理各种查询的折衷。简单的非关系数据库(如文档和键值数据库)有其一席之地,但当它们用作应用的通用后端时,应用程序通常会在超出理论扩展限制之前超出数据库的查询限制。Vocal 的大多数数据库查询在 MongoDB 上运行良好,但随着时间推移,越来越多的查询需要特殊技巧才能工作。 + +在技术要求方面,Vocal 与维基百科非常相似。维基百科是世界上最大的网站之一,它在 MySQL(或者说它的分支 MariaDB)上运行。当然,这需要一些重要的工程来完成这项工作,但在可预见的未来,我认为关系数据库不会对 Vocal 的扩展构成严重威胁。 + +我做过一个比较,托管 AWS RDS Postgres 实例的成本不到旧 MongoDB 实例的五分之一,但 Postgres 实例的 CPU 使用率仍然低于 10%,尽管它提供的流量比旧站点多。这主要是因为一些重要的查询在文档数据库架构下一直效率很低。 + +迁移可以新写一篇博客文章来讲述,但基本上是 Thinkmill 开发人员使用 [MoSQL][16] 构建了一个 [ETL 管道][15] 来完成这项繁重的工作。由于 Keystone 对于 Postgres 支持仍然比较基础,但它是一个 FOSS 项目,所以我能够解决在 SQL 生成性能方面遇到的问题。对于这类事情,我总是推荐 Markys Winand 的 SQL 博客:[使用 Luke 索引][17] 和 [现代 SQL][18]。他的文章很友好,甚至对那些暂时不太关注 SQL 人来说也是容易理解的,但他拥有你大多数需要的理论知识。如果你仍然有问题,一本好的、专注于 DB 性能的书可以帮助你。 + +### 平台 + +#### 架构 + +V1 是几个 Node.js 应用,运行在 Cloudflare(作为 CDN)背后的单个虚拟专用服务器(VPS)上。避免过度设计作为优先事项,我是这个准则的粉丝。然而,在 V2 开始开发时,很明显,Vocal 已经超越了这个简单的架构。在处理巨大峰值流量时,它没有给 Thinkmill 开发人员提供很多选择,而且它很难在不停机情况下安全部署更新。 + +这是 V2 的新架构: + +![Vocal V2 的技术架构,请求从 CDN 进入,然后经过 AWS 的负载均衡。负载均衡将流量分配到两个应用程序 "Platform" 和 "Website"。"Platform" 是一款 Keystone 应用程序,将数据存储在 Redis 和 Postgres 中。][19] + +基本上就是两个 Node.js 应用程序复制放在负载均衡器后面,非常简单。有些人认为可扩展架构要比这复杂得多,但是我曾经在一些比 Vocal 规模大得多的网站工作过,它们仍然只是在负载均衡器后面复制服务,带有 DB 后端。你仔细想想,如果平台架构需要随着站点的增长而变得越来越复杂,那么它就不是真正“可扩展的”。网站可扩展性主要是关于解决那些破坏可扩展的实现细节。 + +如果流量增长得足够多,Vocal 的架构可能需要一些补充,但它变得更加复杂的主要原因是新功能。例如,如果(出于某种原因)Vocal 将来需要处理实时地理空间数据,那将是一个与博客文章截然不同的技术,所以我希望对它进行架构上的更改。大型网站架构的复杂性主要来自于复杂的功能。 + +不知道未来的架构应该是什么样子很正常,所以我总是建议你尽可能从简单开始。修复一个简单架构要比复杂架构更容易,成本也更低。此外,不必要的复杂架构更有可能出现错误,而这些错误将更难调试。 + +顺便说一下,Vocal 分成了两个应用程序,但这并不重要。一个常见的扩展错误是,以可扩展的名义过早地将应用分割成更小的服务,但将应用分割在错误的位置,从而导致更多的可扩展性问题。作为一个独立的应用,Vocal 的规模还不错,但它的分割做的也很好。 + +#### 基础设施 + +Thinkmill 有一些人有 AWS 经验,但它主要是一个开发车间,需要一些比之前的 Vocal 部署更容易上手的东西。我最终在 AWS Fargate 上部署了新的 Vocal,这是一个相对较新的弹性容器服务(ECS)后端。在过去,许多人希望 ECS 作为一个“托管服务运行 Docker 容器”的简单产品,但人们仍然必须构建和管理自己的服务器集群,这让人感到失望。使用 ECS Fargate,AWS 会管理集群。它支持运行带有复制、健康检查、滚动更新、自动缩放和简单警报等基本功能的 Docker 容器。 + +一个好的替代方案是平台即服务(PaaS),比如 App Engine 或 Heroku。Thinkmill 已经在简单的项目中使用它们,但通常在其他项目中需要更大的灵活性。有很多大型网站运行在 PaaS 上,但 Vocal 的规模决定了使用自定义云部署是有经济意义的。 + +另一个明显的替代方案是 Kubernetes。Kubernetes 比 ECS Fargate 拥有更多的功能,但它的成本要高得多 -- 无论是资源开销还是维护所需的人员(例如定期节点升级)。一般来说,我不向任何没有专门 DevOps 员工的地方推荐 Kubernetes。Fargate 具有 Vocal 需要的功能,使得 Thinkmill 和 Creatd 专心于网站改进,而不是忙碌于搭建基础设施。 + +另一种选择是“无服务器”功能产品,例如 AWS Lambda 或 Google 云。它们非常适合处理流量很低或很不规则的服务,但是 ECS Fargate 的自动伸缩功能足以满足 Vocal 的后端。这些产品的另一个优点是,它们允许开发人员在云环境中部署东西,但无需了解很多关于云环境的知识。权衡的结果是,无服务器产品与开发过程、测试以及调试过程紧密耦合。Thinkmill 内部已经有足够的 AWS 专业知识来管理 Fargate 的部署,任何知道如何制作 Node.js Express Hello World 应用程序的开发人员都可以在 Vocal 上工作,而无需了解无服务器功能或 Fargate 的知识。 + +ECS Fargate 的一个明显缺点是供应商唯一。但是,避免供应商唯一是一种权衡,就像避免停机一样。如果你担心迁移,那么花费在平台独立性比迁移上更多是没有意义的。在 Vocal 中,依赖于 Fargate 的代码总量小于 500 行 [Terraform][23]。最重要的是 Vocal 应用程序代码本身与平台无关,它可以在普通开发人员的机器上运行,然后打包成一个 Docker 容器,几乎可以运行在任何可以运行 Docker 容器的地方,包括 ECS Fargate。 + +Fargate 的另一个缺点是设置复杂。与 AWS 中的大多数东西一样,它处于 VPC、子网、IAM 策略世界中。幸运的是,这类东西是静态的(不像服务器集群一样需要维护)。 + +### 制作一个可扩展的应用程序 + +如果你想轻松地运行一个大规模的应用程序,需要做一大堆正确的事情。如果你遵循[应用程序设计的十二个守则][22],事情会变得容易,所以我不会在这里重复。 + +如果员工无法大规模操作一个“可扩展”的系统,那么构建“可扩展”系统就毫无意义 -- 就像在独轮车上安装喷气发动机一样。使 Vocal 可扩展的一个重要部分是将 CI/CD 和[基础设施即代码][23]之类的东西作为代码的一部分。同样,有些部署想法也不值得,因为它们使生产与开发环境相差太大(参阅[应用程序设计守则第十点][24])。生产和开发之间的任何差异都会降低应用程序的开发速度(实践得来),并且最终可能会导致错误。 + +### 缓存 + +缓存是一个很大的话题 -- 我曾经做过[一个关于 HTTP 缓存的演讲][25],相对比较基础。我将在这里谈论缓存对于 GraphQL 的重要性。 + +首先,一个重要的警告:每当遇到性能问题时,你可能会想:“我可以将这个值放入缓存中吗,以便再次使用时速度更快?”**微基准测试_总是_告诉你:是的。**然而,由于缓存一致性等问题,随处设置缓存往往会使整个系统**变慢**。以下是我对于缓存的理想备忘录: + + 1. 是否需要通过缓存解决性能问题 + 2. 再仔细想想(不缓存性能往往更加健壮) + 3. 是否可以通过改进现有的缓存来解决问题 + 4. 如果所有都失败了,那么可以考虑添加新的缓存 + +在 Web 系统中,你经常使用的一个缓存是 HTTP 缓存系统,因此,在添加额外缓存之前,试着使用 HTTP 缓存是一个好主意。我将在这篇文章中重点讨论这一点。 + +另一个常见的陷阱是使用哈希映射或应用程序内部其他东西进行缓存。[它在本地开发中效果很好,但在扩展时表现糟糕。][26]最好的办法是使用支持显式缓存库,支持 Redis 或 Memcached 这样的可插拔后端。 + +#### 基础知识 + +HTTP 规范中有两种类型缓存:私有和公共。私有缓存不会和多个用户共享数据 -- 实际上是用户的浏览器缓存。其余的就是公共缓存。它们包括受你控制的(例如 CDN、Varnish 或 Nginx 等服务器)和不受你控制的(代理)。代理缓存在当今的 HTTPS 世界中很少见,但一些公司网络会有。 + +![][27] + +缓存查找键通常基于 URL,因此如果你遵循“相同的内容,相同的 URL;不同的内容,不同的 URL” 规则,即,给每个页面一个规范的 URL,避免从一个 URL 返回不同的内容这样“聪明”的技巧,缓存就会强壮一点。显然,这对 GraphQL API 端点有影响(我将在后面讨论)。 + +你的服务器可以采用自定义配置,但配置 HTTP 缓存的主要方法是在 Web 响应上设置 HTTP 头。最重要的头是 `cache-control`。下面这一行说明所有缓存都可以缓存页面长达 3600 秒(一小时): + +```http +cache-control: max-age=3600, public +``` + +对于有关用户的页面(例如用户设置页面),使用 `private` 而不是 `public` 来告诉公共缓存不要存储响应,防止其提供给其他用户是很重要的。 + +另一个常见的头是 `vary`,它告诉缓存响应基于 URL 之外的一些内容而变化。(实际上,它将 HTTP 头添加到缓存建中,和 URL 一起。)这是一个非常生硬的工具,这就是为什么尽可能使用良好 URL 结构的原因,但一个重要的示例是告诉浏览器响应取决于登录的 cookie,以便在登录或注销时更新页面。 + +```http +vary: cookie +``` + +如果页面根据登录状态而变化,即使在正式注销版本上,你也需要 `cache-control:private` (和 `vary:cookie`),确保响应不会混淆。 + +其他有用的头包括 `etag` 和 `last-modified`,但我不会在这里介绍它们。你可能仍然会看到一些诸如 `expires` 和 `pragma:cache` 这种老旧的 HTTP 头。它们早在 1997 年就被 HTTP/1.1 淘汰了,所以我只在我想禁用缓存或者执意时才使用它们。 + +#### 客户端头 + +鲜为人知的是,HTTP 规范允许在客户端请求中使用 `cache-control` 头以减少缓存时间并获得最新响应。不幸的是,似乎大多数浏览器都不支持大于 0 的 `max-age` ,但如果你有时在更新后需要一个最新响应,`no-cache` 会很有用。 + +#### HTTP 缓存和 GraphQL + +如上,正常的缓存建是 URL。但是 GraphQL API 通常只使用一个端点(让我们称之为 `/api/`)。如果你希望 GraphQL 查询可以缓存,那么查询参数将出现在 URL 路径中,例如 `/api/?query={user{id}}&variables={"x":99}`(忽略 URL 转义)。诀窍是将 GraphQL 客户端配置为使用 HTTP GET 请求进行查询(例如,[将 `apollo-link-http` 设置为 `useGETForQueries`][28] )。 + +突变不能缓存,所以它们仍然需要使用 HTTP POST 请求。对于 POST 请求,缓存只会看到 `/api/` 作为 URL 路径,但缓存将拒绝缓存 POST 请求。请记住,GET 用于非突变查询(即幂等),POST 用于非突变(即非幂等)。在一种情况下,你可能希望避免使用 GET 查询:如果查询变量包含敏感信息。URL 经常出现在日志文件、浏览器历史记录和聊天中,因此 URL 中包含敏感信息通常不是一个好主意。无论如何,像身份验证这种事情应该作为不可缓存的更改来完成,这是一个特殊的情况,值得记住。 + +不幸的是,有一个问题:GraphQL 查询往往比 REST API URL 大得多。如果你简单地打开基于 GET 的查询,你会得到一些非常长的 URL,很容易超过 2000 字节的限制,目前一些流行的浏览器和服务器还不会接受它们。一种解决方案是发送某种查询 ID,而不是发送整个查询,即类似于 `/api/?queryId=42&variables={"x":99}`。Apollo GraphQL 服务器支持这两种方式。 + +一种方法是[从代码中提取所有 GraphQL 查询并构建一个服务器端和客户端共享的查找表][29]。缺点之一是这会使构建过程更加复杂,另一个缺点是它将客户端项目与服务器项目耦合,这与 GraphQL 的卖点背道而驰。还有一个缺点是 X 版本和 Y 版本可能对于同一组查询有会识别出不同的涵义,这会成为一个问题,因为 1:复制的应用程序将在更新推出或回滚期间提供多个版本,2:客户端可能会使用缓存的 JavaScript,即使你升级或降级服务器。 + +另一种方式是 Apollo GraphQL 所宣称的 [自动持久查询(APQ)][30]。对于 APQ 而言,查询 ID 是查询的哈希值。客户端向服务器发出请求,通过哈希查询。如果服务器无法识别该查询,则客户端会在 POST 请求中发送完整的查询,服务器会保存此次查询的散列值,以便下次识别。 + +![][31] + +#### HTTP 缓存和 Keystone 5 + +如上所述,Vocal 使用 Keystone 5 生成 GraphQL API,Keystone 5 和 Apollo GraphQL 服务器配合一起工作。那么我们是如何设置缓存头的呢? + +Apollo 支持 GraphQL 模式的缓存提示。巧妙地是,Apollo 会收集查询涉及的所有内容的所有提示,然后它会自动计算适当的缓存头值。例如,以这个查询为例: + +``` +query userAvatarUrl { + authenticatedUser { + name + avatar_url + } +} +``` + +如果 `name` 的最长期限为 1 天,而 `avatar_url` 的最长期限为 1 小时,则整体缓存最长期限将是最小值,即 1 小时。`authenticatedUser` 取决于登录 cookie,因此它需要一个 `private` 提示,它会覆盖其他字段的 `public`,因此生成的 HTTP 头将是 `cache-control:max-age=3600,private`。 + +我[对 Keystone 列表和字段添加了缓存提示][32]。以下是一个简单例子,在文档的待办列表演示中给一个字段添加缓存提示: + +``` +const keystone = new Keystone({ + name: 'Keystone To-Do List', + adapter: new MongooseAdapter(), +}); + +keystone.createList('Todo', { + schemaDoc: 'A list of things which need to be done', + fields: { + name: { + type: Text, + schemaDoc: 'This is the thing you need to do', + isRequired: true, + cacheHint: { + scope: 'PUBLIC', + maxAge: 3600, + }, + }, + }, +}); +``` + +#### 另一个问题:CORS + +跨域资源共享(CORS)规则会与基于 API 网站中的缓存产生冲突。 + +在深入问题细节之前,让我们跳到最简单的解决方案:将主站点和 API 放在一个域中。如果你的站点和 API 位于同一个域,就不必担心 CORS 规则(但你可能需要考虑[限制 cookie][33])。如果你的 API 专门用于网站,这是最简单的解决方案,你可以愉快地跳过这一节。 + +在 Vocal V1 中,网站(Next.js)和平台(Keystone GraphQL)应用程序处于不同域(`vocal.media` 和 `api.vocal.media`)。为了保护用户免受恶意网站的侵害,现代浏览器不会让一个网站与另一个网站进行交互。因此,在允许 `vocal.media` 向 `api.vocal.media` 发出请求之前,浏览器会对 `api.vocal.media` 进行“预检”。这是一个使用 `OPTIONS` 方法的 HTTP 请求,主要是询问跨域资源共享是否允许。预检通过后,浏览器会发出最初的正常请求。 + +令人沮丧的是,预检是针对每个 URL 的。浏览器为每个 URL 发出一个新的 `OPTIONS` 请求,服务器每次都会响应。[服务器没法说 `vocal.media` 是所有 `api.vocal.media` 请求的可信来源][34] 。当所有内容都是对一个 API 端点的 POST 请求时,这个问题并不严重,但是在为每个查询提供 Get URL 后,每个查询都因预检而延迟。更令人沮丧的是,HTTP 规范说 `OPTIONS` 请求不能被缓存,所以你会发现你所有的 GraphQL 数据都被完美地缓存在用户身旁的 CDN 中,但浏览器仍然必须发出所有的预检请求。 + +如果你不能使用共享域,有几种解决方案。 + +如果你的 API 足够简单,你或许可以利用 [CORS 规则的例外][35]。 + +某些缓存服务器可以配置为忽略 HTTP 规范,任何情况都会缓存 `OPTIONS` 请求。例如,基于 Varnish 的缓存和 AWS CloudFrone。这不如完全避免预检那么有效,但比默认的要好。 + +另一个选项是 [JSONP][36],很巧妙。当心:如果做错了,那么可能会创建安全漏洞。 + +#### Vocal 更好地利用缓存 + +HTTP 缓存在底层工作之后,我需要让应用程序更好地利用它。 + +HTTP 缓存的一个限制是它在响应级别上是全部或没有。大多数响应都可以缓存,但如果一个字节不能缓存,那整个页面就无法缓存。作为一个博客平台,大多数 Vocal 数据都是可缓存的,但在旧网站上,由于右上角的菜单栏,几乎没有页面可以缓存。对于匿名用户,菜单栏将显示邀请用户登录或创建账号的链接。对于已登录用户,它会变成用户头像和用户个人资料菜单,因为页面会根据用户登录状态而变化,所以无法在 CDN 中缓存任何页面。 + +![A typical page from Vocal. Most of the page is highly cachable content, but in the old site none of it was actually cachable because of a little menu in the top right corner.][37] + +这些页面是由 React 组件的服务器端渲染(SSR)的。解决方法是将所有依赖于登录 cookie 的 React 组件强制设置为[仅延迟呈现客户端][38],现在,服务器会返回完全通用的页面,其中包含用于个性化组件(如登录菜单栏)的占位符。当页面在浏览器中加载时,这些占位符将通过调用 GraphQL API 在客户端填充。通用页面可以安全地缓存到 CDN 中。 + +这一技巧不仅提高了缓存命中率,还帮助改善了人们感知的页面加载时间。空白屏幕和旋转动画让我们不耐烦,但一旦第一个内容出现,它会分散我们几百毫秒的注意力。如果人们在社交媒体上点击一个 Vocal 帖子的链接,主要内容就会立即从 CDN 中出现,很少有人会注意到,有些组件直到几百毫秒后才会完全出现。 + +顺便说一下,另一个让用户更快地看到第一个内容的技巧是[流渲染][39],而不是等待整个页面渲染完成后再发送。不幸的是,[Node.js 还不支持这个功能][40]。 + +拆分响应来提高可缓存性也适用于 GraphQL。通过一个请求查询多个数据片段的能力通常是 GraphQL 的一个优势,但如果响应的不同部分具有不同的缓存,那么最好将它们分开。举个简单的例子,Vocal 的分页组件需要知道当前页面的页数和内容。最初,组件在一个查询中同时获取两个页面,但由于页面的总数是所有页面的一个常量,所有我将其设置为一个单独的查询,以便缓存它。 + +#### 缓存的好处 + +缓存的明显好处是它减轻了 Vocal 后端服务器的负载。效果很好,但是依赖缓存来获得容量是危险的,因为当有一天你不可避免地放弃缓存时,你仍然需要一个备份计划。 + +提高页面响应速度而使用缓存是一个好理由。 + +其他一些好处可能不那么明显。峰值流量往往是高度本地化的。如果一个有很多社交媒体粉丝的人分享了一个页面的链接,那么 Vocal 的流量就会大幅上升,但主要是指向分享的那个页面及其资产。这就是为什么缓存擅长吸收最糟糕的流量峰值,它使后端流量模式相对更平滑,更容易自动伸缩处理。 + +另一个好处是优雅的回滚。即使后端由于某些原因出现了严重的问题,站点最受欢迎的部分仍然可以通过 CDN 缓存来提供服务。 + +### 其他的性能调整 + +正如我常说的,可扩展的秘诀不是让事情变得更复杂。它只是让事情变得不比需要的更复杂,然后彻底解决所有防止扩展的东西。扩展 Vocal 涉及到许多不适合这篇文章的小事情。 + +一个经验:对于分布式系统中难以调试的问题,最困难的部分通常是获取正确的数据,从而了解发生的原因。我能想到很多时候,我被困住了,只能靠猜测来“即兴发挥”,而不是找出如何找到正确的数据。有时这行得通,但对复杂的问题却不行。 + +一个相关技巧是,你可以通过获取系统中每个组件的实时数据(甚至只是 **tail -F** ),在不同的窗口中显示,然后在另一个窗口中单击网站来了解很多信息。比如:“为什么切换这个复选框会在后台产生这么多的 DB 查询?” + +这里有个例子。有些页面需要几秒钟以上的时间来呈现,但这个情况只会在部署环境中使用 SSR 时会出现。监控仪表盘没有显示任何 CPU 使用量峰值,应用程序也没有使用磁盘,所以这表明应用程序可能正在等待网络请求,可能是后端请求。在开发环境中,我可以使用 [sysstat 工具][42]来记录 CPU、RAM、磁盘使用情况,以及 Postgres 语句日志和正常的应用日志来观察应用程序是如何工作的。[Node.js 支持探测跟踪 HTTP 请求][42],比如使用 [bpftrace][44],但它们不能在开发环境中工作,所以我在源代码中找到了探测功能,并创建了一个带有请求日志的 Node.js 版本。我使用 [tcpdump][45] 记录网络数据,这让我找到了问题所在:对于网站发出的每一个 API 请求,都要创建一个新的网络连接到平台。如果这都不起作用,我想我会在应用程序中添加请求跟踪。 + +网络连接在本地机器上速度很快,但在现实网络上却不可忽略。设置加密连接(比在生产环境中)需要更长时间。如果你向一个服务器(比如一个 API)发出大量请求,保持连接打开并重用它是很重要的。浏览器会自动这么做,但 Node.js 默认不会,因为它不知道你是否发出了很多请求,所以这个问题只出现在 SSR 上。与许多长时间的调试会话一样,修复非常简单:只需将 SSR 配置为 [保持连接存活][46],这样会使页面的呈现时间大幅下降。 + +如果你想了解更多这方面的知识,我强烈建议你阅读[高性能浏览器网络][47]这本书(免费在线阅读),并跟随 [Brendan Gregg 出版的指南][48]。 + +### 你的站点? + +实际上,我们还可以做很多事情来提升 Vocal 的速度,但我们没有全做。在初创公司和在大公司身为一个固定员工做 SRE 工作还是有很大区别的。我们的目标、预算和发布日期都很紧张,但最终我们的网站得到了很大改善,给了用户他们想要的东西。 + +同样的,你的站点有它自己的目标,并且可能与 Vocal 有很大的不同。然而,我希望这篇文章和它的链接至少能给你一些有用的想法,为用户创造更好的东西。 + +-------------------------------------------------------------------------------- + +via: https://theartofmachinery.com/2020/06/29/scaling_a_graphql_site.html + +作者:[Simon Arneaud][a] +选题:[lujun9972][b] +译者:[MjSeven](https://github.com/MjSeven) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]: https://theartofmachinery.com +[b]: https://github.com/lujun9972 +[1]: https://vocal.media +[2]: https://www.meetup.com/en-AU/GraphQL-Sydney/events/267681845/ +[3]: https://theartofmachinery.com/2020/04/21/what_is_high_traffic.html +[4]: https://theartofmachinery.com/images/scaling_a_graphql_site/vocal1.png +[5]: https://theartofmachinery.com/images/scaling_a_graphql_site/vocal2.png +[6]: https://jerrick.media +[7]: https://www.thinkmill.com.au/ +[8]: https://theartofmachinery.com/images/scaling_a_graphql_site/alexa.png +[9]: https://aws.amazon.com/blogs/database/amazon-rds-customers-update-your-ssl-tls-certificates-by-february-5-2020/ +[10]: https://github.com/vercel/next.js +[11]: https://www.keystonejs.com/ +[12]: https://wiki.c2.com/?SecondSystemEffect +[13]: https://vocal.media/resources/vocal-2-0 +[14]: https://theartofmachinery.com/about.html +[15]: https://en.wikipedia.org/wiki/Extract,_transform,_load +[16]: https://github.com/stripe/mosql +[17]: https://use-the-index-luke.com/ +[18]: https://modern-sql.com/ +[19]: https://theartofmachinery.com/images/scaling_a_graphql_site/architecture.svg +[20]: https://aws.amazon.com/fargate/ +[21]: https://www.terraform.io/docs/providers/aws/r/ecs_task_definition.html +[22]: https://12factor.net/ +[23]: https://theartofmachinery.com/2019/02/16/talks.html +[24]: https://12factor.net/dev-prod-parity +[25]: https://www.meetup.com/en-AU/Port80-Sydney/events/lwcdjlyvjblb/ +[26]: https://theartofmachinery.com/2016/07/30/server_caching_architectures.html +[27]: https://theartofmachinery.com/images/scaling_a_graphql_site/http_caches.svg +[28]: https://www.apollographql.com/docs/link/links/http/#options +[29]: https://www.apollographql.com/blog/persisted-graphql-queries-with-apollo-client-119fd7e6bba5 +[30]: https://www.apollographql.com/blog/improve-graphql-performance-with-automatic-persisted-queries-c31d27b8e6ea +[31]: https://theartofmachinery.com/images/scaling_a_graphql_site/apq.png +[32]: https://www.keystonejs.com/api/create-list/#cachehint +[33]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies#Define_where_cookies_are_sent +[34]: https://lists.w3.org/Archives/Public/public-webapps/2012AprJun/0236.html +[35]: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS#Simple_requests +[36]: https://en.wikipedia.org/wiki/JSONP +[37]: https://theartofmachinery.com/images/scaling_a_graphql_site/cachablepage.png +[38]: https://nextjs.org/docs/advanced-features/dynamic-import#with-no-ssr +[39]: https://medium.com/the-thinkmill/progressive-rendering-the-key-to-faster-web-ebfbbece41a4 +[40]: https://github.com/vercel/next.js/issues/1209 +[41]: https://linux.die.net/man/1/tail +[42]: https://github.com/sysstat/sysstat/ +[43]: http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html +[44]: https://theartofmachinery.com/2019/04/26/bpftrace_d_gc.html +[45]: https://danielmiessler.com/study/tcpdump/ +[46]: https://www.npmjs.com/package/agentkeepalive +[47]: https://hpbn.co/ +[48]: http://www.brendangregg.com/ \ No newline at end of file