Merge pull request #4397 from Cathon/master

[translated] 20160608 Simple Python Framework from Scratch.md
This commit is contained in:
VicYu 2016-09-12 10:57:21 +08:00 committed by GitHub
commit f4ca4d9f52
2 changed files with 466 additions and 462 deletions

View File

@ -1,462 +0,0 @@
[Cathon is translating...]
Simple Python Framework from Scratch
===================================
Why would you want to build a web framework? I can think of a few reasons:
- Novel idea that will displace other frameworks.
- Get some mad street cred.
- Your problem domain is so unique that other frameworks don't fit.
- You're curious how web frameworks work because you want to become a better web developer.
I'll focus on the last point. This post aims to describe what I learned by writing a small server and framework by explaining the design and implementation process step by step, function by function. The complete code for this project can be found in this [repository][1].
I hope this encourages other people to try because the it was fun, taught me a lot about how web applications work, and it was a lot easier than I thought!
### Scope
Frameworks handle things like the request-response cycle, authentication, database access, generating templates, and lots more. Web developers use frameworks because most web applications share a lot of functionality and it doesn't make sense to re-implement all of that for every project.
Bigger frameworks like Rails or Django operate on a high level of abstraction and are said to be "batteries-included". It would take thousands of man-hours to implement all of those features so it's important to focus on tiny subset for this project. Before setting down a single line of code I created a list of features and constraints.
Features:
- Must handle GET and POST HTTP requests. You can get a brief overview of HTTP in [this wiki article][2]).
- Must be asynchronous (I'm loving the Python 3 asyncio module).
- Must include simple routing logic along with capturing parameters.
- Must provide a simple user-facing API like other cool microframeworks.
- Must handle authentication, because it's cool to learn that too (saved for Part 2).
Constraints:
- Will only handle a small subset of HTTP/1.1: no transfer-encoding, no http-auth, no content-encoding (gzip), no [persistant connections][3].
- No MIME-guessing for responses - users will have to set this manually.
- No WSGI - just simple TCP connection handling.
- No database support.
I decided a small use case would make the above more concrete. It would also demonstrate the framework's API:
```
from diy_framework import App, Router
from diy_framework.http_utils import Response
# GET simple route
async def home(r):
rsp = Response()
rsp.set_header('Content-Type', 'text/html')
rsp.body = '<html><body><b>test</b></body></html>'
return rsp
# GET route + params
async def welcome(r, name):
return "Welcome {}".format(name)
# POST route + body param
async def parse_form(r):
if r.method == 'GET':
return 'form'
else:
name = r.body.get('name', '')[0]
password = r.body.get('password', '')[0]
return "{0}:{1}".format(name, password)
# application = router + http server
router = Router()
router.add_routes({
r'/welcome/{name}': welcome,
r'/': home,
r'/login': parse_form,})
app = App(router)
app.start_server()
```
The user is supposed to be able to define a few asynchronous functions that either return strings or Response objects, then pair those functions up with strings that represent routes, and finally start handling requests with a single function call (start_server).
Having created the designs, I condensed it into abstract things I needed to code up:
- Something that should accept TCP connections and schedule an asynchronous function to handle them.
- Something to parse raw text into some kind of abstracted container.
- Something to decide which function should be called for each request.
- Something that bundles all of the above together and presents a simple interface to a developer.
I started by writing tests that were essentially contracts describing each piece of functionality. After a few refactorings, the layout settled into a handful of composable pieces. Each piece is relatively decoupled from the other components, which is especially beneficial in this case because each piece can be studied on its own. The pieces are concrete reflections of the abstractions that I listed above:
- A HTTPServer object that holds onto a Router object and a http_parser module and uses them to initialize...
- HTTPConnection objects, where each one represents a single client HTTP connection and takes care of the request-response cycle: parses incoming bytes into a Request object using the http_parser module; uses an instance of Router to find the correct function to call to generate a response; finally send the response back to the client.
- A pair of Request and Response objects that give a user a comfortable way to work with what are in essence specially formatted strings of bytes. The user shouldn't be aware of the correct message format or delimiters.
- A Router object that contains route:function pairs. It exposes a way to add these pairs and a way, given a URL path, to find a function.
- Finally, an App object that contains configuration and uses it to instantiate a HTTPServer instance.
Let's go over each of these pieces, starting from HTTPConnection.
### Modelling Asynchronous Connections
To satisfy the contraints, each HTTP request is a separate TCP connection. This makes request handling slower because of the relatively high cost of establishing multiple TCP connections (cost of DNS lookup, hand-shake, [Slow Start][4], etc.) but it's much easier to model. For this task, I chose the fairly high-level [asyncio-stream][5] module that rests on top of [asyncio's transports and protocols][6]. I recommend checking out the code for it in the stdlib because it's a joy to read!
An instance of HTTPConnection handles multiple tasks. First, it reads data incrementally from a TCP connection using an asyncio.StreamReader object and stores it in a buffer. After each read operation, it tries to parse whatever is in the buffer and build out a Request object. Once it receives the whole request, it generates a reply and sends it back to the client through an asyncio.StreamWriter object. It also handles two more tasks: timing out a connection and handling errors.
You can view the complete code for the class [here][7]. I'll introduce each part of the code separately, with the docstrings removed for brevity:
```
class HTTPConnection(object):
def init(self, http_server, reader, writer):
self.router = http_server.router
self.http_parser = http_server.http_parser
self.loop = http_server.loop
self._reader = reader
self._writer = writer
self._buffer = bytearray()
self._conn_timeout = None
self.request = Request()
```
The init method is boring: it just collects objects to work with later. It stores a router, an http_parser, and loop objects, which are used to generate responses, parse requests, and schedule things in the event loop.
Next, it stores the reader-writer pair, which together represent a TCP connection, and an empty [bytearray][8] that serves as buffer for raw bytes. _conn_timeout stores an instance of [asyncio.Handle][9] that is used to manage the timeout logic. Finally, it also stores a single instance of Request.
The following code handles the core functionality of receiving and sending data:
```
async def handle_request(self):
try:
while not self.request.finished and not self._reader.at_eof():
data = await self._reader.read(1024)
if data:
self._reset_conn_timeout()
await self.process_data(data)
if self.request.finished:
await self.reply()
elif self._reader.at_eof():
raise BadRequestException()
except (NotFoundException,
BadRequestException) as e:
self.error_reply(e.code, body=Response.reason_phrases[e.code])
except Exception as e:
self.error_reply(500, body=Response.reason_phrases[500])
self.close_connection()
```
Everything here is closed in a try-except block so that any exceptions thrown during parsing a request or replying to one are caught and an error response is sent back to the client.
The request is read inside of a while loop until the parser sets self.request.finished = True or until the client closed the connection, signalled by self._reader.at_eof() method returning True. The code tries to read data from a StreamReader on every iteration of the loop and incrementally build out self.request through the call to self.process_data(data). The connection timeout timer is reset everytime the loop reads any data.
There's an error in there - can you spot it? I'll get back to it shortly. I also have to note that this loop can potentially eat up all of the CPU because self._reader.read() returns b'' objects if there is nothing to read - meaning the loop will cycle millions of times, doing nothing. A possible solution here would be to wait a bit of time in a non-blocking fashion: await asyncio.sleep(0.1). I'll hold off on optimizing this until it's needed.
Remember the error that I mentioned at the start of the previous paragraph? The self._reset_conn_timeout() method is called only when data is read from the StreamReader. The way it's set up now means that the timeout is not initiated until the first byte arrives. If a client opens a connection to the server and doesn't send any data - it never times out. This could be used to exhaust system resources and cause a denial of service. The fix is to simply call self._reset_conn_timeout() in the init method.
When the request is received or when the connection drops, the code hits an if-else block. This block determines if the parser, having received all the data, finished parsing the request? Yes? Great - generate a reply and send it back! No? Uh-oh - something's wrong with the request - raise an exception! Finally, self.close_connection is called to perform cleanup.
Parsing the parts of a request is done inside the self.process_data method. It's a very short and simple method that's easy to test:
```
async def process_data(self, data):
self._buffer.extend(data)
self._buffer = self.http_parser.parse_into(
self.request, self._buffer)
```
Each call accumulates data into self._buffer and then tries to parse whatever has gathered inside of the buffer using self.http_parser. It's worth pointing out here that this code exhibits a pattern called [Dependency Injection][10]. If you remember the init function, you know that I pass in a http_server object, which contains a http_parser object. In this case, the http_parser object is a module that is part of the diy_framework package, but it could potentially be anything else that has a parse_into function that accepts a Request object and a bytearray. This is useful for two reasons. First, it means that this code is easier to extend. Someone could come along and want to use HTTPConnection with a different parser - no problem - just pass it in as an argument! Second, it makes testing much easier because the http_parser is not hard coded anywhere so replacing it with a dummy or [mock][11] object is super easy.
The next interesting piece is the reply method:
```
async def reply(self):
request = self.request
handler = self.router.get_handler(request.path)
response = await handler.handle(request)
if not isinstance(response, Response):
response = Response(code=200, body=response)
self._writer.write(response.to_bytes())
await self._writer.drain()
```
Here, an instance of HTTPConnection uses a router object it got from the HTTPServer (another example of Dependency Injection) to obtain an object that will generate a response. A router can be anything that has a get_handler method that accepts a string and returns a callable or raises a NotFoundException. The callable object is then used to process the request and generate a response. Handlers are written by the users of the framework, as outlined in use case above, and should return either strings or Response objects. Response objects give us a nice interface so the simple if block ensures that whatever a handler returns, the code further along ends up with a uniform Response object.
Next, the StreamWriter instance assigned to self._writer is called to send back a string of bytes to the client. Before the function returns, it awaits at await self._writer.drain(), which ensures that all the data has been sent to the client. This ensures that a call to self._writer.close won't happen when there is still unsent data in the buffer.
There are two more interesting parts of the HTTPConnection class: a method that closes the connection and a group of methods that handle the timeout mechanism. First, closing a connection is accomplished by this little function:
```
def close_connection(self):
self._cancel_conn_timeout()
self._writer.close()
```
Any time a connection is to be closed, the code has to first cancel the timeout to clean it out of the event loop.
The timeout mechanism is a set of three related functions: a function that acts on the timeout by sending an error message to the client and closing the connection; a function that cancels the current timeout; and a function that schedules the timeout. The first two are simple and I add them for completeness, but I'll explain the third one — _reset_conn_timeout — in more detail.
```
def _conn_timeout_close(self):
self.error_reply(500, 'timeout')
self.close_connection()
def _cancel_conn_timeout(self):
if self._conn_timeout:
self._conn_timeout.cancel()
def _reset_conn_timeout(self, timeout=TIMEOUT):
self._cancel_conn_timeout()
self._conn_timeout = self.loop.call_later(
timeout, self._conn_timeout_close)
```
Everytime _reset_conn_timeout is called, it first cancels any previously set asyncio.Handle object assigned to self._conn_timeout. Then, using the BaseEventLoop.call_later function, it schedules the _conn_timeout_close function to run after timeout seconds. If you remember the contents of the handle_request function, you'll know that this function gets called every time any data is received. This cancels any existing timeout and re-schedules the _conn_timeout_close function timeout seconds in the future. As long as there is data coming, this cycle will keep resetting the timeout callback. If no data is received inside of timeout seconds, the _conn_timeout_close finally gets called.
### Creating the Connections
Something has to create HTTPConnection objects and it has to do it correctly. This task is delegated to the HTTPServer class, which is a very simple container that helps to store some configuration (the parser, router, and event loop instances) and then use that configuration to create instances of HTTPConnection:
```
class HTTPServer(object):
def init(self, router, http_parser, loop):
self.router = router
self.http_parser = http_parser
self.loop = loop
async def handle_connection(self, reader, writer):
connection = HTTPConnection(self, reader, writer)
asyncio.ensure_future(connection.handle_request(), loop=self.loop)
```
Each instance of HTTPServer can listen on one port. It has an asynchronous handle_connection method that creates instances of HTTPConnection and schedules them for execution in the event loop. This method is passed to [asyncio.start_server][12] and serves as a callbacks: it's called every time a TCP connection is initiated with a StreamReader and StreamWriter as the arguments.
```
self._server = HTTPServer(self.router, self.http_parser, self.loop)
self._connection_handler = asyncio.start_server(
self._server.handle_connection,
host=self.host,
port=self.port,
reuse_address=True,
reuse_port=True,
loop=self.loop)
```
This forms the core of how the application works: asyncio.start_server accepts TCP connections and calls a method on a preconfigured HTTPServer object. This method handles all the logic for a single connection: reading, parsing, generating and sending a reply back to the client, and closing the connection. It focuses on IO logic and coordinates parsing and generating a reply.
With the core IO stuff out of the way lets move on over to...
### Parsing Requests
The users of this tiny framework are spoiled and don't want to work with bytes. They want a higher level of abstraction - a more convenientt way of working with requests. The tiny framework includes a simple HTTP parser that transforms bytes into Request objects.
These Request objects are containers that looks like:
```
class Request(object):
def init(self):
self.method = None
self.path = None
self.query_params = {}
self.path_params = {}
self.headers = {}
self.body = None
self.body_raw = None
self.finished = False
```
It has everything that a developer needs in order to accept data coming from a client in an easily understandable package. Well, everything except cookies, which are crucial in order to do things like authentication. I'll leave that for part 2.
Each HTTP request contains certain required pieces - like the path or the method. It also contains certain optional pieces like the body, headers, or URL parameters. Furthermore, owing to the popularity of REST, the URL, minus the URL parameters, may also contain pieces of information eg. "/users/1/edit" contains a user's id.
Each part of a request has to be identified, parsed, and assigned to the correct part of a Request object. The fact that HTTP/1.1 is text protocol simplifies things (HTTP/2 is binary protocol - whole 'nother level of fun).
The http_parser module is a group of functions inside because the parser does not need to keep track of state. Instead, the calling code has to manage a Request object and pass it into the parse_into function along with a bytearray containing the raw bytes of a request. To this end, the parser modifies both the request object as well as the bytearray buffer passed to it. The request object gets fuller and fuller while the bytearray buffer gets emptier and emptier.
The core functionality of the http_parser module is inside the parse_into function:
```
def parse_into(request, buffer):
_buffer = buffer[:]
if not request.method and can_parse_request_line(_buffer):
(request.method, request.path,
request.query_params) = parse_request_line(_buffer)
remove_request_line(_buffer)
if not request.headers and can_parse_headers(_buffer):
request.headers = parse_headers(_buffer)
if not has_body(request.headers):
request.finished = True
remove_intro(_buffer)
if not request.finished and can_parse_body(request.headers, _buffer):
request.body_raw, request.body = parse_body(request.headers, _buffer)
clear_buffer(_buffer)
request.finished = True
return _buffer
```
As you can see in the code above, I divided the parsing process into three parts: parsing the request line (the line that goes GET /resource HTTP/1.1), parsing the headers, and parsing the body.
The request line contains the HTTP method and the URL. The URL in turn contains yet more information: the path, url parameters, and developer defined url parameters. Parsing out the method and URL is easy - it's a matter of splitting the string appropriately. The urlparse.parse function is used to parse out the URL parameters, if any, from the URL. The developer defined url parameters are extracted using regular expressions.
Next up are the HTTP headers. These are simply lines of text that are key-value pairs. The catch is that a there may be multiple headers of the same name but with different values. An important header to watch out for is the Content-Length header that specifies the length of the body (not the whole request, just the body!), which is important in determining whether to parse the body at all.
Finally, the parser looks at the HTTP method and headers and decides whether to parse the request's body.
### Routing!
The router is a bridge between the framework and the user in the sense that the user creates a Router object and fills it with path/function pairs using the appropriate methods and then gives the Router object to the App. The App object in turn uses the get_handler function to get a hold of a callable that generates a response. In short, the router is responsible for two things - storing pairs of paths/functions and handing back a pair to whatever asks for one.
There are two methods in the Router class that allow an end-developer to add routes: add_routes and add_route. Since add_routes is a convenient wrapper around add_route, I'll skip describing it and focus on add_route:
```
def add_route(self, path, handler):
compiled_route = self.class.build_route_regexp(path)
if compiled_route not in self.routes:
self.routes[compiled_route] = handler
else:
raise DuplicateRoute
```
This method first "compiles" a route — a string like '/cars/{id}' — into a compiled regexp object using the Router.build_route_regexp class method. These compiled regexp objects serve both to match a request's path and to extract developer defined URL parameters specified by that route. Next there's a check that raises an exception if the same route already exists, and finally the route/handler pair is added to a simple dictionary — self.routes.
Here's how the Router "compiles" routes:
```
@classmethod
def build_route_regexp(cls, regexp_str):
"""
Turns a string into a compiled regular expression. Parses '{}' into
named groups ie. '/path/{variable}' is turned into
'/path/(?P<variable>[a-zA-Z0-9_-]+)'.
:param regexp_str: a string representing a URL path.
:return: a compiled regular expression.
"""
def named_groups(matchobj):
return '(?P<{0}>[a-zA-Z0-9_-]+)'.format(matchobj.group(1))
re_str = re.sub(r'{([a-zA-Z0-9_-]+)}', named_groups, regexp_str)
re_str = ''.join(('^', re_str, '$',))
return re.compile(re_str)
```
The method uses regular expressions to substitute all occurrences of "{variable}" with named regexp groups: "(?P<variable>...)". Then it adds the ^ and $ regexp signifiers at the beginning and end of the resulting string, and finally compiles regexp object out of it.
Storing a route is just half the battle, here's how to get one back:
```
def get_handler(self, path):
logger.debug('Getting handler for: {0}'.format(path))
for route, handler in self.routes.items():
path_params = self.class.match_path(route, path)
if path_params is not None:
logger.debug('Got handler for: {0}'.format(path))
wrapped_handler = HandlerWrapper(handler, path_params)
return wrapped_handler
raise NotFoundException()
```
Once the App has a Request object, it also has the path part of the URL (ie. /users/15/edit). Having that, it needs a matching function to generate the response or a 404 error. get_handler takes a path as an argument, loops over routes and calls the Router.match_path class method on each one to check if any compiled regexp matches the request's path. If it does, it wraps the route's function in a HandleWrapper. The path_params dictionary contains path variables (ie. the '15' from /users/15/edit) or is left empty if the route doesn't specify any variables. Finally it returns the wrapped route's function to the App.
If the code iterates through all the routes and none of them matches the path, the function raises a NotFoundException.
The Route.match class method is simple:
```
def match_path(cls, route, path):
match = route.match(path)
try:
return match.groupdict()
except AttributeError:
return None
```
It uses the regexp object's match method to check if the route and path matches. It returns None on no matches.
Finally, we have the HandleWrapper class. Its only job is to wrap an asynchronous function, store the path_params dictionary, and expose a uniform interface through the handle method:
```
class HandlerWrapper(object):
def init(self, handler, path_params):
self.handler = handler
self.path_params = path_params
self.request = None
async def handle(self, request):
return await self.handler(request, **self.path_params)
```
### Bringing it All Together
The last piece of the framework is the one that ties everything together — the App class.
The App class serves to gather all the configuration details. An App object uses its single method — start_server — to create an instance of HTTPServer using some of the configuration data, and then feed it to the function asyncio.start_server more info here. The asyncio.start_server function will call the HTTPServer object's handle_connection method for every incoming TCP connection:
```
def start_server(self):
if not self._server:
self.loop = asyncio.get_event_loop()
self._server = HTTPServer(self.router, self.http_parser, self.loop)
self._connection_handler = asyncio.start_server(
self._server.handle_connection,
host=self.host,
port=self.port,
reuse_address=True,
reuse_port=True,
loop=self.loop)
logger.info('Starting server on {0}:{1}'.format(
self.host, self.port))
self.loop.run_until_complete(self._connection_handler)
try:
self.loop.run_forever()
except KeyboardInterrupt:
logger.info('Got signal, killing server')
except DiyFrameworkException as e:
logger.error('Critical framework failure:')
logger.error(e.traceback)
finally:
self.loop.close()
else:
logger.info('Server already started - {0}'.format(self))
```
### Lessons learned
If you look at the repo, you'll notice that the whole thing is roughly 320 lines of code if you don't count the tests (it's ~540 if you do). It really surprised me that it's possible to fit so much functionality in so little code. Granted, this framework does not offer useful pieces like templates, authentication, or database access, but hey, it's something fun to work on :). This also gave me an idea how other framework like Django or Tornado work at a general level and it's already paying off in how quickly I'm able to debug things.
This is also the first project that I did in true TDD fashion and it's just amazing how pleasant and productive the process was. Writing tests first forced me to think about design and architecture and not just on gluing bits of code together to "make it work". Don't get me wrong, there are many scenarios when the latter approach is preferred, but if you're placing a premium on low-maintenance code that you and others will be able to work in weeks or months in the future then TDD is exactly what you need.
I explored things like [the Clean Architecture][13] and dependency injection, which is most evident in how the Router class is a higher-level abstraction (Entity?) that's close to the "core" whereas pieces like the http_parser or App are somewhere on the outer edges because they do either itty-bitty string or bytes work or interface with mid-level IO stuff. Whereas TDD forced me to think about each small part separately, this made me ask myself questions like: Does this combination of method calls compose into an understandable action? Do the class names accurately reflect the problem that I'm solving? Is it easy to distinguish different levels of abstraction in my code?
Go ahead, write a small framework, it's a ton of fun!
--------------------------------------------------------------------------------
via: http://mattscodecave.com/posts/simple-python-framework-from-scratch.html
作者:[Matt][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: http://mattscodecave.com/hire-me.html
[1]: https://github.com/sirMackk/diy_framework
[2]:https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[3]: https://en.wikipedia.org/wiki/HTTP_persistent_connection
[4]: https://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm#Slow_start
[5]: https://docs.python.org/3/library/asyncio-stream.html
[6]: https://docs.python.org/3/library/asyncio-protocol.html
[7]: https://github.com/sirMackk/diy_framework/blob/88968e6b30e59504251c0c7cd80abe88f51adb79/diy_framework/http_server.py#L46
[8]: https://docs.python.org/3/library/functions.html#bytearray
[9]: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.Handle
[10]: https://en.wikipedia.org/wiki/Dependency_injection
[11]: https://docs.python.org/3/library/unittest.mock.html
[12]: https://docs.python.org/3/library/asyncio-stream.html#asyncio.start_server
[13]: https://blog.8thlight.com/uncle-bob/2012/08/13/the-clean-architecture.html

View File

@ -0,0 +1,466 @@
从零构建一个简单的 Python 框架
===================================
为什么你想要自己构建一个 web 框架呢?我想,原因有以下几点:
- 你有一个新奇的想法,觉得将会取代其他的框架
- 你想要获得一些名气
- 你的问题领域很独特,以至于现有的框架不太合适
- 你对 web 框架是如何工作的很感兴趣,因为你想要成为一位更好的 web 开发者。
接下来的笔墨将着重于最后一点。这篇文章旨在通过对设计和实现过程一步一步的阐述告诉读者,我在完成一个小型的服务器和框架之后学到了什么。你可以在这个[代码仓库][1]中找到这个项目的完整代码。
我希望这篇文章可以鼓励更多的人来尝试,因为这确实很有趣。它让我知道了 web 应用是如何工作的,而且这比我想的要容易的多!
### 范围
框架可以处理请求-响应周期身份认证数据库访问模板生成等。Web 开发者使用框架是因为,大多数的 web 应用拥有大量相同的功能,而对每个项目都重新实现同样的功能意义不大。
比较大的的框架如 Rails 和 Django 实现了高层次的抽象,开箱即用功能齐备。而实现所有的这些功能可能要花费数千小时,因此在这个项目上,我们重点完成其中的一小部分。在开始写代码前,我先列举一下所需的功能以及限制。
功能:
- 处理 HTTP 的 GET 和 POST 请求。你可以在[这篇 wiki][2] 中对 HTTP 有个大致的了解。
- 实现异步操作(我喜欢 Python 3 的 asyncio 模块)。
- 简单的路由逻辑以及参数捕获。
- 像其他微型框架一样,提供一个简单的用户级 API 。
- 支持身份认证,因为学会这个很酷啊(微笑)。
限制:
- 将只支持 HTTP 1.1 的一个小子集不支持传输编码http-auth内容编码gzip以及[持久化连接][3]。
- 不支持对响应的 MIME 判断 - 用户需要手动设置。
- 不支持 WSGI - 仅能处理简单的 TCP 连接。
- 不支持数据库。
我觉得一个小的用例可以让上述内容更加具体,也可以用来演示这个框架的 API
```
from diy_framework import App, Router
from diy_framework.http_utils import Response
# GET simple route
async def home(r):
rsp = Response()
rsp.set_header('Content-Type', 'text/html')
rsp.body = '<html><body><b>test</b></body></html>'
return rsp
# GET route + params
async def welcome(r, name):
return "Welcome {}".format(name)
# POST route + body param
async def parse_form(r):
if r.method == 'GET':
return 'form'
else:
name = r.body.get('name', '')[0]
password = r.body.get('password', '')[0]
return "{0}:{1}".format(name, password)
# application = router + http server
router = Router()
router.add_routes({
r'/welcome/{name}': welcome,
r'/': home,
r'/login': parse_form,})
app = App(router)
app.start_server()
```
'
用户需要定义一些能够返回字符串或响应对象的异步函数然后将这些函数与表示路由的字符串配对最后通过一个函数start_server调用开始处理请求。
完成设计之后,我将它抽象为几个我需要编码的部分:
- 接受 TCP 连接以及调度一个异步函数来处理这些连接的部分
- 将原始文本解析成某种抽象容器的部分
- 对于每个请求,用来决定调用哪个函数的部分
- 将上述部分集中到一起,并为开发者提供一个简单接口的部分
我先编写一些测试,这些测试被用来描述每个部分的功能。几次重构后,整个设计被分成若干部分,每个部分之间是相对解耦的。这样就非常好,因为每个部分可以被独立地研究学习。以下是我上文列出的抽象的具体体现:
- 一个 HTTPServer 对象,需要一个 Router 对象和一个 http_parser 模块,并使用它们来初始化。
- HTTPConnection 对象,每一个对象表示一个单独的客户端 HTTP 连接,并且处理其请求-响应周期:使用 http_parser 模块将收到的字节流解析为一个 Request 对象;使用一个 Router 实例寻找并调用正确的函数来生成一个响应; 最后将这个响应发送回客户端。
- 一对 Request 和 Response 对象为用户提供了一种友好的方式,来处理实质上是字节流的字符串。用户不需要知道正确的消息格式和分隔符。
- 一个 Router 对象包含路由:函数对。它提供一个添加配对的方法,还有一个根据 URL 路径查找相应函数的方法。
- 最后,一个 App 对象。它包含配置信息,并使用它们实例化一个 HTTPServer 实例。
让我们从 HTTPConnection 开始来讲解各个部分。
### 模拟异步连接
为了满足约束条件,每一个 HTTP 请求都是一个单独的 TCP 连接。这使得处理请求的速度变慢了,因为建立多个 TCP 连接需要相对高的花销DNS 查询TCP 三次握手,[慢启动][4]的花销等等),不过这样更加容易模拟。对于这一任务,我选择相对高级的 [asyncio-stream][5] 模块,它建立在 asyncio 的传输和协议的基础之上。我强烈推荐阅读标准库中的相应代码!
一个 HTTPConnection 的实例能够处理多个任务。首先,它使用 asyncio.StreamReader 对象以增量的方式从 TCP 连接中读取数据,并存储在缓存中。每一个读取操作完成后,它会尝试解析缓存中的数据,并生成一个 Request 对象。一旦收到了完整的请求,它就生成一个回复,并通过 asyncio.StreamWriter 对象发送回客户端。当然,它还有两个任务:超时连接以及错误处理。
你可以在[这里][7]浏览这个类的完整代码。我将分别介绍代码的每一部分。为了简单起见,我移除了代码文档。
```
class HTTPConnection(object):
def init(self, http_server, reader, writer):
self.router = http_server.router
self.http_parser = http_server.http_parser
self.loop = http_server.loop
self._reader = reader
self._writer = writer
self._buffer = bytearray()
self._conn_timeout = None
self.request = Request()
```
这个 init 方法仅仅收集了一些对象以供后面使用。它存储了一个 router 对象,一个 http_parser 对象以及 loop 对象,分别用来生成响应,解析请求以及在事件循环中调度任务。
然后,它存储了代表一个 TCP 连接的读写对,和一个充当缓冲区的空[字节数组][8]。_conn_timeout 存储了一个 [asyncio.Handle][9] 的实例,用来管理超时逻辑。最后,它还存储了一个 Request 对象的实例。
下面的代码是用来接受和发送数据的核心功能:
```
async def handle_request(self):
try:
while not self.request.finished and not self._reader.at_eof():
data = await self._reader.read(1024)
if data:
self._reset_conn_timeout()
await self.process_data(data)
if self.request.finished:
await self.reply()
elif self._reader.at_eof():
raise BadRequestException()
except (NotFoundException,
BadRequestException) as e:
self.error_reply(e.code, body=Response.reason_phrases[e.code])
except Exception as e:
self.error_reply(500, body=Response.reason_phrases[500])
self.close_connection()
```
所有内容被包含在 try-except 代码块中,在解析请求或响应期间抛出的异常可以被捕获,然后一个错误响应会发送回客户端。
在 while 循环中不断读取请求,直到解析器将 self.request.finished 设置为 True 或者客户端关闭连接,使得 self._reader_at_eof() 函数返回值为 True 为止。这段代码尝试在每次循环迭代中从 StreamReader 中读取数据,并通过调用 self.process_data(data) 函数以增量方式生成 self.request 。每次循环读取数据时,连接超时计数器被重值。
这儿有个错误,你发现了吗?稍后我们会再讨论这个。需要注意的是,这个循环可能会耗尽 CPU 资源,因为如果没有消息读取 self._reader.read() 函数将会返回一个空的字节对象。这就意味着循环将会不断运行却什么也不做。一个可能的解决方法是用非阻塞的方式等待一小段时间await asyncio.sleep(0.1)。我们暂且不对它做优化。
还记得上一段我提到的那个错误吗?只有 StreamReader 读取数据时self._reset_conn_timeout() 函数才会被调用。这就意味着直到第一个字节到达时timeout 才被初始化。如果有一个客户端建立了与服务器的连接却不发送任何数据,那就永远不会超时。这可能被用来消耗系统资源,或导致拒绝式服务攻击。修复方法就是在 init 函数中调用 self._reset_conn_timeout() 函数。
当请求接受完成或连接中断时,程序将运行到 if-else 代码块。这部分代码会判断解析器收到完整的数据后是否完成了解析。如果是,好,生成一个回复并发送回客户端。如果不是,那么请求信息可能有错误,抛出一个异常!最后,我们调用 self.close_connection 执行清理工作。
解析请求的部分在 self.process_data 方法中。这个方法非常简短,也易于测试:
```
async def process_data(self, data):
self._buffer.extend(data)
self._buffer = self.http_parser.parse_into(
self.request, self._buffer)
```
每一次调用都将数据累积到 self._buffer 中,然后试着用 self.http_parser 来解析已经收集的数据。这里需要指出的是,这段代码展示了一种称为[依赖注入][10]的模式。如果你还记得 init 函数的话,应该知道我们传入了一个包含 http_parser 的 http_server 对象。在这个例子里http_parser 对象是 diy_framework 包中的一个模块。不过它也可以是任何含有 parse_into 函数的类。这个 parse_into 函数接受一个 Request 对象以及字节数组作为参数。这很有用,原因有二。一是,这意味着这段代码更易扩展。如果有人想通过一个不同的解析器来使用 HTTPConnection没问题只需将它作为参数传入即可。二是这使得测试更加容易因为 http_parser 不是硬编码的,所以使用虚假数据或者 [mock][11] 对象来替代是很容易的。
下一段有趣的部分就是 reply 方法了:
```
async def reply(self):
request = self.request
handler = self.router.get_handler(request.path)
response = await handler.handle(request)
if not isinstance(response, Response):
response = Response(code=200, body=response)
self._writer.write(response.to_bytes())
await self._writer.drain()
```
这里,一个 HTTPConnection 的实例使用了 HTTPServer 中的路由对象来得到一个生成响应的对象。一个路由可以是任何一个拥有 get_handler 方法的对象,这个方法接收一个字符串作为参数,返回一个可调用的对象或者抛出 NotFoundException 异常。而这个可调用的对象被用来处理请求以及生成响应。处理程序由框架的使用者编写,如上文所说的那样,应该返回字符串或者 Response 对象。Response 对象提供了一个友好的接口,因此这个简单的 if 语句保证了无论处理程序返回什么,代码最终都得到一个统一的 Response 对象。
接下来,赋值给 self._writer 的 StreamWriter 实例被调用,将字节字符串发送回客户端。函数返回前,程序在 self._writer.drain() 处等待以确保所有的数据被发送给客户端。只要缓存中还有未发送的数据self._writer.close() 方法就不会执行。
HTTPConnection 类还有两个更加有趣的部分:一个用于关闭连接的方法,以及一组用来处理超时机制的方法。首先,关闭一条连接由下面这个小函数完成:
```
def close_connection(self):
self._cancel_conn_timeout()
self._writer.close()
```
每当一条连接将被关闭时,这段代码首先取消超时,然后把连接从事件循环中清除。
超时框架由三个有关的函数组成:第一个函数在超时后给客户端发送错误消息并关闭连接; 第二个函数取消当前的超时; 第三个函数调度超时功能。前两个函数比较简单,我将详细解释第三个函数 _reset_cpmm_timeout() 。
```
def _conn_timeout_close(self):
self.error_reply(500, 'timeout')
self.close_connection()
def _cancel_conn_timeout(self):
if self._conn_timeout:
self._conn_timeout.cancel()
def _reset_conn_timeout(self, timeout=TIMEOUT):
self._cancel_conn_timeout()
self._conn_timeout = self.loop.call_later(
timeout, self._conn_timeout_close)
```
每当 _reset_conn_timeout 函数被调用时,它会先取消之前所有赋值给 self._conn_timeout 的 asyncio.Handle 对象。然后,使用 BaseEventLoop.call_later 函数让 _conn_timeout_close 函数在超时数秒后执行。如果你还记得 handle_request 函数的内容,就知道每当接收到数据时,这个函数就会被调用。这就取消了当前的超时并且重新安排 _conn_timeout_close 函数在超时数秒后执行。只要接收到数据,这个循环就会不断地重置超时回调。如果在超时时间内没有接收到数据,最后函数 _conn_timeout_close 就会被调用。
### 创建连接
我们需要创建 HTTPConnection 对象,并且正确地使用它们。这一任务由 HTTPServer 类完成。HTTPServer 类是一个简单的容器,可以存储着一些配置信息(解析器,路由和事件循环实例),并使用这些配置来创建 HTTPConnection 实例:
```
class HTTPServer(object):
def init(self, router, http_parser, loop):
self.router = router
self.http_parser = http_parser
self.loop = loop
async def handle_connection(self, reader, writer):
connection = HTTPConnection(self, reader, writer)
asyncio.ensure_future(connection.handle_request(), loop=self.loop)
```
HTTPServer 的每一个实例能够监听一个端口。它有一个 handle_connection 的异步方法来创建 HTTPConnection 的实例,并安排它们在事件循环中运行。这个方法被传递给 [asyncio.start_server][12] 作为一个回调函数。也就是说,每当一个 TCP 连接初始化时,它就会被调用。
```
self._server = HTTPServer(self.router, self.http_parser, self.loop)
self._connection_handler = asyncio.start_server(
self._server.handle_connection,
host=self.host,
port=self.port,
reuse_address=True,
reuse_port=True,
loop=self.loop)
```
这就是整个应用程序工作原理的核心asyncio.start_server 接受 TCP 连接,然后在预配置的 HTTPServer 对象上调用一个方法。这个方法将处理一条 TCP 连接的所有逻辑:读取,解析,生成响应并发送回客户端,以及关闭连接。它的重点是 IO 逻辑,解析和生成响应。
讲解了核心的 IO 部分,让我们继续。
### 解析请求
这个微型框架的使用者被宠坏了,不愿意和字节打交道。它们想要一个更高层次的抽象 - 一种更加简单的方法来处理请求。这个微型框架就包含了一个简单的 HTTP 解析器,能够将字节流转化为 Request 对象。
这些 Request 对象是像这样的容器:
```
class Request(object):
def init(self):
self.method = None
self.path = None
self.query_params = {}
self.path_params = {}
self.headers = {}
self.body = None
self.body_raw = None
self.finished = False
```
它包含了所有需要的数据,可以用一种容易理解的方法从客户端接受数据。哦,不包括 cookie ,它对身份认证是非常重要的,我会将它留在第二部分。
每一个 HTTP 请求都包含了一些必需的内容,如请求路径和请求方法。它们也包含了一些可选的内容,如请求体,请求头,或是 URL 参数。随着 REST 的流行,除了 URL 参数URL 本身会包含一些信息。比如,"/user/1/edit" 包含了用户的 id 。
一个请求的每个部分都必须被识别,解析,并正确地赋值给 Request 对象的对应属性。HTTP/1.1 是一个文本协议这简化了很多。HTTP/2 是一个二进制协议,这又是另一种乐趣了)
The http_parser module is a group of functions inside because the parser does not need to keep track of state. Instead, the calling code has to manage a Request object and pass it into the parse_into function along with a bytearray containing the raw bytes of a request. To this end, the parser modifies both the request object as well as the bytearray buffer passed to it. The request object gets fuller and fuller while the bytearray buffer gets emptier and emptier.
解析器不需要跟踪状态,因此 http_parser 模块其实就是一组函数。调用函数需要用到 Request 对象,并将它连同一个包含原始请求信息的字节数组传递给 parse_into 函数。然后解析器会修改 request 对象以及充当缓存的字节数组。字节数组的信息被逐渐地解析到 request 对象中。
http_parser 模块的核心功能就是下面这个 parse_into 函数:
```
def parse_into(request, buffer):
_buffer = buffer[:]
if not request.method and can_parse_request_line(_buffer):
(request.method, request.path,
request.query_params) = parse_request_line(_buffer)
remove_request_line(_buffer)
if not request.headers and can_parse_headers(_buffer):
request.headers = parse_headers(_buffer)
if not has_body(request.headers):
request.finished = True
remove_intro(_buffer)
if not request.finished and can_parse_body(request.headers, _buffer):
request.body_raw, request.body = parse_body(request.headers, _buffer)
clear_buffer(_buffer)
request.finished = True
return _buffer
```
从上面的代码中可以看到我把解析的过程分为三个部分解析请求行请求行有这样的格式GET /resource HTTP/1.1),解析请求头以及解析请求体。
请求行包含了 HTTP 请求方法以及 URL 地址。而 URL 地址则包含了更多的信息路径url 参数和开发者自定义的 url 参数。解析请求方法和 URL 还是很容易的 - 合适地分割字符串就好了。函数 urlparse.parse 可以用来解析 URL 参数。开发者自定义的 URL 参数可以通过正则表达式来解析。
接下来是 HTTP 头部。它们是一行行由键值对组成的简单文本。问题在于,可能有多个 HTTP 头有相同的名字,却有不同的值。一个值得关注的 HTTP 头部是 Content-Length它描述了请求体的字节长度仅仅是请求体。这对于决定是否解析请求体有很重要的作用。
最后,解析器根据 HTTP 方法和头部来决定是否解析请求体。
### 路由!
在某种意义上,路由就像是连接框架和用户的桥梁,用户用合适的方法创建 Router 对象并为其设置路径/函数对,然后将它赋值给 App 对象。而 App 对象依次调用 get_handler 函数生成相应的回调函数。简单来说,路由就负责两件事,一是存储路径/函数对,二是返回需要的路径/函数对
Router 类中有两个允许开发者添加路由的方法,分别是 add_routes 和 add_route。因为 add_routes 就是 add_route 函数的一层封装,我们将主要讲解 add_route 函数:
```
def add_route(self, path, handler):
compiled_route = self.class.build_route_regexp(path)
if compiled_route not in self.routes:
self.routes[compiled_route] = handler
else:
raise DuplicateRoute
```
首先,这个函数使用 Router.build_router_regexp 的类方法,将一条路由规则(如 '/cars/{id}' 这样的字符串),“编译”到一个已编译的正则表达式对象。这些已编译的正则表达式用来匹配请求路径,以及解析开发者自定义的 URL 参数。如果存在一个相同的路由,程序就会抛出一个异常。最后,这个路由/处理程序对被添加到一个简单的字典self.routes中。
下面展示 Router 是如何“编译”路由的:
```
@classmethod
def build_route_regexp(cls, regexp_str):
"""
Turns a string into a compiled regular expression. Parses '{}' into
named groups ie. '/path/{variable}' is turned into
'/path/(?P<variable>[a-zA-Z0-9_-]+)'.
:param regexp_str: a string representing a URL path.
:return: a compiled regular expression.
"""
def named_groups(matchobj):
return '(?P<{0}>[a-zA-Z0-9_-]+)'.format(matchobj.group(1))
re_str = re.sub(r'{([a-zA-Z0-9_-]+)}', named_groups, regexp_str)
re_str = ''.join(('^', re_str, '$',))
return re.compile(re_str)
```
The method uses regular expressions to substitute all occurrences of "{variable}" with named regexp groups: "(?P<variable>...)". Then it adds the ^ and $ regexp signifiers at the beginning and end of the resulting string, and finally compiles regexp object out of it.
这个方法使用正则表达式将所有出现的 "{variable}" 替换为 "(?P<variable>)"。然后在字符串头尾分别添加 ^ 和 $ 标记,最后编译正则表达式对象。
完成了路由存储仅成功了一半,下面是如何得到路由对应的函数:
```
def get_handler(self, path):
logger.debug('Getting handler for: {0}'.format(path))
for route, handler in self.routes.items():
path_params = self.class.match_path(route, path)
if path_params is not None:
logger.debug('Got handler for: {0}'.format(path))
wrapped_handler = HandlerWrapper(handler, path_params)
return wrapped_handler
raise NotFoundException()
```
一旦 App 对象获得一个 Request 对象,也就获得了 URL 的路径部分(如 /users/15/edit。然后我们需要匹配函数来生成一个响应或者 404 错误。get_handler 函数将路径作为参数,循环遍历路由,对每条路由调用 Router.match_path 类方法检查是否有已编译的正则对象与这个请求路径匹配。如果存在,我们就调用 HandleWrapper 来包装路由对应的函数。path_params 字典包含了路径变量(如 '/users/15/edit' 中的 '15'),若路由没有指定变量,字典就为空。最后,我们将包装好的函数返回给 App 对象。
如果遍历了所有的路由都找不到与路径匹配的,函数就会抛出 NotFoundException 异常。
这个 Route.match 类方法挺简单:
```
def match_path(cls, route, path):
match = route.match(path)
try:
return match.groupdict()
except AttributeError:
return None
```
它使用正则对象的 match 方法来检查路由是否与路径匹配。若果不匹配,则返回 None 。
最后,我们有 HandleWraapper 类。它的唯一任务就是封装一个异步函数,存储 path_params 字典,并通过 handle 方法对外提供一个统一的接口。
```
class HandlerWrapper(object):
def init(self, handler, path_params):
self.handler = handler
self.path_params = path_params
self.request = None
async def handle(self, request):
return await self.handler(request, **self.path_params)
```
### 合并到一起
框架的最后部分就是用 App 类把所有的部分联系起来。
App 类用于集中所有的配置细节。一个 App 对象通过其 start_server 方法,使用一些配置数据创建一个 HTTPServer 的实例,然后将它传递给 asyncio.start_server 函数。asyncio.start_server 函数会对每一个 TCP 连接调用 HTTPServer 对象的 handle_connection 方法。
```
def start_server(self):
if not self._server:
self.loop = asyncio.get_event_loop()
self._server = HTTPServer(self.router, self.http_parser, self.loop)
self._connection_handler = asyncio.start_server(
self._server.handle_connection,
host=self.host,
port=self.port,
reuse_address=True,
reuse_port=True,
loop=self.loop)
logger.info('Starting server on {0}:{1}'.format(
self.host, self.port))
self.loop.run_until_complete(self._connection_handler)
try:
self.loop.run_forever()
except KeyboardInterrupt:
logger.info('Got signal, killing server')
except DiyFrameworkException as e:
logger.error('Critical framework failure:')
logger.error(e.traceback)
finally:
self.loop.close()
else:
logger.info('Server already started - {0}'.format(self))
```
### 总结
如果你查看源码,就会发现所有的代码仅 320 余行(包括测试代码共 540 余行)。这么少的代码实现了这么多的功能,让我有点惊讶。这个框架没有提供模板,身份认证以及数据库访问等功能(这些内容也很有趣哦)。这也让我知道,像 Django 和 Tornado 这样的框架是如何工作的,而且我能够快速地调试它们了。
这也是我按照测试驱动开发完成的第一个项目,整个过程有趣而有意义。先编写测试用例迫使我思考设计和架构,而不仅仅是把代码放到一起,让它们可以运行。不要误解我的意思,有很多时候,后者的方式更好。不过如果你想给确保这些不怎么维护的代码在之后的几周甚至几个月依然工作,那么测试驱动开发正是你需要的。
我研究了下[整洁架构][13]以及依赖注入模式,这些充分体现在 Router 类是如何作为一个更高层次的抽象的。Router 类是比较接近核心的,像 http_parser 和 App 的内容比较边缘化,因为它们只是完成了极小的字符串和字节流,或是中层 IO 的工作。测试驱动开发迫使我独立思考每个小部分,这使我问自己这样的问题:方法调用的组合是否易于理解?类名是否准确地反映了我正在解决的问题?我的代码中是否很容易区分出不同的抽象层?
来吧,写个小框架,真的很有趣:)
--------------------------------------------------------------------------------
via: http://mattscodecave.com/posts/simple-python-framework-from-scratch.html
作者:[Matt][a]
译者:[Cathon](https://github.com/Cathon)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: http://mattscodecave.com/hire-me.html
[1]: https://github.com/sirMackk/diy_framework
[2]:https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[3]: https://en.wikipedia.org/wiki/HTTP_persistent_connection
[4]: https://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm#Slow_start
[5]: https://docs.python.org/3/library/asyncio-stream.html
[6]: https://docs.python.org/3/library/asyncio-protocol.html
[7]: https://github.com/sirMackk/diy_framework/blob/88968e6b30e59504251c0c7cd80abe88f51adb79/diy_framework/http_server.py#L46
[8]: https://docs.python.org/3/library/functions.html#bytearray
[9]: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.Handle
[10]: https://en.wikipedia.org/wiki/Dependency_injection
[11]: https://docs.python.org/3/library/unittest.mock.html
[12]: https://docs.python.org/3/library/asyncio-stream.html#asyncio.start_server
[13]: https://blog.8thlight.com/uncle-bob/2012/08/13/the-clean-architecture.html