This commit is contained in:
Xingyu Wang 2019-11-19 22:25:21 +08:00
parent 39aebf96ab
commit f49c159b09

View File

@ -7,34 +7,33 @@
[#]: via: (https://opensource.com/article/19/10/protobuf-data-interchange)
[#]: author: (Marty Kalin https://opensource.com/users/mkalindepauledu)
How to use Protobuf for data interchange
如何使用 Protobuf 做数据交换
======
Protobuf encoding increases efficiency when exchanging data between
applications written in different languages and running on different
platforms.
> 在以不同语言编写并在不同平台上运行的应用程序之间交换数据时Protobuf 编码可提高效率。
![metrics and data shown on a computer screen][1]
Protocol buffers ([Protobufs][2]), like XML and JSON, allow applications, which may be written in different languages and running on different platforms, to exchange data. For example, a sending application written in Go could encode a Go-specific sales order in Protobuf, which a receiver written in Java then could decode to get a Java-specific representation of the received order. Here is a sketch of the architecture over a network connection:
<ruby>协议缓冲区<rt>Protocol Buffers
</rt></ruby>[Protobufs][2])像 XML 和 JSON 一样,可以让用不同语言编写并在不同平台上运行的应用程序交换数据。例如,用 Go 编写的发送应用程序可以在 Protobuf 中对 Go 特定的销售订单进行编码,然后用 Java 编写的接收方可以对它进行解码,以获取所接收订单的 Java 特定表示方式。这是在网络连接上的体系结构示意图:
```
`Go sales order--->Pbuf-encode--->network--->Pbuf-decode--->Java sales order`
Go sales order--->Pbuf-encode--->network--->Pbuf-decode--->Java sales order
```
Protobuf encoding, in contrast to its XML and JSON counterparts, is binary rather than text, which can complicate debugging. However, as the code examples in this article confirm, the Protobuf encoding is significantly more efficient in size than either XML or JSON encoding.
与 XML 和 JSON 相比Protobuf 编码是二进制而不是文本这会使调试复杂化。但是正如本文中的代码示例所确认的那样Protobuf 编码在大小上比 XML 或 JSON 编码要有效得多。
Protobuf is efficient in another way. At the implementation level, Protobuf and other encoding systems serialize and deserialize structured data. Serialization transforms a language-specific data structure into a bytestream, and deserialization is the inverse operation that transforms a bytestream back into a language-specific data structure. Serialization and deserialization may become the bottleneck in data interchange because these operations are CPU-intensive. Efficient serialization and deserialization is another Protobuf design goal.
Protobuf 以另一种方式提供了这种有效性。在实现级别Protobuf 和其他编码系统对结构化数据进行序列化和反序列化。序列化将特定语言的数据结构转换为字节流,反序列化是将字节流转换回特定语言的数据结构的逆运算。序列化和反序列化可能成为数据交换的瓶颈,因为这些操作会占用大量 CPU。高效的序列化和反序列化是 Protobuf 的另一个设计目标。
Recent encoding technologies, such as Protobuf and FlatBuffers, derive from the [DCE/RPC][3] (Distributed Computing Environment/Remote Procedure Call) initiative of the early 1990s. Like DCE/RPC, Protobuf contributes to both the [IDL][4] (interface definition language) and the encoding layer in data interchange.
最近的编码技术,例如 Protobuf 和 FlatBuffers源自 1990 年代初期的 [DCE/RPC][3]<ruby>分布式计算环境/远程过程调用<rt>Distributed Computing Environment/Remote Procedure Call</rt></ruby>)计划。与 DCE/RPC 一样Protobuf 在数据交换中为 [IDL][4](接口定义语言)和编码层做出了贡献。
This article will look at these two layers then provide code examples in Go and Java to flesh out Protobuf details and show that Protobuf is easy to use.
本文将着眼于这两层,然后提供 Go 和 Java 中的代码示例以充实 Protobuf 的细节,并表明 Protobuf 是易于使用的。
### Protobuf as an IDL and encoding layer
### Protobuf 作为一个 IDL 和编码层
DCE/RPC, like Protobuf, is designed to be language- and platform-neutral. The appropriate libraries and utilities allow any language and platform to play in the DCE/RPC arena. Furthermore, the DCE/RPC architecture is elegant. An IDL document is the contract between the remote procedure on the one side and callers on the other side. Protobuf, too, centers on an IDL document.
An IDL document is text and, in DCE/RPC, uses basic C syntax along with syntactic extensions for metadata (square brackets) and a few new keywords such as **interface**. Here is an example:
像 Protobuf 一样DCE/RPC 被设计为与语言和平台无关。适当的库和实用程序允许任何语言和平台用于 DCE/RPC 领域。此外DCE/RPC 体系结构非常优雅。IDL 文档是一侧的远程过程与另一侧的调用者之间的协定。Protobuf 也是以 IDL 文档为中心的。
IDL 文档是文本,在 DCE/RPC 中,使用基本 C 语法以及元数据的语法扩展(方括号)和一些新的关键字,例如 `interface`。这是一个例子:
```
[uuid (2d6ead46-05e3-11ca-7dd1-426909beabcd), version(1.0)]
@ -48,29 +47,28 @@ interface echo {
}
```
This IDL document declares a procedure named **echo**, which takes three arguments: the **[in]** arguments of type **handle_t** (implementation pointer) and **idl_char** (array of ASCII characters) are passed to the remote procedure, whereas the **[out]** argument (also a string) is passed back from the procedure. In this example, the **echo** procedure does not explicitly return a value (the **void** to the left of **echo**) but could do so. A return value, together with one or more **[out]** arguments, allows the remote procedure to return arbitrarily many values. The next section introduces a Protobuf IDL, which differs in syntax but likewise serves as a contract in data interchange.
The IDL document, in both DCE/RPC and Protobuf, is the input to utilities that create the infrastructure code for exchanging data:
该 IDL 文档声明了一个名为 `echo` 的过程,该过程带有三个参数:类型为 `handle_t`(实现指针)和 `idl_char`ASCII 字符数组)的 `[in]` 参数被传递给远程过程,而 `[out]` 参数(也是一个字符串)从该过程中传回。在此示例中,`echo` 过程不会显式返回值(`echo` 左侧的 `void`),但也可以返回。返回值,以及一个或多个 `[out]` 参数,允许远程过程任意返回许多值。下一节将介绍 Protobuf IDL它的语法不同但同样用作数据交换中的协定。
DCE/RPC 和 Protobuf 中的 IDL 文档是创建用于交换数据的基础结构代码的实用程序的输入:
```
`IDL document--->DCE/PRC or Protobuf utilities--->support code for data interchange`
IDL document--->DCE/PRC or Protobuf utilities--->support code for data interchange
```
As relatively straightforward text, the IDL is likewise human-readable documentation about the specifics of the data interchange—in particular, the number of data items exchanged and the data type of each item.
作为相对简单的文本IDL 同样是关于数据交换的细节的便于人类阅读的文档(特别是交换的数据项的数量和每个项的数据类型)。
Protobuf can used in a modern RPC system such as [gRPC][5]; but Protobuf on its own provides only the IDL layer and the encoding layer for messages passed from a sender to a receiver. Protobuf encoding, like the DCE/RPC original, is binary but more efficient.
Protobuf 可用于现代 RPC 系统,例如 [gRPC][5];但是 Protobuf 本身仅提供 IDL 层和编码层,用于从发送者传递到接收者的消息。与原始的 DCE/RPC 一样Protobuf 编码是二进制的,但效率更高。
At present, XML and JSON encodings still dominate in data interchange through technologies such as web services, which make use of in-place infrastructure such as web servers, transport protocols (e.g., TCP, HTTP), and standard libraries and utilities for processing XML and JSON documents. Moreover, database systems of various flavors can store XML and JSON documents, and even legacy relational systems readily generate XML encodings of query results. Every general-purpose programming language now has libraries that support XML and JSON. What, then, recommends a return to a _binary_ encoding system such as Protobuf?
目前XML 和 JSON 编码仍在通过 Web 服务等技术进行的数据交换中占主导地位,这些技术利用 Web 服务器、传输协议(例如 TCP、HTTP以及标准库和实用程序等原有的基础设施来处理 XML 和 JSON 文档。 此外,各种类型的数据库系统可以存储 XML 和 JSON 文档,甚至旧式关系型系统也可以轻松生成查询结果的 XML 编码。现在,每种通用编程语言都具有支持 XML 和 JSON 的库。那么,是什么建议我们回到 Protobuf 之类的**二进制**编码系统呢?
Consider the negative decimal value **-128**. In the 2's complement binary representation, which dominates across systems and languages, this value can be stored in a single 8-bit byte: 10000000. The text encoding of this integer value in XML or JSON requires multiple bytes. For example, UTF-8 encoding requires four bytes for the string, literally **-128**, which is one byte per character (in hex, the values are 0x2d, 0x31, 0x32, and 0x38). XML and JSON also add markup characters, such as angle brackets and braces, to the mix. Details about Protobuf encoding are forthcoming, but the point of interest now is a general one: Text encodings tend to be significantly less compact than binary ones.
让我们看一下负十进制值 `-128`。在 2 的补码二进制表示形式(在系统和语言中占主导地位)中,此值可以存储在单个 8 位字节中:`10000000`。此整数值在 XML 或 JSON 中的文本编码需要多个字节。例如UTF-8 编码需要四个字节的字符串,即 `-128`,即每个字符一个字节(十六进制,值为 `0x2d`、`0x31`、`0x32` 和 `0x38`。XML 和 JSON 还添加了标记字符,例如尖括号和大括号。有关 Protobuf 编码的详细信息下面就会介绍,但现在的关注点是一个通用点:文本编码的压缩性明显低于二进制编码。
### A code example in Go using Protobuf
My code examples focus on Protobuf rather than RPC. Here is an overview of the first example:
* The IDL file named _dataitem.proto_ defines a Protobuf **message** with six fields of different types: integer values with different ranges, floating-point values of a fixed size, and strings of two different lengths.
* The Protobuf compiler uses the IDL file to generate a Go-specific version (and, later, a Java-specific version) of the Protobuf **message** together with supporting functions.
* The IDL file named _dataitem.proto_ defines a Protobuf `message` with six fields of different types: integer values with different ranges, floating-point values of a fixed size, and strings of two different lengths.
* The Protobuf compiler uses the IDL file to generate a Go-specific version (and, later, a Java-specific version) of the Protobuf `message` together with supporting functions.
* A Go app populates the native Go data structure with randomly generated values and then serializes the result to a local file. For comparison, XML and JSON encodings also are serialized to local files.
* As a test, the Go application reconstructs an instance of its native data structure by deserializing the contents of the Protobuf file.
* As a language-neutrality test, the Java application also deserializes the contents of the Protobuf file to get an instance of a native data structure.
@ -101,9 +99,9 @@ message DataItem {
}
```
The IDL uses the current proto3 rather than the earlier proto2 syntax. The package name (in this case, **main**) is optional but customary; it is used to avoid name conflicts. The structured **message** contains eight fields, each of which has a Protobuf data type (e.g., **int64**, **string**), a name (e.g., **oddA**, **short**), and a numeric tag (aka key) after the equals sign **=**. The tags, which are 1 through 8 in this example, are unique integer identifiers that determine the order in which the fields are serialized.
The IDL uses the current proto3 rather than the earlier proto2 syntax. The package name (in this case, `main`) is optional but customary; it is used to avoid name conflicts. The structured `message` contains eight fields, each of which has a Protobuf data type (e.g., `int64`, `string`), a name (e.g., `oddA`, `short`), and a numeric tag (aka key) after the equals sign `=`. The tags, which are 1 through 8 in this example, are unique integer identifiers that determine the order in which the fields are serialized.
Protobuf messages can be nested to arbitrary levels, and one message can be the field type in the other. Here's an example that uses the **DataItem** message as a field type:
Protobuf messages can be nested to arbitrary levels, and one message can be the field type in the other. Here's an example that uses the `DataItem` message as a field type:
```
@ -112,7 +110,7 @@ message DataItems {
}
```
A single **DataItems** message consists of repeated (none or more) **DataItem** messages.
A single `DataItems` message consists of repeated (none or more) `DataItem` messages.
Protobuf also supports enumerated types for clarity:
@ -123,9 +121,9 @@ enum PartnershipStatus {
}
```
The **reserved** qualifier ensures that the numeric values used to implement the three symbolic names cannot be reused.
The `reserved` qualifier ensures that the numeric values used to implement the three symbolic names cannot be reused.
To generate a language-specific version of one or more declared Protobuf **message** structures, the IDL file containing these is passed to the _protoc_ compiler (available in the [Protobuf GitHub repository][7]). For the Go code, the supporting Protobuf library can be installed in the usual way (with **%** as the command-line prompt):
To generate a language-specific version of one or more declared Protobuf `message` structures, the IDL file containing these is passed to the _protoc_ compiler (available in the [Protobuf GitHub repository][7]). For the Go code, the supporting Protobuf library can be installed in the usual way (with `%` as the command-line prompt):
```
@ -139,7 +137,7 @@ The command to compile the Protobuf IDL file _dataitem.proto_ into Go source cod
`% protoc --go_out=. dataitem.proto`
```
The flag **\--go_out** directs the compiler to generate Go source code; there are similar flags for other languages. The result, in this case, is a file named _dataitem.pb.go_, which is small enough that the essentials can be copied into a Go application. Here are the essentials from the generated code:
The flag `\--go_out` directs the compiler to generate Go source code; there are similar flags for other languages. The result, in this case, is a file named _dataitem.pb.go_, which is small enough that the essentials can be copied into a Go application. Here are the essentials from the generated code:
```
@ -162,21 +160,21 @@ func (*DataItem) ProtoMessage()    {}
func init() {}
```
The compiler-generated code has a Go structure **DataItem**, which exports the Go fields—the names are now capitalized—that match the names declared in the Protobuf IDL. The structure fields have standard Go data types: **int32**, **int64**, **float32**, and **string**. At the end of each field line, as a string, is metadata that describes the Protobuf types, gives the numeric tags from the Protobuf IDL document, and provides information about JSON, which is discussed later.
The compiler-generated code has a Go structure `DataItem`, which exports the Go fields—the names are now capitalized—that match the names declared in the Protobuf IDL. The structure fields have standard Go data types: `int32`, `int64`, `float32`, and `string`. At the end of each field line, as a string, is metadata that describes the Protobuf types, gives the numeric tags from the Protobuf IDL document, and provides information about JSON, which is discussed later.
There are also functions; the most important is **proto.Marshal** for serializing an instance of the **DataItem** structure into Protobuf format. The helper functions include **Reset**, which clears a **DataItem** structure, and **String**, which produces a one-line string representation of a **DataItem**.
There are also functions; the most important is `proto.Marshal` for serializing an instance of the `DataItem` structure into Protobuf format. The helper functions include `Reset`, which clears a `DataItem` structure, and `String`, which produces a one-line string representation of a `DataItem`.
The metadata that describes Protobuf encoding deserves a closer look before analyzing the Go program in more detail.
### Protobuf encoding
A Protobuf message is structured as a collection of key/value pairs, with the numeric tag as the key and the corresponding field as the value. The field names, such as **oddA** and **small**, are for human readability, but the _protoc_ compiler does use the field names in generating language-specific counterparts. For example, the **oddA** and **small** names in the Protobuf IDL become the fields **OddA** and **Small**, respectively, in the Go structure.
A Protobuf message is structured as a collection of key/value pairs, with the numeric tag as the key and the corresponding field as the value. The field names, such as `oddA` and `small`, are for human readability, but the _protoc_ compiler does use the field names in generating language-specific counterparts. For example, the `oddA` and `small` names in the Protobuf IDL become the fields `OddA` and `Small`, respectively, in the Go structure.
The keys and their values both get encoded, but with an important difference: some numeric values have a fixed-size encoding of 32 or 64 bits, whereas others (including the **message** tags) are _varint_ encoded—the number of bits depends on the integer's absolute value. For example, the integer values 1 through 15 require 8 bits to encode in _varint_, whereas the values 16 through 2047 require 16 bits. The _varint_ encoding, similar in spirit (but not in detail) to UTF-8 encoding, favors small integer values over large ones. (For a detailed analysis, see the Protobuf [encoding guide][8].) The upshot is that a Protobuf **message** should have small integer values in fields, if possible, and as few keys as possible, but one key per field is unavoidable.
The keys and their values both get encoded, but with an important difference: some numeric values have a fixed-size encoding of 32 or 64 bits, whereas others (including the `message` tags) are _varint_ encoded—the number of bits depends on the integer's absolute value. For example, the integer values 1 through 15 require 8 bits to encode in _varint_, whereas the values 16 through 2047 require 16 bits. The _varint_ encoding, similar in spirit (but not in detail) to UTF-8 encoding, favors small integer values over large ones. (For a detailed analysis, see the Protobuf [encoding guide][8].) The upshot is that a Protobuf `message` should have small integer values in fields, if possible, and as few keys as possible, but one key per field is unavoidable.
Table 1 below gives the gist of Protobuf encoding:
**Table 1. Protobuf data types**
`Table 1. Protobuf data types`
Encoding | Sample types | Length
---|---|---
@ -184,9 +182,9 @@ varint | int32, uint32, int64 | Variable length
fixed | fixed32, float, double | Fixed 32-bit or 64-bit length
byte sequence | string, bytes | Sequence length
Integer types that are not explicitly **fixed** are _varint_ encoded; hence, in a _varint_ type such as **uint32** (**u** for unsigned), the number 32 describes the integer's range (in this case, 0 to 232 \- 1) rather than its bit size, which differs depending on the value. For fixed types such as **fixed32** or **double**, by contrast, the Protobuf encoding requires 32 and 64 bits, respectively. Strings in Protobuf are byte sequences; hence, the size of the field encoding is the length of the byte sequence.
Integer types that are not explicitly `fixed` are _varint_ encoded; hence, in a _varint_ type such as `uint32` (`u` for unsigned), the number 32 describes the integer's range (in this case, 0 to 232 \- 1) rather than its bit size, which differs depending on the value. For fixed types such as `fixed32` or `double`, by contrast, the Protobuf encoding requires 32 and 64 bits, respectively. Strings in Protobuf are byte sequences; hence, the size of the field encoding is the length of the byte sequence.
Another efficiency deserves mention. Recall the earlier example in which a **DataItems** message consists of repeated **DataItem** instances:
Another efficiency deserves mention. Recall the earlier example in which a `DataItems` message consists of repeated `DataItem` instances:
```
@ -195,13 +193,13 @@ message DataItems {
}
```
The **repeated** means that the **DataItem** instances are _packed_: the collection has a single tag, in this case, 1. A **DataItems** message with repeated **DataItem** instances is thus more efficient than a message with multiple but separate **DataItem** fields, each of which would require a tag of its own.
The `repeated` means that the `DataItem` instances are _packed_: the collection has a single tag, in this case, 1. A `DataItems` message with repeated `DataItem` instances is thus more efficient than a message with multiple but separate `DataItem` fields, each of which would require a tag of its own.
With this background in mind, let's return to the Go program.
### The dataItem program in detail
The _dataItem_ program creates a **DataItem** instance and populates the fields with randomly generated values of the appropriate types. Go has a **rand** package with functions for generating pseudo-random integer and floating-point values, and my **randString** function generates pseudo-random strings of specified lengths from a character set. The design goal is to have a **DataItem** instance with field values of different types and bit sizes. For example, the **OddA** and **EvenA** values are 64-bit non-negative integer values of odd and even parity, respectively; but the **OddB** and **EvenB** variants are 32 bits in size and hold small integer values between 0 and 2047. The random floating-point values are 32 bits in size, and the strings are 16 (**Short**) and 32 (**Long**) characters in length. Here is the code segment that populates the **DataItem** structure with random values:
The _dataItem_ program creates a `DataItem` instance and populates the fields with randomly generated values of the appropriate types. Go has a `rand` package with functions for generating pseudo-random integer and floating-point values, and my `randString` function generates pseudo-random strings of specified lengths from a character set. The design goal is to have a `DataItem` instance with field values of different types and bit sizes. For example, the `OddA` and `EvenA` values are 64-bit non-negative integer values of odd and even parity, respectively; but the `OddB` and `EvenB` variants are 32 bits in size and hold small integer values between 0 and 2047. The random floating-point values are 32 bits in size, and the strings are 16 (`Short`) and 32 (`Long`) characters in length. Here is the code segment that populates the `DataItem` structure with random values:
```
@ -234,7 +232,7 @@ dataItem := &amp;DataItem {
}
```
Once created and populated with values, the **DataItem** instance is encoded in XML, JSON, and Protobuf, with each encoding written to a local file:
Once created and populated with values, the `DataItem` instance is encoded in XML, JSON, and Protobuf, with each encoding written to a local file:
```
@ -250,7 +248,7 @@ func encodeAndserialize(dataItem *DataItem) {
}
```
The three serializing functions use the term _marshal_, which is roughly synonymous with _serialize_. As the code indicates, each of the three **Marshal** functions returns an array of bytes, which then are written to a file. (Possible errors are ignored for simplicity.) On a sample run, the file sizes were:
The three serializing functions use the term _marshal_, which is roughly synonymous with _serialize_. As the code indicates, each of the three `Marshal` functions returns an array of bytes, which then are written to a file. (Possible errors are ignored for simplicity.) On a sample run, the file sizes were:
```
@ -261,7 +259,7 @@ dataitem.pbuf:  88 bytes
The Protobuf encoding is significantly smaller than the other two. The XML and JSON serializations could be reduced slightly in size by eliminating indentation characters, in this case, blanks and newlines.
Below is the _dataitem.json_ file resulting eventually from the **json.MarshalIndent** call, with added comments starting with **##**:
Below is the _dataitem.json_ file resulting eventually from the `json.MarshalIndent` call, with added comments starting with `##`:
```
@ -281,7 +279,7 @@ Although the serialized data goes into local files, the same approach would be u
### Testing serialization/deserialization
The Go program next runs an elementary test by deserializing the bytes, which were written earlier to the _dataitem.pbuf_ file, into a **DataItem** instance. Here is the code segment, with the error-checking parts removed:
The Go program next runs an elementary test by deserializing the bytes, which were written earlier to the _dataitem.pbuf_ file, into a `DataItem` instance. Here is the code segment, with the error-checking parts removed:
```
@ -291,7 +289,7 @@ testItem.Reset()                            // clear the DataItem
err = proto.Unmarshal(filebytes, testItem)  // deserialize into a DataItem instance
```
The **proto.Unmarshal** function for deserializing Protbuf is the inverse of the **proto.Marshal** function. The original **DataItem** and the deserialized clone are printed to confirm an exact match:
The `proto.Unmarshal` function for deserializing Protbuf is the inverse of the `proto.Marshal` function. The original `DataItem` and the deserialized clone are printed to confirm an exact match:
```
@ -308,7 +306,7 @@ boPb#T0O8Xd&amp;Ps5EnSZqDg4Qztvo7IIs 9vH66AiGSQgCDxk&amp;
### A Protobuf client in Java
The example in Java is to confirm Protobuf's language neutrality. The original IDL file could be used to generate the Java support code, which involves nested classes. To suppress warnings, however, a slight addition can be made. Here is the revision, which specifies a **DataMsg** as the name for the outer class, with the inner class automatically named **DataItem** after the Protobuf message:
The example in Java is to confirm Protobuf's language neutrality. The original IDL file could be used to generate the Java support code, which involves nested classes. To suppress warnings, however, a slight addition can be made. Here is the revision, which specifies a `DataMsg` as the name for the outer class, with the inner class automatically named `DataItem` after the Protobuf message:
```
@ -353,7 +351,7 @@ public class Main {
}
```
Production-grade testing would be far more thorough, of course, but even this preliminary test confirms the language-neutrality of Protobuf: the _dataitem.pbuf_ file results from the Go program's serialization of a Go **DataItem**, and the bytes in this file are deserialized to produce a **DataItem** instance in Java. The output from the Java test is the same as that from the Go test.
Production-grade testing would be far more thorough, of course, but even this preliminary test confirms the language-neutrality of Protobuf: the _dataitem.pbuf_ file results from the Go program's serialization of a Go `DataItem`, and the bytes in this file are deserialized to produce a `DataItem` instance in Java. The output from the Java test is the same as that from the Go test.
### Wrapping up with the numPairs program
@ -374,9 +372,9 @@ message NumPair {
}
```
A **NumPair** message consists of two **int32** values together with an integer tag for each field. A **NumPairs** message is a sequence of embedded **NumPair** messages.
A `NumPair` message consists of two `int32` values together with an integer tag for each field. A `NumPairs` message is a sequence of embedded `NumPair` messages.
The _numPairs_ program in Go (below) creates 2 million **NumPair** instances, with each appended to the **NumPairs** message. This message can be serialized and deserialized in the usual way.
The _numPairs_ program in Go (below) creates 2 million `NumPair` instances, with each appended to the `NumPairs` message. This message can be serialized and deserialized in the usual way.
#### Example 2. The numPairs program
@ -460,11 +458,11 @@ func main() {
}
```
The randomly generated odd and even values in each **NumPair** range from zero to 2 billion and change. In terms of raw rather than encoded data, the integers generated in the Go program add up to 16MB: two integers per **NumPair** for a total of 4 million integers in all, and each value is four bytes in size.
The randomly generated odd and even values in each `NumPair` range from zero to 2 billion and change. In terms of raw rather than encoded data, the integers generated in the Go program add up to 16MB: two integers per `NumPair` for a total of 4 million integers in all, and each value is four bytes in size.
For comparison, the table below has entries for the XML, JSON, and Protobuf encodings of the 2 million **NumPair** instances in the sample **NumsPairs** message. The raw data is included, as well. Because the _numPairs_ program generates random values, output differs across sample runs but is close to the sizes shown in the table.
For comparison, the table below has entries for the XML, JSON, and Protobuf encodings of the 2 million `NumPair` instances in the sample `NumsPairs` message. The raw data is included, as well. Because the _numPairs_ program generates random values, output differs across sample runs but is close to the sizes shown in the table.
**Table 2. Encoding overhead for 16MB of integers**
`Table 2. Encoding overhead for 16MB of integers`
Encoding | File | Byte size | Pbuf/other ratio
---|---|---|---
@ -475,11 +473,11 @@ XML | pairs.xml | 126MB | 21%
As expected, Protobuf shines next to XML and JSON. The Protobuf encoding is about a quarter of the JSON one and about a fifth of the XML one. But the raw data make clear that Protobuf incurs the overhead of encoding: the serialized Protobuf message is 11MB larger than the raw data. Any encoding, including Protobuf, involves structuring the data, which unavoidably adds bytes.
Each of the serialized 2 million **NumPair** instances involves _four_ integer values: one apiece for the **Even** and **Odd** fields in the Go structure, and one tag per each field in the Protobuf encoding. As raw rather than encoded data, this would come to 16 bytes per instance, and there are 2 million instances in the sample **NumPairs** message. But the Protobuf tags, like the **int32** values in the **NumPair** fields, use _varint_ encoding and, therefore, vary in byte length; in particular, small integer values (which include the tags, in this case) require fewer than four bytes to encode.
Each of the serialized 2 million `NumPair` instances involves _four_ integer values: one apiece for the `Even` and `Odd` fields in the Go structure, and one tag per each field in the Protobuf encoding. As raw rather than encoded data, this would come to 16 bytes per instance, and there are 2 million instances in the sample `NumPairs` message. But the Protobuf tags, like the `int32` values in the `NumPair` fields, use _varint_ encoding and, therefore, vary in byte length; in particular, small integer values (which include the tags, in this case) require fewer than four bytes to encode.
If the _numPairs_ program is revised so that the two **NumPair** fields hold values less than 2048, which have encodings of either one or two bytes, then the Protobuf encoding drops from 27MB to 16MB—the very size of the raw data. The table below summarizes the new encoding sizes from a sample run.
If the _numPairs_ program is revised so that the two `NumPair` fields hold values less than 2048, which have encodings of either one or two bytes, then the Protobuf encoding drops from 27MB to 16MB—the very size of the raw data. The table below summarizes the new encoding sizes from a sample run.
**Table 3. Encoding with 16MB of integers &lt; 2048**
`Table 3. Encoding with 16MB of integers &lt; 2048`
Encoding | File | Byte size | Pbuf/other ratio
---|---|---|---