Database: A Guide to Evolvable Data Models and Encoding (Thrift, Protobuf, Avro)

Explore techniques for managing schema changes and maintaining forward and backward compatibility in evolving data models.

Jul 31, 2024

Schema changes are a critical aspect of application evolution. As applications grow and adapt to new requirements, their data models must also evolve. The ability to handle schema changes efficiently is crucial for maintaining application performance and user experience.

Schema

A schema is a blueprint or structure that defines how data is organized in a database. It outlines the tables, fields, and relationships within the database, essentially describing how data is stored and accessed. Schema specifies what data will be stored, the data types (e.g., integers, strings, dates), and how the data elements relate to each other.

Relational schema: it defines tables, columns, and relationships using a structured, predefined format. Relational schemas are strict, meaning the data must conform to the defined structure.
Nonrelational schema: it can store data in various formats, such as key-value pairs, documents, graphs, or wide-column stores, without requiring a fixed structure.

Encoding

Encoding is the process of converting data into a specific format so that it can be stored or transmitted efficiently. Think of it as a way of packaging data into a format that can be easily saved to disk, loaded into memory, or sent over a network.

Storage on Disk/In Memory: When data is saved to a database, it needs to be converted into a format that the database can efficiently store and retrieve. This is done through encoding.
Data Transmission Over the Internet: When data needs to be sent from one place to another, such as from a server to a client, it is serialized into an encoding format. Common formats include JSON (JavaScript Object Notation) and XML (eXtensible Markup Language).

Why Encoding Matters ?

Data Representation: Encoding formats determine how data is serialized (converted into a byte stream) and deserialized (converted back into a usable format) when stored on disk or in memory. In databases that enforce schemas (e.g., relational databases), the schema defines the structure of the data. When data is written to the database, it is encoded according to this schema.
Reading/Writing Data Over the Internet: When data is transmitted over the internet (e.g., between a client and server or between different services), it needs to be encoded into a format that can be easily transmitted and understood by both parties. Common formats include JSON, XML, and Protocol Buffers. APIs often specify the encoding format to ensure consistent communication between services. For instance, a REST API might use JSON encoding for all its responses and requests. This ensures that any changes to the schema are managed within the constraints of the chosen encoding format.
Efficiency: Proper encoding helps save space and speed up data retrieval and transmission.
Schema Changes: When you update the schema of your database (e.g., adding a new field), encoding helps manage these changes so that old and new versions of the data can coexist without issues.

When schema changes occur, corresponding changes to application code are usually required to handle the new data structures. However, these code changes can't always be deployed instantaneously, especially in large applications. Rolling upgrades for server-side applications and delayed updates for client-side applications mean that old and new code, along with old and new data formats, may coexist for some time.

To maintain smooth operation during this period, it's essential to ensure both backward and forward compatibility. Backward compatibility allows newer code to read data written by older code, while forward compatibility enables older code to ignore or gracefully handle new additions made by newer code.

Types of encoding

Using language-specific serializing algorithms for encoding in-memory objects into byte sequences is generally discouraged for several reasons:

Language Dependency: Java uses java.io.Serializable, and Python uses pickle. As a result, data encoded in one language is typically very difficult to decode in another language.
Security Risks: To decode data back into the original object types, the process often needs to instantiate arbitrary classes. This requirement can introduce serious security risks. An attacker who can manipulate the byte sequence being decoded might instantiate malicious classes, potentially leading to remote code execution and other security breaches.
Versioning Challenges: These encoding libraries often treat versioning as an afterthought. They are designed for quick and easy data encoding but neglect the complexities of maintaining forward and backward compatibility.
Efficiency Issues: The performance of these built-in serialization methods can be suboptimal. For instance, Java’s built-in serialization is notorious for its poor performance and large encoded size.

Due to these issues, it's generally recommended to avoid using language-specific serializing algorithms for anything other than short-term or transient purposes. Instead, using standardized encoding formats like JSON, XML, Protocol Buffers, Thrift, or Avro is preferable.

JSON (JavaScript Object Notation)

{
  "name": "John Doe",
  "email": "john.doe@example.com",
  "interests": ["reading", "traveling", "coding"]
}

JSON is easy to read and write, making it ideal for configuration files, data interchange, and debugging. Supported by most programming languages, allowing universal data interchange. Schema-less format enables flexible and dynamic data structures. Built-in libraries in most programming languages make parsing and generating JSON straightforward.

/* in hexadecimal */
83 A4 6E 61 6D 65 A8 4A 6F 68 6E 20 44 6F 65 A5 65 6D 61 69 6C B5 6A 6F 68 6E 2E 64 6F 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D A9 69 6E 74 65 72 65 73 74 73 93 A7 72 65 61 64 69 6E 67 A9 74 72 61 76 65 6C 69 6E 67 A6 63 6F 64 69 6E 67 00


/* Let's decode the given binary output piece by piece. */

83 - start (with 3 key-value pairs)
A4 - string of length 4
6E 61 6D 65 - name (key)
A8 - string of length 8
4A 6F 68 6E 20 44 6F 65 - John Doe (value)


A5 - string of length 5
65 6D 61 69 6C - email (key)
B5 - string of length 20
6A 6F 68 6E 2E 64 6F 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D - john.doe@example.com (value)


A9 - string of length 9
69 6E 74 65 72 65 73 74 73 - interests (key)
93 - array with 3 elements
A7 - string of length 7
72 65 61 64 69 6E 67 - reading (value 1)
A9 - string of length 9
74 72 61 76 65 6C 69 6E 67 - traveling (value 2)
A6 - string of length 6
63 6F 64 69 6E 67 - coding (value 3)
00 - End of struct

# Total bytes = 79 bytes

Limitations:

Verbose: JSON can be inefficient in terms of storage and bandwidth due to its verbose nature.
Limited Data Types: Supports basic data types but lacks more complex structures like dates and binary data.
No Built-In Schema Validation: Requires additional tools for schema validation and ensuring data consistency.
Performance: Parsing and generating JSON can be slower compared to binary formats, especially for large datasets.

XML (Extensible Markup Language)

<user>
    <name>John Doe</name>
    <email>john.doe@example.com</email>
    <interests>
        <interest>reading</interest>
        <interest>traveling</interest>
        <interest>coding</interest>
    </interests>
</user>

XML is easy to understand and debug, designed to be both human and machine-readable. Allows representation of complex data relationships and hierarchies. Schemas like DTD and XSD enable rigorous validation of document structure and content.

Limitations:

Verbose: XML is more verbose than JSON, leading to larger file sizes and increased data transmission costs.
Complex Parsing: Parsing XML is more resource-intensive compared to JSON and binary formats.
Limited Data Types: Primarily supports text content, requiring additional parsing for complex data types.
No Native Data Structures: Does not natively support arrays or objects as intuitively as JSON.

Binary Encoding Formats: Why Use Them?

While JSON and XML are widely used for their readability and flexibility, binary encoding formats like Protocol Buffers (Protobuf), Thrift, and Avro offer significant advantages in terms of efficiency, performance, and data consistency. Here’s why binary encoding formats are often preferred in certain scenarios:

Efficiency: Binary formats are much more compact compared to text-based formats like JSON and XML. This reduces storage requirements and minimizes data transmission costs.
Performance: Binary formats enable faster parsing and serialization due to their compact and well-defined structure.
Strong Typing and Schema Enforcement: Binary formats require a predefined schema, ensuring strong typing and data consistency. This schema-based approach prevents errors that can arise from missing or malformed data.
Backward and Forward Compatibility: Binary formats handle backward and forward compatibility effectively, allowing systems to evolve without breaking existing data.

Use Cases in Modern Databases

Bigtable: Uses Protobuf for efficient data storage and retrieval.
Cassandra: Utilizes Thrift for client API interactions.
Apache Hadoop and Kafka: Avro is widely used for data serialization in the Hadoop ecosystem and for message serialization in Kafka.

Apache Thrift Encoding

Apache Thrift is a binary serialization format developed by Facebook for efficient data interchange between different programming languages. Thrift requires an Interface Definition Language (IDL) file to define the structure of the data.

struct UserProfile {
  1: required string name,
  2: required string email,
  3: optional list<string> interests
}

Thrift Binary Protocol

Apache Thrift serializes data according to the Thrift IDL definition and the Thrift Binary Protocol. Here's the binary protocol output for the above example:

/* in hexadecimal */
0b 00 01 00 00 00 08 4a 6f 68 6e 20 44 6f 65 0b 00 02 00 00 00 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 0f 00 03 0b 00 00 00 03 00 00 00 07 72 65 61 64 69 6e 67 00 00 00 09 74 72 61 76 65 6c 69 6e 67 00 00 00 06 63 6f 64 69 6e 67 00


/* Let's decode the given binary output piece by piece. */

0b - Type: string (11)
00 01 - Field ID: 1
00 00 00 08 - Length: 8
John Doe - 4a 6f 68 6e 20 44 6f 65 (Value)



0b - Type: string (11)
00 02 - Field ID: 2
00 00 00 14 - Length: 20 (~ 14 in hex)
john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value)



0f - Type: list (15)
00 03 - Field ID: 3
0b - Element Type: string (11)
00 00 00 03 - Number of Elements: 3

00 00 00 07 - Length: 7
reading - 72 65 61 64 69 6e 67 (Value)

00 00 00 09 - Length: 9
traveling - 74 72 61 76 65 6c 69 6e 67 (Value)

00 00 00 06 - Length: 6
coding - 63 6f 64 69 6e 67 (Value)
00 - End of struct

# Total bytes = 86 bytes

To deserialize the Thrift-encoded binary data back into a structured object,

Thrift IDL Definition: The Thrift IDL (Interface Definition Language) file that describes the structure of the data, including the data types and field IDs.
Encoded data: Data that was encoded using Thrift binary protocol.

The above encoding might appear to consume more bytes than JSON encoding with MessagePack due to an additional 8 bytes. However, this is because Thrift's binary protocol utilizes 2 bytes to represent each field/key ID and 4 bytes to denote the length of the values. Additionally, Thrift binary protocol can represent larger datasets more efficiently as it does not store the "key" for each entry, resulting in fewer bytes used as the dataset size increases.

Thrift Compact Protocol

The Thrift Compact Protocol is an encoding scheme designed to reduce the size of serialized data for efficient network transmission and storage. It achieves this by using variable-length encoding and other space-saving techniques to represent data more compactly compared to the standard Thrift Binary Protocol.

Compact Protocol: Combines the field type and field ID into a single byte or a few bytes, reducing the overhead.
Variable-Length Encoding: uses variable-length encoding for integers, meaning it uses fewer bytes to represent smaller numbers. For example, the number 5 can be represented in a single byte instead of using a fixed 4-byte integer.
Efficient Type Encoding: The Compact Protocol uses special encodings for common types like booleans, which can be represented in just one bit instead of a full byte.
Length-Prefix Strings and Containers: Strings and containers (like lists) are prefixed with their length, but the length itself is encoded using variable-length encoding, which saves space when dealing with short strings or small lists.

Here's the compact protocol output for the above example:

/* in hexadecimal */
18 08 4a 6f 68 6e 20 44 6f 65 18 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 19 38 07 72 65 61 64 69 6e 67 09 74 72 61 76 65 6c 69 6e 67 06 63 6f 64 69 6e 67 00

/* Let's decode the given binary output piece by piece. */

18 - Field ID (1) and type (string, 8) combined
08 - length of string
John Doe - 4a 6f 68 6e 20 44 6f 65 (Value)


18 - Field ID (2 = 1+prevID) and type (string, 8) combined
14 - length of string 20 (~ 14 in hex)
john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value)


19 - Field ID (3 = 1+prevID) and type (list, 9) combined
38 - count (3) and type(string, 8) combined
07 - length of string
reading - 72 65 61 64 69 6e 67 (Value)
09 - length of string
traveling - 74 72 61 76 65 6c 69 6e 67 (Value)
06 - length of string
coding - 63 6f 64 69 6e 67 (Value)
00 - End of struct


# Total bytes = 61 bytes

These optimizations make the Compact Protocol particularly useful in scenarios where bandwidth or storage is limited, such as mobile applications or networked services with high data throughput requirements.

Backward Compatibility in Thrift

Thrift achieves this through optional fields and default values:

Optional Fields: Fields can be marked as optional. When new fields are added to a Thrift struct, they are typically marked as optional. This way, older clients and servers can still process the data without knowing about the new fields.
Default Values: Providing default values for new fields helps maintain compatibility. If an older version of a service encounters an unknown field, it can ignore it and use the default value.

Forward Compatibility in Thrift

Forward compatibility ensures that older versions of services can read data written by newer versions. Thrift supports this by allowing fields to be added or removed in a way that doesn’t break older clients or servers.

Unknown Fields: Thrift’s binary protocol includes field identifiers, allowing older versions to skip unknown fields.
```
struct User {
  1: required string name;
  2: optional i32 age = 0;
  3: optional string email; // new field
}
```

Protobuf (Protocol Buffers)

Protocol Buffers (Protobuf) is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It was developed by Google and is widely used for data interchange in microservices and other applications.

Similar to Thrift, you define the structure of your data in a .proto file. Protobuf generates code in multiple languages based on the schema definition. You use the generated code to serialize and deserialize data to/from a compact binary format. Here's how you would define the same structure in a Protobuf .proto file for the previous example data:

syntax = "proto3";

message UserProfile {
  string name = 1;
  string email = 2;
  repeated string interests = 3;
}

Protocol Buffers use a combination of field numbers and wire types to encode data. The key points are:

Field Number: Identifies the field within the message.
Wire Type: Indicates the format of the data for that field (e.g., length-delimited, 32-bit, etc.).
The field number and wire type are packed into a single byte (or more) using the following format:
- The lower three bits of the byte represent the wire type.
- The remaining bits represent the field number.

The ProtoBuf output for the above example:

/* in hexadecimal */
0a 08 4a 6f 68 6e 20 44 6f 65 12 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 1a 07 72 65 61 64 69 6e 67 1a 09 74 72 61 76 65 6c 69 6e 67 1a 06 63 6f 64 69 6e 67 00


/* Let's break it down */

0a - (00001010) in binary
   - last three bits (010) represent the wire type (2)
   - remaining bits (00001) represent the field number (1)
08 - length of string (8)
John Doe - 4a 6f 68 6e 20 44 6f 65 (Value)


12 - (00010010) in binary
   - last three bits (010) represent the wire type (2)
   - remaining bits (0010) represent the field number (2)
14 - length of string 20 (~ 14 in hex)
john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value)


1a - (00011010) in binary
   - last three bits (010) represent the wire type (2)
   - remaining bits (0011) represent the field number (3)
07 - length of string (7)
reading - 72 65 61 64 69 6e 67 (Value)
1a - Field number 3, wire type 2
09 - length of string
traveling - 74 72 61 76 65 6c 69 6e 67 (Value)
1a - Field number 3, wire type 2
06 - length of string
coding - 63 6f 64 69 6e 67 (Value)
00 - End of struct


# Total bytes = 61 bytes

Backward Compatibility in Protobuf

Backward compatibility in Protobuf is achieved by using field numbers and allowing unknown fields to be ignored:

Field Numbers: Each field in a Protobuf message has a unique number. Adding a new field with a new number doesn’t affect older versions.
Unknown Fields: When an older version of a service encounters unknown fields, it simply ignores them.

Forward Compatibility in Protobuf

Forward compatibility in Protobuf allows older services to understand data written by newer versions:

Optional Fields: New fields are marked as optional. Older versions ignore unknown fields without any issues.
Reserved Keywords: When fields are removed, their numbers can be reserved to prevent future conflicts.

Avro

Apache Avro is a data serialization system that provides rich data structures, compact, fast binary data format, and a container file format to store persistent data. It is designed for use with Hadoop, but it can be used independently of Hadoop as well. Avro uses JSON to define schemas and binary format to encode data. It is schema-based, meaning that data is always serialized along with the schema, making it self-descriptive. First we have to create an Avro schema (.avsc) for our data

{
  "type": "record",
  "name": "UserProfile",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "interests", "type": {"type": "array", "items": "string"}}
  ]
}

Use the Avro library to serialize the data according to the schema. Avro uses zigzag encoding for integers to efficiently encode negative numbers. Zigzag encoding maps signed integers to unsigned integers in such a way that small magnitude signed integers (both positive and negative) have small magnitude unsigned integer representations. This is beneficial because Avro uses variable-length encoding for integers, so smaller numbers use fewer bytes.

Zigzag Encoding

In zigzag encoding, the least significant bit (LSB) of the encoded value is used to store the sign of the integer, and the rest of the bits store the absolute value of the integer, shifted to the left. The encoding works as follows:

Positive numbers are encoded such that they remain in the same order but shifted.
Negative numbers are encoded in a way that places them in the even slots when viewed in binary (except for LSB which is always set for negative numbers).

\(Zigzag\ Encoded\ Value=(n<<1)⊕(n>>31)\)

Where:

n<<1 shifts the bits of n to the left by 1 position, making room for the sign bit.
n>>31 shifts the sign bit of nn to the least significant bit position, effectively copying the sign bit across the entire 32-bit integer.

Example:

+5 : 0000 0101 (original binary)
   : 0000 1010 (shifted left by 1)
   : 0a (encoded value, decimal = 10)

-5 : 1111 1011 (original binary in 2's complement)
   : 1111 0110 (shifted left by 1)
   : 0000 1001 (sign bit handling, decimal = 9)

Avro encoded output for the above example is:

/* in hexadecimal */
10 4a 6f 68 6e 20 44 6f 65 28 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 06 0e 72 65 61 64 69 6e 67 12 74 72 61 76 65 6c 69 6e 67 0c 63 6f 64 69 6e 67 00


/* Let's break it down */
10 - 0001 0000 (LSB = 0)
   - 1 000 (next 4 bits from LSB = 8 in decimal)
   - Length of string
John Doe - 4a 6f 68 6e 20 44 6f 65 (Value)


28 - 0010 1000 (LSB = 0)
   - 10 100 (next 5 bits from LSB = 20 in decimal)
   - Length of string
john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value)


06 - 0000 0110 (LSB = 0)
   - 11 (next 2 bits from LSB = 3 in decimal)
   - Length of list
0e - 0000 1110 (LSB = 0)
   - 111 (next 3 bits from LSB = 7 in decimal)
   - Length of string
reading - 72 65 61 64 69 6e 67 (Value)
12 - 0001 0010 (LSB = 0)
   - 1001 (next 4 bits from LSB = 9 in decimal)
   - Length of string
traveling - 74 72 61 76 65 6c 69 6e 67 (Value)
0c - 0000 1100 (LSB = 0)
   - 110 (next 3 bits from LSB = 6 in decimal)
   - Length of string
coding - 63 6f 64 69 6e 67 (Value)
00 - End of struct


# Total bytes = 57

Forward and backward compatibility

In Apache Avro, the concept of forward and backward compatibility is achieved through the use of writer and reader schemas. This compatibility allows systems to evolve over time without breaking existing data. Here’s a detailed explanation of how Avro ensures forward and backward compatibility:

Writer Schema: This is the schema used when data is written to an Avro file or message. It defines the structure and data types of the records being serialized.
Reader Schema: This is the schema used when data is read from an Avro file or message. It defines the structure and data types that the reader expects.

Backward Compatibility: This means that new data written with a new schema (writer schema) can be read by an old schema (reader schema).
Forward Compatibility: This means that old data written with an old schema (writer schema) can be read by a new schema (reader schema).

Compatibility Rules

Avro does care about the order of fields in the schema, but matches fields by their names. In Avro, there are specific rules that determine whether a schema change is backward or forward compatible. Here’s a detailed explanation:

Adding Fields:
- Backward Compatibility: Adding a new field with a default value is backward compatible. The old reader schema can read the new data and use the default value for the new field.
- Forward Compatibility: Adding a new field without a default value is not forward compatible because the old data won't have this field. However, if the new field has a default value, it is forward compatible.
Removing Fields:
- Backward Compatibility: Removing a field is not backward compatible because the old reader expects the field to be present.
- Forward Compatibility: Removing a field is forward compatible. The new reader schema can read old data and ignore the missing field.
Changing Field Types:
- Backward Compatibility: Changing the type of a field can break backward compatibility unless the new type can be safely converted to the old type (e.g., changing from int to long).
- Forward Compatibility: Changing the type of a field can also break forward compatibility unless the old type can be safely converted to the new type (e.g., changing from int to long). Avro supports union types, allowing fields to accept multiple types, aiding in schema evolution.
Renaming Fields:
- Backward Compatibility: Renaming a field is not backward compatible because the old reader expects the original field name.
- Forward Compatibility: Renaming a field is not forward compatible for the same reason; the new reader won't find the old field name.

Maintaining backward and forward compatibility is essential in distributed systems to ensure smooth communication and service evolution. Apache Thrift, Protocol Buffers, and Apache Avro provide robust mechanisms to handle compatibility:

Thrift uses optional fields, default values, and field identifiers.
Protobuf relies on field numbers, optional fields, and reserved keywords.
Avro leverages schema evolution, default values, and union types.

Understanding these mechanisms allows developers to design resilient systems capable of evolving without disrupting existing services. Each framework has its strengths and use cases, and the choice of which to use depends on the specific requirements of your project.

Dataflow Through Services

In our journey through various encoding methods and their applications, we have explored how different data formats and protocols enable efficient communication between systems. Now, let's delve deeper into how these encoding methods play a crucial role in dataflow through services, particularly focusing on REST (Representational State Transfer) and RPC (Remote Procedure Call). Understanding these concepts is vital as they form the backbone of modern web services and microservices architecture, influencing how systems are designed, deployed, and maintained.

Clients and Servers: The Foundation of Network Communication

The client-server model is the most common arrangement for network communication. In this model, servers expose APIs over the network, and clients make requests to these APIs. This interaction is fundamental to how the web operates, with web browsers acting as clients making HTTP requests to web servers to fetch resources such as HTML, CSS, and JavaScript.

But web browsers are not the only clients. Native apps on mobile devices and desktop computers also make network requests, as do client-side JavaScript applications using techniques like Ajax. In these cases, the server's response is often in a format like JSON, which is convenient for further processing by the client-side application.

Service-Oriented Architecture (SOA) and Microservices

To manage the complexity of large applications, the service-oriented architecture (SOA) approach decomposes applications into smaller, function-specific services. This approach has evolved into what we now call microservices architecture. Microservices promote independently deployable and evolvable services, allowing teams to release new versions without extensive coordination.

In microservices architecture, services provide an application-specific API that controls what clients can do. This encapsulation ensures that services impose fine-grained restrictions, enhancing security and maintainability.

REST and Web Services

When HTTP is used as the protocol for communication, it is termed a web service. Web services are prevalent in various contexts, such as:

Client Applications: Native apps or JavaScript web apps making HTTP requests to services over the internet.
Inter-service Communication: Services within the same organization communicating over HTTP, often within the same datacenter.
Cross-Organizational Communication: Services exchanging data over the internet, such as public APIs for payment processing or OAuth.

REST is a design philosophy built on HTTP principles, emphasizing simple data formats and the use of URLs to identify resources. It leverages HTTP features for cache control, authentication, and content type negotiation, making it a popular choice for microservices and cross-organizational service integration.

SOAP: The Alternative to REST

SOAP (Simple Object Access Protocol) is an XML-based protocol for making network API requests. Unlike REST, SOAP aims to be independent of HTTP, using a complex set of standards known as WS-*. SOAP's API is described using the Web Services Description Language (WSDL), which enables code generation for accessing remote services.

Despite its structured approach, SOAP has fallen out of favor in many organizations due to its complexity and interoperability issues. RESTful APIs, with their simpler, more flexible design, have become the predominant style for public APIs.

Remote Procedure Calls (RPC)

Remote Procedure Calls (RPC) are a powerful mechanism for enabling communication between different services or components over a network. RPC allows a program to cause a procedure (subroutine) to execute in another address space (commonly on another physical machine). This is done in a way that is intended to make the remote procedure call look and behave like a local call within the same program. The concept of RPC has been around since the 1970s and remains a cornerstone of distributed computing.

Key Characteristics:

Location Transparency: RPC abstracts the complexities of network communication, making a remote request appear similar to a local function call.
Protocol Independence: While RPC frameworks can operate over various transport protocols, the most common ones include TCP/IP and HTTP.
Language Independence: Many RPC frameworks are designed to work across different programming languages. This is achieved by defining the procedure calls in an interface definition language (IDL), which is then used to generate the necessary code for different programming languages.
Encapsulation and Modularity: RPC promotes modularity and encapsulation by allowing functionalities to be distributed across various services or servers. Each service can expose specific methods that other services or clients can call remotely.

Challenges with RPC:

Network Reliability: Unlike local function calls, network requests can fail due to various reasons, such as network partitions, timeouts, or server failures. Handling these failures requires additional mechanisms like retries and timeouts.
Latency and Performance: Network latency can be highly variable, and network calls are generally much slower than local function calls. This variability can affect the performance and responsiveness of applications relying on RPC.
Idempotence: Retrying failed requests can lead to duplicate actions if not handled correctly. Idempotence ensures that performing the same operation multiple times results in the same outcome, which is crucial for reliable RPC implementations.
Data Encoding and Serialization: Parameters need to be encoded into a format suitable for network transmission. This encoding must be consistent across different languages and platforms, which can be complex for large or complex data structures.
Compatibility and Evolution: As services evolve, maintaining compatibility between different versions of clients and servers becomes crucial. This often involves ensuring backward and forward compatibility in data encoding and handling multiple versions of the service API.

Modern RPC Frameworks

Modern RPC frameworks have evolved to address some of these challenges:

gRPC: Built on Protocol Buffers, gRPC supports streaming, asynchronous calls, and robust service definitions. It is widely used in microservices architectures.
Thrift: Developed by Facebook, Thrift supports multiple programming languages and offers a flexible IDL for defining services.
Avro: Part of the Apache Hadoop ecosystem, Avro provides efficient serialization and RPC mechanisms, particularly suited for big data applications.

These frameworks provide advanced features like service discovery, load balancing, and built-in support for retries and idempotence, making them suitable for building scalable and resilient distributed systems.

Understanding the intricacies of dataflow through services, particularly REST and RPC, is crucial for designing scalable and maintainable systems. By leveraging the right encoding techniques and ensuring compatibility, developers can build robust services that support independent evolution and seamless communication across different platforms and organizational boundaries. As we continue to explore encoding methods and their applications, the principles of REST and RPC will remain foundational to modern web services and microservices architecture.

Message-Passing Dataflow

Previously, we discussed how encoded data flows from one process to another using REST and RPC, where one process sends a request over the network and expects a response quickly. In this final section, we'll explore asynchronous message-passing systems. These systems blend characteristics of both RPC and databases, using an intermediary called a message broker to temporarily store and forward messages. This method provides several benefits, including improved system reliability and logical decoupling of processes.

Asynchronous Communication

In asynchronous message-passing systems, a client sends a message to another process without waiting for an immediate response. This communication pattern is one-way, meaning the sender does not expect a reply. If a response is needed, it is typically sent on a separate channel. This approach contrasts with RPC, where a response is expected promptly.

Role of Message Brokers

A message broker (also known as a message queue or message-oriented middleware) acts as an intermediary between the sender and recipient. Instead of sending the message directly, the client sends it to the broker, which then delivers it to the appropriate recipient. This intermediary role provides several advantages:

Buffering: The broker can store messages temporarily if the recipient is unavailable or overloaded, thus enhancing system reliability.
Automatic Redelivery: If a process crashes, the broker can automatically redeliver messages, preventing data loss.
Decoupling: The sender does not need to know the IP address or port number of the recipient, simplifying deployment in dynamic environments like the cloud.
Broadcasting: One message can be sent to multiple recipients, supporting various communication patterns.
Logical Decoupling: The sender publishes messages without concern for who consumes them, promoting loose coupling between system components.

Historically, message brokers were dominated by commercial enterprise software from companies like TIBCO, IBM WebSphere, and webMethods. Today, open-source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka are widely used. These brokers vary in delivery semantics and configuration, but generally, they work as follows:

A process sends a message to a named queue or topic.
The broker ensures the message is delivered to one or more consumers or subscribers.
Multiple producers and consumers can interact with the same topic, facilitating scalable and flexible dataflow.

Data Encoding in Message-Passing Systems

Data encoding in message-passing systems refers to the process of converting data into a format that can be efficiently transmitted over a network and interpreted correctly by the receiving process. This involves serializing the data into a byte stream and potentially compressing it to optimize network usage.

Importance of Data Encoding

Interoperability: Encoding ensures that data can be understood across different systems, programming languages, and platforms. Producers and consumers might be written in different languages, and encoding standardizes the data exchange format.
Efficiency: Efficient encoding minimizes the size of the data being transmitted, reducing network latency and bandwidth usage.
Compatibility: Proper encoding supports backward and forward compatibility, allowing different versions of producers and consumers to communicate without issues.

Common Data Encoding Formats: JSON, XML, Protobuf, Avro, MessagePack

Real-World Message-Passing Systems

RabbitMQ

Description: RabbitMQ is an open-source message broker that implements the Advanced Message Queuing Protocol (AMQP). It supports a variety of messaging patterns, including pub/sub, request/reply, and point-to-point.
Use Cases: Used in financial services for processing transactions, in e-commerce platforms for managing order processing and inventory updates, and in IoT systems for collecting and distributing sensor data.
Key Features:
- Supports multiple messaging protocols (AMQP, MQTT, STOMP)
- Flexible routing capabilities
- High availability and clustering support
- Management UI for monitoring and managing message queues

Apache Kafka

Description: Apache Kafka is a distributed streaming platform and message broker developed by LinkedIn and open-sourced through the Apache Software Foundation. It is designed for high-throughput, low-latency, and scalable event streaming.
Use Cases: Widely used in big data pipelines, real-time analytics, log aggregation, and stream processing in companies like LinkedIn, Uber, and Netflix.
Key Features:
- High throughput and low latency
- Horizontal scalability
- Durable message storage
- Integration with stream processing frameworks like Apache Spark and Apache Flink

Understanding message-passing dataflow is essential for designing scalable and resilient distributed systems, ensuring that data flows smoothly and efficiently across processes.

Conclusion

We explored several data encoding formats and their compatibility properties. Programming language–specific encodings are limited to a single language and often lack compatibility. Textual formats like JSON, XML, and CSV offer flexibility but require careful handling to ensure compatibility. Binary schema–driven formats like Thrift, Protocol Buffers, and Avro provide efficient encoding with well-defined compatibility semantics.

Different modes of dataflow were also discussed. For RPC and REST APIs, the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response. In asynchronous message passing, nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient.

With careful planning, backward and forward compatibility and rolling upgrades are achievable. May your application's evolution be swift and your deployments frequent.

Copyright Notice

This content is freely available for academic and research purposes only. Redistribution, modification, and use in any form for commercial purposes is strictly prohibited without explicit written permission from the copyright holder.

For inquiries regarding permissions, please contact trnquiltrips@gmail.com

System Design

Discussion about this post