Tuesday 4 August 2015

Protocol Buffers & Lists

Introduction
For this example I will be using the following example .proto and looking at serializing large repeated data sets in C#.

.proto
 syntax = "proto2";  
   
 package pcdf;  
   
 message Boiler  
 {  
   required string BoilerId = 1;  
   required int32 TableNumber = 2;  
   required string TableName = 3;  
   required string BoilerData = 4;  
   required string BrandName = 5;  
   required string Qualifier = 6;  
   required string ModelName = 7;  
 }  
   
 message AllBoilers  
 {  
   repeated Boiler Boiler = 1;  
 }  

Serialization Code
 var listBuilder = new AllBoilers.Builder();  
   
 for (var i = 0; i < 7000; i++)  
 {  
   var builder = new Boiler.Builder  
   {  
     BoilerId = "ID",  
     TableNumber = 2,  
     TableName = "Name",  
     BoilerData = "Data",  
     BrandName = "Brand",  
     ModelName = "Model",  
     Qualifier = "Qualifier"  
   };  
   
   listBuilder.AddBoiler(builder);  
 }  
   
 var list = listBuilder.Build();  
 var bytes = list.ToByteArray();  
   
 var fileStream = new FileStream(OUTPUT_FILE, FileMode.Create);  
 fileStream.Write(bytes, 0, bytes.Length);  
For the actual tests real test data was used.

Deserialization Code
 var fileStream = new FileStream(OUTPUT_FILE, FileMode.Open);  
   
 var boilers = new AllBoilers.Builder();  
 boilers.MergeFrom(fileStream);  


Comparison
For a list of over 7k items Json.net serializes about 10ms slower on my machine, not quite what I expected from what is supposed to be a super fast binary serializer. Upon further reading it appears Protocol Buffers are not designed to handle large data sets, but this can be worked around by serializing the items individually.

Fix
To fix this we will serialize each item induvidually but add them all to the same stream with the length of the item added before each item.

Fixed Serialize
 var fileStream = new FileStream(OUTPUT_FILE, FileMode.Create);  
   
 for (var i = 0; i < 7000; i++)  
 {  
   var builder = new Boiler.Builder  
   {  
     BoilerId = "ID",  
     TableNumber = 2,  
     TableName = "Name",  
     BoilerData = "Data",  
     BrandName = "Brand",  
     ModelName = "Model",  
     Qualifier = "Qualifier"  
   };  
   
   var bytes = builder.Build().ToByteArray();  
   var intBytes = BitConverter.GetBytes(bytes.Length);  
   
   fileStream.Write(intBytes, 0, intBytes.Length);  
   fileStream.Write(bytes, 0, bytes.Length);  
 }  

Fixed Deserialize
 var fileStream = new FileStream(OUTPUT_FILE, FileMode.Open);  
   
 var list = new List<Boiler>();  
   
 int size;  
 while ((size = GetSize(fileStream)) != -1)  
 {  
   var bytes = new byte[size];  
   fileStream.Read(bytes, 0, bytes.Length);  
   
   var boiler = new Boiler.Builder();  
   boiler.MergeFrom(bytes);  
   
   list.Add(boiler.Build());  
 }  

This updated code resulted in a significant difference (160ms -> 25ms) between JSON.Net and Protocol Buffers.

References
https://developers.google.com/protocol-buffers/
https://developers.google.com/protocol-buffers/docs/techniques#large-data
https://github.com/google/protobuf

Protocol Buffers in C#

Installation
Protocol Buffers can be installed via Nuget,
 PM> Install-Package Google.ProtocolBuffers  
Or search for Protocol Buffers, there is also a Protocol buffers lite package that is smaller and does not use relection (better for some lighter platforms like mono). Note: an option must be set in the .proto file to compile the proto for use with the lite package.

Tools
As well as installing the protocol buffers library this will also download the tools required to compile .proto files and generate C# files from them. The tools will be located where the nuget package was installed to.

.proto
A .proto file will be required to generate the required files for usage. This example .proto defines a message called TestMessage with 3 properties. The package is used like a namespace and will also be used to generate the namespace of the classes in C#.
 syntax = "proto2";  
   
 package test_proto;  
   
 message TestMessage  
 {  
   required int32 Id = 1;  
   required string Name = 2;  
   optional string Details = 3;  
 }  
For more information on defining a .proto file see the protocol buffer v2 spec.

Protoc
To compile a .proto file locate protoc.exe in the tools dir and run it with the following parameters.
 protoc -oitem test.proto  
where item is the name of the output file and test.proto the name of the .proto file. This outputs a compiled version of the .proto file.

ProtoGen
The next step is to generate the .cs files that will contain the generated code that can be used to create and serialize the data.
 .\ProtoGen.exe item  
This should generate a .cs file containing everything you need to read/write to the protocol buffers.

Serialization
 var builder = new TestMessage.Builder();  
   
 builder.Id = 2;  
 builder.Name = "Test";  
 builder.Details = "Some Detials";  
   
 var message = builder.Build();  
   
 var bytes = message.ToByteArray();  
You must create a builder as messages are immutable (Each message generates and internal Builder class), a message can be turned back in to a builder easily wit the ToBuilder() method.

Deserialization
 var builder = new TestMessage.Builder();  
 builder.MergeFrom(bytes);  
 var message = builder.Build();  
You could just read the data from the builder if you wish.

Other Methods
message.ToJson() & message.ToXml()
These can be used to visualize the message in other formats.

message.WriteTo(Stream stream)
This is used to write the bytes to a stream.

References
https://developers.google.com/protocol-buffers/
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec
https://github.com/google/protobuf

Protocol Buffers

Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data (official site).

In short protocol buffers are googles in house binary serialization format, designed to serialize small messages fast and with a small data size.

Where to use them
Protocol buffers are great when you need fast serialization and a small payload (where efficiency/speed is important). They are not so good when data formats are likely to change as this will mean you need to discard and recreate schemas and drop backwards compatibility or have multiple schemas for different versions.

Pros
  • Fast
  • Small Payload
  • Many cross platform implementations (c++, Java, C#, python, JS)
  • Generators a provided to generate data access classes
Cons
  • Schema Enforced (Need to define .proto file)
  • More complicated API than some other serializers

Proto Format
Protocol Buffer messages are first defined in a .proto file using the proto language. This can then be used to generate data access objects for your language of choice.
 message Boiler  
 {  
   required string BoilerId = 1;  
   required int32 TableNumber = 2;  
   required string TableName = 3;  
   required string BoilerData = 4;  
   required string BrandName = 5;  
   required string Qualifier = 6;  
   required string ModelName = 7;  
 }  
Fields are described with options (required, optional, repeated, etc..) followed by Type (int32, int64, string), field name then = and an Id number that must be unique. For more information read the v2 spec.

Proto3
The current version is proto2, but Proto3 is current in alpha and being developed (Source code on GitHub) The update will include a new version of the .proto syntax.

Binary vs Text Serialization
 {  
  "Id": "21",  
  "Name": "Test User",  
  "address": {  
   "streetAddress": "Test Street",  
   "city": "Birmingham"  
  }  
 }  
Some formats serialize to text (JSON, XML, etc..) this can be really useful if you need to quickly view and edit the documents without special tooling. Binary serializations will require a tool that can read the document before you can begin to read and manipulate the data. Binary serializations will normally be smaller and more efficient though.

References
https://developers.google.com/protocol-buffers/
https://github.com/google/protobuf
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec