Sort order preserving serialization
by ananthakumaran

FoundationDB is a distributed key-value store. From the database perspective, both keys and values are just bytes. Keys are sorted lexicographically and different chunks are assigned to different servers. Because of this locality property, the range read operation is efficient as most of the time the client would be able to fetch the data from a single server.

As the keys are just bytes, the application must serialize the keys and more importantly if we need to use range operation, then the serialization should preserve the order of the keys. Fortunately, all the clients implement the tuple layer which has this property. This post attempts to explain how various data types are encoded in a way that preserves their natural sort order.

I will represent bytes using their hex format for the rest of the blog post. Each data type is tagged with an identifier which uniquely identifies the type.

NUll

Null is encoded with the tag 00

Byte String

Tag 01 is used for byte string. 00 is used as the terminator and 00 inside the byte string are escaped by replacing them with 00FF. For example the byte string ABCDEF would be encoded as 01ABCDEF00.

It’s clear that order would be preserved for byte strings without 00 because the same constant values are prefixed and suffixed for all the byte strings. Let’s consider all the possible cases of byte strings with 00.

AB00DD
01AB00FFDD00
AB01DD
01AB01DD00

AB00BC
01AB00FFBC00
AB00DD
01AB00FFDD00

AB00DD
01AB00FFDD00
AB01
01AB0100

AB00
01AB00FF00
AB01DD
01AB01DD00

AB
01AB00
AB00
01AB00FF00

Because the terminator 00 has the lowest byte value and if the compared byte string has any value other than 00 in the same position then there is no need to consider the rest of the bytes. The last case shown is a bit tricky and there is a reason for choosing FF as the escape value and it will become more apparent when we consider Tuple types.

Integer

Integers upto 8 bytes are supported. Integers get encoded into variable size which depends on minium bytes required to represent the integer. Tags 0C to 13 are used for negative numbers and tag 14 for zero and tags 15 to 1C are used for positive number.

For positive integers, big-endian byte order is used which naturally preserves the sort order. For negative integers, the sign is removed and bit complement of big-endian byte order is used bit-complement(big-endian(abs(n))). The bit complement operation reverses the sort order. If two integers get encoded into different sized bytes, then the order would be preserved by tag values. The tag value increases as the integer value increases.

-98344948949494949
0CFEA29BCA3C69535A
-303040404040
0FB9716265B7
-20404
12B04B
-42
13D5
42
152A
20404
164FB4
303040404040
19468E9D9A48
98344948949494949
1C015D6435C396ACA5

Float

Tags 20 (32 bits) and 21 (64 bits) are used for floating point numbers.

def transform(<<sign::big-integer-size(8), rest::binary>> = full) do
  if (sign &&& 0x80) != 0x00 do
    :binary.bin_to_list(full)
    |> Enum.map(fn e -> 0xFF ^^^ e end)
    |> IO.iodata_to_binary()
  else
    <<0x80 ^^^ sign>> <> rest
  end
end

The value is first encoded in IEEE 754 binary format. For positive numbers just the sign bit should be flipped and for negative numbers all the bits should be flipped. This should provide total order as per the IEEE specification.

Tuple

Because of the way other values are encoded, a tuple can be encoded just by concatenating the encoded values. Note that the sort order is preserved only between values of same types. Specially comparison between types like Float and Integer won’t work.

(AB, 42)
01AB00152A
(AB00, 42)
01AB00FF00152A

This example illustrates why FF is chosen as the escape character. As long as we don’t use FF as the tag for any type, the byte string value 00 and the terminator value 00 would not conflict and break the sort order. Other types don’t introduce these kinds of behaviors because they are either of fixed size or in case of Integer, the tag will determine the order of the values if they are of different size.

Nested Tuple

Tuple type can’t be nested as there is no difference between (1, (2, 3)) and (1, 2, (3)). Both values would get encoded as the same as the flattened tuple (1, 2, 3). Nested tuple, as the name implies supports arbitrary nesting. [ is used to represent nested tuple.

Tag 05 and terminator 00 are used for nested tuple. Encoded values are concatenated and any Null values are escaped as 00FF.

[1, [2, 3]]
05150105150215030000
[1, 2, [3]]
05150115020515030000

There are other data types like unicode string, arbitrary precision number, UUID, boolean etc, but all of them follow similar principles discussed so far. As such, they are not considered here.

  |   High contention allocator ⇒

About ActiveSphere

We help businesses build and scale complex web-apps, visualizations and high-throughput messaging systems. Check out some of our other open source projects or our work portfolio.