Introduction
A binary search cube is a linear data structure that contains a sorted two dimensional dynamic array of nodes which each point to a sorted array of key/value pairs. This creates a three dimensional structure with subsequently three axes. The first two axes function as a lookup table and do not contain any keys. As the axes are of roughly similar size the structure resembles a cube.
The cube can shrink and grow as elements are added. In an ideal situation the axes are of identical size, but it doesn't significantly degrade performance if they are not perfectly matched.
To find a key a binary search is performed on the x axis to find a matching y axis, a second binary search is performed on the y axis to find a matching z axis and a final binary search is performed to find a matching key on the z axis. Each binary search is performed in O(log (cbrt n)) time and the three searches combined are the equivalent of an O(log n) operation.
The term cbrt stands for cube root.
Insertion
Inserting or removing a node is an O(cbrt n) operation. If a node falls between two Z axes the search algorithm can be optimized to pick the lower order axis so the node is appended to the end, which is a faster operation. As a consequence a node will never be inserted on the 0 index with the exception of the floor axis.
The special case of a floor insertion can be checked at the start of the binary search, subsequent binary searches on the X, Y and Z axes can be optimized to deference the 0 index check.
Finding an index
Finding an index is an O(cbrt n) operation. In order to be able to find an index the volume of the X axes and each Y axis must be remembered. The average search time can be split in half by starting the index search at the end of the X axis if the index is greater than half the volume. With this optimization finding the first or last index is an O(1) operation.
The index search is optimal when the X and Y axes are of identical size. Finding the index on the Z axis is an O(1) operation.
Finding a key
The most efficient search approach appears to be a binary search using deferred detection of equality. Notably the detection of equality can be skipped for the X and Y axis.
Balancing a binary search cube
Balancing a binary search cube is fairly simple. When adding a node to the Z axis and it exceeds twice the size of the X axis the Z axis is split. If this results in the underlying Y axis exceeding twice the size of the X axis the Y axis is split. These are O(cbrt n) operations.
It can be detected whether the cube is imbalanced by comparing the volume to the optimal volume. Shrinking an Y axis is a worst case O((cbrt n)2) operation as each Z axis should be checked as well. This operation is required infrequently.
Using a binary search cube as a priority queue
A binary search cube can be optimized to check the last index first, allowing O(1) push and pop operations.
Alternatively the last index of each axis can be checked first, allowing early termination of the search. This allows O(m) push and pull operations, where m is the number of dimensions of the cube. This also allows O(log (cbrt n)) insertions in the middle of the cube if the volume is relatively high and the number of different priorities is relatively low.
Memory usage
A binary search cube has half the memory overhead of a binary search tree. A cube with 1,000,000 elements will have 100 X nodes and 10,000 Y nodes. A binary search tree of the same size requires 2,000,000 leaf node pointers, with some implementations requiring an additional 1,000,000 root node pointers.
CPU usage
While inserting and removing elements is an O(cbrt n) operation only half the nodes on an axis need to be moved on average. As moving memory within an array is extremely fast, insert operations on a binary search cube should outperform a balanced binary tree until the cube grows extremely large. A cube with 1 million nodes will on average require 50 nodes to be moved for an insertion, while a balanced binary tree has to move 20 nodes using more complex operations.
Binary Search Hyper Cube
The toll of insert and resize operations on large cubes can be reduced by using a hyper cube. The maximum length of an axis is set at 16 and whenever the maximum volume is exhausted an extra dimension is added. The hyper cube starts out as a 1 dimensional array, turning into a 2 dimensional binary search square when the 17th element is added. When the 257th element is added the binary search square is split into a binary search cube with 3 dimensions, though this split can happen at the 129th element depending on the distribution. When the 65537th element is added the cube turns into a 4 dimensional tesseract. And so on.
Using static axis sizes will eliminate the need to resize Z axes as the cube grows and the average insertion (moving 8 elements) will become faster than O(log n) once a size of 256 elements is reached.
The average Z axis will contain 8 elements so the binary search can be substituted by a linear search. It also becomes a possibility to use an optimal search for the 15 possible axis lengths.
Looking up an index is an O(log n) operation.
The main drawback is that a 4 dimensional hyper cube needs to access the index of 3 axes for each key it looks up from the root level. In order to increase the access speed each axis can remember the floor key, which is all that is needed to perform a binary search using deferred detection of equality on the lower axes.
Cubesort
The average sorting speed is O(log n) for random data, and O(1) for ordered data. Cubesort is stable.
The binary searches on the X and Y axes can be optimized to check the last element first. In this case random data has a sorting speed of O(log n), and O(log (cbrt n)) for sorted clustered data. This behavior is most pronounced when the range of sorted clustered elements is larger than (cbrt n)2.
A binary search cube can be much more rapidly converted to a 1 dimensional array than any other data structure. This gives an additional speed advantage when sorting arrays.
External Cubesort
The Z axis can be stored externally. If a binary search penteract (5 dimensions) is used with a volume of 10 billion nodes the V, W, X, and Y axes on a 32 bit system have a memory requirement of 800 MB. The disk space requirement for the Z axis will be 200 GB, existing of 1 billion files, each containing 2 KB of data (assuming there's a 4 byte key, and a 16 byte value). Only one file would need to be accessed for each key lookup.
If the filename of each file exists of the floor key it's possible to rebuild the cube using scandir() without having to read any files.
One Z axis can be kept in memory for each Y axis to function as a cache which would increase the memory usage to 3.2 GB.
Fixed axis size
The lower axes can be given a fixed maximum size. The Z axes can be given a size of 32, the Y axes a size of 128, the X axes a size of 256, with the W axis of variable size. This type of search tesseract (4 dimensions) doesn't require resizing as it grows, has reasonable performance for both smaller and larger sizes, and allows fast insertions.
Resizing the search tesseract as it shrinks can be handled by keeping the average Z axis length above 4.
Fixed key size
If the key size is fixed (typically when the key is an integer) it's possible to implement a cube as a trie. A binary search is used to find a key fragment, with the exception that a key fragment can be found in O(1) time when it falls within the section of the axis that is in order.
When using this approach resizing is never necessary. Subsequently ideal distribution isn't guaranteed.
Advantages
- Low memory overhead.
- Iterating a search cube is fast and easy.
- O(1) insertion of sorted data and O(log (root n)) insertion of semi-sorted data.
- Innate support for O(root n) index searches.
- Balancing is fast and infrequent.
Disadvantages
- More complex binary search compared to binary trees.
Gregorius van den Hoven, June 2014