Constants Are Changing : a Software and Technology scrapbook: Basic Search Algorithms

TopCoder: Data Science Tutorials - Algorithms

Binary Search Tree

In computer science, binary search trees (BST), sometimes called ordered or sorted binary trees, are a class of data structures used to implement lookup tables and dynamic sets. They store data items, known as keys, and allow fast insertion and deletion of such keys, as well as checking whether a key is present in a tree.
Binary search trees keep their keys in sorted order, so that lookup and other operations can use the principle of binary search: when looking for a key in a tree (or a place to insert a new key), they traverse the tree from root to leaf, making comparisons to keys stored in the nodes of the tree and deciding, based on the comparison, to continue searching in the left or right subtrees. On average, this means that each comparison allows the operations to skip over half of the tree, so that each lookup/insertion/deletion takes time proportional to the logarithm of the number of items stored in the tree. This is much better than the linear time required to find items by key in an unsorted array, but slower than the corresponding operations on hash tables.

The simplest algorithms for BST item insertion may yield a tree with height n in rather common situations. For example, when the items are inserted in sorted key order, the tree degenerates into a linked list with n nodes. The difference in performance between the two situations may be enormous: for n = 1,000,000, for example, the minimum height is

\lfloor \log_2(1,000,000) \rfloor = 19

If the data items are known ahead of time, the height can be kept small, in the average sense, by adding values in a random order, resulting in a random binary search tree. However, there are many situations (such as online algorithms) where thisrandomization is not viable.

Self-balancing binary trees solve this problem by performing transformations on the tree (such as tree rotations) at key times, in order to keep the height proportional to log₂(n). Although a certain overhead is involved, it may be justified in the long run by ensuring fast execution of later operations.

Maintaining the height always at its minimum value

\lfloor \log_2(n) \rfloor

is not always viable; it can be proven that any insertion algorithm which did so would have an excessive overhead.^{[citation needed]} Therefore, most self-balanced BST algorithms keep the height within a constant factor of this lower bound.

In the asymptotic ("Big-O") sense, a self-balancing BST structure containing n items allows the lookup, insertion, and removal of an item in O(log n) worst-case time, and ordered enumeration of all items in O(n) time. For some implementations these are per-operation time bounds, while for others they are amortized bounds over a sequence of operations. These times are asymptotically optimal among all data structures that manipulate the key only through comparisons.

A self-balancing (or height-balanced) binary search tree is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.^[1]

These structures provide efficient implementations for mutable ordered lists, and can be used for other abstract data structures such as associative arrays, priority queues and sets.

Red Black BSTs are an example.

Red Black BST:
Excellent explanation (for Java) by Cay Horstmann:
https://weblogs.java.net/blog/cayhorstmann/archive/2011/05/12/easy-red-black-trees

A red-black tree is a binary search tree with the following additional
properties:

Every node is colored red or black.

The root is black

A red node cannot have a red child (the “no double reds” rule)

All paths from the root to a null have the same number of

black nodes. I call this the “equal exit cost” rule.

Here is one:

Of course, the nodes aren't actually colored. Each node simply has a flag to
indicate whether it is considered red or black. (The choice of these colors is
traditional; one could have equally well used some other attributes. Perhaps,
in an alternate universe, students learn about chocolate-vanilla trees.)
Instead of the colors, I find it more useful to consider each node to be a
toll booth. As you travel from the root to one of the nullreferences (an exit), you have to pay $1 at each black toll booth, but the red
toll booths are free. The “equal exit cost” rule says that the cost of the
trip is the same, no matter which exit you choose.
The nulls turn out to be a bit troublesome, and many authors
replace them by dummy black leaves. Then all nodes either have two children or
are one of those dummy leaves. I didn't want dummy leaves—they are bound to
be confusing for beginning students.

I looked at several textbooks. The most concise descriptions reduce
red-black trees to 2-4 trees, which is nice in an algorithms course, but it
wasn't an option for me. Others use dummy leaves. The remaining ones present a
mess of fiddly special cases. Working through them made me lose the will to
live.
I was about to give up when a Google search pointed me to Lyn Turbak's
lecture notes, and I gasped when I saw this:

The figure is attributed to Chris Okasaki, Purely Functional Data
Structures, Cambridge University Press, 1998. I've got to get a copy.
This is brilliant. You insert a node in the usual way and color it red. If
its parent is also red, reorganize as per the image. The red node goes one step
further to the root. Repeat if necessary. If the red node arrives at the root,
color it black.
Why does it work? The transformation doesn't change the cost of traveling
through the nodes—it's $1 before and after. And changing the root from red to
black—if it comes to that— just adds $1 to the cost of every trip.
I kept googling for a similar simplification of removal, and I found this excellent articleby Matt Might. Unfortunately, it uses dummy nodes. But it wasn't that hard to
reformulate his approach without them. Here goes.
As with insertion, first use the standard removal algorithm for binary
search trees. Note that in that algorithm, the removed node has at most one
child. We never remove a node with two children; instead, we fill it with the
value of another node with at most one child, then we remove the other
one.
Two cases are easy. First, if the node to be removed is red, there is no
problem with the removal—the resulting tree is still a red-black tree.
Next, assume that the node to be removed has a child. Because of the
“equal exit cost” rule, the child must be red. Simply remove the parent and
color the child black.

B-tree

In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children (Comer 1979, p. 123). Unlike self-balancing binary search trees, the B-tree is optimized for systems that read and write large blocks of data. It is commonly used in databases and filesystems.

In B-trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre-defined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree), each internal node may have only 2 or 3 child nodes.

Each internal node of a B-tree will contain a number of keys. The keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a₁ and a₂. All values in the leftmost subtree will be less than a₁, all values in the middle subtree will be between a₁ and a₂, and all values in the rightmost subtree will be greater than a₂.
Usually, the number of keys is chosen to vary between $d$ and $2d$ , where $d$ is the minimum number of keys, and $d+1$ is the minimum degree or branching factor of the tree. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be split or combined. If an internal node has $2d$ keys, then adding a key to that node can be accomplished by splitting the $2d$ key node into two $d$ key nodes and adding the key to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have $d$ keys, then a key may be deleted from the internal node by combining with its neighbor. Deleting the key would make the internal node have $d-1$ keys; joining the neighbor would add $d$ keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of $2d$ keys.
The number of branches (or child nodes) from a node will be one more than the number of keys stored in the node. In a 2-3 B-tree, the internal nodes will store either one key (with two child nodes) or two keys (with three child nodes). A B-tree is sometimes described with the parameters $(d+1)$ — $(2d+1)$ or simply with the highest branching order, $(2d+1)$ .
A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root.
B-trees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. This usually occurs when the node data are in secondary storage such as disk drives. By maximizing the number of keys within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size of a full disk block or an analogous size in secondary storage. While 2-3 B-trees are easier to explain, practical B-trees using secondary storage need a large number of child nodes to improve performance.

This section describes a problem faced by database designers, outlines a series of increasingly effective solutions to the problem, and ends by describing how the B-tree solves the problem completely.
Time to search a sorted file[edit]Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation. A binary search of a sorted table with $N$ records, for example, can be done in roughly $\lceil \log_2 N \rceil$ comparisons. If the table had 1,000,000 records, then a specific record could be located with at most 20 comparisons: $\lceil \log_2 (1,000,000) \rceil = 20$ .

Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds.^[4] For simplicity, assume reading from disk takes about 10 milliseconds.
Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds.
The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read.
To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.

An index speeds the search[edit]A significant improvement can be made with an index. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads.
The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block.
Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the aux-aux index identifies the relevant block in aux-index. Reading and searching that aux-index block identifies the relevant block in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record.

The auxiliary indices have turned the search problem from a binary search requiring roughly $\log_2 N$ disk reads to one requiring only $\log_b N$ disk reads where $b$ is the blocking factor (the number of entries per block: $b = 100$ entries per block; $\log_b 1,000,000 = 3$ reads).

In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk read.
Insertions and deletions cause trouble[edit]If the database does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated.
Deleting records from a database doesn't cause much trouble. The index can stay the same, and the record can just be marked as deleted. The database stays in sorted order. If there is a large number of deletions, then the searching and storage become less efficient.
Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record in the file requires shifting all of the records down one. Such an operation is just too expensive to be practical. A trick is to leave some space lying around to be used for insertions. Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions. Those records would be marked as if they were "deleted" records.
Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is that enough space is nearby such that a lot of blocks do not need to be reorganized. Alternatively, some out-of-sequence disk blocks may be used.

The B-tree uses all those ideas[edit]The B-tree uses all of the ideas described above. In particular, a B-tree:

keeps keys in sorted order for sequential traversing

uses a hierarchical index to minimize the number of disk reads

uses partially full blocks to speed insertions and deletions

keeps the index balanced with an elegant recursive algorithm

In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions.

Hash Table

In computing, a hash table (hash map) is a data structure used to implement an associative array, a structure that can map keys tovalues. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
Ideally, the hash function will assign each key to a unique bucket, but it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. Instead, most hash table designs assume that hash collisions—different keys that are assigned by the hash function to the same bucket—will occur and must be accommodated in some way.
In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at (amortized^[2]) constant average cost per operation.^[3]^[4]In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

In the Java programming language, every class implicitly or explicitly provides a hashCode() method, which digests the data stored in an instance of the class into a single hash value (a 32-bit signed integer). This hash is used by other code when storing or manipulating the instance – the values are intended to be evenly distributed for varied inputs in order to use in clustering. This property is important to the performance of hash tables and other data structures that store objects in groups ("buckets") based on their computed hash values.

Technically, in Java, hashCode() by default is a native method, meaning, it has the modifier 'native', as it is implemented directly in the native code in the JVM.
All the classes inherit a basic hash scheme from the fundamental base class java.lang.Object, but instead many override this to provide a hash function that better handles their specific data. Classes which provide their own implementation must override the object method public int hashCode().

The general contract for overridden implementations of this method is that they behave in a way consistent with the same object's equals() method: that a given object must consistently report the same hash value (unless it is changed so that the new version is no longer considered "equal" to the old), and that two objects which equals() says are equal must report the same hash value. There's no requirement that hash values be consistent between different Java implementations, or even between different execution runs of the same program, and while two unequal objects having different hashes is very desirable, this is not mandatory (that is, the hash function implemented doesn't need to be aperfect hash).^[1]For example, the class Employee might implement its hash function by composing the hashes of its members:

public class Employee { int employeeId; String name; Department dept; // other methods would be in here
@Override
public int hashCode() { int hash = 1; hash = hash * 17 + employeeId; hash = hash * 31 + name.hashCode(); hash = hash * 13 + (dept == null ? 0 : dept.hashCode()); return hash; }
}

The 3 things you should know about hashCode()

1. Whenever you implement equals, you MUST also implement hashCode

If you fail to do so, you will end up with broken objects. Why? An object’s hashCode method must take the same fields into account as its equals method. By overriding the equals method, you’re declaring some objects as equal to other objects, but the original hashCodemethod treats all objects as different. So you will have equal objects with different hash codes. For example, calling contains() on aHashMap will return false, even though the object has been added.

How to write a good hashCode function is beyond the scope of this article, it is perfectly explained in Joshua Bloch’s popular bookEffective Java, which should not be missing in a Java developer’s bookshelf.

To be on the safe side, let the Eclipse IDE generate the equals andhashCode functions as a pair: Source > Generate hashCode() and equals()….

generate hashcode equals The 3 things you should know about hashCode()

To protect yourself, you can also configure Eclipse to detect violations of this rule and display errors for classes that implement equals but not hashCode. Unfortunately, this options is set to “Ignore” by default: Preferences > Java > Compiler > Errors/Warnings, then use the quick filter to search for “hashcode”:

hashcode error config The 3 things you should know about hashCode()

Update: As laurent points out, the equalsverifier is a great tool to verify the contract of hashCode and equals. You should consider using it in your unit tests.

HashCode collisions

Whenever two different objects have the same hash code, we call this a collision. A collision is nothing critical, it just means that there is more than one object in a single bucket, so a HashMap lookup has to look again to find the right object. A lot of collisions will degrade the performance of a system, but they won’t lead to incorrect results.

But if you mistake the hash code for a unique handle to an object, e.g use it as a key in a Map, then you will sometimes get the wrong object. Because even though collisions are rare, they are inevitable. For example, the Strings "Aa" and "BB" produce the same hashCode:2112. Therefore:

2. Never misuse hashCode as a key

You may object that, unlike the printer’s type case, in Java there are 4,294,967,296 compartments (2³² possible int values). With 4 billion slots, collisions seem to be extremely unlikely, right?

Turns out that it’s not so unlikely. Here’s the surprising math of collisions: Please imagine 23 random people in a room. How would you estimate the odds of finding two fellows with the same birthday among them? Pretty low, because there are 365 days in a year? In fact, the odds are about 50%! And with 50 people it’s a save bet. This phenomenon is called the Birthday paradox. Transferred to hash codes, this means that with 77,163 different objects, you have a 50/50 chance for a collision – given that you have an ideal hashCodefunction, that evenly distributes objects over all available buckets.

hashcode collisions The 3 things you should know about hashCode()

Example:

The Enron email dataset contains 520,924 emails. Computing theString hash codes of the email contents, I found 50 pairs (and even 2 triples) of different emails with the same hash code. For half a million strings, this is a pretty good result. But the message here is: if you have many data items, collisions will occur. If you were using the hashCode as a key here, you would not immediately notice your mistake. But a few people would get the wrong mail.

HashCodes can change

Finally, there’s one important detail in the hashCode contract that can be quite surprising: hashCode does not guarantee the same result in different executions. Let’s have a look at the JavaDoc:

Whenever it is invoked on the same object more than once during an execution of a Java application, thehashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

This is uncommon, in fact, some classes in the class library even specify the exact formula they use to calculate hash codes (e.g.String). For these classes, the hash code will always be the same. But while most of the hashCode implementations provide stable values, you must not rely on it. As this article points out, there are Java libraries that actually return different hashCode values in different processes and this tends to confuse people. Google’s Protocol Buffersis an example.

Therefore, you should not use the hash code in distributed applications. A remote object may have a different hash code than a local one, even if the two are equal.

3. Do not use hashCode in distributed applications

Moreover, you should be aware that the implementation of ahashCode function may change from one version to another. Therefore your code should not depend on any particular hash code values. For example, your should not use the hash code to persist state. Next time you run the application, the hash codes of the “same” objects may be different.

The best advice is probably: don’t use hashCode at all, except when you create hash-based algorithms.

Constants Are Changing : a Software and Technology scrapbook

Wednesday, 22 April 2015

Basic Search Algorithms

HashCode collisions

Example:

HashCodes can change

No comments:

Post a Comment