UnShuffle Sort
Encyclopedia

Introduction

The UnShuffle Sort is a distribution or merge sort
Merge sort
Merge sort is an O comparison-based sorting algorithm. Most implementations produce a stable sort, meaning that the implementation preserves the input order of equal elements in the sorted output. It is a divide and conquer algorithm...

 which was developed by Art S. Kagel. UnShuffle is most efficient when sorting data which exhibits low entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

, in effect where the data is well ordered or contains sub-sequences of ordered items. It was first mentioned in an article in Computer Language Magazine (Vol. 3 No. 11, November 1985). The current implementation is the result of several years of additional experimentation. The sort involves two phases. During the first distribution phase items are distributed to a variable number of ordered lists using a structure that minimizes the number of comparisons required. Once all items have been distributed the sorted lists are merged to the output list. UnShuffle is one of a very few sorts that can be applied directly to linked lists.

The algorithm utilizes a data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...

 dubbed a 'pile
Pile (data structure)
A pile is an abstract data structure for storing data in a loosely ordered way. There are two different usages of the term; one refers to an ordered deque, the other to an improved heap.-Ordered deque:...

' which is an ordered deque
Deque
In computer science, a double-ended queue is an abstract data structure that implements a queue for which elements can only be added to or removed from the front or back...

 which permits elements to be added to the head of the list if the new item is valued less than or equal to the current head or to the tail of the list if the new item is greater than or equal to the current tail. It is not permitted to insert elements between existing elements. This is normally implemented as a doubly linked list of lists with pointers to the head and tail of the doubly linked list containing the data items on the pile and to the next and previous pile. To keep things simple some normal optimizations are not included in the description of the algorithm but are presented at the end. Details of removing items from the input stream/list and appending items to the output stream/list are omitted as implementation specific.

Best case behavior for UnShuffle degenerates into a sort check for a fully ordered or ordered but reversed data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

 with only one pile created using only N-1 comparisons (with the recommended optimization) and no merge necessary. There is no worst case behavior, it can be shown that no data set will perform worse than the Order
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

 of the sort which is O((K/4)*N) for Phase I + O (N*(log K)) for Phase II using an Ideal Merge for Phase II where K is the number of piles created during the distribution phase (Phase I) and is proportional to the level of entropy in the data.

Because it sorts linked lists rather than data arrays only pointers are moved and because of the pile structure there are no expensive swaps performed so sort time is not dependent on the size of the data but only on the length and complexity of the key.

Phase I - Distribution Phase

  1. Take the first input item and create pile#1 with item#1 as head and tail.
  2. If no more items goto Phase II.
  3. Take next input item.
  4. Select the last pile for comparison.
  5. Compare to top of the pile.
  6. If equal place on top of pile, goto 2.
  7. If less and this pile is not the first pile then select the previous pile for comparison, go to 5.
  8. If less and this pile is the first pile, place item on top of the pile, goto 2.
  9. If greater and this is not the last pile place on top of the next pile (which is the one previously compared), goto 2, else this Item is greater than top of last pile. Continue to 10.
  10. Compare to bottom of the current pile.
  11. If equal place on bottom of pile, goto 2.
  12. If greater and this is not the first pile select the previous pile, goto 10.
  13. If less and this is the last pile create a new pile with the item and append to the list of piles, goto 2.
  14. Place on bottom of the next pile (which is the one previously compared), goto 2.

Phase II - Merge Phase

Use an Ideal Merge to merge the piles created in Phase I. Output to the output linked list or data stream. Ideal Merge is a merge algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 Art S. Kagel believes he also invented which can be shown to be the best possible merge of sorted sinks (queues). Algorithm follows.

Ideal Merge Algorithm

  1. Arrange the input queue headers into an array or linked list if not this way already.
  2. Sort the list of input queue based on the relative value of the top element of each queue (since the list of queues is typically small almost any sort will do).
  3. Output the top element of the first queue.
  4. If first queue is empty then drop it. If no other queues then exit. else Second queue is now first, goto 3. else continue to 5.
  5. If first queue is the only remaining queue, goto 3.
  6. Compare the new top of the first queue to the top of the second queue. If less than or equal goto 3, else continue.
  7. Binary search the second and following queue tops to determine where the top of the current first queue belongs in the list of queues and insert first queue there in the list. The second queue is now first goto 3.

Optimizations to Phase I

Keep track of whether the last element was placed on the top or bottom of some pile and begin comparison for the next element on the same side (i.e. if the last element was placed on the bottom of a pile begin comparisons with the bottom of the last pile rather than the top and switch to comparing to the top if the item is greater than the bottom of the last pile). For the general case and for highly ordered data also this is the single most important optimization to the basic algorithm saving, on average, N/2 comparisons and maximum of N-1 comparisons for reversed input.

Keep track of which pile the last element was appended or prepended to and begin comparison there. Intuitively this should be a good optimization and for certain data sets it may be, but testing has shown that the added complexity of the modified algorithm overshadows any benefit in the general case.

Optimizations to Phase II

The general Ideal Merge algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 can be enhanced by knowledge of the nature of the Distribution Phase:

When used as the merge for Unshuffle sorting, the creation of the list of queues and the initial sort of the piles is not needed since the distribution creates the piles ordered by their top element. In this case the merge begins with step #3.

Since runs of elements may be prepended to each pile before any other pile is prepended or appended to there is benefit to comparing the next item on the first pile to the top of the second pile, after having output the top element from the pile, to determine if the binary reordering search is needed yet, if not that element can be immediately output and this step repeated. (I've included this optimization in my description of the merge above.)

Optimizations for specific input

For data arriving in a stream, like a file sort or piped input, create an iterative version of the algorithm with an add_element function and an output_element function. The add_element would sort the single element into the piles structure. The first call to the output function will block further add_element calls from modifying the data structures and begin the merge by outputting the first element on pile 1 and performing the pile reorder operation if needed. Each subsequent call will return the top of the first pile and reorder as needed.

Arrays can also be treated as a stream and subjected to the iterative sort rather than build a linked list from the array. Because of the function call overhead the benefit of this suggestion may be marginal at best but it suggests that a non-iterative version that treats the array as a stream may do even better. Hmm, an array sort version that substitutes an internal add-loop for the external loop calling an add-to-stream function should be fast enough to make direct array sorting feasible.

UnShuffle is not naturally a stable sort but one can force stability by appending the input record number to the end of the sort key.

Duplicate record elimination can be performed during both Phase I and Phase II or delayed until Phase II. The elimination accomplished during Phase I is partial only so the elimination in Phase II is still needed but processing time, and IO cost (if sort-work files are needed for larger data sets), is saved by doing both.

Because the second phase is a merge of sorted sinks, it is possible to apply the distribution phase in several independent parallel operations through separate processes or threads and then to apply the merge to the complete set of piles.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK