Secondary Sort
The repartitionAndSortWithinPartitions
transformation
repartitions the dataset according to a partitioner and, within each
resulting partition, sorts records by their keys. This transformation
pushes sorting down into the shuffle machinery, where large amounts of
data can be spilled efficiently and sorting can be combined with other
operations.
For example, Apache Hive on Spark uses this transformation inside its
join
implementation. It also acts as a vital building
block in the secondary sort pattern, in which you group records by
key and then, when iterating over the values that correspond to a key,
have them appear in a particular order. This scenario occurs in algorithms
that need to group events by user and then analyze the events for each
user, based on the time they occurred.