(Too) many moons ago, in a land far, far away (Burrough's B-263 mainframe with 12K of memory, 1 tape drive, and 1 hard disk) we had a function called RSD - Read Sorted Disk. When the need arose to sort a file, whenever the length of the (combined) sort fields PLUS 6 (2-bytes each for the cylinder#(cc), head#(hh), and record#(rr) address of a record on disk) was less than half the length of the records being sorted (which was true a majority of the time), instead of swapping entire records on disk (memory sorting in 12K wasn't on the menu), we read the file and wrote "key" records containing just the sort field(s) plus the original record's address (cchhrr)). Then we sorted that "key" file on the sort field(s). After sorting, the RSD function would read the "key" records in sequence and use the cchhrr to retrieve/return the "full" records in sorted order.
Perhaps a similar implementation but utilizing memory for each small "phase" of the polyphase sort/merge would speed up the process? If real string swapping is occurring (and I suspect it is), then sorting "shorter" elements in memory would speed up the process of every phase. Admittedly, there IS an extra pass+ to read the original file and build/write the "key" file, and there IS an extra pass+ to read the sorted "key" file and retrieve/write the sorted "original" file. That's why the RSD process was only used if the length of the sort fields (plus 6) was less than half the original record length.