Author Topic: Internals of String swapping (Read 9907 times)

codeguy · « **on:** November 15, 2018, 07:24:54 pm »

With my Polyphase merge and adaptations of tested correct sorting algorithms, use of these on strings, while stable is REALLY slow. Do internals copy/write strings on swap or are pointers to them swapped? I suspect the former to accommodate arrays. If I am correct, please let me know. I will be making adjustments to compensate for this if that's the case. It will essentially be a skip list used to sort strings stably using a further modified version of MergeInsert, a la StringsIn$(skiplist(whatever)), where skiplist() is a sequentially numbered array of LongInt. Thanks for your help.

RonB · « **Reply #1 on:** November 15, 2018, 08:25:55 pm »

(Too) many moons ago, in a land far, far away (Burrough's B-263 mainframe with 12K of memory, 1 tape drive, and 1 hard disk) we had a function called RSD - Read Sorted Disk. When the need arose to sort a file, whenever the length of the (combined) sort fields PLUS 6 (2-bytes each for the cylinder#(cc), head#(hh), and record#(rr) address of a record on disk) was less than half the length of the records being sorted (which was true a majority of the time), instead of swapping entire records on disk (memory sorting in 12K wasn't on the menu), we read the file and wrote "key" records containing just the sort field(s) plus the original record's address (cchhrr)). Then we sorted that "key" file on the sort field(s). After sorting, the RSD function would read the "key" records in sequence and use the cchhrr to retrieve/return the "full" records in sorted order.
Perhaps a similar implementation but utilizing memory for each small "phase" of the polyphase sort/merge would speed up the process? If real string swapping is occurring (and I suspect it is), then sorting "shorter" elements in memory would speed up the process of every phase. Admittedly, there IS an extra pass+ to read the original file and build/write the "key" file, and there IS an extra pass+ to read the sorted "key" file and retrieve/write the sorted "original" file. That's why the RSD process was only used if the length of the sort fields (plus 6) was less than half the original record length.

codeguy · « **Reply #2 on:** November 15, 2018, 11:24:50 pm »

Thank you, RonB.
This is the result of my efforts thus far. It is the second type of external merge as decribed in Wikipedia. I want to retain stability, regardless of there being faster methods. The splitting procedure is fine. It is the sorting, even using O(NLogN) methods that is really slowing this process. Once sorted, Polyphase Merge behavior is as expected.

luke · « **Reply #3 on:** November 16, 2018, 03:28:51 am »

qbx.cpp::swap_string()

Eugh, it's doing a full copy. There'll be some annoying edge cases to take care of but that should be easy enough to inline & optimise to just an exchange of pointers.

codeguy · « **Reply #4 on:** November 16, 2018, 06:58:15 pm »

Thank you. I will modify this for using skiplists. But as it stands, the memory overhead for the prior attached algorithm is acceptably low using the n/2 auxiliary merge method, yielding a 33% greater range than standard MergeSort using n auxiliary. Once the string swap is sped up, I expect this version will be significantly faster. Thanks Luke.

luke · « **Reply #5 on:** November 17, 2018, 03:34:04 am »

Naturally what should have been a few line change turns out to be an epic battle between me and the type system.

I'll return to this when I'm less frustrated.

News:

Author Topic: Internals of String swapping (Read 9907 times)

codeguy

Internals of String swapping

RonB

Re: Internals of String swapping

codeguy

Re: Internals of String swapping

luke

Re: Internals of String swapping

codeguy

Re: Internals of String swapping

luke

Re: Internals of String swapping