Author Topic: A Question on Sorting (Read 31391 times)

rickclark58 · « **Reply #15 on:** September 19, 2021, 11:32:22 am »

Just to add a bit more to the discussion in relation to sequential files. I am not saying this applies to your problem, but this problem has been around since the beginning. Back in the day, most data was stored as sequential files and as the number of records (an individual line in a sequential file) grew so did the time required to access them. Sequential files are linear so have a big O of O(n). The process was to load a record into memory, search it for the needed data, process it, and then load the next record into memory, and so on. The best case is to find the searched element on the first lookup and the worst case is to find the element on the last lookup, or not at all. The average would be in the middle. This doesn't sound bad unless you have millions of records to search through or very large sequential files. Since these files usually reside on disk, you are having to go through the IO process of the disk (or tape back in the day) so the processing time became quite significant. What was needed was a good way to look up the data in a manner that skipped most of the record on disk and could pull the needed information directly from the record on disk. They came up with indexes.

Indexes are sorted lists of the data in the record with the lookup item as a key and the location of that item on the disk. Since the list is sorted, you can do a binary search on the list. A binary search is O(log n) which is significantly faster than O(n). The difference is "In contrast to O(N) which takes an additional step for each data element, O(log N) means that the algorithm takes an additional step each time the data doubles." -Introduction to Big O Notation.

Most indexes are actually stored as tree structures and you just have to walk down the tree to the needed item if it exists. Trees are actually an implementation of a binary search in reverse so when you traverse a tree you are doing a binary search on the data. Indexes significantly improved data access and even today, every database creates indexes on the data for fast retrieval. If you have a lot of sequential files to go through, building indexes will significantly increase access time. However, there are many free open source databases that you can use via ODBC or dedicated libraries so there is no sense in reinventing the wheel nowadays. Using a local database is the way to go now if you need to process large data sets.

Dimster · « **Reply #16 on:** September 19, 2021, 08:25:31 pm »

So Tempodi I kind of get where you are heading with the Yin-Yang, Higher than/Lower than and a low interest loan is an example where Low beats High. Reframing my concept to something like Best/Worst or Acceptable/Not Acceptable would be more to the point of agnostic computerize than a humanistic bias towards High being positive and Low negative.

Rick, not quite sure I follow the INDEXES ... are you saying the data values in the Sequential File are pre ordered and an Index is affixed to each sequential record? I guess I could sort each sequential file every time I write a data value to it.

Or is the point that the data values, in the sequential file, remain as originally entered but the Indices somehow identify the hierarchical order?

One of the things I am regretting with the way I have recorded all this damn data, is to keep an accompanying Stats File for each of the Sequential Files. I know now, what I didn't realize back then when I first started the program, that things like a Maximum Value or a Moving Average (etc, etc,etc) could have been recorded at the same time as data entry. As it is now, I need to open multiple files , search data, calculate and present/display results everytime I need to know whats Up with my program. I'm thinking that Stat File maybe akin to what you are describing with INDEXES ???

rickclark58 · « **Reply #17 on:** September 19, 2021, 09:00:57 pm »

The indexes contain key values like an account number. When you save the record, the account number is inserted into the index tree in the proper order. So if the account number is 100 it would be inserted into the tree right after 99. Normally, the records are contained in a block, say 100 records per block. The location of the record in the block is recorded with the account number. So, when you need to load account 100 again, the program looks through the index and finds 100, and then retrieves the location of the record within the block. It would then seek to that location and load the record.

Typically records have a header and the start of the header is actually what is stored in the index. The header is a fixed length, so the program knows how much data to read. After reading the header, the file pointer would be sitting at the beginning of the record. The program would read the record header to find out how long the record is and then load the record based on the length in the header. That way you can have variable length records which will save space and makes it faster to retrieve data.

This is a very simplified explanation and this isn't something that you would code yourself. You could use SQL Lite for instance, and simply query the database for account number 100. Using a database, you can get 1 record or 100 with a simple query, which makes it much easier to manipulate this kind of data. You can do something like "Select * from customers where account_number = 100". If the record is there, it will return it to the program where you can access the data within the record. The star means to grab all the fields in the record. Customers would be a table that contains the data. You can also do things like join multiple tables and return the information as if it were one table. It is quite handy when you get the hang of it and very powerful.

It was quite straightforward to load a database with preexisting data too as there are built-in commands to do such a thing. It is actually quite common in many businesses. When I worked for Ford Motor Finance, we would get a daily dump from the mainframe that we would load into the Oracle database every day. Once you set it up, it is an automatic process. I am not suggesting you use a database for your data, as I don't really know what you are doing with it. Managing datasets has always been a problem, which is why almost everyone uses a database if they are working with a lot of data. It is just easier.

SMcNeill · « **Reply #18 on:** September 19, 2021, 09:40:47 pm »

Indexes are key for maintaining large data sets. Let me give you an overly simplified example:

dog, cat, apple, frog, elephant, + 1,000,000 more items…. (Store this in an array of DIM Words(1,000,005) AS STRING)

Now, let’s say the above is your data set, and you need it sorted alphabetically. Instead of moving the data itself, you simply create an index and use it.

3, 2, 1, 5, 4, + 1,000,000 more items…. (Store these values in an array of DIM Index(1,000,005) AS LONG)

Now, which is the first word in the list alphabetically? It’s not Words(1) — that’s “dog”.

But what about Words(Index(1))?

Index(1) holds a value of 3, making that Words(3) — “apple”. The first element of the index points to the first word alphabetically in the list!

And Words(Index(2)) is “cat”, and Words(Index(3)) is “dog”, and so on!

Your data never moves. It stays in its original order at all times. You’re just sorting an Index to the data, rather than rearranging or reordering the data itself — and that can be an extremely powerful database tool!

Think of a list of data with first name, last name, address, state, zip code, phone number….

Now, you could sort and move and reorder that data EVERY time you needed it listed in a certain order…. Your boss likes last names first. Telemarketing likes sorted by phone numbers. The mail men want it sorted by address…. An advertiser wants it sorted by zip code….

Now, since you’re maintaining a Fortune 500 database with 397,216,312 customers, are you going to want to take the time to reshuffle ALL that data every time you need it sorted by a different criteria??

NOT IF YOU WANT TO KEEP YOUR JOB!!

For companies like that, maintaining the database — without altering it and possibly corrupting it — is the Holy Grail! You *don’t* erase a 300 million customer database. You *don’t* overwrite it. You *worship* it as your God and *ONLY* add new customers to the end of it! The customer database can ONLY grow — NOTHING else!!

So then how do you sort it for all those different people and use cases?

By indexes!

DIM FirstName(1,000,000,000) AS LONG — Index sorted by first name.
DIM LastName(1,000,000,000) AS LONG — Index sorted by last name.

You want to know who the first customer is, sorted by first name? Thats CustomerInfo(FirstName(1)).

Want to know who the first customer is, sorted by last name? CustomerInfo(LastName(1)).

You reference the data by its position in the index, not by its position in the database itself.

Dimster · « **Reply #19 on:** September 20, 2021, 08:05:15 am »

Huuuuum... In terms of creating an Index file for one existing Sequential File of say 4000 data values I would either have to figure out how to affix a record number to each data value or calculate the position of each data value within the Sequential file ??????

SMcNeill · « **Reply #20 on:** September 20, 2021, 12:00:28 pm »

Quote from: Dimster on September 20, 2021, 08:05:15 am

Huuuuum... In terms of creating an Index file for one existing Sequential File of say 4000 data values I would either have to figure out how to affix a record number to each data value or calculate the position of each data value within the Sequential file ??????

Here's a quick little example of an indexed file at work:

Code: QB64: [Select]

Dim word(10) As String
Dim index(10) As Long
$Let SKIPENTRY = TRUE
 
 
$If SKIPENTRY Then
    For i = 1 To 10
        For j = 1 To 5 'create a totally random 5 letter word
            word(i) = word(i) + Chr$(Rnd * 26 + 65) 'these "words" are made up of 5 random letters
        Next
        index(i) = i 'point the initial index to the words as they currently exist
    Next
$Else
    For i = 1 To 10
    Print "Give me 10 words to store for you (#"; i; "of 10) =>";
    Input word(i) 'Get any 10 words which you might like from the user
    index(i) = i 'point the initial index to the words as they currently exist
    Next
$End If
 
 
'sort by index
For i = 1 To 10
    For j = 1 To 10
        If word(index(i)) < word(index(j)) Then Swap index(i), index(j)
    Next
Next
 
Print , "Original", "Sorted"
For i = 1 To 10
    Print i, word(i), word(index(i))
Next
 

[ You are not allowed to view this attachment ] [ You are not allowed to view this attachment ]

Now, as you can see from the above, I'm leaving my data in its original layout and order. We're not touching it at all, and we'll always be able to reproduce it faithfully without worry in the future. I've created a single index here (called Index oddly enough -- who would've thunk it?!), and instead of sorting by the array elements itself, I'm sorting the array by its indexed elements.

To keep it simple and easy to understand, I've implemented the most basic of sorting routines here -- a lousy, unoptimized bubble sort, but I'm only using 10 elements in my demo array so that shouldn't be any issue for performance.

If you want to manually enter 10 words yourself, rather than just let the program toss up a bunch of gibberish for our data, simply change that $LET statement to be FALSE and then run the program. If you want to test with a larger dataset, just make the arrays large enough to hold your data and change the loops end point ot match. 400 elements or 4,000,000,000 elements -- the process to work with them would all more or less be the same. (The main difference would be the need for a better sorting algorithm with a larger data set, but the concept behind indexing the data would remain the same.)

At this point, you could now save the data to disk (my_data.txt for example) and save the index to disk (my_data.ndx for example), and if you ever needed to search for something alphabetically, you'd just do a quick binary search based on word(index(position))).

It takes a little while for people to get used to the idea of referencing their data by array(index(position)) rather than simply by array(position), but once they've gotten used to the concept, they'll always tell you how powerful a tool it can be when the situation calls for it.

Small datasets usually don't need an index. You just sort them on the fly, by whatever criteria you need, and use them as is. Massive data sets (such as a large company's customer database, or an index to all the books in a library) are almost always upkept with multiple indexes which stay sorted all the time.

Indexing is definitely a nice skill/tool to learn to work with, just in case you ever find yourself in a situation where working with your data becomes too time consuming otherwise. ;)

Dimster · « **Reply #21 on:** September 21, 2021, 03:49:12 pm »

Thanks Steve - definitely going to sort out Indexing. I appreciate the info.

TempodiBasic · « **Reply #22 on:** September 22, 2021, 05:01:16 am »

Hi
Thanks to Steve for his code demo, I hope he doesn't mind if I have enlarged (modded) his demo to stress the advantages of working with indexes. So I just added more indexes to show how to manage the thing.

Code: QB64: [Select]

Const Max = 15
Dim word(Max) As String
Dim As Long index(Max), indexZA(Max), indexLENmin(Max)
$Let SKIPENTRY = TRUE
 
 
$If SKIPENTRY Then
    For i = 1 To Max
        For j = 1 To (1 + Rnd * 9) 'create a totally random 1-10 letter word
            word(i) = word(i) + Chr$(Rnd * 26 + 65) 'these "words" are made up of 5 random letters
        Next
        index(i) = i 'point the initial index to the words as they currently exist
    Next
$Else
    FOR i = 1 TO max
    PRINT "Give me 10 words to store for you (#"; i; "of ";max;"10) =>";
    INPUT word(i) 'Get any ";max;" words which you might like from the user
    index(i) = i 'point the initial index to the words as they currently exist
    NEXT
$End If
 
'copy the intial indexes
For a = 1 To Max
    indexZA(a) = index(a)
    indexLENmin(a) = index(a)
Next
 
'sort by index  A-Z
For i = 1 To Max
    For j = 1 To Max
        If word(index(i)) < word(index(j)) Then Swap index(i), index(j)
    Next
Next
 
'sort by index  Z-A
For i = 1 To Max
    For j = 1 To Max
        If word(indexZA(i)) > word(indexZA(j)) Then Swap indexZA(i), indexZA(j)
    Next
Next
 
'sort by index  Lenght of word   from short to longer
For i = 1 To Max
    For j = 1 To Max
        If Len(word(indexLENmin(i))) > Len(word(indexLENmin(j))) Then Swap indexLENmin(i), indexLENmin(j)
    Next
Next
 
Print , "Original", "Sorted A-Z", "Sorted Z-A", "Sorted by Lenght"
For i = 1 To Max
    Print i, word(i), word(index(i)), word(indexZA(i)), word(indexLENmin(i))
Next
 
 

and here the results
[ You are not allowed to view this attachment ]

SMcNeill · « **Reply #23 on:** September 22, 2021, 05:23:22 am »

I don’t mind the modding at all, Tempodi. 😘

Only thing I want to point out is the fact that you don’t need a separate index to go from A-Z to Z-A sorting. Just read your index in reverse….

Instead of:

PRINT i, word(i), word(index(i)), word(indexZA(i)), word(indexLENmin(i))

Use:

PRINT i, word(i), word(index(i)), word(index(Max - i + 1)), word(indexLENmin(i))

If your index is storing the order from A-Z, reading that index backwards will give you the same list from Z-A.

indexZA() is good to use just as an example, but in most real world use cases, you’d just use the original index itself.

index(i) is the A to Z order from 1 to Max

index(Max - i + 1) is simply the inverse of that order, so it’s Z to A.

TempodiBasic · « **Reply #24 on:** September 24, 2021, 07:56:36 pm »

Hi Steve
thanks for tips
it is important to have in mind to use less resources to do the task so
I have corrected my homework mod with this:

Code: QB64: [Select]

Const Max = 15
Dim word(Max) As String
Dim As Long index(Max), indexLENmin(Max)
$Let SKIPENTRY = TRUE
 
 
$If SKIPENTRY Then
    For i = 1 To Max
        For j = 1 To (1 + Rnd * 9) 'create a totally random 1-10 letter word
            word(i) = word(i) + Chr$(Rnd * 26 + 65) 'these "words" are made up of 5 random letters
        Next
        index(i) = i 'point the initial index to the words as they currently exist
    Next
$Else
    FOR i = 1 TO max
    PRINT "Give me 10 words to store for you (#"; i; "of ";max;"10) =>";
    INPUT word(i) 'Get any ";max;" words which you might like from the user
    index(i) = i 'point the initial index to the words as they currently exist
    NEXT
$End If
 
'copy the intial index
For a = 1 To Max
    indexLENmin(a) = index(a)
Next
 
'sort by index  A-Z
For i = 1 To Max
    For j = 1 To Max
        If word(index(i)) < word(index(j)) Then Swap index(i), index(j)
    Next
Next
 
'sort by index  Z-A
' no needed because it is the reverse of Z-A
 
'sort by index  Lenght of word   from the shortest to longestest
For i = 1 To Max
    For j = 1 To Max
        If Len(word(indexLENmin(i))) > Len(word(indexLENmin(j))) Then Swap indexLENmin(i), indexLENmin(j)
    Next
Next
 
' sort by index length of word from the longest to te shortest
' no needed because it is the reverse of   from the shortest to the longest
 
Print "    "; "Original"; "   "; "Sorted A-Z"; "   "; "Sorted Z-A"; "   "; "by Shortest"; "   "; "by Longest"
For i = 1 To Max
    Locate , 1:
    Print i;
    Locate , 5: Print word(i);
    Locate , 20: Print word(index(i));
    Locate , 31: Print word(index(Max - i + 1));
    Locate , 42: Print word(indexLENmin(i));
    Locate , 58: Print word(indexLENmin(Max - i + 1))
    Sleep 1
Next
 

[ You are not allowed to view this attachment ]

News:

Author Topic: A Question on Sorting (Read 31391 times)

rickclark58

Re: A Question on Sorting

Dimster

Re: A Question on Sorting

rickclark58

Re: A Question on Sorting

SMcNeill

Re: A Question on Sorting

Dimster

Re: A Question on Sorting

SMcNeill

Re: A Question on Sorting

Dimster

Re: A Question on Sorting

TempodiBasic

Re: A Question on Sorting

SMcNeill

Re: A Question on Sorting

TempodiBasic

Re: A Question on Sorting