Steve - would you be able to run that edit by me one more time? If I had a field carrying data "AppleBananaCarrot" and wanted to change this data to "AppleBeetCarrot" - assuming I'm typing the complete new data for that field, why would the "na" re-appear in the field's data?
Certainly. :)
Now, our original data looked like AppleBananaCarrot.
Our original index would look like
Record 1: Start 1, Length 5
Record 2: Start 6, Length 6
Record 3: Start 12, Length 6
With this information, we can pick any record which we want, and retrieve the variable length data that it contains. For the 2nd record, we start at byte number 6, and retrieve 6 bytes of data: "Banana". Without these two pieces of information, how do we know how long our data is? How do we know this isn't one piece of information on an AppleBananaCarrot smoothie? Our index tracks both the start, and the length of our data fields for us.
Now, if we end up editing and replacing "Banana" with "Beet", we want to make the changes to the database as simple as possible. Since "Beet" is less than, or equal to, the length of "Banana", we can just swap its data directly in place of the old, and then update our index.
Our Edited Data now looks like: Apple
BeetnaCarrot
And our edited Index now looks like:
Record 1: Start 1, Length 5
Record 2: Start 6,
Length 4Record 3: Start 12, Length 6
Record 4+: All remains the same
With the
BOLD ITALICS above, you can see that we've made the minimal amount of changes possible to preserve our data, and keep all the information referenced properly.
Now, you
*could* do as you suggest, and pack the database as you go along, but that's actually going to be a lot of processing to do
Apple
BeetCarrot <-- Let's say you want to make this your data base, so that "na" isn't in there any longer.
You'd have to recalculate the index completely, just to make that one single change:
Record 1: Start 1, Length 5
Record 2: Start 6,
Length 4Record 3:
Start 10, Length 6
Record 4+:
All need their Start references updated, Length will remain the same
Notice with the italics here, how much you're changing, and reindexing? Now, for a database of 3 records, there's no worry over it. You'll be finished in microseconds. But what if we suddenly are working with a database with 1,000,000,000 records?? Do you *really* want to rebuild that index and pack that database with each and every change to the dataset? Or do you think it might be better to just leave some junk in it (like that lost "na" from banana), and then schedule a purge/packing of the data for 2AM on Sunday, when the company is closed and nobody is using the database?
It's not that the "na" is re-appearing in that field's data. It's simply that you can put, at most, 6-bytes of information in the same place where "banana" was stored in the dataset. Since you're only putting 4-bytes there, there's no reason to mess with the other 2 at all, at this point in time. (You're changing "Bana" for "Beet".) Your new index won't reference those other 2-bytes (the "na"), and when you rebuild/pack the database in the future, you'll get rid of that unreferenced data, and your database
*will* end up looking like you suggest "AppleBeetCarrot". You're simply ignoring them completely, for the moment, and just making the minimal changes to your index and data, to store your new information, without having to alter more than a single record at the time. Packing and cleaning the database can be done at a later time, to save a little disk space on your machines. ;)
The whole point is to make your edits, and store your data, while only affecting a single index/record at a time. If you were using a RANDOM access file, you certainly don't want to have to redo the whole file just because you changed a single record, and it's the same here with a variable-length data field. You don't want to have to move and change the whole dang database, just to update a single record -- you only want to make changes which impact on that one record, and that record only.
As long as your edit is less than, or equal to the old data, you can just update the data and the length index, and be done with it.
If it's longer than the old data, you then simply tack it to the end of your existing data, and point the start and length index to that new position, and be done with it.
Both options keep you only making changes to a single record/index, but both will also leave stray, unreferenced data behind. Going this route, you'll want to manually clean up the database at some later point in time, which won't affect user productivity any.
(And all you do then is basically just read your records, and write them to a new database/index, in order -- then delete or archive the old set, and rename the new set to take its place in your program. Personally, I prefer to archive the old datasets, just in case the drive ever gets corrupted, or something unforeseen happens like your intern deletes the whole thing on you, while trying to show off...)