Author Topic: Generalized study of randomness (Read 9304 times)

STxAxTIC · « **on:** January 25, 2020, 10:17:31 pm »

Alright, so I think I've more-or-less maxed out this idea for now.

In a very tight nutshell, putting it all in place,

(1) I prototyped an idea about analyzing pseudo-randomness publicly here: https://www.qb64.org/forum/index.php?topic=2095.0

(2) ... And that led to what I figured was a "complete" study, https://www.qb64.org/forum/index.php?topic=2104.0, which is all contained in this download: http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

And now the general case has been worked out for arbitrary everything - this'll work on binary, numbers, strings, hex, whatever. The method doesn't care at all. What I will do next is write it up formally and attach it in the above document, but in the meantime I want to pose an example I'll solve as a fun question.

I will attach to this post a file of at least 100,000 psuedo-random characters. I'll state up front that contained in the file are a few identical copies of an ordered string. Its occurrence is incredibly sparse, made of lots of weird characters, and hard to pick out by eye, but it's certainly in the file.

What may interest some of you, before I write up my solution, is to try to find the ordered string on your own. Write whatever program you want, use whatever tool you want. For a quick example, if I put three copies of the name "ron77" in a random string, it might look like "kjFDdFhDSaGfFGron77kHj6l547d68h79ron778fk0900j90ron77ldshlkdhkdjfhdsf", and your program would have to find "ron77" in the string without knowing what to look for. Question is, can your code handle it for the file below?

(Will post mine in a few days probably.)

bplus · « **Reply #1 on:** January 26, 2020, 09:33:37 am »

So what? this is like Where's Waldo only we don't know what Waldo looks like?

We will only know it's him if we find more than one of him?

piece of cake ;-))

STxAxTIC · « **Reply #2 on:** January 26, 2020, 09:50:16 am »

Bplus that's a wonderful way to put it!

Turns out it *is* a piece of cake - but I'm positive there are many ways to find Waldo, so to speak, with or without fancy notation. What I avoid doing is looping over the series a trillion times. It's all in how you prepare the data.

Bonus: Don't use INSTR

bplus · « **Reply #3 on:** January 26, 2020, 10:15:50 am »

Quote

Bonus: Don't use INSTR

Big Bonus PLUS! don't type with your index figures.

STxAxTIC · « **Reply #4 on:** January 26, 2020, 10:28:55 am »

Quote

Big Bonus PLUS! don't type with your index figures.

(For a guy like me I need to avoid running my life with the middle finger, not the index.)

(Assuming you meant fingers, not figures.)

bplus · « **Reply #5 on:** January 26, 2020, 11:07:07 am »

Quote from: STxAxTIC on January 26, 2020, 10:28:55 am

(For a guy like me I need to avoid running my life with the middle finger, not the index.)

(Assuming you meant fingers, not figures.)

Freud is looking up from where he is and smiling.

STxAxTIC · « **Reply #6 on:** January 27, 2020, 11:02:17 am »

Welp, the downloads leveled off. I imagine maybe one, maybe two people had a real look. Any solutions yet? By tonight I'll spill the beans. Hint: it's not a lot of code (in the right language)

bplus · « **Reply #7 on:** January 27, 2020, 11:21:41 am »

I was just settling in to check out the txt file: number of chars , char range.. without INSTR I need to come up with different strategy.

STxAxTIC · « **Reply #8 on:** January 27, 2020, 11:26:05 am »

O man, didnt wanna discourage though. Whip up any scheme you want. I can imagine a few instr based methods. The brute force factor is big with this, but it's certainly a way.

As for the character range - and this shouldnt matter much - is every character you can hit on the keyboard basically.

bplus · « **Reply #9 on:** January 27, 2020, 11:49:22 am »

Man! STxAxTIC it's like you are all work and no play! :)

STxAxTIC · « **Reply #10 on:** January 27, 2020, 12:19:27 pm »

Hahahahahahahahaha bplusssss!!!!

I knew you could do it! Do share the code! Or you wanna wait til after? I understand. But you surely found it!

bplus · « **Reply #11 on:** January 27, 2020, 12:42:49 pm »

Actually I did find the telling part on my very first test with string length of 15 but that was not the longest repeated string was it?

Nope it is 41, the first character a little tricky! so much so, I wonder if you know you have this after inserting a 40 length string in 14 random places one of which had the same previous start character as another. What are the odds?

The 40 length string first appears at 2770 BUT the first 41 length string appears at 8451 and the 2nd appears at 10072, like the birthdays in a room of 30, you are likely to find a matching first character randomly placing a 40 length string 14 times in random characters. (Note: I am continuously modifying code to get this data.)

OK done with study:

Code: QB64: [Select]

_TITLE "Find string x in random txt" 'b+ 2020-01-27
file$ = "randoms.txt"
OPEN file$ FOR BINARY AS #1
f$ = SPACE$(LOF(1))
GET #1, , f$
CLOSE #1
fl = LEN(f$)
PRINT "File String Length"; fl; "... press any to continue"
SLEEP
FOR ml = 43 TO 40 STEP -1 ' <<< start at 41 for max length string match, start at 40 to see string STxAxTIC likely inserted 14 times
    PRINT "Testing string length:"; ml
    FOR i = 1 TO fl
        match$ = MID$(f$, i, ml)
        matchi = i
        p = INSTR(i + ml, f$, match$)
        cnt = 0
        WHILE p
            cnt = cnt + 1
            IF cnt = 1 THEN foundF = foundF + 1: PRINT "A match for: "; match$
            PRINT "Match Length, match start, position, cnt:"; ml; ","; matchi; ","; p; ","; cnt
            p = INSTR(p + ml, f$, match$)
        WEND
        IF cnt THEN EXIT FOR
    NEXT
    IF foundF = 2 THEN EXIT FOR
NEXT
PRINT "done"
 
 

Updated code more in fashion of how I wanted to present report.

STxAxTIC · « **Reply #12 on:** January 27, 2020, 09:23:39 pm »

Alright, so here is, I swear this time, the last installment of this randomness thing. I'm super happy with the whole thing start to finish, especially how bplus took the time to solve the problem. The solution below differs from bplus's by quite a bit. This one places heavy emphasis on data preparation so the algorithm can be dumb-simple. So here we go, in steps...

Note: See this PDF for a better explanation of everything all in one place: http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

1) Create the pseudo-random sequence

The data file attached at the top of this post was generated by this code:

Code: QB64: [Select]

n = 100000
asc1 = 32
asc2 = 126
ascd = asc2 - asc1
x1$ = "@ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y"
OPEN "seq_abc_0.txt" FOR OUTPUT AS #1
FOR i = 1 TO n
    IF (RND > 10 / n) THEN
        p = INT(RND * (ascd + 1)) + asc1
        PRINT #1, CHR$(p)
    ELSE
        FOR j = 1 TO LEN(x1$)
            PRINT #1, MID$(x1$, j, 1)
        NEXT
    END IF
NEXT
CLOSE #1

You can see there is a very rare chance that the string @ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y is inserted into what is otherwise a sea of letters, numbers, brackets, punctuation symbols - by all rights a random mess.

2) Prepare data files

While I leave the formal explanation for this step in the PDF, the following code creates "regrouped" copies of the original sequence.

Code: QB64: [Select]

FOR q = 12 TO 24 STEP 4
    PRINT q
    FOR f = 0 TO q - 1
        'PRINT f
        OPEN "seq_abc_0.txt" FOR INPUT AS #1
        OPEN "seq_abc_" + LTRIM$(RTRIM$(STR$(q))) + "_" + LTRIM$(RTRIM$(STR$(f))) + ".txt" FOR OUTPUT AS #2
        y$ = ""
        qcount = 0
        ff = f
        q$ = ""
        DO WHILE NOT EOF(1)
            LINE INPUT #1, x$
            IF (ff > 0) THEN
                q$ = q$ + x$
                ff = ff - 1
            ELSE
                y$ = y$ + x$
                qcount = qcount + 1
                IF (qcount = q) THEN
                    PRINT #2, y$
                    y$ = ""
                    qcount = 0
                END IF
            END IF
        LOOP
        z$ = y$ + q$
        IF (LEN(z$) > 0) THEN
            IF (LEN(z$) < q) THEN
                PRINT #2, z$
            ELSE
                y$ = LEFT$(z$, q)
                z$ = RIGHT$(z$, LEN(z$) - LEN(y$))
                PRINT #2, y$
                PRINT #2, z$
            END IF
        END IF
        CLOSE #2
        CLOSE #1
    NEXT
NEXT

3) Processing

While QB64 is a worthy tool for anything, the Unix toolset is more adapted to string manipulation in files. Pasted below are four shell scripts, which when I run ./main.sh, the entire analysis takes place, and I'm handed a file called result.txt.

main.sh

Code: QB64: [Select]

#!/bin/bash
 
for k in 12 26 20 24; do
    ./combine.sh $k
done
 
for filename in data/*.dat; do
    ./core.sh $filename
done
 
./analyze.sh

combine.sh

Code: QB64: [Select]

#!/bin/bash
 
touch data/data.tmp
 
j=$1
for filename in data/seq_abc_"$j"_*.txt; do
    cat data/data.tmp $filename > data/tmpf && cat data/tmpf > data/data.tmp
done
rm data/tmpf
 
newfile=data/seq_abc_"$j"_all.dat
mv data/data.tmp $newfile

core.sh

Code: QB64: [Select]

#!/bin/bash
 
cp "$1" data_t.tmp
 
# Sort and count unique entries.
while read -r 
do
    echo "$REPLY"
done < data_t.tmp > tmpf 
sort tmpf | uniq -c | sort -nr > data_t.tmp
 
# Move the first column to the end.
awk '{first = $1; $1=""; print $0, "\t" first}' data_t.tmp > tmpf && cat tmpf > data_t.tmp
 
# Clean up.
rm tmpf
 
# Save sorted file.
mv data_t.tmp $(echo "$1" | cut -f 1 -d '.').res

analyze.sh

Code: QB64: [Select]

#!/bin/bash
 
for i in data/*.res; do
    awk 'NR>20{exit} {print $0}' $i
done > result.txt

4) Results
The file result.txt comes out to:

Code: QB64: [Select]

w0rk_&_n0~pl@y m@ke$.j@c 14
rk_&_n0~pl@y m@ke$.j@ck^ 14
pl@y m@ke$.j@ck^@^Dull b 14
n0~pl@y m@ke$.j@ck^@^Dul 14
ll w0rk_&_n0~pl@y m@ke$. 14
l@y m@ke$.j@ck^@^Dull b0 14
l w0rk_&_n0~pl@y m@ke$.j 14
k_&_n0~pl@y m@ke$.j@ck^@ 14
0rk_&_n0~pl@y m@ke$.j@ck 14
0~pl@y m@ke$.j@ck^@^Dull 14
~pl@y m@ke$.j@ck^@^Dull 14
_n0~pl@y m@ke$.j@ck^@^Du 14
_&_n0~pl@y m@ke$.j@ck^@^ 14
@y m@ke$.j@ck^@^Dull b0y 14
@ll w0rk_&_n0~pl@y m@ke$ 14
&_n0~pl@y m@ke$.j@ck^@^D 14
w0rk_&_n0~pl@y m@ke$.j@ 14
)@ll w0rk_&_n0~pl@y m@ke 2
 

(Say, bplus - was the 41-character instance that occurred twice the one at the end of my list? Didnt look for it but that's probably it!)

5) Analysis
Finally, we get to the end - and incidentally this is the only part I do by hand. (I *will* automate this if you make me). And what precisely do I do by hand? Press spacebar in the above data cluster til the columns align. Of course this would be unnecessary if I used a chunk size of 40 or 41 or whatever - but the game here is to assume nothing start to finish about the size of the repeated sub-string.

Code: QB64: [Select]

     w0rk_&_n0~pl@y m@ke$.j@c
       rk_&_n0~pl@y m@ke$.j@ck^
               pl@y m@ke$.j@ck^@^Dull b
            n0~pl@y m@ke$.j@ck^@^Dul
  ll w0rk_&_n0~pl@y m@ke$.
                l@y m@ke$.j@ck^@^Dull b0
   l w0rk_&_n0~pl@y m@ke$.j
        k_&_n0~pl@y m@ke$.j@ck^@
       0rk_&_n0~pl@y m@ke$.j@ck
              0~pl@y m@ke$.j@ck^@^Dull
              ~pl@y m@ke$.j@ck^@^Dull 
           _n0~pl@y m@ke$.j@ck^@^Du
         _&_n0~pl@y m@ke$.j@ck^@^
                 @y m@ke$.j@ck^@^Dull b0y
 @ll w0rk_&_n0~pl@y m@ke$
          &_n0~pl@y m@ke$.j@ck^@^D
     w0rk_&_n0~pl@y m@ke$.j@
)@ll w0rk_&_n0~pl@y m@ke
------------------------------------------
 @ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y
 

6) Remarks
... And there we have it. One pipeline, no scanning over and over. This is basically histograms without the graph part.

Again, see it described better here:

http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

... And before any wiseass wants to point out how much shorter bplus's solution is...let me remind them that... hm, actually... I'll let you figure out why for yourself :-)

bplus · « **Reply #13 on:** January 28, 2020, 12:15:42 am »

(Say, bplus - was the 41-character instance that occurred twice the one at the end of my list? Didnt look for it but that's probably it!)

Hey STxAxTIC you didn't run my code? It took all of 30 secs to Run, here is copy of output (see atttached).

None of your posted lists contain a 41 character string. The closest you got was 40, I did report that the first character of the 41 character string was tricky and probably random accident to your 40 character inserted string.

Correction it takes 7.7 secs to run after Sleep keypress to start.

SMcNeill · « **Reply #14 on:** January 28, 2020, 03:38:36 am »

IF I was worried about trying to extract something unknown from a file filled with random data, I'd probably do something like this:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
DIM SHARED text$
 
'load the data
OPEN "randoms.txt" FOR BINARY AS #1
l = LOF(1): text$ = SPACE$(l)
GET #1, 1, text$
CLOSE
 
REDIM SHARED WordArray(1000000) AS STRING
getwords 3, 50
 
 
SUB getwords (Min, Max)
    PRINT "One moment while I gather all word combinations with letters between "; Min; " and "; Max; "..."
    FOR i = Min TO Max
        FOR j = 1 TO LEN(text$)
            count = count + 1
            IF count > UBOUND(WordArray) THEN REDIM _PRESERVE WordArray(UBOUND(WordArray) * 2) AS STRING
            WordArray(count) = MID$(text$, j, i)
        NEXT
    NEXT
    PRINT count; " words counted.  Now sorting the list..."
    REDIM _PRESERVE WordArray(count) AS STRING 'free some unused memory.
    REDIM WordCount AS LONG, WordList(count) AS LONG, WordCount(count) AS LONG
    FOR i = 0 TO count
        w$ = WordArray(i)
        FOR j = 0 TO WordCount
            IF LEN(WordArray(j)) > LEN(WordArray(i)) THEN EXIT FOR
            IF w$ = WordArray(WordList(j)) THEN
                WordCount(j) = WordCount(j) + 1
                IF WordCount(j) > 3 THEN PRINT w$; " duplicated"; WordCount(j); "times"
                EXIT FOR
            END IF
        NEXT
        IF j = WordCount + 1 THEN WordCount = j: WordList(j) = i
        IF i MOD 1000 = 0 THEN LOCATE 3, 1: PRINT i; count; WordCount
    NEXT
    PRINT w$; " was the most common repeated word. ("; maxcount; ")"
END SUB
 

Make a word list of all possible words, then simply count the number of times which those words occur in the data. The non-random data should repeat more times than the random data does, so one can then take this list, sort it, and just eyeball it for the most repeated pattern.

It seems to me that "ll " (l, l, space) is probably the most repeated pattern in the code, so eyeballing the data at that point would be a dern good space to start eyeballing for your "inserted" pattern. ;) (Which makes sense, since you repeated it twice with "@ll " and "Dull " in your phrase.)

Of course, this little method isn't fast (though it's much slower than I would've thought it to be!), but if you can narrow down the word lists by changing minimal size or maximum size, it'll be much more efficient at running. I only created it this way to play along with the idea that "you have no idea what might be embedded...". Since *anything* might be embedded in there, I figure the best bet is to check *everything*, though I didn't bother to look at 1 or 2 letter combos... :P

A nice sort would probably reduce the time here down to a fraction of what it currently is. Then you just check the current word and look to see if the next word is the same, and count them in one pass after they're sorted. I just didn't bother to add a sort routine in here, as I didn't want to complicate the illustration of the general process which I'd use to look for your non-random data in here. ;)

News:

Author Topic: Generalized study of randomness (Read 9304 times)

STxAxTIC

Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

STxAxTIC

Re: Generalized study of randomness

bplus

Re: Generalized study of randomness

SMcNeill

Re: Generalized study of randomness