Author Topic: Generalized study of randomness  (Read 5696 times)

0 Members and 1 Guest are viewing this topic.

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Generalized study of randomness
« on: January 25, 2020, 10:17:31 pm »
Alright, so I think I've more-or-less maxed out this idea for now.

In a very tight nutshell, putting it all in place,

(1) I prototyped an idea about analyzing pseudo-randomness publicly here: https://www.qb64.org/forum/index.php?topic=2095.0

(2) ... And that led to what I figured was a "complete" study, https://www.qb64.org/forum/index.php?topic=2104.0, which is all contained in this download: http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

And now the general case has been worked out for arbitrary everything - this'll work on binary, numbers, strings, hex, whatever. The method doesn't care at all. What I will do next is write it up formally and attach it in the above document, but in the meantime I want to pose an example I'll solve as a fun question.

I will attach to this post a file of at least 100,000 psuedo-random characters. I'll state up front that contained in the file are a few identical copies of an ordered string. Its occurrence is incredibly sparse, made of lots of weird characters, and hard to pick out by eye, but it's certainly in the file.

What may interest some of you, before I write up my solution, is to try to find the ordered string on your own. Write whatever program you want, use whatever tool you want. For a quick example, if I put three copies of the name "ron77" in a random string, it might look like "kjFDdFhDSaGfFGron77kHj6l547d68h79ron778fk0900j90ron77ldshlkdhkdjfhdsf", and your program would have to find "ron77" in the string without knowing what to look for. Question is, can your code handle it for the file below?

(Will post mine in a few days probably.)
* randoms.txt (Filesize: 98.19 KB, Downloads: 167)
« Last Edit: January 25, 2020, 10:23:50 pm by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #1 on: January 26, 2020, 09:33:37 am »
So what? this is like Where's Waldo only we don't know what Waldo looks like?

We will only know it's him if we find more than one of him?

piece of cake ;-))

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #2 on: January 26, 2020, 09:50:16 am »
Bplus that's a wonderful way to put it!

Turns out it *is* a piece of cake - but I'm positive there are many ways to find Waldo, so to speak, with or without fancy notation. What I avoid doing is looping over the series a trillion times. It's all in how you prepare the data.

Bonus: Don't use INSTR
« Last Edit: January 26, 2020, 09:59:00 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #3 on: January 26, 2020, 10:15:50 am »
Quote
Bonus: Don't use INSTR

Big Bonus PLUS! don't type with your index figures.

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #4 on: January 26, 2020, 10:28:55 am »
Quote
Big Bonus PLUS! don't type with your index figures.

(For a guy like me I need to avoid running my life with the middle finger, not the index.)

(Assuming you meant fingers, not figures.)
« Last Edit: January 26, 2020, 10:30:53 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #5 on: January 26, 2020, 11:07:07 am »
(For a guy like me I need to avoid running my life with the middle finger, not the index.)

(Assuming you meant fingers, not figures.)

Freud is looking up from where he is and smiling.
« Last Edit: January 26, 2020, 11:08:14 am by bplus »

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #6 on: January 27, 2020, 11:02:17 am »
Welp, the downloads leveled off. I imagine maybe one, maybe two people had a real look. Any solutions yet? By tonight I'll spill the beans. Hint: it's not a lot of code (in the right language)
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #7 on: January 27, 2020, 11:21:41 am »
I was just settling in to check out the txt file: number of chars , char range.. without INSTR I need to come up with different strategy.

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #8 on: January 27, 2020, 11:26:05 am »
O man, didnt wanna discourage though.  Whip up any scheme you want. I can imagine a few instr based methods. The brute force factor is big with this, but it's certainly a way.

As for the character range - and this shouldnt matter much - is every character you can hit on the keyboard basically.
« Last Edit: January 27, 2020, 11:33:36 am by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #9 on: January 27, 2020, 11:49:22 am »
Man! STxAxTIC it's like you are all work and no play! :)

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #10 on: January 27, 2020, 12:19:27 pm »
Hahahahahahahahaha bplusssss!!!!

I knew you could do it! Do share the code! Or you wanna wait til after? I understand. But you surely found it!
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #11 on: January 27, 2020, 12:42:49 pm »
Actually I did find the telling part on my very first test with string length of 15 but that was not the longest repeated string was it?

Nope it is 41, the first character a little tricky! so much so, I wonder if you know you have this after inserting a 40 length string in 14 random places one of which had the same previous start character as another. What are the odds?

The 40 length string first appears at 2770 BUT the first 41 length string appears at 8451 and the 2nd appears at 10072, like the birthdays in a room of 30, you are likely to find a matching first character randomly placing a 40 length string 14 times in random characters. (Note: I am continuously modifying code to get this data.)

OK done with study:
Code: QB64: [Select]
  1. _TITLE "Find string x in random txt" 'b+ 2020-01-27
  2. file$ = "randoms.txt"
  3. OPEN file$ FOR BINARY AS #1
  4. f$ = SPACE$(LOF(1))
  5. GET #1, , f$
  6. fl = LEN(f$)
  7. PRINT "File String Length"; fl; "... press any to continue"
  8. FOR ml = 43 TO 40 STEP -1 ' <<< start at 41 for max length string match, start at 40 to see string STxAxTIC likely inserted 14 times
  9.     PRINT "Testing string length:"; ml
  10.     FOR i = 1 TO fl
  11.         match$ = MID$(f$, i, ml)
  12.         matchi = i
  13.         p = INSTR(i + ml, f$, match$)
  14.         cnt = 0
  15.         WHILE p
  16.             cnt = cnt + 1
  17.             IF cnt = 1 THEN foundF = foundF + 1: PRINT "A match for: "; match$
  18.             PRINT "Match Length, match start, position, cnt:"; ml; ","; matchi; ","; p; ","; cnt
  19.             p = INSTR(p + ml, f$, match$)
  20.         WEND
  21.         IF cnt THEN EXIT FOR
  22.     NEXT
  23.     IF foundF = 2 THEN EXIT FOR
  24. PRINT "done"
  25.  
  26.  

Updated code more in fashion of how I wanted to present report.
« Last Edit: January 27, 2020, 01:37:21 pm by bplus »

Offline STxAxTIC

  • Library Staff
  • Forum Resident
  • Posts: 1091
  • he lives
    • View Profile
Re: Generalized study of randomness
« Reply #12 on: January 27, 2020, 09:23:39 pm »
Alright, so here is, I swear this time, the last installment of this randomness thing. I'm super happy with the whole thing start to finish, especially how bplus took the time to solve the problem. The solution below differs from bplus's by quite a bit. This one places heavy emphasis on data preparation so the algorithm can be dumb-simple. So here we go, in steps...

Note: See this PDF for a better explanation of everything all in one place: http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

1) Create the pseudo-random sequence

The data file attached at the top of this post was generated by this code:
Code: QB64: [Select]
  1. n = 100000
  2. asc1 = 32
  3. asc2 = 126
  4. ascd = asc2 - asc1
  5. x1$ = "@ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y"
  6. OPEN "seq_abc_0.txt" FOR OUTPUT AS #1
  7. FOR i = 1 TO n
  8.     IF (RND > 10 / n) THEN
  9.         p = INT(RND * (ascd + 1)) + asc1
  10.         PRINT #1, CHR$(p)
  11.     ELSE
  12.         FOR j = 1 TO LEN(x1$)
  13.             PRINT #1, MID$(x1$, j, 1)
  14.         NEXT
  15.     END IF

You can see there is a very rare chance that the string @ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y is inserted into what is otherwise a sea of letters, numbers, brackets, punctuation symbols - by all rights a random mess.

2) Prepare data files

While I leave the formal explanation for this step in the PDF, the following code creates "regrouped" copies of the original sequence.
Code: QB64: [Select]
  1. FOR q = 12 TO 24 STEP 4
  2.     PRINT q
  3.     FOR f = 0 TO q - 1
  4.         'PRINT f
  5.         OPEN "seq_abc_0.txt" FOR INPUT AS #1
  6.         OPEN "seq_abc_" + LTRIM$(RTRIM$(STR$(q))) + "_" + LTRIM$(RTRIM$(STR$(f))) + ".txt" FOR OUTPUT AS #2
  7.         y$ = ""
  8.         qcount = 0
  9.         ff = f
  10.         q$ = ""
  11.         DO WHILE NOT EOF(1)
  12.             LINE INPUT #1, x$
  13.             IF (ff > 0) THEN
  14.                 q$ = q$ + x$
  15.                 ff = ff - 1
  16.             ELSE
  17.                 y$ = y$ + x$
  18.                 qcount = qcount + 1
  19.                 IF (qcount = q) THEN
  20.                     PRINT #2, y$
  21.                     y$ = ""
  22.                     qcount = 0
  23.                 END IF
  24.             END IF
  25.         LOOP
  26.         z$ = y$ + q$
  27.         IF (LEN(z$) > 0) THEN
  28.             IF (LEN(z$) < q) THEN
  29.                 PRINT #2, z$
  30.             ELSE
  31.                 y$ = LEFT$(z$, q)
  32.                 z$ = RIGHT$(z$, LEN(z$) - LEN(y$))
  33.                 PRINT #2, y$
  34.                 PRINT #2, z$
  35.             END IF
  36.         END IF
  37.         CLOSE #2
  38.         CLOSE #1
  39.     NEXT

3) Processing

While QB64 is a worthy tool for anything, the Unix toolset is more adapted to string manipulation in files. Pasted below are four shell scripts, which when I run ./main.sh, the entire analysis takes place, and I'm handed a file called result.txt.

main.sh
Code: QB64: [Select]
  1. #!/bin/bash
  2.  
  3. for k in 12 26 20 24; do
  4.     ./combine.sh $k
  5. done
  6.  
  7. for filename in data/*.dat; do
  8.     ./core.sh $filename
  9. done
  10.  
  11. ./analyze.sh

combine.sh
Code: QB64: [Select]
  1. #!/bin/bash
  2.  
  3. touch data/data.tmp
  4.  
  5. j=$1
  6. for filename in data/seq_abc_"$j"_*.txt; do
  7.     cat data/data.tmp $filename > data/tmpf && cat data/tmpf > data/data.tmp
  8. done
  9. rm data/tmpf
  10.  
  11. newfile=data/seq_abc_"$j"_all.dat
  12. mv data/data.tmp $newfile

core.sh
Code: QB64: [Select]
  1. #!/bin/bash
  2.  
  3. cp "$1" data_t.tmp
  4.  
  5. # Sort and count unique entries.
  6.     echo "$REPLY"
  7. done < data_t.tmp > tmpf
  8. sort tmpf | uniq -c | sort -nr > data_t.tmp
  9.  
  10. # Move the first column to the end.
  11. awk '{first = $1; $1=""; print $0, "\t" first}' data_t.tmp > tmpf && cat tmpf > data_t.tmp
  12.  
  13. # Clean up.
  14. rm tmpf
  15.  
  16. # Save sorted file.
  17. mv data_t.tmp $(echo "$1" | cut -f 1 -d '.').res

analyze.sh
Code: QB64: [Select]
  1. #!/bin/bash
  2.  
  3. for i in data/*.res; do
  4.     awk 'NR>20{exit} {print $0}' $i
  5. done > result.txt

4) Results
The file result.txt comes out to:

Code: QB64: [Select]
  1. w0rk_&_n0~pl@y m@ke$.j@c 14
  2. rk_&_n0~pl@y m@ke$.j@ck^ 14
  3. pl@y m@ke$.j@ck^@^Dull b 14
  4. n0~pl@y m@ke$.j@ck^@^Dul 14
  5. ll w0rk_&_n0~pl@y m@ke$. 14
  6. l@y m@ke$.j@ck^@^Dull b0 14
  7. l w0rk_&_n0~pl@y m@ke$.j 14
  8. k_&_n0~pl@y m@ke$.j@ck^@ 14
  9. 0rk_&_n0~pl@y m@ke$.j@ck 14
  10. 0~pl@y m@ke$.j@ck^@^Dull 14
  11. ~pl@y m@ke$.j@ck^@^Dull 14
  12. _n0~pl@y m@ke$.j@ck^@^Du 14
  13. _&_n0~pl@y m@ke$.j@ck^@^ 14
  14. @y m@ke$.j@ck^@^Dull b0y 14
  15. @ll w0rk_&_n0~pl@y m@ke$ 14
  16. &_n0~pl@y m@ke$.j@ck^@^D 14
  17. w0rk_&_n0~pl@y m@ke$.j@ 14
  18. )@ll w0rk_&_n0~pl@y m@ke 2
  19.  

(Say, bplus - was the 41-character instance that occurred twice the one at the end of my list? Didnt look for it but that's probably it!)

5) Analysis
Finally, we get to the end - and incidentally this is the only part I do by hand. (I *will* automate this if you make me). And what precisely do I do by hand? Press spacebar in the above data cluster til the columns align. Of course this would be unnecessary if I used a chunk size of 40 or 41 or whatever - but the game here is to assume nothing start to finish about the size of the repeated sub-string.

Code: QB64: [Select]
  1.      w0rk_&_n0~pl@y m@ke$.j@c
  2.        rk_&_n0~pl@y m@ke$.j@ck^
  3.                pl@y m@ke$.j@ck^@^Dull b
  4.             n0~pl@y m@ke$.j@ck^@^Dul
  5.   ll w0rk_&_n0~pl@y m@ke$.
  6.                 l@y m@ke$.j@ck^@^Dull b0
  7.    l w0rk_&_n0~pl@y m@ke$.j
  8.         k_&_n0~pl@y m@ke$.j@ck^@
  9.       0rk_&_n0~pl@y m@ke$.j@ck
  10.              0~pl@y m@ke$.j@ck^@^Dull
  11.               ~pl@y m@ke$.j@ck^@^Dull
  12.            _n0~pl@y m@ke$.j@ck^@^Du
  13.          _&_n0~pl@y m@ke$.j@ck^@^
  14.                  @y m@ke$.j@ck^@^Dull b0y
  15.  @ll w0rk_&_n0~pl@y m@ke$
  16.           &_n0~pl@y m@ke$.j@ck^@^D
  17.      w0rk_&_n0~pl@y m@ke$.j@
  18. )@ll w0rk_&_n0~pl@y m@ke
  19. ------------------------------------------
  20.  @ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y
  21.  

6) Remarks
... And there we have it. One pipeline, no scanning over and over. This is basically histograms without the graph part.

Again, see it described better here:

http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf

... And before any wiseass wants to point out how much shorter bplus's solution is...let me remind them that... hm, actually... I'll let you figure out why for yourself :-)
« Last Edit: January 27, 2020, 09:33:15 pm by STxAxTIC »
You're not done when it works, you're done when it's right.

Offline bplus

  • Global Moderator
  • Forum Resident
  • Posts: 8053
  • b = b + ...
    • View Profile
Re: Generalized study of randomness
« Reply #13 on: January 28, 2020, 12:15:42 am »
(Say, bplus - was the 41-character instance that occurred twice the one at the end of my list? Didnt look for it but that's probably it!)

Hey STxAxTIC you didn't run my code? It took all of 30 secs to Run, here is copy of output (see atttached).

None of your posted lists contain a 41 character string. The closest you got was 40, I did report that the first character of the 41 character string was tricky and probably random accident to your 40 character inserted string.

 
41 char string_LI.jpg



Correction it takes 7.7 secs to run after Sleep keypress to start.

« Last Edit: January 28, 2020, 12:27:00 am by bplus »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Generalized study of randomness
« Reply #14 on: January 28, 2020, 03:38:36 am »
IF I was worried about trying to extract something unknown from a file filled with random data, I'd probably do something like this:

Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2. DIM SHARED text$
  3.  
  4. 'load the data
  5. OPEN "randoms.txt" FOR BINARY AS #1
  6. l = LOF(1): text$ = SPACE$(l)
  7. GET #1, 1, text$
  8.  
  9. REDIM SHARED WordArray(1000000) AS STRING
  10. getwords 3, 50
  11.  
  12.  
  13. SUB getwords (Min, Max)
  14.     PRINT "One moment while I gather all word combinations with letters between "; Min; " and "; Max; "..."
  15.     FOR i = Min TO Max
  16.         FOR j = 1 TO LEN(text$)
  17.             count = count + 1
  18.             IF count > UBOUND(WordArray) THEN REDIM _PRESERVE WordArray(UBOUND(WordArray) * 2) AS STRING
  19.             WordArray(count) = MID$(text$, j, i)
  20.         NEXT
  21.     NEXT
  22.     PRINT count; " words counted.  Now sorting the list..."
  23.     REDIM _PRESERVE WordArray(count) AS STRING 'free some unused memory.
  24.     REDIM WordCount AS LONG, WordList(count) AS LONG, WordCount(count) AS LONG
  25.     FOR i = 0 TO count
  26.         w$ = WordArray(i)
  27.         FOR j = 0 TO WordCount
  28.             IF LEN(WordArray(j)) > LEN(WordArray(i)) THEN EXIT FOR
  29.             IF w$ = WordArray(WordList(j)) THEN
  30.                 WordCount(j) = WordCount(j) + 1
  31.                 IF WordCount(j) > 3 THEN PRINT w$; " duplicated"; WordCount(j); "times"
  32.                 EXIT FOR
  33.             END IF
  34.         NEXT
  35.         IF j = WordCount + 1 THEN WordCount = j: WordList(j) = i
  36.         IF i MOD 1000 = 0 THEN LOCATE 3, 1: PRINT i; count; WordCount
  37.     NEXT
  38.     PRINT w$; " was the most common repeated word. ("; maxcount; ")"
  39.  

Make a word list of all possible words, then simply count the number of times which those words occur in the data.  The non-random data should repeat more times than the random data does, so one can then take this list, sort it, and just eyeball it for the most repeated pattern.

It seems to me that "ll " (l, l, space) is probably the most repeated pattern in the code, so eyeballing the data at that point would be a dern good space to start eyeballing for your "inserted" pattern.  ;)   (Which makes sense, since you repeated it twice with "@ll " and "Dull " in your phrase.)

Of course, this little method isn't fast (though it's much slower than I would've thought it to be!), but if you can narrow down the word lists by changing minimal size or maximum size, it'll be much more efficient at running.  I only created it this way to play along with the idea that "you have no idea what might be embedded...".   Since *anything* might be embedded in there, I figure the best bet is to check *everything*, though I didn't bother to look at 1 or 2 letter combos...  :P



A nice sort would probably reduce the time here down to a fraction of what it currently is.  Then you just check the current word and look to see if the next word is the same, and count them in one pass after they're sorted.  I just didn't bother to add a sort routine in here, as I didn't want to complicate the illustration of the general process which I'd use to look for your non-random data in here.  ;)
« Last Edit: January 28, 2020, 03:43:29 am by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!