Alright, so here is, I swear this time, the last installment of this randomness thing. I'm super happy with the whole thing start to finish, especially how bplus took the time to solve the problem. The solution below differs from bplus's by quite a bit. This one places heavy emphasis on data preparation so the algorithm can be dumb-simple. So here we go, in steps...
Note: See this PDF for a better explanation of everything all in one place:
http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf1) Create the pseudo-random sequenceThe data file attached at the top of this post was generated by this code:
n = 100000
asc1 = 32
asc2 = 126
ascd = asc2 - asc1
x1$ = "@ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y"
p
= INT(RND * (ascd
+ 1)) + asc1
You can see there is a very rare chance that the string
@ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y is inserted into what is otherwise a sea of letters, numbers, brackets, punctuation symbols - by all rights a random mess.
2) Prepare data filesWhile I leave the formal explanation for this step in the PDF, the following code creates "regrouped" copies of the original sequence.
'PRINT f
y$ = ""
qcount = 0
ff = f
q$ = ""
q$ = q$ + x$
ff = ff - 1
y$ = y$ + x$
qcount = qcount + 1
y$ = ""
qcount = 0
z$ = y$ + q$
3) ProcessingWhile QB64 is a worthy tool for anything, the Unix toolset is more adapted to string manipulation in files. Pasted below are four shell scripts, which when I run ./main.sh, the entire analysis takes place, and I'm handed a file called result.txt.
main.sh
#!/bin/bash
./combine.sh $k
done
./core.sh $filename
done
./analyze.sh
combine.sh
core.sh
#!/bin/bash
cp "$1" data_t.tmp
# Sort
and count unique entries.
echo "$REPLY"
done < data_t.tmp > tmpf
sort tmpf | uniq -c | sort -nr > data_t.tmp
# Move the first column
to the
end.
awk '{first = $1; $1=""; print $0, "\t" first}' data_t.tmp > tmpf && cat tmpf > data_t.tmp
# Clean up.
rm tmpf
# Save sorted file.
mv data_t.tmp $(echo "$1" | cut -f 1 -d '.').res
analyze.sh
#!/bin/bash
awk 'NR>20{exit} {print $0}' $i
done > result.txt
4) ResultsThe file result.txt comes out to:
w0rk_&_n0~pl@y m@ke$.j@c 14
rk_&_n0~pl@y m@ke$.j@ck^ 14
pl@y m@ke$.j@ck^@^Dull b 14
n0~pl@y m@ke$.j@ck^@^Dul 14
ll w0rk_&_n0~pl@y m@ke$. 14
l@y m@ke$.j@ck^@^Dull b0 14
l w0rk_&_n0~pl@y m@ke$.j 14
k_&_n0~pl@y m@ke$.j@ck^@ 14
0rk_&_n0~pl@y m@ke$.j@ck 14
0~pl@y m@ke$.j@ck^@^Dull 14
~pl@y m@ke$.j@ck^@^Dull 14
_n0~pl@y m@ke$.j@ck^@^Du 14
_&_n0~pl@y m@ke$.j@ck^@^ 14
@y m@ke$.j@ck^@^Dull b0y 14
@ll w0rk_&_n0~pl@y m@ke$ 14
&_n0~pl@y m@ke$.j@ck^@^D 14
w0rk_&_n0~pl@y m@ke$.j@ 14
)@ll w0rk_&_n0~pl@y m@ke 2
(Say, bplus - was the 41-character instance that occurred twice the one at the end of my list? Didnt look for it but that's probably it!)
5) AnalysisFinally, we get to the end - and incidentally this is the only part I do by hand. (I *will* automate this if you make me). And what precisely do I do by hand? Press spacebar in the above data cluster til the columns align. Of course this would be unnecessary if I used a chunk size of 40 or 41 or whatever - but the game here is to assume nothing start to finish about the size of the repeated sub-string.
w0rk_&_n0~pl@y m@ke$.j@c
rk_&_n0~pl@y m@ke$.j@ck^
pl@y m@ke$.j@ck^@^Dull b
n0~pl@y m@ke$.j@ck^@^Dul
ll w0rk_&_n0~pl@y m@ke$.
l@y m@ke$.j@ck^@^Dull b0
l w0rk_&_n0~pl@y m@ke$.j
k_&_n0~pl@y m@ke$.j@ck^@
0rk_&_n0~pl@y m@ke$.j@ck
0~pl@y m@ke$.j@ck^@^Dull
~pl@y m@ke$.j@ck^@^Dull
_n0~pl@y m@ke$.j@ck^@^Du
_&_n0~pl@y m@ke$.j@ck^@^
@y m@ke$.j@ck^@^Dull b0y
@ll w0rk_&_n0~pl@y m@ke$
&_n0~pl@y m@ke$.j@ck^@^D
w0rk_&_n0~pl@y m@ke$.j@
)@ll w0rk_&_n0~pl@y m@ke
------------------------------------------
@ll w0rk_&_n0~pl@y m@ke$.j@ck^@^Dull b0y
6) Remarks... And there we have it. One pipeline, no scanning over and over. This is basically histograms without the graph part.
Again, see it described better here:
http://barnes.x10host.com/ordo%20ab%20chao/Ordo%20Ab%20Chao.pdf... And before any wiseass wants to point out how much shorter bplus's solution is...let me remind them that... hm, actually... I'll let you figure out why for yourself :-)