Sourced from comments in thread (English from image, Dutch from Vinny93, German from TJA, Turkish from some rando, Lexicographical from monogram)
Plot with Pearson Score
Code
gnuplot -p -e '
set xlabel "Base Sequence";
set ylabel "Alphabetic";
set xtics 1,1,12;
set ytics 1,1,12;
set title "Alphabetic Number Plot with Correlation Score";
set key outside left;
set size ratio 0.45;
stats "alphabetic.tab" using 1:2 name "E";
stats "" using 1:3 name "D";
stats "" using 1:4 name "G";
stats "" using 1:5 name "T";
stats "" using 1:6 name "L";
plot "" using 1:2 with lines title sprintf("%s (%.3f)", columnhead(2), E_correlation),
"" using 1:3 with lines title sprintf("%s (%.3f)", columnhead(3), D_correlation),
"" using 1:4 with lines title sprintf("%s (%.3f)", columnhead(4), G_correlation),
"" using 1:5 with lines title sprintf("%s (%.3f)", columnhead(5), T_correlation),
"" using 1:6 with lines title sprintf("%s (%.3f)", columnhead(6), L_correlation)
'
It looks like the most random language is Dutch (closest to zero), and Turkish appears to be the least random (probably the 10,11,12 sequence skewed it).
Although Lexicographic also appears to have a near zero score, despite being the most ordered. I think Pearson is a bad measure here, and maybe a Serial Correlation test might be better.
Which language provides the most random alphabetically sorted sequence?
Data
Sourced from comments in thread (English from image, Dutch from Vinny93, German from TJA, Turkish from some rando, Lexicographical from monogram)
Plot with Pearson Score
Code
gnuplot -p -e ' set xlabel "Base Sequence"; set ylabel "Alphabetic"; set xtics 1,1,12; set ytics 1,1,12; set title "Alphabetic Number Plot with Correlation Score"; set key outside left; set size ratio 0.45; stats "alphabetic.tab" using 1:2 name "E"; stats "" using 1:3 name "D"; stats "" using 1:4 name "G"; stats "" using 1:5 name "T"; stats "" using 1:6 name "L"; plot "" using 1:2 with lines title sprintf("%s (%.3f)", columnhead(2), E_correlation), "" using 1:3 with lines title sprintf("%s (%.3f)", columnhead(3), D_correlation), "" using 1:4 with lines title sprintf("%s (%.3f)", columnhead(4), G_correlation), "" using 1:5 with lines title sprintf("%s (%.3f)", columnhead(5), T_correlation), "" using 1:6 with lines title sprintf("%s (%.3f)", columnhead(6), L_correlation) '
It looks like the most random language is Dutch (closest to zero), and Turkish appears to be the least random (probably the 10,11,12 sequence skewed it).
Although Lexicographic also appears to have a near zero score, despite being the most ordered. I think Pearson is a bad measure here, and maybe a Serial Correlation test might be better.
You put a lot of work into this.
Thank you for doing and sharing this
This is the second comment I’ve seen like this from you.
Please never stop.
I didn’t expect soneone to put that much effort into it.
Thanks! This is awesome!