Author Topic: Print Unicode, UTF-8  (Read 6135 times)

0 Members and 1 Guest are viewing this topic.

This topic contains a post which is marked as Best Answer. Press here if you would like to see it.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Print Unicode, UTF-8
« on: January 01, 2020, 04:46:05 pm »
A couple of quick routines which I worked up to print unicode for us, and to convert UTF-8 codepages to unicode format, so that they'll print properly for us. 

Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2. f = _LOADFONT("cyberbit.ttf", 16, "monospace")
  3.  
  4.  
  5. FOR j = 0 TO 65000 STEP 1000
  6.     CLS
  7.     COLOR _RGB32(255, 255, 0)
  8.     PRINT "UNICODE Values"; j; " to"; j + 999
  9.     COLOR -1
  10.     FOR i = 0 TO 999
  11.         PrintUniCode i + j
  12.     NEXT
  13.     SLEEP
  14.  
  15.  
  16.  
  17. OPEN "Petr Text.txt" FOR BINARY AS #1
  18.  
  19. b = 1
  20.     p = p + b
  21.     FOR j = 0 TO 3
  22.         GET #1, p+j, a(j)
  23.     NEXT
  24.     UTF8 = ConvertUTF8toUniCode(a(), b)
  25.     PrintUniCode UTF8
  26.  
  27.     SLEEP
  28.  
  29.  
  30. FUNCTION ConvertUTF8toUniCode (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
  31.     'b tells us how many bytes we used in this conversion,
  32.     'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
  33.     SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
  34.         CASE IS < 194: ConvertUTF8toUniCode = a(0): b = 1 'one byte character
  35.         CASE 194 TO 223: ConvertUTF8toUniCode = (64 * (a(0) - 194) + a(1)): b = 2 'two byte character
  36.         CASE 224 TO 239: b = 3: '3byte UTF-8 symbol  (I haven't sorted out the conversion for these yet)
  37.         CASE 240 TO 255: b = 4 ' 4byte UFT-8 symbol
  38.     END SELECT
  39.  
  40. SUB PrintUniCode (code AS LONG)
  41.     IF code < 1 OR code > 65535 THEN EXIT SUB
  42.     _MAPUNICODE code TO 0
  43.     PRINT CHR$(0);
  44.     _MAPUNICODE 0 TO 0

Note: This uses some text that Petr shared on the forums here, in the past, so I hope he won't mind me using it for testing purposes. 

I haven't quite sorted out the work needed to convert 3 and 4 byte UTF-8 values to Unicode, but just being able to read and convert 1 and 2-byte characters will cover 99% of all documents which the general public would use normally.  It covers all the general Latin pages, and seems to have all the general symbols used in text.

I'm thinking 3-character UTF-8 characters are Japanese Kanji/Chinese characters, and such, and I have no idea what the 4-character UTF-8 characters represent.  Maybe some of the Emoji, or scientific symbols/notations?  I honestly don't know, as I personally tend to type exclusively in pure old unaccented American English, and the old 7-bit ANSI codes are usually enough for my needs...

Still though, I've been breaking my brain working with the keyhit remapper, so I thought I'd swap gears and take a break for a small bit...

So I swap over to trying to decode Unicode, UTF-8, and ANSI/ASCII encodings instead.  LOL!!



Anywho...  If you folks who tend to speak and write in those languages with these extended code sets could test things out, I'd appreciate it.  As far as I can tell, it works with what I've tested it with so far, but my eyes are old, tired, and I don't know the languages good enough to tell an A-forward slash from an A-backward slash from an A-circle over top...   Hell, I don't even know what the heck you actually call those accents...

I hope it works as it should, but if not, just let me know what's wrong with it, and I'll try and sort it out, as long as somebody is willing to be patient and work with me and tell me exactly what the heck is off with the conversion processes so I can fix them.  ;)
* Petr Text.txt (Filesize: 0.22 KB, Downloads: 218)
« Last Edit: January 01, 2020, 06:10:12 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Print Unicode, UTF-8
« Reply #1 on: January 01, 2020, 05:22:06 pm »
Hi Steve,
it does not work properly. If I remember it well, the point there on thread, was a move of characters over 128, which contain non-English diacritics back down to letters compatible with English, ie characters up to 127? To do this, you would need a function that returns the local national lunguage code number  LCID  (Powershell and get culture can be used, or a similar command for Linux) and then use DATA to move high characters from an ascii table down to English characters. Accents will not be displayed correctly, but the text will be readable. I think this was the point?

I enclose the output of your program and also my ascii table, as is set by default for my language.

 
UTF8 output.JPG


 
BinASCII.JPG

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Print Unicode, UTF-8
« Reply #2 on: January 01, 2020, 05:27:46 pm »
I can work on this program tomorrow with you, because it's almost midnight here and I'm going to work tomorrow. I must go sleeping.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #3 on: January 01, 2020, 05:36:27 pm »
Quote
Hi Steve,
it does not work properly.…

It’s not using the ASCII table, or character pages for us.  Instead, we should be reading the UTF-8 directly and then converting and using the unicode characters in the cyberbit.ttf which comes with QB64. 

Does the unicode characters display properly at the start, or is it something wrong with them to begin with, which isn’t working for you?
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #4 on: January 01, 2020, 06:07:18 pm »
You're right Petr; it's not working properly there.

It's odd, because I get the correct outputs here:
Code: QB64: [Select]
  1. SCREEN _NEWIMAGE(800, 600, 32)
  2. f = _LOADFONT("cyberbit.ttf", 16, "monospace")
  3.  
  4.  
  5. GOTO skip_test
  6. FOR j = 0 TO 65000 STEP 1000
  7.     CLS
  8.     COLOR _RGB32(255, 255, 0)
  9.     PRINT "UNICODE Values"; j; " to"; j + 999
  10.     COLOR -1
  11.     FOR i = 0 TO 999
  12.         PrintUniCode i + j
  13.     NEXT
  14.     SLEEP
  15.  
  16. skip_test:
  17.  
  18. OPEN "Petr Text.txt" FOR BINARY AS #1
  19.  
  20.     p = p + 1
  21.     GET #1, p, a(0)
  22.     IF a(0) > 193 THEN
  23.         GET #1, p + 1, a(1)
  24.         p = p + 1
  25.         '       i = a(0) * 256 + a(1)
  26.         SELECT CASE a(0) '194 to 223
  27.             CASE IS < 194: PrintUniCode i
  28.             CASE 194 TO 223: PrintUniCode (a(1) + 64 * (a(0) - 194))
  29.             CASE ELSE: PRINT "U:"; a(0), a(1) 'it's a 3 or 4 byte unicode
  30.         END SELECT
  31.     ELSE
  32.         PrintUniCode a(0)
  33.     END IF
  34.     SLEEP
  35.  
  36.  
  37. SUB PrintUniCode (code AS LONG)
  38.     IF code < 1 OR code > 65535 THEN EXIT SUB
  39.     _MAPUNICODE code TO 0
  40.     PRINT CHR$(0);
  41.     _MAPUNICODE 0 TO 0

So apparently, when I converted it to a Function, rather than a direct method (as above), I glitched something out.  The issue seems minor enough -- I just need to figure out why the function is giving me different results than the direct version is.  I'm on it!  :P
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #5 on: January 01, 2020, 06:13:53 pm »
Scratch that.  It's working exactly as it should.  I simply wasn't sending it the proper sequence of bytes to convert.

The original code:

Code: QB64: [Select]
  1.     p = p + b
  2.     FOR j = 0 TO 3
  3.         GET #1, p, a(j)
  4.     NEXT
  5.  

The glitch?

        GET #1, p + j, a(j)

We need to get up to 4 bytes and send them to the convertor -- not send it the SAME byte 4 times in a row! 

When you get a chance again later, give it a try one more time and see how it does for you.  I've updated the code in the original post, and I think it's got the issue corrected properly for us.  :)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline luke

  • Administrator
  • Seasoned Forum Regular
  • Posts: 324
    • View Profile
Re: Print Unicode, UTF-8
« Reply #6 on: January 01, 2020, 06:21:37 pm »
Here's a UTF-8 decoder I had in my back pocket that does all the characters:
Code: [Select]
DEFLNG A-Z
REDIM SHARED __utf8d(0) AS _BYTE

__utf8dl:
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
DATA 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
DATA 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
DATA &Ha,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H4,&H3,&H3
DATA &Hb,&H6,&H6,&H6,&H5,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8
DATA &H0,&H1,&H2,&H3,&H5,&H8,&H7,&H1,&H1,&H1,&H4,&H6,&H1,&H1,&H1,&H1
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1
DATA 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1

SUB utf8print (in$)
    IF UBOUND(__utf8d) = 0 THEN
        REDIM __utf8d(0 TO 399) AS _BYTE
        RESTORE __utf8dl
        FOR i& = 0 TO 399
            READ __utf8d(i&)
        NEXT i&
    END IF
    o$ = ""
    FOR i& = 1 TO LEN(in$)
        c~%% = ASC(in$, i&)
        typ~%% = __utf8d(c~%%)
        IF s~%% <> 0 THEN
            cp = (c~%% AND &H3F) OR (cp * 64)
        ELSE
            cp = (255 \ (2 ^ typ~%%)) AND c~%%
        END IF
        s~%% = __utf8d(256 + s~%% * 16 + typ~%%)

        IF s~%% = 0 THEN
            _MAPUNICODE cp& TO 0
            PRINT CHR$(0);
        ELSEIF s~%% = 1 THEN
            ERROR 2
        END IF
    NEXT i&
    _MAPUNICODE 0 TO 0
END SUB

FUNCTION fromhex$ (in$)
    IF LEN(in$) MOD 2 THEN ERROR 2
    o$ = SPACE$(LEN(in$) / 2)
    FOR i& = 1 TO LEN(in$) STEP 2
        c$ = MID$(in$, i&, 2)
        MID$(o$, (i& + 1) / 2) = _MK$(_BYTE, VAL("&H" + c$))
    NEXT i&
    fromhex$ = o$
END FUNCTION
If you do something like
Code: [Select]
utf8print fromhex$("D09220D187D0B0D189D0B0D18520D18ED0B3D0B020D0B6D0B8D0BB20D0B1D18B20D186D0B8D182D180D183D1813F20D094D0B02C20D0BDD0BE20D184D0B0D0BBD18CD188D0B8D0B2D18BD0B920D18DD0BAD0B7D0B5D0BCD0BFD0BBD18FD18021") it works nicely.

Going up higher you have something like
Code: [Select]
utf8print fromhex$("E0A490") which does work but you'll need a different font because cyberbit doesn't support it.

Unfortunately you can't do
Code: [Select]
utf8print fromhex$("F0908080") because _MAPUNICODE is crappy and doesn't support code points above U+10000.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #7 on: January 01, 2020, 07:16:57 pm »
Nice to see Luke is just as human when it comes to typos!  :)

There's a problem in the line here:             _MAPUNICODE cp& TO 0

Remove that &, and keep your variable type consistent inside that sub.  The only reason why it's working at the moment, is because of the DEFLNG A-Z at the top of code keeping the types matching for the program.  :)

https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline luke

  • Administrator
  • Seasoned Forum Regular
  • Posts: 324
    • View Profile
Re: Print Unicode, UTF-8
« Reply #8 on: January 01, 2020, 07:29:57 pm »
Eugh, it was originally just a decoder function and I added that bit to make it a printer.

Maybe instead of a typo we'll call it emphasis? :)

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Print Unicode, UTF-8
« Reply #9 on: January 02, 2020, 11:27:08 am »
Hi Steve! I test it again using your upgraded first source code and now it works correctly, just font is clipped down, but it doesn't concern this program but it's a known font bug.

Good job!

 
UTF8 output.JPG


Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #10 on: January 02, 2020, 11:27:24 am »
Sorted out my own little decoder, which ends up giving the exact same results as Luke's, from what initial testing I've tried it with.

Code: QB64: [Select]
  1. REDIM SHARED __utf8d(0) AS _BYTE
  2.  
  3. SCREEN _NEWIMAGE(800, 600, 32)
  4. f = _LOADFONT("cyberbit.ttf", 16, "monospace")
  5.  
  6. FOR j = 0 TO 65000 STEP 1000
  7.     CLS
  8.     COLOR _RGB32(255, 255, 0)
  9.     PRINT "UNICODE Values"; j; " to"; j + 999
  10.     COLOR -1
  11.     FOR i = 0 TO 999
  12.         PrintUniCode i + j
  13.     NEXT
  14.     SLEEP
  15.  
  16.  
  17.  
  18. OPEN "Petr Text.txt" FOR BINARY AS #1
  19.  
  20. b = 1
  21.     p = p + b
  22.     FOR j = 0 TO 3
  23.         GET #1, p + j, a(j)
  24.     NEXT
  25.     UTF8 = ConvertUTF8toUniCode(a(), b)
  26.     PrintUniCode UTF8
  27.  
  28.     SLEEP
  29.  
  30.  
  31. Luke$ = fromhex$("D09220D187D0B0D189D0B0D18520D18ED0B3D0B020D0B6D0B8D0BB20D0B1D18B20D186D0B8D182D180D183D1813F20D094D0B02C20D0BDD0BE20D184D0B0D0BBD18CD188D0B8D0B2D18BD0B920D18DD0BAD0B7D0B5D0BCD0BFD0BBD18FD18021")
  32.  
  33.  
  34.  
  35. b = 1: p = 0: l = LEN(Luke$)
  36. DO UNTIL p >= LEN(Luke$)
  37.     p = p + b
  38.     FOR j = 0 TO 3
  39.         IF p + j > l THEN a(j) = 0 ELSE a(j) = ASC(Luke$, p + j)
  40.     NEXT
  41.     UTF8 = ConvertUTF8toUniCode(a(), b)
  42.     PrintUniCode UTF8
  43. PRINT "Luke's Translation:"
  44. utf8print Luke$
  45.  
  46.  
  47. __utf8dl:
  48. DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  49. DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  50. DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  51. DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  52. DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
  53. DATA 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
  54. DATA 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
  55. DATA &Ha,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H4,&H3,&H3
  56. DATA &Hb,&H6,&H6,&H6,&H5,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8
  57. DATA &H0,&H1,&H2,&H3,&H5,&H8,&H7,&H1,&H1,&H1,&H4,&H6,&H1,&H1,&H1,&H1
  58. DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1
  59. DATA 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1
  60. DATA 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1
  61. DATA 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1
  62.  
  63. SUB utf8print (in$)
  64.     IF UBOUND(__utf8d) = 0 THEN
  65.         REDIM __utf8d(0 TO 399) AS _BYTE
  66.         RESTORE __utf8dl
  67.         FOR i& = 0 TO 399
  68.             READ __utf8d(i&)
  69.         NEXT i&
  70.     END IF
  71.     o$ = ""
  72.     FOR i& = 1 TO LEN(in$)
  73.         c~%% = ASC(in$, i&)
  74.         typ~%% = __utf8d(c~%%)
  75.         IF s~%% <> 0 THEN
  76.             cp = (c~%% AND &H3F) OR (cp * 64)
  77.         ELSE
  78.             cp = (255 \ (2 ^ typ~%%)) AND c~%%
  79.         END IF
  80.         s~%% = __utf8d(256 + s~%% * 16 + typ~%%)
  81.  
  82.         IF s~%% = 0 THEN
  83.             _MAPUNICODE cp TO 0
  84.             PRINT CHR$(0);
  85.         ELSEIF s~%% = 1 THEN
  86.             ERROR 2
  87.         END IF
  88.     NEXT i&
  89.     _MAPUNICODE 0 TO 0
  90.  
  91.  
  92.  
  93. FUNCTION fromhex$ (in$)
  94.     IF LEN(in$) MOD 2 THEN ERROR 2
  95.     o$ = SPACE$(LEN(in$) / 2)
  96.     FOR i& = 1 TO LEN(in$) STEP 2
  97.         c$ = MID$(in$, i&, 2)
  98.         MID$(o$, (i& + 1) / 2) = _MK$(_BYTE, VAL("&H" + c$))
  99.     NEXT i&
  100.     fromhex$ = o$
  101.  
  102.  
  103.  
  104.  
  105.  
  106.  
  107.  
  108. FUNCTION ConvertUTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
  109.     DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
  110.     'b tells us how many bytes we used in this conversion,
  111.     'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
  112.     SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
  113.         CASE IS < 128
  114.             ConvertUTF8toUniCode& = a(0)
  115.             b = 1 'one byte character
  116.         CASE 194 TO 223
  117.             first = a(0) MOD 32
  118.             second = a(1) MOD 64
  119.             ConvertUTF8toUniCode& = 64 * first + second
  120.             b = 2 'two byte character
  121.         CASE 224 TO 239 '3byte UTF-8 symbol
  122.             first = a(0) MOD 16
  123.             second = a(1) MOD 64
  124.             third = a(2) MOD 64
  125.             ConvertUTF8toUniCode& = 4096 * first + 64 * second + third
  126.             b = 3
  127.         CASE 240 TO 255 ' 4byte UFT-8 symbol
  128.             first = a(0) MOD 8
  129.             second = a(1) MOD 64
  130.             third = a(2) MOD 64
  131.             fourth = a(3) MOD 64
  132.             ConvertUTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
  133.             b = 4
  134.     END SELECT
  135.  
  136. SUB PrintUniCode (code AS LONG)
  137.     IF code < 1 OR code > 65535 THEN EXIT SUB
  138.     _MAPUNICODE code TO 0
  139.     PRINT CHR$(0);
  140.     _MAPUNICODE 0 TO 0
  141.  

My little decoder looks like this:

Code: QB64: [Select]
  1. FUNCTION ConvertUTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
  2.     DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
  3.     'b tells us how many bytes we used in this conversion,
  4.     'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
  5.     SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
  6.         CASE IS < 128
  7.             ConvertUTF8toUniCode& = a(0)
  8.             b = 1 'one byte character
  9.         CASE 194 TO 223
  10.             first = a(0) MOD 32
  11.             second = a(1) MOD 64
  12.             ConvertUTF8toUniCode& = 64 * first + second
  13.             b = 2 'two byte character
  14.         CASE 224 TO 239 '3byte UTF-8 symbol
  15.             first = a(0) MOD 16
  16.             second = a(1) MOD 64
  17.             third = a(2) MOD 64
  18.             ConvertUTF8toUniCode& = 4096 * first + 64 * second + third
  19.             b = 3
  20.         CASE 240 TO 255 ' 4byte UFT-8 symbol
  21.             first = a(0) MOD 8
  22.             second = a(1) MOD 64
  23.             third = a(2) MOD 64
  24.             fourth = a(3) MOD 64
  25.             ConvertUTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
  26.             b = 4
  27.     END SELECT

No need for a whole string of DATA statements, which I found myself completely lost as to what they were actually doing, this routine only relies on a few simple MOD statements and basic math to get us to the same solution. 

You've got to love that there's always multiple ways to come to the same solution to a problem!  ;D
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Print Unicode, UTF-8
« Reply #11 on: January 02, 2020, 11:39:06 am »
I do not have enough knowledge about the construction of UNICODE that I would ever think of such a solution.

Marked as best answer by SMcNeill on January 02, 2020, 07:12:00 am

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Print Unicode, UTF-8
« Reply #12 on: January 02, 2020, 12:10:34 pm »
I do not have enough knowledge about the construction of UNICODE that I would ever think of such a solution.

Take a look at this little program and see if it doesn't come close to handling any conversion needs you might need:

Code: QB64: [Select]
  1. REDIM SHARED __utf8d(0) AS _BYTE
  2.  
  3. SCREEN _NEWIMAGE(800, 600, 32)
  4. f = _LOADFONT("cyberbit.ttf", 16, "monospace")
  5.  
  6. FOR j = 0 TO 255
  7.     ASCIItoUTF8 j, a(), b
  8.     PRINT j; ")";
  9.     FOR i = 0 TO b - 1
  10.         PRINT a(i);
  11.     NEXT
  12.     PRINT CHR$(j),
  13.     PrintUniCode UTF8toUniCode(a(), b)
  14.     PRINT
  15.     SLEEP
  16.  
  17. FUNCTION ASCIItoUnicode& (ASCII AS _UNSIGNED _BYTE)
  18.     ASCIItoUnicode& = _MAPUNICODE(ASCII)
  19.  
  20. SUB ASCIItoUTF8 (ASCII AS _UNSIGNED _BYTE, a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
  21.     UnicodeToUTF8 _MAPUNICODE(ASCII), a(), b
  22.  
  23. SUB UnicodeToUTF8 (Unicode AS LONG, a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE) 'a() should be an array from 0 to 3 which we want to get the value back from.
  24.     FOR i = 0 TO 3: a(i) = 0: NEXT 'reset all values to 0
  25.     SELECT CASE Unicode
  26.         CASE 0 TO 127
  27.             a(0) = Unicode
  28.             b = 1 'we return 1 byte
  29.         CASE 128 TO 2047
  30.             a(0) = 192 + Unicode \ 64
  31.             a(1) = 128 + Unicode MOD 64
  32.             b = 2 'we return 2 bytes
  33.         CASE 2048 TO 65535
  34.             a(2) = 128 + Unicode MOD 64
  35.             r = Unicode \ 64
  36.             a(1) = 128 + r MOD 64
  37.             a(0) = 224 + r \ 64
  38.             b = 3 'we return 3 bytes
  39.         CASE 65536 TO 1114111
  40.             a(3) = 128 + Unicode MOD 64
  41.             r = Unicode \ 64
  42.             a(2) = 128 + r MOD 64
  43.             r = r \ 64
  44.             a(1) = 128 + r MOD 64
  45.             a(0) = 240 + r \ 64
  46.             b = 4 'we return 4 bytes
  47.         CASE ELSE
  48.             PRINT "Invalid Value passed to UniCodeToUTF8 convertor!"; Unicode, HEX$(Unicode)
  49.             ERROR 5
  50.     END SELECT
  51.  
  52. FUNCTION UTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
  53.     DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
  54.     'b tells us how many bytes we used in this conversion,
  55.     'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
  56.     SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
  57.         CASE IS < 128
  58.             UTF8toUniCode& = a(0)
  59.             b = 1 'one byte character
  60.         CASE 194 TO 223
  61.             first = a(0) MOD 32
  62.             second = a(1) MOD 64
  63.             UTF8toUniCode& = 64 * first + second
  64.             b = 2 'two byte character
  65.         CASE 224 TO 239 '3byte UTF-8 symbol
  66.             first = a(0) MOD 16
  67.             second = a(1) MOD 64
  68.             third = a(2) MOD 64
  69.             UTF8toUniCode& = 4096 * first + 64 * second + third
  70.             b = 3
  71.         CASE 240 TO 255 ' 4byte UFT-8 symbol
  72.             first = a(0) MOD 8
  73.             second = a(1) MOD 64
  74.             third = a(2) MOD 64
  75.             fourth = a(3) MOD 64
  76.             UTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
  77.             b = 4
  78.     END SELECT
  79.  
  80. SUB PrintUniCode (code AS LONG)
  81.     IF code < 1 OR code > 65535 THEN EXIT SUB
  82.     _MAPUNICODE code TO 0
  83.     PRINT CHR$(0);
  84.     _MAPUNICODE 0 TO 0

If you notice, we have a loop from 0 to 255, which we use to get ASCII codes from: FOR j = 0 TO 255   

Then we convert ASCII to UTF8 with:     ASCIItoUTF8 j, a(), b

The next segment prints the UTF-8 codes for us:
    FOR i = 0 TO b - 1
        PRINT a(i);
    NEXT

Then, to prove that things are working as they should, we print the CHR$ character itself:
    PRINT CHR$(j), 

And, for comparison, we then print the unicode version from the UTF8-translation, just so we can compare and make certain that all of the pieces match and look identical for us: 
    PrintUniCode UTF8toUniCode(a(), b)   



I'm thinking there's enough conversion routines in the toolkit here now, that people should be able to quickly/easily convert back and forth between ASCII, UTF8, and Unicode, as their program needs.  :)

If I'm missing something, just let me know, but I think this, when inserted into a program with my key remapper, will allow me to create an easily international input/output system for use with my QB64 programs in the future -- with no real reliance with _MAPUNICODE and code pages being required in the future.
 
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Unma

  • Newbie
  • Posts: 1
    • View Profile
Re: Print Unicode, UTF-8
« Reply #13 on: April 06, 2020, 08:55:59 am »
Just to join the club, I also have QB64 function that converts UTF-8 strings into UNICODE number.

Not knowing for Your efforts, I have done this as aid to (hopefully automated) translating among many languages.
Once again I did something needless :-)

It is written in such a way that after 8 years, even I can understand what I was doing today. And that is not always the case :-)
Well, it my be helpful to somebody unfamiliar with UNICODE standard.

Code: QB64: [Select]
  1. '------------------------------
  2. ' INPUT:  UTF-8 string
  3. '------------------------------
  4. ' OUTPUT: ERROR   UTF2UNICODE < 0
  5. '         ---------------------
  6. '         OK      UTF2UNICODE => 0 AND UTF2UNICODE =< &H10FFFF
  7. '                 Recoginsed unicode character is removed from the begining of argument. This can be turned off (see at the bottom).
  8. '------------------------------
  9. FUNCTION UTF2UNICODE& (txt$)
  10.     DIM chlen AS INTEGER
  11.     DIM hb AS LONG
  12.     DIM db AS LONG
  13.     DIM result AS LONG
  14.  
  15.     IF LEN(txt$) = 0 THEN
  16.         UTF2UNICODE& = -1 'Invalid argument
  17.         EXIT FUNCTION
  18.     END IF
  19.     result = -2 'Unspecified error
  20.     chlen = 1
  21.     hb = ASC(txt$) 'head-byte only
  22.  
  23.     IF (hb AND &B10000000) = 0 THEN ' ? 0xxx xxxx  TRUE=byte is ASCII character
  24.         result = hb
  25.     ELSE
  26.         IF (hb AND &B11100000) = &B11000000 THEN ' ? 110x xxxx  TRUE=byte is 1st of two bytes
  27.             'head-byte + data-byte
  28.             ' 110xxxxx   10yyyyyy
  29.             '---------------------
  30.             chlen = 2
  31.             'result = (hb AND &B00011111) * &B01000000
  32.             result = (hb AND &H1F) * &H40 '            head-byte  shifted left 6 places  result | 0000 0000  0000 0000  0000 0xxx  xx00 0000 |
  33.             db = ASC(MID$(txt$, 2, 1)) '                     data-byte
  34.             'result = result OR (db AND &B00111111)
  35.             result = result OR (db AND &H3F) '              data-byte  copied                 result | 0000 0000  0000 0000  0000 0xxx  xxyy yyyy |
  36.         ELSE
  37.             IF (hb AND &B11110000) = &B11100000 THEN ' ? 1110 xxxx  TRUE=byte is 1st of 3 bytes
  38.                 'head-byte + data-byte1 + data-byte2
  39.                 ' 1110xxxx   10yyyyyy     10zzzzzz
  40.                 '-----------------------------------
  41.                 chlen = 3
  42.                 'result = (hb AND &B00001111) * &B 0001 0000 0000 0000
  43.                 result = (hb AND &HF) * &H1000 '               head-byte   shifted left 12 places  result | 0000 0000  0000 0000  xxxx 0000  0000 0000 |
  44.                 db = ASC(MID$(txt$, 2, 1)) '                     data-byte1
  45.                 result = result OR ((db AND &H3F) * &H40) '  data-byte1  shifted left 6 places   result | 0000 0000  0000 0000  xxxx yyyy  yy00 0000 |
  46.                 db = ASC(MID$(txt$, 3, 1)) '                     data-byte2
  47.                 result = result OR (db AND &H3F) '           data-byte2  copied                  result | 0000 0000  0000 0000  xxxx yyyy  yyzz zzzz |
  48.             ELSE
  49.                 IF (hb AND &B11111000) = &B11110000 THEN ' ? 1111 0xxx  TRUE=byte is 1st of 4
  50.                     'head-byte + data-byte1 + data-byte2 + data-byte3
  51.                     ' 11110xxx   10yyyyyy     10zzzzzz     10wwwwww
  52.                     '------------------------------------------------
  53.                     chlen = 4
  54.                     'result = (hb AND &B00000111) * &B 0000 0100  0000 0000  0000 0000
  55.                     result = (hb AND &H6) * &H400000 '             head-byte   shifted left 18 places  result | 0000 0000  000x xx00  0000 0000  0000 0000 |
  56.                     db = ASC(MID$(txt$, 2, 1)) '                      data-byte1
  57.                     result = result OR ((db AND &H3F) * &H1000) ' data-byte1  shifted left 12 places  result | 0000 0000  000x xxyy  yyyy 0000  0000 0000 |
  58.                     db = ASC(MID$(txt$, 3, 1)) '                      data-byte2
  59.                     result = result OR ((db AND &H3F) * &H40) '   data-byte2  shifted left 6 places   result | 0000 0000  000x xxyy  yyyy zzzz  zz00 0000 |
  60.                     db = ASC(MID$(txt$, 4, 1)) '                      data-byte3
  61.                     result = result OR (db AND &H3F) '           data-byte3  copied                  result | 0000 0000  000x xxyy  yyyy zzzz  zzww wwww |
  62.                 ELSE
  63.                     'Not a head-byte.
  64.                     result = hb
  65.                 END IF
  66.             END IF
  67.         END IF
  68.     END IF
  69.     IF chlen < LEN(txt$) THEN txt$ = MID$(txt$, chlen + 1) ELSE txt$ = "" ' By commenting this line, function will leave string-argument unchanged.
  70.     UTF2UNICODE& = result
  71.  


I intend to ad some more "error codes" (negative returns) at some later point. Right now I have to deal another issue.
It is obvious that there is a bad blood between QB64 IDE and my OS (Linux Mint).
By

Offline RhoSigma

  • QB64 Developer
  • Forum Resident
  • Posts: 565
    • View Profile
Re: Print Unicode, UTF-8
« Reply #14 on: April 06, 2020, 09:13:16 am »
And a cross reference to a similar topic: https://qb64forum.alephc.xyz/index.php?topic=2248
« Last Edit: January 31, 2022, 07:11:11 pm by RhoSigma »
My Projects:   https://qb64forum.alephc.xyz/index.php?topic=809
GuiTools - A graphic UI framework (can do multiple UI forms/windows in one program)
Libraries - ImageProcess, StringBuffers (virt. files), MD5/SHA2-Hash, LZW etc.
Bonus - Blankers, QB64/Notepad++ setup pack