Author Topic: Print Unicode, UTF-8 (Read 20363 times)

SMcNeill · « **on:** January 01, 2020, 04:46:05 pm »

A couple of quick routines which I worked up to print unicode for us, and to convert UTF-8 codepages to unicode format, so that they'll print properly for us.

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
f = _LOADFONT("cyberbit.ttf", 16, "monospace")
_FONT f
 
 
FOR j = 0 TO 65000 STEP 1000
    CLS
    COLOR _RGB32(255, 255, 0)
    PRINT "UNICODE Values"; j; " to"; j + 999
    COLOR -1
    FOR i = 0 TO 999
        PrintUniCode i + j
    NEXT
    SLEEP
NEXT
 
 
CLS
 
OPEN "Petr Text.txt" FOR BINARY AS #1
DIM a(3) AS _UNSIGNED _BYTE
DIM i AS _UNSIGNED INTEGER
DIM b AS _UNSIGNED _BYTE
 
b = 1
DO UNTIL EOF(1)
    p = p + b
    FOR j = 0 TO 3
        GET #1, p+j, a(j)
    NEXT
    UTF8 = ConvertUTF8toUniCode(a(), b)
    PrintUniCode UTF8
 
    SLEEP
LOOP
 
 
FUNCTION ConvertUTF8toUniCode (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
    'b tells us how many bytes we used in this conversion,
    'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
    SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
        CASE IS < 194: ConvertUTF8toUniCode = a(0): b = 1 'one byte character
        CASE 194 TO 223: ConvertUTF8toUniCode = (64 * (a(0) - 194) + a(1)): b = 2 'two byte character
        CASE 224 TO 239: b = 3: '3byte UTF-8 symbol  (I haven't sorted out the conversion for these yet)
        CASE 240 TO 255: b = 4 ' 4byte UFT-8 symbol
    END SELECT
END FUNCTION
 
SUB PrintUniCode (code AS LONG)
    IF code < 1 OR code > 65535 THEN EXIT SUB
    _MAPUNICODE code TO 0
    PRINT CHR$(0);
    _MAPUNICODE 0 TO 0
END SUB

Note: This uses some text that Petr shared on the forums here, in the past, so I hope he won't mind me using it for testing purposes.

I haven't quite sorted out the work needed to convert 3 and 4 byte UTF-8 values to Unicode, but just being able to read and convert 1 and 2-byte characters will cover 99% of all documents which the general public would use normally. It covers all the general Latin pages, and seems to have all the general symbols used in text.

I'm thinking 3-character UTF-8 characters are Japanese Kanji/Chinese characters, and such, and I have no idea what the 4-character UTF-8 characters represent. Maybe some of the Emoji, or scientific symbols/notations? I honestly don't know, as I personally tend to type exclusively in pure old unaccented American English, and the old 7-bit ANSI codes are usually enough for my needs...

Still though, I've been breaking my brain working with the keyhit remapper, so I thought I'd swap gears and take a break for a small bit...

So I swap over to trying to decode Unicode, UTF-8, and ANSI/ASCII encodings instead. LOL!!

Anywho... If you folks who tend to speak and write in those languages with these extended code sets could test things out, I'd appreciate it. As far as I can tell, it works with what I've tested it with so far, but my eyes are old, tired, and I don't know the languages good enough to tell an A-forward slash from an A-backward slash from an A-circle over top... Hell, I don't even know what the heck you actually call those accents...

I hope it works as it should, but if not, just let me know what's wrong with it, and I'll try and sort it out, as long as somebody is willing to be patient and work with me and tell me exactly what the heck is off with the conversion processes so I can fix them. ;)

Petr · « **Reply #1 on:** January 01, 2020, 05:22:06 pm »

Hi Steve,
it does not work properly. If I remember it well, the point there on thread, was a move of characters over 128, which contain non-English diacritics back down to letters compatible with English, ie characters up to 127? To do this, you would need a function that returns the local national lunguage code number LCID (Powershell and get culture can be used, or a similar command for Linux) and then use DATA to move high characters from an ascii table down to English characters. Accents will not be displayed correctly, but the text will be readable. I think this was the point?

I enclose the output of your program and also my ascii table, as is set by default for my language.

Petr · « **Reply #2 on:** January 01, 2020, 05:27:46 pm »

I can work on this program tomorrow with you, because it's almost midnight here and I'm going to work tomorrow. I must go sleeping.

SMcNeill · « **Reply #3 on:** January 01, 2020, 05:36:27 pm »

Quote

Hi Steve,
it does not work properly.…

It’s not using the ASCII table, or character pages for us. Instead, we should be reading the UTF-8 directly and then converting and using the unicode characters in the cyberbit.ttf which comes with QB64.

Does the unicode characters display properly at the start, or is it something wrong with them to begin with, which isn’t working for you?

SMcNeill · « **Reply #4 on:** January 01, 2020, 06:07:18 pm »

You're right Petr; it's not working properly there.

It's odd, because I get the correct outputs here:

Code: QB64: [Select]

SCREEN _NEWIMAGE(800, 600, 32)
f = _LOADFONT("cyberbit.ttf", 16, "monospace")
_FONT f
 
 
GOTO skip_test
FOR j = 0 TO 65000 STEP 1000
    CLS
    COLOR _RGB32(255, 255, 0)
    PRINT "UNICODE Values"; j; " to"; j + 999
    COLOR -1
    FOR i = 0 TO 999
        PrintUniCode i + j
    NEXT
    SLEEP
NEXT
 
skip_test:
CLS
 
OPEN "Petr Text.txt" FOR BINARY AS #1
DIM a(1) AS _UNSIGNED _BYTE
DIM i AS _UNSIGNED INTEGER
 
DO UNTIL EOF(1)
    p = p + 1
    GET #1, p, a(0)
    IF a(0) > 193 THEN
        GET #1, p + 1, a(1)
        p = p + 1
        '       i = a(0) * 256 + a(1)
        SELECT CASE a(0) '194 to 223
            CASE IS < 194: PrintUniCode i
            CASE 194 TO 223: PrintUniCode (a(1) + 64 * (a(0) - 194))
            CASE ELSE: PRINT "U:"; a(0), a(1) 'it's a 3 or 4 byte unicode
        END SELECT
    ELSE
        PrintUniCode a(0)
    END IF
    SLEEP
LOOP
 
 
SUB PrintUniCode (code AS LONG)
    IF code < 1 OR code > 65535 THEN EXIT SUB
    _MAPUNICODE code TO 0
    PRINT CHR$(0);
    _MAPUNICODE 0 TO 0
END SUB

So apparently, when I converted it to a Function, rather than a direct method (as above), I glitched something out. The issue seems minor enough -- I just need to figure out why the function is giving me different results than the direct version is. I'm on it! :P

SMcNeill · « **Reply #5 on:** January 01, 2020, 06:13:53 pm »

Scratch that. It's working exactly as it should. I simply wasn't sending it the proper sequence of bytes to convert.

The original code:

Code: QB64: [Select]

DO UNTIL EOF(1)
    p = p + b
    FOR j = 0 TO 3
        GET #1, p, a(j)
    NEXT
 

The glitch?

GET #1, p + j, a(j)

We need to get up to 4 bytes and send them to the convertor -- not send it the SAME byte 4 times in a row!

When you get a chance again later, give it a try one more time and see how it does for you. I've updated the code in the original post, and I think it's got the issue corrected properly for us. :)

luke · « **Reply #6 on:** January 01, 2020, 06:21:37 pm »

Here's a UTF-8 decoder I had in my back pocket that does all the characters:

Code: [Select]

DEFLNG A-Z
REDIM SHARED __utf8d(0) AS _BYTE

__utf8dl:
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
DATA 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
DATA 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
DATA &Ha,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H4,&H3,&H3
DATA &Hb,&H6,&H6,&H6,&H5,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8
DATA &H0,&H1,&H2,&H3,&H5,&H8,&H7,&H1,&H1,&H1,&H4,&H6,&H1,&H1,&H1,&H1
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1
DATA 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1

SUB utf8print (in$)
    IF UBOUND(__utf8d) = 0 THEN
        REDIM __utf8d(0 TO 399) AS _BYTE
        RESTORE __utf8dl
        FOR i& = 0 TO 399
            READ __utf8d(i&)
        NEXT i&
    END IF
    o$ = ""
    FOR i& = 1 TO LEN(in$)
        c~%% = ASC(in$, i&)
        typ~%% = __utf8d(c~%%)
        IF s~%% <> 0 THEN
            cp = (c~%% AND &H3F) OR (cp * 64)
        ELSE
            cp = (255 \ (2 ^ typ~%%)) AND c~%%
        END IF
        s~%% = __utf8d(256 + s~%% * 16 + typ~%%)

        IF s~%% = 0 THEN
            _MAPUNICODE cp& TO 0
            PRINT CHR$(0);
        ELSEIF s~%% = 1 THEN
            ERROR 2
        END IF
    NEXT i&
    _MAPUNICODE 0 TO 0
END SUB

FUNCTION fromhex$ (in$)
    IF LEN(in$) MOD 2 THEN ERROR 2
    o$ = SPACE$(LEN(in$) / 2)
    FOR i& = 1 TO LEN(in$) STEP 2
        c$ = MID$(in$, i&, 2)
        MID$(o$, (i& + 1) / 2) = _MK$(_BYTE, VAL("&H" + c$))
    NEXT i&
    fromhex$ = o$
END FUNCTION

If you do something like

Code: [Select]

utf8print fromhex$("D09220D187D0B0D189D0B0D18520D18ED0B3D0B020D0B6D0B8D0BB20D0B1D18B20D186D0B8D182D180D183D1813F20D094D0B02C20D0BDD0BE20D184D0B0D0BBD18CD188D0B8D0B2D18BD0B920D18DD0BAD0B7D0B5D0BCD0BFD0BBD18FD18021")

it works nicely.

Going up higher you have something like

Code: [Select]

utf8print fromhex$("E0A490") which does work but you'll need a different font because cyberbit doesn't support it.

Unfortunately you can't do

Code: [Select]

utf8print fromhex$("F0908080") because _MAPUNICODE is crappy and doesn't support code points above U+10000.

SMcNeill · « **Reply #7 on:** January 01, 2020, 07:16:57 pm »

Nice to see Luke is just as human when it comes to typos! :)

There's a problem in the line here: _MAPUNICODE cp& TO 0

Remove that &, and keep your variable type consistent inside that sub. The only reason why it's working at the moment, is because of the DEFLNG A-Z at the top of code keeping the types matching for the program. :)

luke · « **Reply #8 on:** January 01, 2020, 07:29:57 pm »

Eugh, it was originally just a decoder function and I added that bit to make it a printer.

Maybe instead of a typo we'll call it emphasis? :)

Petr · « **Reply #9 on:** January 02, 2020, 11:27:08 am »

Hi Steve! I test it again using your upgraded first source code and now it works correctly, just font is clipped down, but it doesn't concern this program but it's a known font bug.

Good job!

SMcNeill · « **Reply #10 on:** January 02, 2020, 11:27:24 am »

Sorted out my own little decoder, which ends up giving the exact same results as Luke's, from what initial testing I've tried it with.

Code: QB64: [Select]

REDIM SHARED __utf8d(0) AS _BYTE
DIM a(3) AS _UNSIGNED _BYTE
DIM i AS _UNSIGNED INTEGER
DIM b AS _UNSIGNED _BYTE
 
SCREEN _NEWIMAGE(800, 600, 32)
f = _LOADFONT("cyberbit.ttf", 16, "monospace")
_FONT f
 
FOR j = 0 TO 65000 STEP 1000
    CLS
    COLOR _RGB32(255, 255, 0)
    PRINT "UNICODE Values"; j; " to"; j + 999
    COLOR -1
    FOR i = 0 TO 999
        PrintUniCode i + j
    NEXT
    SLEEP
NEXT
 
 
CLS
 
OPEN "Petr Text.txt" FOR BINARY AS #1
 
b = 1
DO UNTIL EOF(1)
    p = p + b
    FOR j = 0 TO 3
        GET #1, p + j, a(j)
    NEXT
    UTF8 = ConvertUTF8toUniCode(a(), b)
    PrintUniCode UTF8
 
    SLEEP
LOOP
CLOSE
 
 
Luke$ = fromhex$("D09220D187D0B0D189D0B0D18520D18ED0B3D0B020D0B6D0B8D0BB20D0B1D18B20D186D0B8D182D180D183D1813F20D094D0B02C20D0BDD0BE20D184D0B0D0BBD18CD188D0B8D0B2D18BD0B920D18DD0BAD0B7D0B5D0BCD0BFD0BBD18FD18021")
 
CLS
 
 
b = 1: p = 0: l = LEN(Luke$)
DO UNTIL p >= LEN(Luke$)
    p = p + b
    FOR j = 0 TO 3
        IF p + j > l THEN a(j) = 0 ELSE a(j) = ASC(Luke$, p + j)
    NEXT
    UTF8 = ConvertUTF8toUniCode(a(), b)
    PrintUniCode UTF8
LOOP
CLOSE
PRINT
PRINT
PRINT "Luke's Translation:"
PRINT
utf8print Luke$
 
_DELAY 3
 
__utf8dl:
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
DATA 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
DATA 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
DATA &Ha,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H3,&H4,&H3,&H3
DATA &Hb,&H6,&H6,&H6,&H5,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8,&H8
DATA &H0,&H1,&H2,&H3,&H5,&H8,&H7,&H1,&H1,&H1,&H4,&H6,&H1,&H1,&H1,&H1
DATA 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1
DATA 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1
DATA 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1
 
SUB utf8print (in$)
    IF UBOUND(__utf8d) = 0 THEN
        REDIM __utf8d(0 TO 399) AS _BYTE
        RESTORE __utf8dl
        FOR i& = 0 TO 399
            READ __utf8d(i&)
        NEXT i&
    END IF
    o$ = ""
    FOR i& = 1 TO LEN(in$)
        c~%% = ASC(in$, i&)
        typ~%% = __utf8d(c~%%)
        IF s~%% <> 0 THEN
            cp = (c~%% AND &H3F) OR (cp * 64)
        ELSE
            cp = (255 \ (2 ^ typ~%%)) AND c~%%
        END IF
        s~%% = __utf8d(256 + s~%% * 16 + typ~%%)
 
        IF s~%% = 0 THEN
            _MAPUNICODE cp TO 0
            PRINT CHR$(0);
        ELSEIF s~%% = 1 THEN
            ERROR 2
        END IF
    NEXT i&
    _MAPUNICODE 0 TO 0
END SUB
 
 
 
FUNCTION fromhex$ (in$)
    IF LEN(in$) MOD 2 THEN ERROR 2
    o$ = SPACE$(LEN(in$) / 2)
    FOR i& = 1 TO LEN(in$) STEP 2
        c$ = MID$(in$, i&, 2)
        MID$(o$, (i& + 1) / 2) = _MK$(_BYTE, VAL("&H" + c$))
    NEXT i&
    fromhex$ = o$
END FUNCTION
 
 
 
 
 
 
 
FUNCTION ConvertUTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
    DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
    'b tells us how many bytes we used in this conversion,
    'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
    SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
        CASE IS < 128
            ConvertUTF8toUniCode& = a(0)
            b = 1 'one byte character
        CASE 194 TO 223
            first = a(0) MOD 32
            second = a(1) MOD 64
            ConvertUTF8toUniCode& = 64 * first + second
            b = 2 'two byte character
        CASE 224 TO 239 '3byte UTF-8 symbol
            first = a(0) MOD 16
            second = a(1) MOD 64
            third = a(2) MOD 64
            ConvertUTF8toUniCode& = 4096 * first + 64 * second + third
            b = 3
        CASE 240 TO 255 ' 4byte UFT-8 symbol
            first = a(0) MOD 8
            second = a(1) MOD 64
            third = a(2) MOD 64
            fourth = a(3) MOD 64
            ConvertUTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
            b = 4
    END SELECT
END FUNCTION
 
SUB PrintUniCode (code AS LONG)
    IF code < 1 OR code > 65535 THEN EXIT SUB
    _MAPUNICODE code TO 0
    PRINT CHR$(0);
    _MAPUNICODE 0 TO 0
END SUB
 

My little decoder looks like this:

Code: QB64: [Select]

FUNCTION ConvertUTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
    DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
    'b tells us how many bytes we used in this conversion,
    'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
    SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
        CASE IS < 128
            ConvertUTF8toUniCode& = a(0)
            b = 1 'one byte character
        CASE 194 TO 223
            first = a(0) MOD 32
            second = a(1) MOD 64
            ConvertUTF8toUniCode& = 64 * first + second
            b = 2 'two byte character
        CASE 224 TO 239 '3byte UTF-8 symbol
            first = a(0) MOD 16
            second = a(1) MOD 64
            third = a(2) MOD 64
            ConvertUTF8toUniCode& = 4096 * first + 64 * second + third
            b = 3
        CASE 240 TO 255 ' 4byte UFT-8 symbol
            first = a(0) MOD 8
            second = a(1) MOD 64
            third = a(2) MOD 64
            fourth = a(3) MOD 64
            ConvertUTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
            b = 4
    END SELECT
END FUNCTION

No need for a whole string of DATA statements, which I found myself completely lost as to what they were actually doing, this routine only relies on a few simple MOD statements and basic math to get us to the same solution.

You've got to love that there's always multiple ways to come to the same solution to a problem! ;D

Petr · « **Reply #11 on:** January 02, 2020, 11:39:06 am »

I do not have enough knowledge about the construction of UNICODE that I would ever think of such a solution.

SMcNeill · « **Reply #12 on:** January 02, 2020, 12:10:34 pm »

Quote from: Petr on January 02, 2020, 11:39:06 am

I do not have enough knowledge about the construction of UNICODE that I would ever think of such a solution.

Take a look at this little program and see if it doesn't come close to handling any conversion needs you might need:

Code: QB64: [Select]

REDIM SHARED __utf8d(0) AS _BYTE
DIM a(3) AS _UNSIGNED _BYTE
DIM i AS _UNSIGNED INTEGER
DIM b AS _UNSIGNED _BYTE
 
SCREEN _NEWIMAGE(800, 600, 32)
f = _LOADFONT("cyberbit.ttf", 16, "monospace")
_FONT f
 
_CONTROLCHR OFF
FOR j = 0 TO 255
    ASCIItoUTF8 j, a(), b
    PRINT j; ")";
    FOR i = 0 TO b - 1
        PRINT a(i);
    NEXT
    PRINT CHR$(j),
    PrintUniCode UTF8toUniCode(a(), b)
    PRINT
    SLEEP
NEXT
 
FUNCTION ASCIItoUnicode& (ASCII AS _UNSIGNED _BYTE)
    ASCIItoUnicode& = _MAPUNICODE(ASCII)
END FUNCTION
 
SUB ASCIItoUTF8 (ASCII AS _UNSIGNED _BYTE, a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
    UnicodeToUTF8 _MAPUNICODE(ASCII), a(), b
END SUB
 
SUB UnicodeToUTF8 (Unicode AS LONG, a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE) 'a() should be an array from 0 to 3 which we want to get the value back from.
    FOR i = 0 TO 3: a(i) = 0: NEXT 'reset all values to 0
    SELECT CASE Unicode
        CASE 0 TO 127
            a(0) = Unicode
            b = 1 'we return 1 byte
        CASE 128 TO 2047
            a(0) = 192 + Unicode \ 64
            a(1) = 128 + Unicode MOD 64
            b = 2 'we return 2 bytes
        CASE 2048 TO 65535
            a(2) = 128 + Unicode MOD 64
            r = Unicode \ 64
            a(1) = 128 + r MOD 64
            a(0) = 224 + r \ 64
            b = 3 'we return 3 bytes
        CASE 65536 TO 1114111
            a(3) = 128 + Unicode MOD 64
            r = Unicode \ 64
            a(2) = 128 + r MOD 64
            r = r \ 64
            a(1) = 128 + r MOD 64
            a(0) = 240 + r \ 64
            b = 4 'we return 4 bytes
        CASE ELSE
            PRINT "Invalid Value passed to UniCodeToUTF8 convertor!"; Unicode, HEX$(Unicode)
            ERROR 5
    END SELECT
END SUB
 
FUNCTION UTF8toUniCode& (a() AS _UNSIGNED _BYTE, b AS _UNSIGNED _BYTE)
    DIM first AS _UNSIGNED _BYTE, second AS _UNSIGNED _BYTE, third AS _UNSIGNED _BYTE, fourth AS _UNSIGNED _BYTE
    'b tells us how many bytes we used in this conversion,
    'so we know how far to move a pointer when reading from a file or UTF-8 encoded string
    SELECT CASE a(0) 'first byte is the control byte which tells us how many bytes we need to make our UTF-8 character
        CASE IS < 128
            UTF8toUniCode& = a(0)
            b = 1 'one byte character
        CASE 194 TO 223
            first = a(0) MOD 32
            second = a(1) MOD 64
            UTF8toUniCode& = 64 * first + second
            b = 2 'two byte character
        CASE 224 TO 239 '3byte UTF-8 symbol
            first = a(0) MOD 16
            second = a(1) MOD 64
            third = a(2) MOD 64
            UTF8toUniCode& = 4096 * first + 64 * second + third
            b = 3
        CASE 240 TO 255 ' 4byte UFT-8 symbol
            first = a(0) MOD 8
            second = a(1) MOD 64
            third = a(2) MOD 64
            fourth = a(3) MOD 64
            UTF8toUniCode& = CLNG(262144 * first + 4096 * second + 64 * third + fourth)
            b = 4
    END SELECT
END FUNCTION
 
SUB PrintUniCode (code AS LONG)
    IF code < 1 OR code > 65535 THEN EXIT SUB
    _MAPUNICODE code TO 0
    PRINT CHR$(0);
    _MAPUNICODE 0 TO 0
END SUB

If you notice, we have a loop from 0 to 255, which we use to get ASCII codes from: FOR j = 0 TO 255

Then we convert ASCII to UTF8 with: ASCIItoUTF8 j, a(), b

The next segment prints the UTF-8 codes for us:
FOR i = 0 TO b - 1
PRINT a(i);
NEXT

Then, to prove that things are working as they should, we print the CHR$ character itself:
PRINT CHR$(j),

And, for comparison, we then print the unicode version from the UTF8-translation, just so we can compare and make certain that all of the pieces match and look identical for us:
PrintUniCode UTF8toUniCode(a(), b)

I'm thinking there's enough conversion routines in the toolkit here now, that people should be able to quickly/easily convert back and forth between ASCII, UTF8, and Unicode, as their program needs. :)

If I'm missing something, just let me know, but I think this, when inserted into a program with my key remapper, will allow me to create an easily international input/output system for use with my QB64 programs in the future -- with no real reliance with _MAPUNICODE and code pages being required in the future.

Unma · « **Reply #13 on:** April 06, 2020, 08:55:59 am »

Just to join the club, I also have QB64 function that converts UTF-8 strings into UNICODE number.

Not knowing for Your efforts, I have done this as aid to (hopefully automated) translating among many languages.
Once again I did something needless :-)

It is written in such a way that after 8 years, even I can understand what I was doing today. And that is not always the case :-)
Well, it my be helpful to somebody unfamiliar with UNICODE standard.

Code: QB64: [Select]

'------------------------------
' INPUT:  UTF-8 string
'------------------------------
' OUTPUT: ERROR   UTF2UNICODE < 0
'         ---------------------
'         OK      UTF2UNICODE => 0 AND UTF2UNICODE =< &H10FFFF
'                 Recoginsed unicode character is removed from the begining of argument. This can be turned off (see at the bottom).
'------------------------------
FUNCTION UTF2UNICODE& (txt$)
    DIM chlen AS INTEGER
    DIM hb AS LONG
    DIM db AS LONG
    DIM result AS LONG
 
    IF LEN(txt$) = 0 THEN
        UTF2UNICODE& = -1 'Invalid argument
        EXIT FUNCTION
    END IF
    result = -2 'Unspecified error
    chlen = 1
    hb = ASC(txt$) 'head-byte only
 
    IF (hb AND &B10000000) = 0 THEN ' ? 0xxx xxxx  TRUE=byte is ASCII character
        result = hb
    ELSE
        IF (hb AND &B11100000) = &B11000000 THEN ' ? 110x xxxx  TRUE=byte is 1st of two bytes
            'head-byte + data-byte
            ' 110xxxxx   10yyyyyy
            '---------------------
            chlen = 2
            'result = (hb AND &B00011111) * &B01000000
            result = (hb AND &H1F) * &H40 '            head-byte  shifted left 6 places  result | 0000 0000  0000 0000  0000 0xxx  xx00 0000 |
            db = ASC(MID$(txt$, 2, 1)) '                     data-byte
            'result = result OR (db AND &B00111111)
            result = result OR (db AND &H3F) '              data-byte  copied                 result | 0000 0000  0000 0000  0000 0xxx  xxyy yyyy |
        ELSE
            IF (hb AND &B11110000) = &B11100000 THEN ' ? 1110 xxxx  TRUE=byte is 1st of 3 bytes
                'head-byte + data-byte1 + data-byte2
                ' 1110xxxx   10yyyyyy     10zzzzzz
                '-----------------------------------
                chlen = 3
                'result = (hb AND &B00001111) * &B 0001 0000 0000 0000
                result = (hb AND &HF) * &H1000 '               head-byte   shifted left 12 places  result | 0000 0000  0000 0000  xxxx 0000  0000 0000 |
                db = ASC(MID$(txt$, 2, 1)) '                     data-byte1
                result = result OR ((db AND &H3F) * &H40) '  data-byte1  shifted left 6 places   result | 0000 0000  0000 0000  xxxx yyyy  yy00 0000 |
                db = ASC(MID$(txt$, 3, 1)) '                     data-byte2
                result = result OR (db AND &H3F) '           data-byte2  copied                  result | 0000 0000  0000 0000  xxxx yyyy  yyzz zzzz |
            ELSE
                IF (hb AND &B11111000) = &B11110000 THEN ' ? 1111 0xxx  TRUE=byte is 1st of 4
                    'head-byte + data-byte1 + data-byte2 + data-byte3
                    ' 11110xxx   10yyyyyy     10zzzzzz     10wwwwww
                    '------------------------------------------------
                    chlen = 4
                    'result = (hb AND &B00000111) * &B 0000 0100  0000 0000  0000 0000
                    result = (hb AND &H6) * &H400000 '             head-byte   shifted left 18 places  result | 0000 0000  000x xx00  0000 0000  0000 0000 |
                    db = ASC(MID$(txt$, 2, 1)) '                      data-byte1
                    result = result OR ((db AND &H3F) * &H1000) ' data-byte1  shifted left 12 places  result | 0000 0000  000x xxyy  yyyy 0000  0000 0000 |
                    db = ASC(MID$(txt$, 3, 1)) '                      data-byte2
                    result = result OR ((db AND &H3F) * &H40) '   data-byte2  shifted left 6 places   result | 0000 0000  000x xxyy  yyyy zzzz  zz00 0000 |
                    db = ASC(MID$(txt$, 4, 1)) '                      data-byte3
                    result = result OR (db AND &H3F) '           data-byte3  copied                  result | 0000 0000  000x xxyy  yyyy zzzz  zzww wwww |
                ELSE
                    'Not a head-byte.
                    result = hb
                END IF
            END IF
        END IF
    END IF
    IF chlen < LEN(txt$) THEN txt$ = MID$(txt$, chlen + 1) ELSE txt$ = "" ' By commenting this line, function will leave string-argument unchanged.
    UTF2UNICODE& = result
END FUNCTION
 

I intend to ad some more "error codes" (negative returns) at some later point. Right now I have to deal another issue.
It is obvious that there is a bad blood between QB64 IDE and my OS (Linux Mint).
By

RhoSigma · « **Reply #14 on:** April 06, 2020, 09:13:16 am »

And a cross reference to a similar topic: https://qb64forum.alephc.xyz/index.php?topic=2248

News:

Author Topic: Print Unicode, UTF-8 (Read 20363 times)

SMcNeill

Print Unicode, UTF-8

Petr

Re: Print Unicode, UTF-8

Petr

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

luke

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

luke

Re: Print Unicode, UTF-8

Petr

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

Petr

Re: Print Unicode, UTF-8

SMcNeill

Re: Print Unicode, UTF-8

Unma

Re: Print Unicode, UTF-8

RhoSigma

Re: Print Unicode, UTF-8