QB64.org Forum

Active Forums => QB64 Discussion => Topic started by: MLambert on August 29, 2019, 05:30:24 am

Title: Using a different code page
Post by: MLambert on August 29, 2019, 05:30:24 am
Hi,

I need to scan data for 'odd' characters and replace them with 'english' equivalent.

For example ... find Ž  and replace with Z.

Can I use a code page to somehow look at the data ??

Mike
Title: Re: Using a different code page
Post by: Petr on August 29, 2019, 01:47:53 pm
Hi. Maybe. Not to trying it. I think function _MAPUNICODE can return unicode letters scancode, so if you write own translate table, then is this possible.

Oh man. I am curios. I must try it.

I use Czech as input language.
Title: Re: Using a different code page
Post by: SMcNeill on August 29, 2019, 01:49:57 pm
Hi,

I need to scan data for 'odd' characters and replace them with 'english' equivalent.

For example ... find Ž  and replace with Z.

Can I use a code page to somehow look at the data ??

Mike

Try this little demo over here: https://www.qb64.org/forum/index.php?topic=1647.msg108765#msg108765

Wordlist for the demo is here: https://giusseppe.net/blog/wp-content/uploads/2015/10/dict_rae_txt.zip
Title: Re: Using a different code page
Post by: Petr on August 29, 2019, 02:04:25 pm
Of course! MapUnicode is not needed! First we must know, from which language input is! In principle your program works correctly, but there is one small bug, which kill space and first character in place, in which is text replaced:

  [ This attachment cannot be displayed inline in 'Print Page' view ]  

Title: Re: Using a different code page
Post by: SMcNeill on August 29, 2019, 02:09:49 pm
Of course! MapUnicode is not needed! First we must know, from which language input is! In principle your program works correctly, but there is one small bug, which kill space and first character in place, in which is text replaced:

  [ This attachment cannot be displayed inline in 'Print Page' view ]

Your text must not be Unicode, as it seems you only need to get one character and then replace it.  Are those extended ASCII characters?
Title: Re: Using a different code page
Post by: Petr on August 29, 2019, 02:13:06 pm
This is unicode.
Title: Re: Using a different code page
Post by: SMcNeill on August 29, 2019, 02:17:41 pm
This is unicode.

Can you share a file with some text, and I’ll test it out in a bit.  ;)

From my testing with, "Ahoj svete. Toto je preložený text Goggle, takže muže být obtížné porozumet.", the characters here are extended ASCII characters and not actually encoded as unicode. The "ž" is being stored and read as CHR$(158), which is different than the spanish codes which I was translating in the demo (they're 2 character extended codes, which read as CHR$(195) + CHR$(whatever)). 

You'd need to tweak the routine a bit to read single character input and replace it, instead of double character input as with my demo, but it should be a simple enough modification to make for someone.
Title: Re: Using a different code page
Post by: Petr on August 29, 2019, 02:37:41 pm
So. Done! Really very fast! :)

First, use this program for searching national codes for your national characters. Run this program while is your system switched to your national keyboard!
Needed values write to paper....

Code: QB64: [Select]
  1.     i$ = INKEY$
  2.     IF LEN(i$) THEN PRINT ASC(i$), i$
  3.  

rewrite your codes from paper to this program: (done for Czech)

Code: QB64: [Select]
  1. PRINT "Insert some Czech text to clipboard and then press enter."
  2. INPUT nothing
  3. FOR T = 1 TO LEN(t$)
  4.      IF ASC(t$, T) > 128 OR ASC(t$, T) < 13 THEN O$ = O$ + replace$(ASC(t$, T)) ELSE O$ = O$ + MID$(t$, T, 1)
  5.  
  6. FUNCTION replace$ (ascii)
  7.     SELECT CASE ascii
  8.         CASE 236: replace$ = "e" 'this are outputs from previous program, this is originally ě, new e
  9.         CASE 154: replace$ = "s"
  10.         CASE 232: replace$ = "c"
  11.         CASE 248: replace$ = "r"
  12.         CASE 158: replace$ = "z"
  13.         CASE 253: replace$ = "y"
  14.         CASE 225: replace$ = "a"
  15.         CASE 237: replace$ = "i"
  16.         CASE 233: replace$ = "e"
  17.         CASE 250: replace$ = "u"
  18.         CASE 249: replace$ = "u"
  19.     END SELECT
  20.  


Then insert text, which you needed to translate to "english" and copy it to clipboard and run second source.

This are my outputs:

Original text:

Rumunská policie rozbila gang, který v zemi týral německé děti.
Ty na rumunský venkov přijely v rámci sociálního programu,
který měl napravit jejich problematické dospívání.
Místo pobytu s psychology...

program output:

Rumunska policie rozbila gang, ktery v zemi tyral nemecke deti.
Ty na rumunsky venkov prijely v ramci socialniho programu,
ktery mel napravit jejich problematicke dospivani.
Misto pobytu s psychology...







Title: Re: Using a different code page
Post by: Petr on August 29, 2019, 02:45:18 pm
Wait, Steve. So this is not unicode. Hmmm. But if I write anything in Czech in Word, the program will translate it correctly.
Title: Re: Using a different code page
Post by: SMcNeill on August 29, 2019, 02:47:34 pm
So. Done! Really very fast! :)

First, use this program for searching national codes for your national characters. Run this program while is your system switched to your national keyboard!
Needed values write to paper....

Code: QB64: [Select]
  1.     i$ = INKEY$
  2.     IF LEN(i$) THEN PRINT ASC(i$), i$
  3.  

rewrite your codes from paper to this program: (done for Czech)

Code: QB64: [Select]
  1. PRINT "Insert some Czech text to clipboard and then press enter."
  2. INPUT nothing
  3. FOR T = 1 TO LEN(t$)
  4.      IF ASC(t$, T) > 128 OR ASC(t$, T) < 13 THEN O$ = O$ + replace$(ASC(t$, T)) ELSE O$ = O$ + MID$(t$, T, 1)
  5.  
  6. FUNCTION replace$ (ascii)
  7.     SELECT CASE ascii
  8.         CASE 236: replace$ = "e" 'this are outputs from previous program, this is originally ě, new e
  9.         CASE 154: replace$ = "s"
  10.         CASE 232: replace$ = "c"
  11.         CASE 248: replace$ = "r"
  12.         CASE 158: replace$ = "z"
  13.         CASE 253: replace$ = "y"
  14.         CASE 225: replace$ = "a"
  15.         CASE 237: replace$ = "i"
  16.         CASE 233: replace$ = "e"
  17.         CASE 250: replace$ = "u"
  18.         CASE 249: replace$ = "u"
  19.     END SELECT
  20.  


Then insert text, which you needed to translate to "english" and copy it to clipboard and run second source.

This are my outputs:

Original text:

Rumunská policie rozbila gang, který v zemi týral německé děti.
Ty na rumunský venkov přijely v rámci sociálního programu,
který měl napravit jejich problematické dospívání.
Místo pobytu s psychology...

program output:

Rumunska policie rozbila gang, ktery v zemi tyral nemecke deti.
Ty na rumunsky venkov prijely v ramci socialniho programu,
ktery mel napravit jejich problematicke dospivani.
Misto pobytu s psychology...

Just remember, those are all just characters which we find in the extended ASCII set.  If the text is stored in two-byte Unicode (as the spanish dictionary I was altering words from was), you'd need to tweak it just a bit to work properly.  You'd want to read in those 2-byte extended character codes (like CHR$(195) + CHR$(165)) and then change those to represent the letter "a", without the accent. 

It's that 2-byte character replacement in my demo, which was eating the extra character in your 1-byte encoded text, and which one needs to watch out for.  You need to know what you're dealing with somewhat, before you start replacing things all willy-nilly.  I'd at least back up my test data before running (or make certain to output to a 'translated' file), to make certain I didn't destroy anything irreversibly. 



Quote
Wait, Steve. So this is not unicode. Hmmm. But if I write anything in Czech in Word, the program will translate it correctly.

I don't think it is.  I think you're just storing the code page with your word document when you save it.  (A lot like how MAPUNICODE needs a particular page to work properly for us.  ;)

Title: Re: Using a different code page
Post by: Ryster on August 29, 2019, 03:23:01 pm
Once Mr. SMcNeiil gave me a definition of how to use diacritical marks and I have a problem solved forever ...
Title: Re: Using a different code page
Post by: MLambert on August 30, 2019, 04:15:35 am
Hi,
Thank you all for the input ... but to complicate things .. the data can be in many different languages and a mixture within the same file .. hence I cannot load a code page for a particular language to process a single file.

Also I can't just replace say .. 134, which may be a ž, with a z because in another language 134 may be a ý which needs to be replaced with a y.

Also some of the characters may be a double byte in length.

Thks again,

Mike
Title: Re: Using a different code page
Post by: SMcNeill on August 30, 2019, 07:15:49 am
The first thing you need to know is what encoding the files are using.  ANSI is 8-bit (0 to 255 character codes), ASCII is 7-bit (0 to 127 character codes, though many times nowadays it's 8 bits as well, with the leading bit always being 0), then there's UTF-8, UTF-16, UTF-32....   

Programs like Notepad will try and guess at the file's encoding -- the first few bytes should usually be a header which tells you what is it, exactly -- but that header is often missing or wrong; in which case you just need to "trial and error" until the text makes sense, unless you know the encoding and can enter it manually.  (Which is why notepad has the option to choose encoding when opening a file -- no matter how good it is at guessing with its "auto-detect", it can still get it wrong sometimes and need human alteration.)

Only once you know the encoding, then you can go about converting it down to a standard 128 character values (standard ASCII).  To do that, you basically need to get a list of all the code values for that specific encoding, and then map them over to what would best represent them in your new encoding.

http://www.fileformat.info/info/charset/UTF-8/list.htm -- UTF-8 encoding codes can be found here, for example.

Looking at the chart above, we see that "è" is C3A8 in hex -- CHR$(195) + CHR$(168).  Since we know the text file is encoded in UTF-8, we can now convert all CHR$(195) + CHR$(168) characters into CHR$(ASCI("e")) characters...

But, without knowing the encoding first, you're just blindly altering values, without being certain which ones actually need changing or not.  If your files have 2-byte characters, I'd guess them to be in UTF-8 format, and they should convert over to ASCII just by mapping the code page above to standard ASCII values.  If they're not UTF-8, then you need to know/detect what they're in, (and hope you don't detect the wrong encoding),  and then convert from whatever format they are in, to the one which suits your needs.

The steps to solve this type of problem is:
1) Determine the encoding the file is currently in.
2) Determine the encoding you want to save them in.
3) Map a set of values from the first to the second.
4) Save the converted file to a different name, just in case you screwed up with step 1 or 2, so you don't corrupt your original data.
Title: Re: Using a different code page
Post by: MLambert on September 04, 2019, 12:54:26 am
Thank you for the help.

I have decided not to use a code page but anything above 128 is a 'special' character.

I know the combinations of special characters to look for and hence replace.

I can filter the data and validate my tables against the data.

Thks everyone for the help.

Your comments are stored in the old memory bank and will be used in the future.

Mike