Author Topic: Using a different code page  (Read 3237 times)

0 Members and 1 Guest are viewing this topic.

Offline MLambert

  • Forum Regular
  • Posts: 115
    • View Profile
Using a different code page
« on: August 29, 2019, 05:30:24 am »
Hi,

I need to scan data for 'odd' characters and replace them with 'english' equivalent.

For example ... find Ž  and replace with Z.

Can I use a code page to somehow look at the data ??

Mike

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Using a different code page
« Reply #1 on: August 29, 2019, 01:47:53 pm »
Hi. Maybe. Not to trying it. I think function _MAPUNICODE can return unicode letters scancode, so if you write own translate table, then is this possible.

Oh man. I am curios. I must try it.

I use Czech as input language.
« Last Edit: August 29, 2019, 01:50:01 pm by Petr »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Using a different code page
« Reply #2 on: August 29, 2019, 01:49:57 pm »
Hi,

I need to scan data for 'odd' characters and replace them with 'english' equivalent.

For example ... find Ž  and replace with Z.

Can I use a code page to somehow look at the data ??

Mike

Try this little demo over here: https://www.qb64.org/forum/index.php?topic=1647.msg108765#msg108765

Wordlist for the demo is here: https://giusseppe.net/blog/wp-content/uploads/2015/10/dict_rae_txt.zip
« Last Edit: August 29, 2019, 01:52:10 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Using a different code page
« Reply #3 on: August 29, 2019, 02:04:25 pm »
Of course! MapUnicode is not needed! First we must know, from which language input is! In principle your program works correctly, but there is one small bug, which kill space and first character in place, in which is text replaced:

  [ You are not allowed to view this attachment ]  


Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Using a different code page
« Reply #4 on: August 29, 2019, 02:09:49 pm »
Of course! MapUnicode is not needed! First we must know, from which language input is! In principle your program works correctly, but there is one small bug, which kill space and first character in place, in which is text replaced:

  [ You are not allowed to view this attachment ]

Your text must not be Unicode, as it seems you only need to get one character and then replace it.  Are those extended ASCII characters?
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Using a different code page
« Reply #5 on: August 29, 2019, 02:13:06 pm »
This is unicode.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Using a different code page
« Reply #6 on: August 29, 2019, 02:17:41 pm »
This is unicode.

Can you share a file with some text, and I’ll test it out in a bit.  ;)

From my testing with, "Ahoj svete. Toto je preložený text Goggle, takže muže být obtížné porozumet.", the characters here are extended ASCII characters and not actually encoded as unicode. The "ž" is being stored and read as CHR$(158), which is different than the spanish codes which I was translating in the demo (they're 2 character extended codes, which read as CHR$(195) + CHR$(whatever)). 

You'd need to tweak the routine a bit to read single character input and replace it, instead of double character input as with my demo, but it should be a simple enough modification to make for someone.
« Last Edit: August 29, 2019, 02:37:51 pm by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Using a different code page
« Reply #7 on: August 29, 2019, 02:37:41 pm »
So. Done! Really very fast! :)

First, use this program for searching national codes for your national characters. Run this program while is your system switched to your national keyboard!
Needed values write to paper....

Code: QB64: [Select]
  1.     i$ = INKEY$
  2.     IF LEN(i$) THEN PRINT ASC(i$), i$
  3.  

rewrite your codes from paper to this program: (done for Czech)

Code: QB64: [Select]
  1. PRINT "Insert some Czech text to clipboard and then press enter."
  2. INPUT nothing
  3. FOR T = 1 TO LEN(t$)
  4.      IF ASC(t$, T) > 128 OR ASC(t$, T) < 13 THEN O$ = O$ + replace$(ASC(t$, T)) ELSE O$ = O$ + MID$(t$, T, 1)
  5.  
  6. FUNCTION replace$ (ascii)
  7.     SELECT CASE ascii
  8.         CASE 236: replace$ = "e" 'this are outputs from previous program, this is originally ě, new e
  9.         CASE 154: replace$ = "s"
  10.         CASE 232: replace$ = "c"
  11.         CASE 248: replace$ = "r"
  12.         CASE 158: replace$ = "z"
  13.         CASE 253: replace$ = "y"
  14.         CASE 225: replace$ = "a"
  15.         CASE 237: replace$ = "i"
  16.         CASE 233: replace$ = "e"
  17.         CASE 250: replace$ = "u"
  18.         CASE 249: replace$ = "u"
  19.     END SELECT
  20.  


Then insert text, which you needed to translate to "english" and copy it to clipboard and run second source.

This are my outputs:

Original text:

Rumunská policie rozbila gang, který v zemi týral německé děti.
Ty na rumunský venkov přijely v rámci sociálního programu,
který měl napravit jejich problematické dospívání.
Místo pobytu s psychology...

program output:

Rumunska policie rozbila gang, ktery v zemi tyral nemecke deti.
Ty na rumunsky venkov prijely v ramci socialniho programu,
ktery mel napravit jejich problematicke dospivani.
Misto pobytu s psychology...








Offline Petr

  • Forum Resident
  • Posts: 1720
  • The best code is the DNA of the hops.
    • View Profile
Re: Using a different code page
« Reply #8 on: August 29, 2019, 02:45:18 pm »
Wait, Steve. So this is not unicode. Hmmm. But if I write anything in Czech in Word, the program will translate it correctly.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Using a different code page
« Reply #9 on: August 29, 2019, 02:47:34 pm »
So. Done! Really very fast! :)

First, use this program for searching national codes for your national characters. Run this program while is your system switched to your national keyboard!
Needed values write to paper....

Code: QB64: [Select]
  1.     i$ = INKEY$
  2.     IF LEN(i$) THEN PRINT ASC(i$), i$
  3.  

rewrite your codes from paper to this program: (done for Czech)

Code: QB64: [Select]
  1. PRINT "Insert some Czech text to clipboard and then press enter."
  2. INPUT nothing
  3. FOR T = 1 TO LEN(t$)
  4.      IF ASC(t$, T) > 128 OR ASC(t$, T) < 13 THEN O$ = O$ + replace$(ASC(t$, T)) ELSE O$ = O$ + MID$(t$, T, 1)
  5.  
  6. FUNCTION replace$ (ascii)
  7.     SELECT CASE ascii
  8.         CASE 236: replace$ = "e" 'this are outputs from previous program, this is originally ě, new e
  9.         CASE 154: replace$ = "s"
  10.         CASE 232: replace$ = "c"
  11.         CASE 248: replace$ = "r"
  12.         CASE 158: replace$ = "z"
  13.         CASE 253: replace$ = "y"
  14.         CASE 225: replace$ = "a"
  15.         CASE 237: replace$ = "i"
  16.         CASE 233: replace$ = "e"
  17.         CASE 250: replace$ = "u"
  18.         CASE 249: replace$ = "u"
  19.     END SELECT
  20.  


Then insert text, which you needed to translate to "english" and copy it to clipboard and run second source.

This are my outputs:

Original text:

Rumunská policie rozbila gang, který v zemi týral německé děti.
Ty na rumunský venkov přijely v rámci sociálního programu,
který měl napravit jejich problematické dospívání.
Místo pobytu s psychology...

program output:

Rumunska policie rozbila gang, ktery v zemi tyral nemecke deti.
Ty na rumunsky venkov prijely v ramci socialniho programu,
ktery mel napravit jejich problematicke dospivani.
Misto pobytu s psychology...

Just remember, those are all just characters which we find in the extended ASCII set.  If the text is stored in two-byte Unicode (as the spanish dictionary I was altering words from was), you'd need to tweak it just a bit to work properly.  You'd want to read in those 2-byte extended character codes (like CHR$(195) + CHR$(165)) and then change those to represent the letter "a", without the accent. 

It's that 2-byte character replacement in my demo, which was eating the extra character in your 1-byte encoded text, and which one needs to watch out for.  You need to know what you're dealing with somewhat, before you start replacing things all willy-nilly.  I'd at least back up my test data before running (or make certain to output to a 'translated' file), to make certain I didn't destroy anything irreversibly. 



Quote
Wait, Steve. So this is not unicode. Hmmm. But if I write anything in Czech in Word, the program will translate it correctly.

I don't think it is.  I think you're just storing the code page with your word document when you save it.  (A lot like how MAPUNICODE needs a particular page to work properly for us.  ;)

https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline Ryster

  • Newbie
  • Posts: 77
    • View Profile
Re: Using a different code page
« Reply #10 on: August 29, 2019, 03:23:01 pm »
Once Mr. SMcNeiil gave me a definition of how to use diacritical marks and I have a problem solved forever ...
« Last Edit: August 30, 2019, 09:45:51 am by Ryster »

Offline MLambert

  • Forum Regular
  • Posts: 115
    • View Profile
Re: Using a different code page
« Reply #11 on: August 30, 2019, 04:15:35 am »
Hi,
Thank you all for the input ... but to complicate things .. the data can be in many different languages and a mixture within the same file .. hence I cannot load a code page for a particular language to process a single file.

Also I can't just replace say .. 134, which may be a ž, with a z because in another language 134 may be a ý which needs to be replaced with a y.

Also some of the characters may be a double byte in length.

Thks again,

Mike

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3972
    • View Profile
    • Steve’s QB64 Archive Forum
Re: Using a different code page
« Reply #12 on: August 30, 2019, 07:15:49 am »
The first thing you need to know is what encoding the files are using.  ANSI is 8-bit (0 to 255 character codes), ASCII is 7-bit (0 to 127 character codes, though many times nowadays it's 8 bits as well, with the leading bit always being 0), then there's UTF-8, UTF-16, UTF-32....   

Programs like Notepad will try and guess at the file's encoding -- the first few bytes should usually be a header which tells you what is it, exactly -- but that header is often missing or wrong; in which case you just need to "trial and error" until the text makes sense, unless you know the encoding and can enter it manually.  (Which is why notepad has the option to choose encoding when opening a file -- no matter how good it is at guessing with its "auto-detect", it can still get it wrong sometimes and need human alteration.)

Only once you know the encoding, then you can go about converting it down to a standard 128 character values (standard ASCII).  To do that, you basically need to get a list of all the code values for that specific encoding, and then map them over to what would best represent them in your new encoding.

http://www.fileformat.info/info/charset/UTF-8/list.htm -- UTF-8 encoding codes can be found here, for example.

Looking at the chart above, we see that "è" is C3A8 in hex -- CHR$(195) + CHR$(168).  Since we know the text file is encoded in UTF-8, we can now convert all CHR$(195) + CHR$(168) characters into CHR$(ASCI("e")) characters...

But, without knowing the encoding first, you're just blindly altering values, without being certain which ones actually need changing or not.  If your files have 2-byte characters, I'd guess them to be in UTF-8 format, and they should convert over to ASCII just by mapping the code page above to standard ASCII values.  If they're not UTF-8, then you need to know/detect what they're in, (and hope you don't detect the wrong encoding),  and then convert from whatever format they are in, to the one which suits your needs.

The steps to solve this type of problem is:
1) Determine the encoding the file is currently in.
2) Determine the encoding you want to save them in.
3) Map a set of values from the first to the second.
4) Save the converted file to a different name, just in case you screwed up with step 1 or 2, so you don't corrupt your original data.
« Last Edit: August 30, 2019, 08:54:34 am by SMcNeill »
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline MLambert

  • Forum Regular
  • Posts: 115
    • View Profile
Re: Using a different code page
« Reply #13 on: September 04, 2019, 12:54:26 am »
Thank you for the help.

I have decided not to use a code page but anything above 128 is a 'special' character.

I know the combinations of special characters to look for and hence replace.

I can filter the data and validate my tables against the data.

Thks everyone for the help.

Your comments are stored in the old memory bank and will be used in the future.

Mike