• CharacterSet Translation

    From Scott Street@1:266/420.1 to All on Fri Feb 11 13:18:22 2022
    Hello Everyone!

    After much tinkering, I've been unable to get translations to be 100%. The biggest issue being CP866 -> UTF8. It seems that I can't get Golded+ to really do translation.

    The bits from my golded.cfg [which I've tried on Linux and macOS]
    -paste-
    XLATPATH /fido/etc/golded/
    XLATLOCALSET UTF-8
    XLATCHARSETALIAS UTF-8 UTF8
    XLATCHARSET CP1125 UTF-8 1125_u8.chs
    XLATCHARSET CP437 UTF-8 437_u8.chs
    XLATCHARSET CP850 UTF-8 850_u8.chs
    XLATCHARSET CP865 UTF-8 865_u8.chs
    XLATCHARSET CP866 UTF-8 866_u8.chs
    XLATCHARSET LATIN-1 UTF-8 iso1_u8.chs
    XLATCHARSET KOI8-R UTF-8 koi8_u8.chs
    -end-

    I thought it was just the messages, so I wrote a PHP library to read JAM files and translate the message body text to UTF8 and then output that to the terminal [the same terminal I use for Golded+, etc etc]. So my terminal (Apple's macOSX Terminal.app) does indeed display characters correctly, it just seems I can't get GoldEd+ to do it as well.

    PHP code bits for reference:
    -paste-
    $xlated = mb_convert_encoding($line, "UTF-8", $msg_encoding);
    -end-

    $xlated is the body line string after mb_convert_encoding() takes the raw bytes ( $line ) and converts them based on $msg_encoding, which is the message's CHRS (or CHRSET) value, which was translated earlier to a PHP native character set. See https://www.php.net/manual/en/function.mb-convert-encoding.php for more info on the PHP function.


    In addition: I'm using the included translation files, the most troubling display is from users with CP866 character sets.
    -file 866_u8.chs-
    ;
    ; This file is a charset conversion module in text form.
    ;
    ; Source file:
    ; http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT
    ;
    100000 ; ID number (when >65535, all 255 chars will be translated)
    0 ; version number
    ;
    2 ; level number
    ;
    CP866
    UTF-8
    ;
    \0 \0 ; NULL
    \0 \d1 ; START OF HEADING
    \0 \d2 ; START OF TEXT
    \0 \d3 ; END OF TEXT
    \0 \d4 ; END OF TRANSMISSION
    \0 \d5 ; ENQUIRY
    \0 \d6 ; ACKNOWLEDGE
    \0 \d7 ; BELL
    \0 \d8 ; BACKSPACE
    \0 \d9 ; HORIZONTAL TABULATION
    \0 \d10 ; LINE FEED
    \0 \d11 ; VERTICAL TABULATION
    \0 \d12 ; FORM FEED
    \0 \d13 ; CARRIAGE RETURN
    \0 \d14 ; SHIFT OUT
    \0 \d15 ; SHIFT IN
    \0 \d16 ; DATA LINK ESCAPE
    \0 \d17 ; DEVICE CONTROL ONE
    \0 \d18 ; DEVICE CONTROL TWO
    \0 \d19 ; DEVICE CONTROL THREE
    \0 \d20 ; DEVICE CONTROL FOUR
    \0 \d21 ; NEGATIVE ACKNOWLEDGE
    \0 \d22 ; SYNCHRONOUS IDLE
    \0 \d23 ; END OF TRANSMISSION BLOCK
    \0 \d24 ; CANCEL
    \0 \d25 ; END OF MEDIUM
    \0 \d26 ; SUBSTITUTE
    \0 \d27 ; ESCAPE
    \0 \d28 ; FILE SEPARATOR
    \0 \d29 ; GROUP SEPARATOR
    \0 \d30 ; RECORD SEPARATOR
    \0 \d31 ; UNIT SEPARATOR
    \0 \d32 ; SPACE
    \0 \d33 ; EXCLAMATION MARK
    \0 \d34 ; QUOTATION MARK
    \0 \d35 ; NUMBER SIGN
    \0 \d36 ; DOLLAR SIGN
    \0 \d37 ; PERCENT SIGN
    \0 \d38 ; AMPERSAND
    \0 \d39 ; APOSTROPHE
    \0 \d40 ; LEFT PARENTHESIS
    \0 \d41 ; RIGHT PARENTHESIS
    \0 \d42 ; ASTERISK
    \0 \d43 ; PLUS SIGN
    \0 \d44 ; COMMA
    \0 \d45 ; HYPHEN-MINUS
    \0 \d46 ; FULL STOP
    \0 \d47 ; SOLIDUS
    \0 \d48 ; DIGIT ZERO
    \0 \d49 ; DIGIT ONE
    \0 \d50 ; DIGIT TWO
    \0 \d51 ; DIGIT THREE
    \0 \d52 ; DIGIT FOUR
    \0 \d53 ; DIGIT FIVE
    \0 \d54 ; DIGIT SIX
    \0 \d55 ; DIGIT SEVEN
    \0 \d56 ; DIGIT EIGHT
    \0 \d57 ; DIGIT NINE
    \0 \d58 ; COLON
    \0 \d59 ; SEMICOLON
    \0 \d60 ; LESS-THAN SIGN
    \0 \d61 ; EQUALS SIGN
    \0 \d62 ; GREATER-THAN SIGN
    \0 \d63 ; QUESTION MARK
    \0 \d64 ; COMMERCIAL AT
    \0 \d65 ; LATIN CAPITAL LETTER A
    \0 \d66 ; LATIN CAPITAL LETTER B
    \0 \d67 ; LATIN CAPITAL LETTER C
    \0 \d68 ; LATIN CAPITAL LETTER D
    \0 \d69 ; LATIN CAPITAL LETTER E
    \0 \d70 ; LATIN CAPITAL LETTER F
    \0 \d71 ; LATIN CAPITAL LETTER G
    \0 \d72 ; LATIN CAPITAL LETTER H
    \0 \d73 ; LATIN CAPITAL LETTER I
    \0 \d74 ; LATIN CAPITAL LETTER J
    \0 \d75 ; LATIN CAPITAL LETTER K
    \0 \d76 ; LATIN CAPITAL LETTER L
    \0 \d77 ; LATIN CAPITAL LETTER M
    \0 \d78 ; LATIN CAPITAL LETTER N
    \0 \d79 ; LATIN CAPITAL LETTER O
    \0 \d80 ; LATIN CAPITAL LETTER P
    \0 \d81 ; LATIN CAPITAL LETTER Q
    \0 \d82 ; LATIN CAPITAL LETTER R
    \0 \d83 ; LATIN CAPITAL LETTER S
    \0 \d84 ; LATIN CAPITAL LETTER T
    \0 \d85 ; LATIN CAPITAL LETTER U
    \0 \d86 ; LATIN CAPITAL LETTER V
    \0 \d87 ; LATIN CAPITAL LETTER W
    \0 \d88 ; LATIN CAPITAL LETTER X
    \0 \d89 ; LATIN CAPITAL LETTER Y
    \0 \d90 ; LATIN CAPITAL LETTER Z
    \0 \d91 ; LEFT SQUARE BRACKET
    \0 \d92 ; REVERSE SOLIDUS
    \0 \d93 ; RIGHT SQUARE BRACKET
    \0 \d94 ; CIRCUMFLEX ACCENT
    \0 \d95 ; LOW LINE
    \0 \d96 ; GRAVE ACCENT
    \0 \d97 ; LATIN SMALL LETTER A
    \0 \d98 ; LATIN SMALL LETTER B
    \0 \d99 ; LATIN SMALL LETTER C
    \0 \d100 ; LATIN SMALL LETTER D
    \0 \d101 ; LATIN SMALL LETTER E
    \0 \d102 ; LATIN SMALL LETTER F
    \0 \d103 ; LATIN SMALL LETTER G
    \0 \d104 ; LATIN SMALL LETTER H
    \0 \d105 ; LATIN SMALL LETTER I
    \0 \d106 ; LATIN SMALL LETTER J
    \0 \d107 ; LATIN SMALL LETTER K
    \0 \d108 ; LATIN SMALL LETTER L
    \0 \d109 ; LATIN SMALL LETTER M
    \0 \d110 ; LATIN SMALL LETTER N
    \0 \d111 ; LATIN SMALL LETTER O
    \0 \d112 ; LATIN SMALL LETTER P
    \0 \d113 ; LATIN SMALL LETTER Q
    \0 \d114 ; LATIN SMALL LETTER R
    \0 \d115 ; LATIN SMALL LETTER S
    \0 \d116 ; LATIN SMALL LETTER T
    \0 \d117 ; LATIN SMALL LETTER U
    \0 \d118 ; LATIN SMALL LETTER V
    \0 \d119 ; LATIN SMALL LETTER W
    \0 \d120 ; LATIN SMALL LETTER X
    \0 \d121 ; LATIN SMALL LETTER Y
    \0 \d122 ; LATIN SMALL LETTER Z
    \0 \d123 ; LEFT CURLY BRACKET
    \0 \d124 ; VERTICAL LINE
    \0 \d125 ; RIGHT CURLY BRACKET
    \0 \d126 ; TILDE
    \0 \d127 ; DELETE
    \d208 \d144 ; CYRILLIC CAPITAL LETTER A
    \d208 \d145 ; CYRILLIC CAPITAL LETTER BE
    \d208 \d146 ; CYRILLIC CAPITAL LETTER VE
    \d208 \d147 ; CYRILLIC CAPITAL LETTER GHE
    \d208 \d148 ; CYRILLIC CAPITAL LETTER DE
    \d208 \d149 ; CYRILLIC CAPITAL LETTER IE
    \d208 \d150 ; CYRILLIC CAPITAL LETTER ZHE
    \d208 \d151 ; CYRILLIC CAPITAL LETTER ZE
    \d208 \d152 ; CYRILLIC CAPITAL LETTER I
    \d208 \d153 ; CYRILLIC CAPITAL LETTER SHORT I
    \d208 \d154 ; CYRILLIC CAPITAL LETTER KA
    \d208 \d155 ; CYRILLIC CAPITAL LETTER EL
    \d208 \d156 ; CYRILLIC CAPITAL LETTER EM
    \d208 \d157 ; CYRILLIC CAPITAL LETTER EN
    \d208 \d158 ; CYRILLIC CAPITAL LETTER O
    \d208 \d159 ; CYRILLIC CAPITAL LETTER PE
    \d208 \d160 ; CYRILLIC CAPITAL LETTER ER
    \d208 \d161 ; CYRILLIC CAPITAL LETTER ES
    \d208 \d162 ; CYRILLIC CAPITAL LETTER TE
    \d208 \d163 ; CYRILLIC CAPITAL LETTER U
    \d208 \d164 ; CYRILLIC CAPITAL LETTER EF
    \d208 \d165 ; CYRILLIC CAPITAL LETTER HA
    \d208 \d166 ; CYRILLIC CAPITAL LETTER TSE
    \d208 \d167 ; CYRILLIC CAPITAL LETTER CHE
    \d208 \d168 ; CYRILLIC CAPITAL LETTER SHA
    \d208 \d169 ; CYRILLIC CAPITAL LETTER SHCHA
    \d208 \d170 ; CYRILLIC CAPITAL LETTER HARD SIGN
    \d208 \d171 ; CYRILLIC CAPITAL LETTER YERU
    \d208 \d172 ; CYRILLIC CAPITAL LETTER SOFT SIGN
    \d208 \d173 ; CYRILLIC CAPITAL LETTER E
    \d208 \d174 ; CYRILLIC CAPITAL LETTER YU
    \d208 \d175 ; CYRILLIC CAPITAL LETTER YA
    \d208 \d176 ; CYRILLIC SMALL LETTER A
    \d208 \d177 ; CYRILLIC SMALL LETTER BE
    \d208 \d178 ; CYRILLIC SMALL LETTER VE
    \d208 \d179 ; CYRILLIC SMALL LETTER GHE
    \d208 \d180 ; CYRILLIC SMALL LETTER DE
    \d208 \d181 ; CYRILLIC SMALL LETTER IE
    \d208 \d182 ; CYRILLIC SMALL LETTER ZHE
    \d208 \d183 ; CYRILLIC SMALL LETTER ZE
    \d208 \d184 ; CYRILLIC SMALL LETTER I
    \d208 \d185 ; CYRILLIC SMALL LETTER SHORT I
    \d208 \d186 ; CYRILLIC SMALL LETTER KA
    \d208 \d187 ; CYRILLIC SMALL LETTER EL
    \d208 \d188 ; CYRILLIC SMALL LETTER EM
    \d208 \d189 ; CYRILLIC SMALL LETTER EN
    \d208 \d190 ; CYRILLIC SMALL LETTER O
    \d208 \d191 ; CYRILLIC SMALL LETTER PE
    \d226 \d150 \d145 ; LIGHT SHADE
    \d226 \d150 \d146 ; MEDIUM SHADE
    \d226 \d150 \d147 ; DARK SHADE
    \d226 \d148 \d130 ; BOX DRAWINGS LIGHT VERTICAL
    \d226 \d148 \d164 ; BOX DRAWINGS LIGHT VERTICAL AND LEFT
    \d226 \d149 \d161 ; BOX DRAWINGS VERTICAL SINGLE AND LEFT DOUBLE
    \d226 \d149 \d162 ; BOX DRAWINGS VERTICAL DOUBLE AND LEFT SINGLE
    \d226 \d149 \d150 ; BOX DRAWINGS DOWN DOUBLE AND LEFT SINGLE
    \d226 \d149 \d149 ; BOX DRAWINGS DOWN SINGLE AND LEFT DOUBLE
    \d226 \d149 \d163 ; BOX DRAWINGS DOUBLE VERTICAL AND LEFT
    \d226 \d149 \d145 ; BOX DRAWINGS DOUBLE VERTICAL
    \d226 \d149 \d151 ; BOX DRAWINGS DOUBLE DOWN AND LEFT
    \d226 \d149 \d157 ; BOX DRAWINGS DOUBLE UP AND LEFT
    \d226 \d149 \d144 ; BOX DRAWINGS DOUBLE HORIZONTAL
    \d226 \d148 \d148 ; BOX DRAWINGS LIGHT UP AND RIGHT
    \d226 \d148 \d180 ; BOX DRAWINGS LIGHT UP AND HORIZONTAL
    \d226 \d148 \d172 ; BOX DRAWINGS LIGHT DOWN AND HORIZONTAL
    \d226 \d148 \d156 ; BOX DRAWINGS LIGHT VERTICAL AND RIGHT
    \d226 \d148 \d128 ; BOX DRAWINGS LIGHT HORIZONTAL
    \d226 \d148 \d188 ; BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL
    \d226 \d149 \d158 ; BOX DRAWINGS VERTICAL SINGLE AND RIGHT DOUBLE
    \d226 \d149 \d159 ; BOX DRAWINGS VERTICAL DOUBLE AND RIGHT SINGLE
    \d226 \d149 \d154 ; BOX DRAWINGS DOUBLE UP AND RIGHT
    \d226 \d149 \d148 ; BOX DRAWINGS DOUBLE DOWN AND RIGHT
    \d226 \d149 \d169 ; BOX DRAWINGS DOUBLE UP AND HORIZONTAL
    \d226 \d149 \d166 ; BOX DRAWINGS DOUBLE DOWN AND HORIZONTAL
    \d226 \d149 \d160 ; BOX DRAWINGS DOUBLE VERTICAL AND RIGHT
    \d226 \d149 \d144 ; BOX DRAWINGS DOUBLE HORIZONTAL
    \d226 \d149 \d172 ; BOX DRAWINGS DOUBLE VERTICAL AND HORIZONTAL
    \d226 \d149 \d167 ; BOX DRAWINGS UP SINGLE AND HORIZONTAL DOUBLE
    \d226 \d149 \d168 ; BOX DRAWINGS UP DOUBLE AND HORIZONTAL SINGLE
    \d226 \d149 \d164 ; BOX DRAWINGS DOWN SINGLE AND HORIZONTAL DOUBLE
    \d226 \d149 \d165 ; BOX DRAWINGS DOWN DOUBLE AND HORIZONTAL SINGLE
    \d226 \d149 \d153 ; BOX DRAWINGS UP DOUBLE AND RIGHT SINGLE
    \d226 \d149 \d152 ; BOX DRAWINGS UP SINGLE AND RIGHT DOUBLE
    \d226 \d149 \d146 ; BOX DRAWINGS DOWN SINGLE AND RIGHT DOUBLE
    \d226 \d149 \d147 ; BOX DRAWINGS DOWN DOUBLE AND RIGHT SINGLE
    \d226 \d149 \d171 ; BOX DRAWINGS VERTICAL DOUBLE AND HORIZONTAL SINGLE
    \d226 \d149 \d170 ; BOX DRAWINGS VERTICAL SINGLE AND HORIZONTAL DOUBLE
    \d226 \d148 \d152 ; BOX DRAWINGS LIGHT UP AND LEFT
    \d226 \d148 \d140 ; BOX DRAWINGS LIGHT DOWN AND RIGHT
    \d226 \d150 \d136 ; FULL BLOCK
    \d226 \d150 \d132 ; LOWER HALF BLOCK
    \d226 \d150 \d140 ; LEFT HALF BLOCK
    \d226 \d150 \d144 ; RIGHT HALF BLOCK
    \d226 \d150 \d128 ; UPPER HALF BLOCK
    \d209 \d128 ; CYRILLIC SMALL LETTER ER
    \d209 \d129 ; CYRILLIC SMALL LETTER ES
    \d209 \d130 ; CYRILLIC SMALL LETTER TE
    \d209 \d131 ; CYRILLIC SMALL LETTER U
    \d209 \d132 ; CYRILLIC SMALL LETTER EF
    \d209 \d133 ; CYRILLIC SMALL LETTER HA
    \d209 \d134 ; CYRILLIC SMALL LETTER TSE
    \d209 \d135 ; CYRILLIC SMALL LETTER CHE
    \d209 \d136 ; CYRILLIC SMALL LETTER SHA
    \d209 \d137 ; CYRILLIC SMALL LETTER SHCHA
    \d209 \d138 ; CYRILLIC SMALL LETTER HARD SIGN
    \d209 \d139 ; CYRILLIC SMALL LETTER YERU
    \d209 \d140 ; CYRILLIC SMALL LETTER SOFT SIGN
    \d209 \d141 ; CYRILLIC SMALL LETTER E
    \d209 \d142 ; CYRILLIC SMALL LETTER YU
    \d209 \d143 ; CYRILLIC SMALL LETTER YA
    \d208 \d129 ; CYRILLIC CAPITAL LETTER IO
    \d209 \d145 ; CYRILLIC SMALL LETTER IO
    \d208 \d132 ; CYRILLIC CAPITAL LETTER UKRAINIAN IE
    \d209 \d148 ; CYRILLIC SMALL LETTER UKRAINIAN IE
    \d208 \d135 ; CYRILLIC CAPITAL LETTER YI
    \d209 \d151 ; CYRILLIC SMALL LETTER YI
    \d208 \d142 ; CYRILLIC CAPITAL LETTER SHORT U
    \d209 \d158 ; CYRILLIC SMALL LETTER SHORT U
    \d194 \d176 ; DEGREE SIGN
    \d226 \d136 \d153 ; BULLET OPERATOR
    \d194 \d183 ; MIDDLE DOT
    \d226 \d136 \d154 ; SQUARE ROOT
    \d226 \d132 \d150 ; NUMERO SIGN
    \d194 \d164 ; CURRENCY SIGN
    \d226 \d150 \d160 ; BLACK SQUARE
    \d194 \d160 ; NO-BREAK SPACE
    END
    -file end-

    My primary example message is from FIDONEWS, MsgID "2:5030/1081.117 61f6e5cd"

    My PHP script correctly converts the CP866 characters into UTF-8; but Golded+ just makes a mess of it.
    The tagline of the message translates to "- And you would do art. Poetry, right?"
    and the origin: (loosly) "I advise you to rub with ant alcohol"
    Which appears to be posted by a version of GoldEd running on Windows-32bit - So I have to believe proper character translation can be done!

    Sorry for the fairly large post; just tried to give as much information as possible in one shot.

    Any help is greatly appreciated!


    Scott

    ---
    * Origin: -={ The Digital Post }=- (1:266/420.1)
  • From Kai Richter@2:240/77 to Scott Street on Sat Feb 12 15:45:44 2022
    Hello Scott!

    11 Feb 22, Scott Street wrote to All:

    The biggest issue being CP866 -> UTF8.
    It seems that I can't get Golded+ to really do translation.

    As far as i remember golded can't handle multibyte charsets.

    But golded can envoke/start an external editor to work with the message body.

    It was said that adding multibyte support to golded is like rewriting the code from scratch.

    Regards

    Kai

    --- GoldED+/LNX 1.1.4.7
    * Origin: Monobox (2:240/77)