| [Top] | [Contents] | [Index] | [ ? ] |
Cover About this collection of procedures Constants Code Conversion predicates Low level procedures for Jfilter Jfilter Main procedures
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This collection contains Scheme procedures converting ISO 2022 Japanese character codes (JIS, EUC) and Shift-JIS, as well as converting a certain number of 'ZENKAKU' characters into 'HANKAKU' characters among them. They were first conceived to be compiled with HOBBIT to help SCM to speed up the letter by letter text handling, which no interpreters are good at. I tried this collection to be compliant with R5RS. This is free software distributed under the terms of GNU General Public License.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This procedure collection deals with Japanese characters. The primitives' codes can be extended or reduced by the user according to the objectives to which they are applied. The collection might also be rewritten for the character code conversion between ISO 2022 compliant GB 2312 or KS C5601 and EUC because the code conversion algorithm would not be different.
The Japanese characters are made of 2 phonetics, 'katakana' and 'hiragana', and of ideographical 'kanji' characters (Chinese characters). Alphabets are also used.
'hiragana' and 'kanji' are expressed in 2 bytes and there are 2 versions of 'katakana': single byte 'hankanu katakana' and 2-byte 'zenkaku katakana'. This collection converts any 'hankanu katakana' into 'zenkaku katakana'.
There are 3 ascii derived code systems in Japanese: JIS, EUC and Shift-JIS.
JIS:
The Japanese character basics are defined by ISO 2022 or JIS (Japanese Industrial Standards):
- JISX0201 character sets define control characters, ascii characters (#x01 through #x7e) and 'hankaku' characters (#xa1 through #xdf).
- JISX0208 character sets (appeared in 1978 and revised in 1983) define 2 byte characters which are composed of 7-bit 1st byte (#x21 through #x7e) and 7-bit 2nd byte (#x21 through #x7e).
The character sets are specified by escape sequences, Escape Character code (#x1b) with specification characters, after which follow JIS character codes. This method seems to be common in Chinese GB 2312 and Korean KS C5601.
The control and ascii characters are common in all 3 character sets. The differences are in multi-byte characters.
EUC:
Every character is expressed in 8 bits (single byte or multi byte). If the MSBs of JIS code becomes 1, then it becoms EUC code.
The 'hankaku katakana' characters are the same as in JIS but preceded by #x8e.
Shift JIS:
This code system is widely used in the MS-DOS derived operating systems and the text is often structured with both CARRIAGE RETURN and NEWLINE. Every character is expressed in 8 bits and the ascii and 'hankaku katakana' characters are the same as in JIS.
The Shift JIS 2 byte characters are made from JIS as follows:
There are character code conversion programs such as nkf or
qkc already. To add a new similar program, there is no other
reason than that it is written in Scheme. But because it is written in
Scheme, you can modify or add the codes, and it is easy to link with
other programs written in Scheme.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are constants used for various purposes.
These are character codes not clearly defined in R5RS but used in the proceduress frequenty:
CHAR:ESCAPE: Escape character corresponding to hexadecimal #X1B CHAR:RETURN: Carriage return corresponding to hexadecimal #X0D |
The CHAR:RETURN is used in MS-DOS derived operating systems to structure the text with lines in combination with NEWLINE code.
These are symbols used to denote the character sets:
'jis : denotes JIS character sets, 1978 or 1983. 'eucj : denotes Japanese EUC character sets. 'sjis : denotes Shift JIS character sets. 'ascii : denotes ascii character sets. 'binary: This is not a character set but denotes binary codes. |
Thess are ISO 2022 Escape Sequences, list of Escape Sequence characters, denoting the following character property:
JCCCF:ASCII: #x1B #x28 #x42
Following characters are ascii.
JCCCF:ROMAN: #x1B #x28 #x4a
Following characters are roman.
JCCCF:X0201: #x1B #x28 #x49
Following characters are 'hankaku katakana'.
JCCCF:LATIN1: #x1B #x2D #x41
latin-1 character set is following.
JCCCF:X0208-1978: #x1B #x24 #x40
JISX0208-1978 2 byte characters follow.
JCCCF:X0208-1983: #x1B #x24 #x42
JISX0208-1983 2 byte characters follow.
JCCCF:X0208-1978-2: #x1B #x24 #x28 #x40
Another way of denoting JISX0208-1978.
JCCCF:X0208-1983-2: #x1B #x24 #x28 #x42
Another way of denoting JISX0208-1983.
|
There are other escape sequences in ISO 2022 for Chinese and Korean but this collection does not use them.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Returns #t if the char c is the 1st byte of Japanese euc 2-byte character, otherwise #f.
Returns #t if the char c is the 2nd byte of Japanese euc 2-byte character, otherwise #f.
Returns #t if the char c is the 1st byte of shift-jis 2-byte character, otherwise #f.
Returns #t if the char c is the 2nd byte of shift-jis 2-byte character, otherwise #f.
Returns #t if the char c is a 'hankaku katakana' character.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The procedures in this section are the primitives and service procedures to construct the Japanese character code conversion Filter procedures in the next section.
Returns upper byte of the 2 bytes integer w.
Returns lower byte of the 2 bytes integer w.
Returns 2 bytes integer whose upper char code is uc and lower char code lc.
The above 3 procedures are 'helpers' used internally in the following procedures.
Returns 2 byte 'zenkaku katakana' character list in to-code, corresponding to 1 byte 'hankaku katakana' character list c. Specify 'jis, 'sjis or 'eucj to to-code. This procedure changes the argument c into 'zenkaku katakana' character list, too.
Converts jis character list s into shift-jis character list and returns it. This procedure rewrites the argument s.
Converts shift-jis character list s into jis character list and returns it. This procedure rewrites the argument s.
Converts jis character list s into Japanese euc character list and returns it. This procedure rewrites the argument s.
Converts Japanese euc character list s into jis character list and returns it. This procedure rewrites the argument s.
Converts sjis character list s into Japanese euc character list and returns it. This procedure rewrites the argument s.
Converts Japanese euc character list s into jis character list and returns it. This procedure rewrites the argument s.
Converts character list s in from-code into to-code and returns it. Specify different variables for prev-sequence and cur-sequence so that you can check if the character currently being read changes the type from prev-sequence because Scheme reads files always SEQUENTIALLY. See the source code of JFILTER:CV for the use of these variables and the procedure itself. The boolean add-cr is used to add the Carriage Return character before #\newline or not to when you are converting into Shift JIS. Set #t to add and #f not to. This procedure rewrites the argument s.
zen2han? is a boolean argument which controls whether to convert a certain number of 'zenkaku' (2 byte) characters in EUCJP, SHIFT_JIS or JIS(ISO-2022-JP) into 'hankaku' (sigle byte) characteurs.
The local procedure zen->han to jcccf:convert searches the 'zenkaku' character given to the character list char-list from the 'zenkaku' vectors and returns the corresponding 'hankaku' character having the same index in the 'hankaku' character vector.
I made an arbitrary choice of 'zenkaku' characters corresponding to the following 'hankaku' characters:
SP,.:;?!"^~_-/|`'()[]{}<>+-=$%#&*
and
[0-9A-Za-z]*.
Sets the newly encountered character property specified in sequence to the current property cur-sequence. The property which has been valid and become obsolete is set to the prev-sequence. Specify to sequence Escape Sequences as described at the top of this document. This procedure is provided to change the passed arguments themselves, rather than the copy of them. The lists cur-sequence and prev-sequence must be different ones.
Writes character list s to the given port.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Converts one entire input in from-code into output in to-code. The input and output may be either Scheme ports, filename strings or #f. The to-code and from-code must be one of 'jis, 'eucj, 'sjis, or #f. The ascii characters are recognized and passed through without any change. The last 4 argguments, remove-cr, add-cr, check-length and han2zen? are optional and can be omitted. The remove-cr and add-cr, boolean arguments, serves for the conversion from or to Shift-Jis and specifies, if they exists, whether the Carriage Return must be removed or added. The check-length, number argument, specifies the number of characters to be read by the program in order to establish the code of input. The han2zen?, boolean arguement, specifies whether to allow conversions of 'zenkaku' characters into 'hankaku', whatever the character code is. In order to specify han2zen?, all of the optional arguments must be specified. In order to specify the check-length you must specify both remove-cr and add-cr, and remove-cr can not be ommitted to specify add-cr. You can specify #f to all the mandatory arguments (cv-file #f #f #f #f), in which case input defaults to (current-input-port), output to (current-output-port), from-code to <judged automatically> calling JUDGE-FILE, to-code to 'eucj, remove-cr to #t, add-cr to #f, cehck-length to 5000 and han2zen? to #f.
Converts the Scheme string string in from-code into to-code and returns the converted newly allocated string. This procedure uses the SLIB library package 'string-port. The to-code and from-code must be one of 'jis, 'eucj or 'sjis. The ascii characters are passed through without any change. The procedure may be useful when used from other Scheme programs converting text lines portion by portion. When compiled with HOBBIT, the code seems not clonable and slower than with the SCM interpreter exectution. For the users of SCM and HOBBIT, I recommed to use the code from your ScmInit.scm.
;an example
(require 'line-i/o)
(define (convert-string file from-code to-code)
(call-with-input-file file
(lambda (inport)
(do ((s (read-line inport)(read-line inport)))
((eof-object? s))
(write-line (cv-string s from-code to-code))))))
|
The above procedure converts all the lines of a text file and displays them. The procedure cv-file is more efficient for this purpose. Though no limit in string length, the CV-STRING is better used for converting a line portion of a text file.
The <body> of CV-FILE and CV-STRING, which accepts ports and other parameters from the upper procedures and makes the character code conversion. Converts the contents of the inport and the result is output to the outport.
c-length is optional. Judges the code of input reading input upto c-length characters if it exists, upto 5000 characters if it doesn't. Returns a symbol from 'jis, 'sjis, 'eucj, 'ascii or 'binary according to the file specified to input. Specification to input may be a filename string or a Scheme port. Beware this procedure is far from perfect when there are plenty of character codes common to all 3 character code sets in the file input, especially 'hankaku katakana' characters.
| [Top] | [Contents] | [Index] | [ ? ] |
Cover
About this collection of procedures
Constants
Code Conversion predicates
Low level procedures for Jfilter
Jfilter Main procedures
| [Top] | [Contents] | [Index] | [ ? ] |
Cover
About this collection of procedures
Constants
Code Conversion predicates
Low level procedures for Jfilter
Jfilter Main procedures
| [Top] | [Contents] | [Index] | [ ? ] |
| Button | Name | Go to | From 1.2.3 go to |
|---|---|---|---|
| [ < ] | Back | previous section in reading order | 1.2.2 |
| [ > ] | Forward | next section in reading order | 1.2.4 |
| [ << ] | FastBack | previous or up-and-previous section | 1.1 |
| [ Up ] | Up | up section | 1.2 |
| [ >> ] | FastForward | next or up-and-next section | 1.3 |
| [Top] | Top | cover (top) of document | |
| [Contents] | Contents | table of contents | |
| [Index] | Index | concept index | |
| [ ? ] | About | this page |