[Top] [Contents] [Index] [ ? ]

Japanese Character Code Conversion Filter procedures in Scheme

Cover  
About this collection of procedures  
Constants  
Code Conversion predicates  
Low level procedures for Jfilter  
Jfilter Main procedures  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Cover

This collection contains Scheme procedures converting ISO 2022 Japanese character codes (JIS, EUC) and Shift-JIS, as well as converting a certain number of 'ZENKAKU' characters into 'HANKAKU' characters among them. They were first conceived to be compiled with HOBBIT to help SCM to speed up the letter by letter text handling, which no interpreters are good at. I tried this collection to be compliant with R5RS. This is free software distributed under the terms of GNU General Public License.

May 2002 By Dai INUKAI (inukai.d@jeans.ocn.ne.jp)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

About this collection of procedures

This procedure collection deals with Japanese characters. The primitives' codes can be extended or reduced by the user according to the objectives to which they are applied. The collection might also be rewritten for the character code conversion between ISO 2022 compliant GB 2312 or KS C5601 and EUC because the code conversion algorithm would not be different.

The Japanese characters are made of 2 phonetics, 'katakana' and 'hiragana', and of ideographical 'kanji' characters (Chinese characters). Alphabets are also used.

'hiragana' and 'kanji' are expressed in 2 bytes and there are 2 versions of 'katakana': single byte 'hankanu katakana' and 2-byte 'zenkaku katakana'. This collection converts any 'hankanu katakana' into 'zenkaku katakana'.

There are 3 ascii derived code systems in Japanese: JIS, EUC and Shift-JIS.

JIS:

The Japanese character basics are defined by ISO 2022 or JIS (Japanese Industrial Standards):

- JISX0201 character sets define control characters, ascii characters (#x01 through #x7e) and 'hankaku' characters (#xa1 through #xdf).

- JISX0208 character sets (appeared in 1978 and revised in 1983) define 2 byte characters which are composed of 7-bit 1st byte (#x21 through #x7e) and 7-bit 2nd byte (#x21 through #x7e).

The character sets are specified by escape sequences, Escape Character code (#x1b) with specification characters, after which follow JIS character codes. This method seems to be common in Chinese GB 2312 and Korean KS C5601.

The control and ascii characters are common in all 3 character sets. The differences are in multi-byte characters.

EUC:

Every character is expressed in 8 bits (single byte or multi byte). If the MSBs of JIS code becomes 1, then it becoms EUC code.

The 'hankaku katakana' characters are the same as in JIS but preceded by #x8e.

Shift JIS:

This code system is widely used in the MS-DOS derived operating systems and the text is often structured with both CARRIAGE RETURN and NEWLINE. Every character is expressed in 8 bits and the ascii and 'hankaku katakana' characters are the same as in JIS.

The Shift JIS 2 byte characters are made from JIS as follows:

There are character code conversion programs such as nkf or qkc already. To add a new similar program, there is no other reason than that it is written in Scheme. But because it is written in Scheme, you can modify or add the codes, and it is easy to link with other programs written in Scheme.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Constants

There are constants used for various purposes.

These are character codes not clearly defined in R5RS but used in the proceduress frequenty:

 
CHAR:ESCAPE: Escape character corresponding to hexadecimal #X1B
CHAR:RETURN: Carriage return  corresponding to hexadecimal #X0D

The CHAR:RETURN is used in MS-DOS derived operating systems to structure the text with lines in combination with NEWLINE code.

These are symbols used to denote the character sets:

 
'jis   : denotes JIS character sets, 1978  or 1983.
'eucj  : denotes Japanese EUC character sets.
'sjis  : denotes Shift JIS character sets.
'ascii : denotes ascii character sets.
'binary: This is not a character set but denotes binary codes.

Thess are ISO 2022 Escape Sequences, list of Escape Sequence characters, denoting the following character property:

 
JCCCF:ASCII:        #x1B #x28 #x42
                  Following characters are ascii.
JCCCF:ROMAN:        #x1B #x28 #x4a
                  Following characters are roman.
JCCCF:X0201:        #x1B #x28 #x49
                  Following characters are 'hankaku katakana'.
JCCCF:LATIN1:       #x1B #x2D #x41
                  latin-1 character set is following.
JCCCF:X0208-1978:   #x1B #x24 #x40
                  JISX0208-1978 2 byte characters follow.
JCCCF:X0208-1983:   #x1B #x24 #x42
                  JISX0208-1983 2 byte characters follow.
JCCCF:X0208-1978-2: #x1B #x24 #x28 #x40
                  Another way of denoting JISX0208-1978.
JCCCF:X0208-1983-2: #x1B #x24 #x28 #x42
                  Another way of denoting JISX0208-1983.

There are other escape sequences in ISO 2022 for Chinese and Korean but this collection does not use them.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Code Conversion predicates

procedure: eucj1? c

Returns #t if the char c is the 1st byte of Japanese euc 2-byte character, otherwise #f.

procedure: eucj2? c

Returns #t if the char c is the 2nd byte of Japanese euc 2-byte character, otherwise #f.

procedure: sjis1? c

Returns #t if the char c is the 1st byte of shift-jis 2-byte character, otherwise #f.

procedure: sjis2? c

Returns #t if the char c is the 2nd byte of shift-jis 2-byte character, otherwise #f.

procedure: hankana? c

Returns #t if the char c is a 'hankaku katakana' character.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Low level procedures for Jfilter

The procedures in this section are the primitives and service procedures to construct the Japanese character code conversion Filter procedures in the next section.

procedure: upper-byte w

Returns upper byte of the 2 bytes integer w.

procedure: lower-byte w

Returns lower byte of the 2 bytes integer w.

procedure: byte->word uc lc

Returns 2 bytes integer whose upper char code is uc and lower char code lc.

The above 3 procedures are 'helpers' used internally in the following procedures.

procedure: han->zen c to-code

Returns 2 byte 'zenkaku katakana' character list in to-code, corresponding to 1 byte 'hankaku katakana' character list c. Specify 'jis, 'sjis or 'eucj to to-code. This procedure changes the argument c into 'zenkaku katakana' character list, too.

procedure: jis->sjis s

Converts jis character list s into shift-jis character list and returns it. This procedure rewrites the argument s.

procedure: sjis->jis s

Converts shift-jis character list s into jis character list and returns it. This procedure rewrites the argument s.

procedure: jis->eucj s

Converts jis character list s into Japanese euc character list and returns it. This procedure rewrites the argument s.

procedure: eucj->jis s

Converts Japanese euc character list s into jis character list and returns it. This procedure rewrites the argument s.

procedure: sjis->eucj s

Converts sjis character list s into Japanese euc character list and returns it. This procedure rewrites the argument s.

procedure: eucj->sjis s

Converts Japanese euc character list s into jis character list and returns it. This procedure rewrites the argument s.

procedure: jcccf:convert s from-code to-code prev-sequence cur-sequence add-cr zen2han?

Converts character list s in from-code into to-code and returns it. Specify different variables for prev-sequence and cur-sequence so that you can check if the character currently being read changes the type from prev-sequence because Scheme reads files always SEQUENTIALLY. See the source code of JFILTER:CV for the use of these variables and the procedure itself. The boolean add-cr is used to add the Carriage Return character before #\newline or not to when you are converting into Shift JIS. Set #t to add and #f not to. This procedure rewrites the argument s.

zen2han? is a boolean argument which controls whether to convert a certain number of 'zenkaku' (2 byte) characters in EUCJP, SHIFT_JIS or JIS(ISO-2022-JP) into 'hankaku' (sigle byte) characteurs.

The local procedure zen->han to jcccf:convert searches the 'zenkaku' character given to the character list char-list from the 'zenkaku' vectors and returns the corresponding 'hankaku' character having the same index in the 'hankaku' character vector.

I made an arbitrary choice of 'zenkaku' characters corresponding to the following 'hankaku' characters:

SP,.:;?!"^~_-/|`'()[]{}<>+-=$%#&* 

and

[0-9A-Za-z]*.

procedure: jcccf:set-sequence! cur-sequence prev-sequence sequence

Sets the newly encountered character property specified in sequence to the current property cur-sequence. The property which has been valid and become obsolete is set to the prev-sequence. Specify to sequence Escape Sequences as described at the top of this document. This procedure is provided to change the passed arguments themselves, rather than the copy of them. The lists cur-sequence and prev-sequence must be different ones.

procedure: jcccf:write-list s port

Writes character list s to the given port.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Jfilter Main procedures

procedure: cv-file input output from-code to-code remove-cr add-cr check-length zen2han?

Converts one entire input in from-code into output in to-code. The input and output may be either Scheme ports, filename strings or #f. The to-code and from-code must be one of 'jis, 'eucj, 'sjis, or #f. The ascii characters are recognized and passed through without any change. The last 4 argguments, remove-cr, add-cr, check-length and han2zen? are optional and can be omitted. The remove-cr and add-cr, boolean arguments, serves for the conversion from or to Shift-Jis and specifies, if they exists, whether the Carriage Return must be removed or added. The check-length, number argument, specifies the number of characters to be read by the program in order to establish the code of input. The han2zen?, boolean arguement, specifies whether to allow conversions of 'zenkaku' characters into 'hankaku', whatever the character code is. In order to specify han2zen?, all of the optional arguments must be specified. In order to specify the check-length you must specify both remove-cr and add-cr, and remove-cr can not be ommitted to specify add-cr. You can specify #f to all the mandatory arguments (cv-file #f #f #f #f), in which case input defaults to (current-input-port), output to (current-output-port), from-code to <judged automatically> calling JUDGE-FILE, to-code to 'eucj, remove-cr to #t, add-cr to #f, cehck-length to 5000 and han2zen? to #f.

procedure: cv-string string from-code to-code

Converts the Scheme string string in from-code into to-code and returns the converted newly allocated string. This procedure uses the SLIB library package 'string-port. The to-code and from-code must be one of 'jis, 'eucj or 'sjis. The ascii characters are passed through without any change. The procedure may be useful when used from other Scheme programs converting text lines portion by portion. When compiled with HOBBIT, the code seems not clonable and slower than with the SCM interpreter exectution. For the users of SCM and HOBBIT, I recommed to use the code from your ScmInit.scm.

 
;an example
(require 'line-i/o)

(define (convert-string file from-code to-code)
  (call-with-input-file file
    (lambda (inport)
      (do ((s (read-line inport)(read-line inport)))
          ((eof-object? s))
        (write-line (cv-string s from-code to-code))))))

The above procedure converts all the lines of a text file and displays them. The procedure cv-file is more efficient for this purpose. Though no limit in string length, the CV-STRING is better used for converting a line portion of a text file.

procedure: jfilter:cv inport outport from-code to-code remove-cr add-cr

The <body> of CV-FILE and CV-STRING, which accepts ports and other parameters from the upper procedures and makes the character code conversion. Converts the contents of the inport and the result is output to the outport.

procedure: judge-file input c-length

c-length is optional. Judges the code of input reading input upto c-length characters if it exists, upto 5000 characters if it doesn't. Returns a symbol from 'jis, 'sjis, 'eucj, 'ascii or 'binary according to the file specified to input. Specification to input may be a filename string or a Scheme port. Beware this procedure is far from perfect when there are plenty of character codes common to all 3 character code sets in the file input, especially 'hankaku katakana' characters.


[Top] [Contents] [Index] [ ? ]

Table of Contents

Cover
About this collection of procedures
Constants
Code Conversion predicates
Low level procedures for Jfilter
Jfilter Main procedures

[Top] [Contents] [Index] [ ? ]

Short Table of Contents

Cover
About this collection of procedures
Constants
Code Conversion predicates
Low level procedures for Jfilter
Jfilter Main procedures

[Top] [Contents] [Index] [ ? ]

About this document

This document was generated using texi2html

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack previous or up-and-previous section 1.1
[ Up ] Up up section 1.2
[ >> ] FastForward next or up-and-next section 1.3
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index concept index  
[ ? ] About this page  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

This document was generated by Dai Inukai on May, 23 2002 using texi2html