By the way, I would love to have a short tutorial about unicode characters at esug.
me too but if nobody starts to have a look and try to understand leadingchar and friends
it will not happen.
I sent a note about my analysis a while ago and nobody reacted.
"This name is obsolete since only the characters that will fit in a byte can be
^self allByteCharacters
=> all the senders should us allByteCharacters
During my journey to the leadingChar realm I took notes and I share them with you.
leadingChar: leadChar code: code
code >= 16r400000 ifTrue: [
self error: 'code is out of range'.
leadChar >= 256 ifTrue: [
self error: 'lead is out of range'.
code < 256 ifTrue: [ ^self value: code ].
^self value: (leadChar bitShift: 22) + code.
^ (value bitAnd: 16r3FFFFF).
^ (value bitAnd: (16r3FC00000)) bitShift: -22.
^ EncodedCharSet charsetAt: self leadingChar
=> a character encodes the characterSet.
Why are we using
^ 0.
^ 0
and I do not get why
I do not understand why Unicode is declared as 1 and not 0.
Unicode class>>initialize
EncodedCharSet declareEncodedCharSet: self atIndex: 0+1.
EncodedCharSet declareEncodedCharSet: self atIndex: 256.
I do not understand why Latin1 does not use declareEncodedCharSet
Latin1 class>>initialize
self initialize
compoundTextSequence := String streamContents:
[ :s |
s nextPut: (Character value: 27).
s nextPut: $(.
s nextPut: $B ].
rightHalfSequence := String streamContents:
[ :s |
s nextPut: (Character value: 27).
s nextPut: $-.
s nextPut: $A ]
I started to distribute the initialization into subclasses starting from this method:
declareEncodedCharSet: anEncodedCharSetOrLanguageEnvironmentClass atIndex: aNumber
"this method is used to modularize the old initialize method:
EncodedCharSets at: 0+1 put: Unicode.
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 17+1 put: Latin9Environment.
EncodedCharSets at: 256 put: Unicode.
and indeed Latin1Environment was not part of the list.
Now apparently we can remove Latin1 because
EncodedCharSets of EncodedCharSet do not contain Latin1
No senders
emitSequenceToResetStateIfNeededOn: aStream forState: state
nextPutRightHalfValue: ascii toStream: aStream
withShiftSequenceIfNeededForTextConverterState: state
nextPutValue: ascii toStream: aStream withShiftSequenceIfNeededForTextConverterState: