Unicode
Map the Unicode character () or codepoint (Integer
) c
to the corresponding “equivalent” character or codepoint, respectively, according to the custom equivalence used within the Julia parser (in addition to NFC normalization).
For example, 'µ'
(U+00B5 micro) is treated as equivalent to 'μ'
(U+03BC mu) by Julia’s parser, so julia_chartransform
performs this transformation while leaving other characters unchanged:
julia> Unicode.julia_chartransform('µ')
'μ': Unicode U+03BC (category Ll: Letter, lowercase)
julia> Unicode.julia_chartransform('x')
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia_chartransform
is mainly useful for passing to the function in order to mimic the normalization used by the Julia parser:
julia> s = "µö"
"µö"
julia> s2 = Unicode.normalize(s, compose=true, stable=true, chartransform=Unicode.julia_chartransform)
"μö"
julia> collect(s2)
2-element Vector{Char}:
'μ': Unicode U+03BC (category Ll: Letter, lowercase)
'ö': Unicode U+00F6 (category Ll: Letter, lowercase)
julia> s2 == string(Meta.parse(s))
true
Julia 1.8
This function was introduced in Julia 1.8.
Unicode.isassigned — Function
Returns true
if the given char or integer is an assigned Unicode code point.
julia> Unicode.isassigned(101)
true
true
— Function
isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity)
Return whether s1
and s2
are canonically equivalent Unicode strings. If casefold=true
, ignores case (performs Unicode case-folding); if stripmark=true
, strips diacritical marks and other combining characters.
As with Unicode.normalize, you can also pass an arbitrary function via the chartransform
keyword (mapping Integer
codepoints to codepoints) to perform custom normalizations, such as .
Examples
For example, the string "noël"
can be constructed in two canonically equivalent ways in Unicode, depending on whether "ë"
is formed from a single codepoint U+00EB or from the ASCII character 'o'
followed by the U+0308 combining-diaeresis character.
Unicode.normalize — Function
Unicode.normalize(s::AbstractString; keywords...)
Unicode.normalize(s::AbstractString, normalform::Symbol)
Normalize the string s
. By default, canonical composition (compose=true
) is performed without ensuring Unicode versioning stability (compat=false
), which produces the shortest possible equivalent string but may introduce composition characters not present in earlier Unicode versions.
Alternatively, finer control and additional transformations may be obtained by calling Unicode.normalize(s; keywords...)
, where any number of the following boolean keywords options (which all default to false
except for compose
) are specified:
compose=false
: do not perform canonical compositiondecompose=true
: do canonical decomposition instead of canonical composition (compose=true
is ignored if present)compat=true
: compatibility equivalents are canonicalized- : perform Unicode case folding, e.g. for case-insensitive string comparison
newline2lf=true
,newline2ls=true
, ornewline2ps=true
: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectivelystripignore=true
: strip Unicode’s “default ignorable” characters (e.g. the soft hyphen or the left-to-right marker)stripcc=true
: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specifiedrejectna=true
: throw an error if unassigned code points are foundstable=true
: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions)
You can also use the chartransform
keyword (which defaults to identity
) to pass an arbitrary function mapping Integer
codepoints to codepoints, which is is called on each character in s
as it is processed, in order to perform arbitrary additional normalizations. For example, by passing chartransform=Unicode.julia_chartransform
, you can apply a few Julia-specific character normalizations that are performed by Julia when parsing identifiers (in addition to NFC normalization: compose=true, stable=true
).
For example, NFKC corresponds to the options compose=true, compat=true, stable=true
.
Examples
julia> "é" == Unicode.normalize("é") #LHS: Unicode U+00e9, RHS: U+0065 & U+0301
true
julia> "μ" == Unicode.normalize("µ", compat=true) #LHS: Unicode U+03bc, RHS: Unicode U+00b5
true
julia> Unicode.normalize("JuLiA", casefold=true)
"julia"
julia> Unicode.normalize("JúLiA", stripmark=true)
"JuLiA"
Julia 1.8
The chartransform
keyword argument requires Julia 1.8.
— Function