Thank you for your help.
Which is more suitable, Boost.Regex or Onigmo, in terms of using regular expressions and dealing with Japanese kanji sentences?
I think I can do both, but is there any place where ○○ is stronger?
Thank you for your cooperation.
In a regular expression,
. represents one character.In fact, the definition of "one character" is ambiguous.The results are different depending on whether or not you have a multi-byte character that defines a combination as a single character in a particular pattern, such as kanji.A regular expression engine that does not recognize a single character correctly will be separated by an incorrect byte position and will return an unintended result.Unicode has not improved this point, and salogate pair or heterogeneous selector must be considered.
Boost.Regex only supports
wchar_t.It does not recognize multi-byte characters such as Shift_JIS or UTF-8 and does not support surrogate pairs or heterogeneous character selectors.
Onigmo supports various encodings.However, no documentation was found regarding the compatibility of surrogate pairs or variant character selectors (probably the process of being aware of surrogate pairs as far as the source code is concerned).
It is recommended that you select a regular expression engine after understanding how far you want it to be recognized as a single character and how each engine responds, and if there is a shortage, you need to make up for it yourself.
© 2023 OneMinuteCode. All rights reserved.