Python's ICU Bindings

« Back to notes

Python’s Unicode-related functionality has improved greatly over the years, but sometimes one needs to go for the big guns and use ICU, the original software package for doing Unicode-aware text processing. It is ancient (seriously, its history is fascinating), venerable, and powerful, and can do basically anything one might possibly need to do involving Unicode. On the other hand, its design is rather complex, and it is designed for use in C++ and Java. There are ecellent Python bindings, but their APIs are not exactly Pythonic in flavor: they are basically just (very) thin wrappers around the underlying C++ APIs. They are, however, very well-designed and consistent, so once you figure out how one kind of thing works, you can usually leverage that to figure out how the rest of it works.

In that spirit, and in the spirit of leaving myself notes, the thing I need to do most often with ICU is to split a string into grapheme clusters. ICU has a family of classes for doing various kinds of string segmentation, and they all behave largely the same way.

Step zero is to install PyICU; ICU is a finicky and tricky beast, and so you’ll want to make sure to follow the installation instructions carefully. You may need to fuss around with environment variables a little bit, depending on how you installed ICU.

Once it’s installed, we are ready to party like it’s 1995. First, you will create an instance of icu.BreakIterator. There are instance types to implement the various different Unicode segmentation algorithms; for our purposes, we want the one that produces grapheme cluster segments (i.e., “characters”, though note that the semantics of that word get hairy quickly).

Once you’ve got a BreakIterator, you can roll through it to get a series of offsets within the string at which there is a grapheme boundary:

import icu

s = "café"

# note that if you want an iterator set up for a different locale, you 
# would ask for one here instead of just using icu.Locale() which gets
# the current/default locale.

grapheme_clusters = []

b = icu.BreakIterator.createCharacterInstance(icu.Locale())
b.setText(s)
i = 0
for j in b:
    this_grapheme = s[i:j] # Python 3 lets us slice by code point
    grapheme_clusters.append(this_grapheme)
    i = j # don't forget to manually keep track of your start offset!

# at this point, grapheme_clusters will be a list of the constituent grpaheme clusters
# in our input string.

If you think that this looks like C++ written in Python, you’re not wrong. The good news is that if you decide later that you actually want to break your string into words, sentences, lines, etc. in a language/locale-aware way, there are equivalent BreakIterators whose APIs follow the same pattern of use. Note that Unicode’s definition of “word” and “sentence” may not align with your needs; this apparatus was designed for text entry UI controls and the like, and so the closer your application is to that, the better a fit it will be.

I typically find that it is useful to wrap all of this up in a function producing a generator, like so:

import icu

def each_grapheme(some_str):
  b = icu.BreakIterator.createCharacterInstance(icu.Locale())
  b.setText(some_str)
  i = 0
  for j in b:
    this_grapheme = unicode(s[i:j])
    yield this_grapheme
    i = j

This lets us do things like dissect complex strings with many combining marks:

import unicodedata

s = "٩(͡๏̯͡๏)۶"

for g_idx, g in enumerate(each_grapheme(s)):
  print(f"Grapheme {g_idx+1}:")
  for c_idx, c in enumerate(g): # for each code point in this grapheme cluster
    print("\t", c_idx+1, c, hex(ord(c)), unicodedata.name(c))

# Produces the following output:
# Grapheme 1:
# 	 1 ٩ 0x669 ARABIC-INDIC DIGIT NINE
# Grapheme 2:
# 	 1 ( 0x28 LEFT PARENTHESIS
# 	 2 ͡ 0x361 COMBINING DOUBLE INVERTED BREVE
# Grapheme 3:
# 	 1 ๏ 0xe4f THAI CHARACTER FONGMAN
# 	 2 ̯ 0x32f COMBINING INVERTED BREVE BELOW
# 	 3 ͡ 0x361 COMBINING DOUBLE INVERTED BREVE
# Grapheme 4:
# 	 1 ๏ 0xe4f THAI CHARACTER FONGMAN
# Grapheme 5:
# 	 1 ) 0x29 RIGHT PARENTHESIS
# Grapheme 6:
# 	 1 ۶ 0x6f6 EXTENDED ARABIC-INDIC DIGIT SIX

In fact, I will often go a little bit further, and build in some other Unicode-related functionality to my each_grapheme() function. Specifically, I like to add in functionality to handle equivalence:

import icu, unicodedata

def each_grapheme(some_str, norm_form='NFC'):
  if not norm_form in [False, "NFC", "NFKC", "NFD", "NFKD"]:
    raise "Invalid normalization specification: must be one of False, NFC, NFD, NFKC, NFKD. Consult a quality Unicode reference."

  b = icu.BreakIterator.createCharacterInstance(icu.Locale())
  b.setText(some_str)
  i = 0

  for j in b:
    this_grapheme = unicode(s[i:j])
    if norm_form:
      yield unicodedata.normalize(unicode(norm_form), this_grapheme)
    else:        
      yield this_grapheme

    i = j

This way, my grapheme clusters can be in whatever equivalence form I need for whatever it is that I’m doing (string matching, language modeling, etc.).

I should also note that there is a very nice Python library called uniseg that also does Unicode-standardized segmentation at the grapheme, word, etc. level, and does not involve dealing with ICU; you may find it an easier lift, and its API is certainly more pleasant. However, it’s still worth considering using ICU:

There are times when you need something beyond the basic behavior (locale-specific control, etc.)
I haven’t benchmarked it, but I strongly suspect that ICU will be much more performant, if you’re dealing with very large amounts of text.
ICU does a ton of other useful Unicode-related things, for example, language and locale-aware case-folding:

icu.UnicodeString(some_str).toUpper(icu.Locale())

For “vanilla” English-language text, this is redundant (as the built-in upper() method will usually do the trick) but when working with multilingual text, or text that includes a lot of compound glyphs, it can be worth it to let ICU do the heavy lifting.

« Back to notes