We recently needed to support unicode URLs for a Django CMS project and it turned out a little tricker than you’d have thought.

I had existing content, blog posts, which were written in Hindi and they’d originally been created with Wordpress. Scouring around online I found that there was already nice way to specify a custom validation for post slugs, maintaining the same normal slug rules, but using the full unicode character set. Something this like:

def unicode_slug_validator(s):
  if re.search(r'^[-\w]+$', flags=re.U):
    raise ValidationError("Invalid Slug, please use only letters, numbers and hyphens.")

And then setting up your class like so:

class UnicodeSlugField(models.CharField):
  default_validators = [validate_unicode_slug]     

This would work fine with totally new content, but because these posts were being imported from Wordpress they didn’t match the above regex. Essentially some of the unicode characters were not classed as letter or number characters (the \w regex flag).

This meant we needed a different approach:

def validate_unicode_slug(s):
  reg = r'[\.\/\\ \?\^\$\*\\{\}\[\]\(\)\+<>\'\"`|@\#!£&$%%]+'
  if re.search(reg, s, re.UNICODE):
    raise ValidationError("Invalid Slug, please use only letters, numbers and hyphens.")

Instead of checking for anything that isn’t a character, instead we check that there aren’t any specific characters that would be a concern in the URL — everything else was fair game, such as the Hindi characters that we were importing from the Wordpress database. When paired up with a UTF-8 database, some matching views and a modern browser, hey presto, we have functioning unicode URLs in Django.