Stephan Lonntorp
Oct 28, 2016
  4404
(3 votes)

URL Transliteration for EPiServer CMS 10

A while back we built a website that had a chinese language version, and we had a few issues with URL Segments not looking very nice. I took me a while to figure out that what I was looking for is called Transliteration. I implemented a really hacky way of modifying the URL Segment that EPiServer produces, so that I could inject my transliterated page name, instead of the page name in chinese.

Now in CMS 10, the old UrlSegment class has been removed, and instead we have the IUrlSegmentGenerator, IUrlSegmentCreator and IUrlSegmentLocator, you can read more about that in the release note CMS-3824.

The default implementations for these interfaces are all internal, so it's still a bit hacky to extend, but, I've implemented a Transliterating UrlSegmentGenerator, and swaped out the implementation, so you don't have to.

OK, so what is transliteration, and why is this important?

Let's say we have a page named "伤寒论 勘误" (I don't know what that means, it's just some chinese text that I copied). The default UrlSegmentGenerator would produce the url "-", since everything but alphanumeric chars are stripped out, so the only thing that remains is the whitespace character in the name.

Using transliteration, the chinese characters are converted to their alphanumeric versions, so the same input string "伤寒论 勘误" is converted to "Shang Han Lun Kan Wu", and the Transliterating UrlSegmentGenerator then produces the url "shang-han-lun-kan-wu".

Granted, I don't know chinese, so I can't verify that this is 100% correct. But I do know that "shang-han-lun-kan-wu" is a better representation than "-", since three pages in chinese, in the same location, would have the urls "-", "-1" and"-2" using the default generator.

This approach should work for all languages, not just chinese, but you'll have to test it for yourself, if you find any bugs, please let us know by sending a pull request.

The code is available at https://github.com/creunaab/EPi.UrlTransliterator, and a package with the same name should be available in the EPiServer NuGet feed shortly.

Oct 28, 2016

Comments

Oct 28, 2016 12:06 PM

Nice work!

Oct 28, 2016 03:08 PM

Yes as you have pointed out we have made it possible to change the default handling for url segments.

We will officially support this from version 10.1.0 (no need to replace implementations in container) where we have added encoding support as well (even if most browsers handle unencoded urls with IRI characters the recommendation is to encode such characters). It will be announced when we release 10.1.0.

Stephan Lonntorp
Stephan Lonntorp Oct 28, 2016 03:25 PM

@Johan, care to elaborate? Have you implemented transliteration, or encoding? or both?

Oct 28, 2016 11:02 PM

There is a class UrlSegmentOptions registered as singleton in IOC container where you can specify which regexp an url segment should be validated against (this exist in cms 10 as well), meaning you can for example specify a regexp that allows unicode characters. So you can replace default instance with your own instance.

What we have added in 10.1 is encoding, that is that IRI urls gets encoded. In cms 10 those urls will not be encoded (most browsers will handle them correctly anyway). In cms 10.1 we have also opened up simple address to allow IRI characters.

Oct 28, 2016 11:07 PM

So to clarify you do not need to replace IUrlSegmentGenerator in IOC container, you can instead set the regexp on UrlSegmentOptions.

Vincent
Vincent Nov 1, 2016 12:54 AM

Nice work mate.

I can read Chinese, and I can confirm each Chinese character is translated to appropriate Pinyin. 

Stephan Lonntorp
Stephan Lonntorp Nov 2, 2016 01:57 PM

@code monkey: Thanks!

@Johan: Being able to replace the regexp isn't really useful for transliteration though, I use another library for transliteration, and AFAIK that has nothing to do with regular expressions. It's nice that you've made it configurable, but being able to change a regulare expression really just caters to a use case for using regular expressions to generate url segments.

Please login to comment.
Latest blogs
Translating Optimizely CMS 12 UI components

Optimizely CMS 12 have been out for a while now, but still some elements haven't been properly translated resulting in a GUI defaulting to english....

Eric Herlitz | Jan 26, 2023 | Syndicated blog

Image preview in Optimizely CMS12 all properties view

With these simple steps, you can now see an Image and its Metadata, including size and dimensions, when editing an Image property in Optimizely...

Tomas Hensrud Gulla | Jan 26, 2023 | Syndicated blog

Setting up the ImageEditor in Optimizely CMS 12

Setting up certain configurations on Opimizely CMS 12 differs quite a bit from prior versions of (Episerver CMS 11 and older). Here's a small guide...

Eric Herlitz | Jan 25, 2023 | Syndicated blog

Happy Hour Returning in February

Hi everyone! It's been a while and we're excited to resume our Happy Hour in February for more learning, sharing, connecting, relaxing, and just to...

Patrick Lam | Jan 24, 2023

Planned breaking change for Locale parameter in Content Graph

Content Graph, a new service which makes it possible to query content using GraphQL, will have a breaking change soon. The service is currently in...

Jonas Bergqvist | Jan 23, 2023

Optimizely Advanced ContentArea Render is Back!

If you have used EPiBootstrapArea package - then back in those days we had an opportunity to render Episerver ContentArea items with different...

valdis | Jan 23, 2023 | Syndicated blog