November Happy Hour will be moved to Thursday December 5th.

Stephan Lonntorp
Oct 28, 2016
  5092
(3 votes)

URL Transliteration for EPiServer CMS 10

A while back we built a website that had a chinese language version, and we had a few issues with URL Segments not looking very nice. I took me a while to figure out that what I was looking for is called Transliteration. I implemented a really hacky way of modifying the URL Segment that EPiServer produces, so that I could inject my transliterated page name, instead of the page name in chinese.

Now in CMS 10, the old UrlSegment class has been removed, and instead we have the IUrlSegmentGenerator, IUrlSegmentCreator and IUrlSegmentLocator, you can read more about that in the release note CMS-3824.

The default implementations for these interfaces are all internal, so it's still a bit hacky to extend, but, I've implemented a Transliterating UrlSegmentGenerator, and swaped out the implementation, so you don't have to.

OK, so what is transliteration, and why is this important?

Let's say we have a page named "伤寒论 勘误" (I don't know what that means, it's just some chinese text that I copied). The default UrlSegmentGenerator would produce the url "-", since everything but alphanumeric chars are stripped out, so the only thing that remains is the whitespace character in the name.

Using transliteration, the chinese characters are converted to their alphanumeric versions, so the same input string "伤寒论 勘误" is converted to "Shang Han Lun Kan Wu", and the Transliterating UrlSegmentGenerator then produces the url "shang-han-lun-kan-wu".

Granted, I don't know chinese, so I can't verify that this is 100% correct. But I do know that "shang-han-lun-kan-wu" is a better representation than "-", since three pages in chinese, in the same location, would have the urls "-", "-1" and"-2" using the default generator.

This approach should work for all languages, not just chinese, but you'll have to test it for yourself, if you find any bugs, please let us know by sending a pull request.

The code is available at https://github.com/creunaab/EPi.UrlTransliterator, and a package with the same name should be available in the EPiServer NuGet feed shortly.

Oct 28, 2016

Comments

Oct 28, 2016 12:06 PM

Nice work!

Oct 28, 2016 03:08 PM

Yes as you have pointed out we have made it possible to change the default handling for url segments.

We will officially support this from version 10.1.0 (no need to replace implementations in container) where we have added encoding support as well (even if most browsers handle unencoded urls with IRI characters the recommendation is to encode such characters). It will be announced when we release 10.1.0.

Stephan Lonntorp
Stephan Lonntorp Oct 28, 2016 03:25 PM

@Johan, care to elaborate? Have you implemented transliteration, or encoding? or both?

Oct 28, 2016 11:02 PM

There is a class UrlSegmentOptions registered as singleton in IOC container where you can specify which regexp an url segment should be validated against (this exist in cms 10 as well), meaning you can for example specify a regexp that allows unicode characters. So you can replace default instance with your own instance.

What we have added in 10.1 is encoding, that is that IRI urls gets encoded. In cms 10 those urls will not be encoded (most browsers will handle them correctly anyway). In cms 10.1 we have also opened up simple address to allow IRI characters.

Oct 28, 2016 11:07 PM

So to clarify you do not need to replace IUrlSegmentGenerator in IOC container, you can instead set the regexp on UrlSegmentOptions.

Vincent
Vincent Nov 1, 2016 12:54 AM

Nice work mate.

I can read Chinese, and I can confirm each Chinese character is translated to appropriate Pinyin. 

Stephan Lonntorp
Stephan Lonntorp Nov 2, 2016 01:57 PM

@code monkey: Thanks!

@Johan: Being able to replace the regexp isn't really useful for transliteration though, I use another library for transliteration, and AFAIK that has nothing to do with regular expressions. It's nice that you've made it configurable, but being able to change a regulare expression really just caters to a use case for using regular expressions to generate url segments.

Please login to comment.
Latest blogs
Optimizely SaaS CMS + Coveo Search Page

Short on time but need a listing feature with filters, pagination, and sorting? Create a fully functional Coveo-powered search page driven by data...

Damian Smutek | Nov 21, 2024 | Syndicated blog

Optimizely SaaS CMS DAM Picker (Interim)

Simplify your Optimizely SaaS CMS workflow with the Interim DAM Picker Chrome extension. Seamlessly integrate your DAM system, streamlining asset...

Andy Blyth | Nov 21, 2024 | Syndicated blog

Optimizely CMS Roadmap

Explore Optimizely CMS's latest roadmap, packed with developer-focused updates. From SaaS speed to Visual Builder enhancements, developer tooling...

Andy Blyth | Nov 21, 2024 | Syndicated blog

Set Default Culture in Optimizely CMS 12

Take control over culture-specific operations like date and time formatting.

Tomas Hensrud Gulla | Nov 15, 2024 | Syndicated blog