Try our conversational search powered by Generative AI!

Stephan Lonntorp
Oct 28, 2016
  4910
(3 votes)

URL Transliteration for EPiServer CMS 10

A while back we built a website that had a chinese language version, and we had a few issues with URL Segments not looking very nice. I took me a while to figure out that what I was looking for is called Transliteration. I implemented a really hacky way of modifying the URL Segment that EPiServer produces, so that I could inject my transliterated page name, instead of the page name in chinese.

Now in CMS 10, the old UrlSegment class has been removed, and instead we have the IUrlSegmentGenerator, IUrlSegmentCreator and IUrlSegmentLocator, you can read more about that in the release note CMS-3824.

The default implementations for these interfaces are all internal, so it's still a bit hacky to extend, but, I've implemented a Transliterating UrlSegmentGenerator, and swaped out the implementation, so you don't have to.

OK, so what is transliteration, and why is this important?

Let's say we have a page named "伤寒论 勘误" (I don't know what that means, it's just some chinese text that I copied). The default UrlSegmentGenerator would produce the url "-", since everything but alphanumeric chars are stripped out, so the only thing that remains is the whitespace character in the name.

Using transliteration, the chinese characters are converted to their alphanumeric versions, so the same input string "伤寒论 勘误" is converted to "Shang Han Lun Kan Wu", and the Transliterating UrlSegmentGenerator then produces the url "shang-han-lun-kan-wu".

Granted, I don't know chinese, so I can't verify that this is 100% correct. But I do know that "shang-han-lun-kan-wu" is a better representation than "-", since three pages in chinese, in the same location, would have the urls "-", "-1" and"-2" using the default generator.

This approach should work for all languages, not just chinese, but you'll have to test it for yourself, if you find any bugs, please let us know by sending a pull request.

The code is available at https://github.com/creunaab/EPi.UrlTransliterator, and a package with the same name should be available in the EPiServer NuGet feed shortly.

Oct 28, 2016

Comments

Oct 28, 2016 12:06 PM

Nice work!

Oct 28, 2016 03:08 PM

Yes as you have pointed out we have made it possible to change the default handling for url segments.

We will officially support this from version 10.1.0 (no need to replace implementations in container) where we have added encoding support as well (even if most browsers handle unencoded urls with IRI characters the recommendation is to encode such characters). It will be announced when we release 10.1.0.

Stephan Lonntorp
Stephan Lonntorp Oct 28, 2016 03:25 PM

@Johan, care to elaborate? Have you implemented transliteration, or encoding? or both?

Oct 28, 2016 11:02 PM

There is a class UrlSegmentOptions registered as singleton in IOC container where you can specify which regexp an url segment should be validated against (this exist in cms 10 as well), meaning you can for example specify a regexp that allows unicode characters. So you can replace default instance with your own instance.

What we have added in 10.1 is encoding, that is that IRI urls gets encoded. In cms 10 those urls will not be encoded (most browsers will handle them correctly anyway). In cms 10.1 we have also opened up simple address to allow IRI characters.

Oct 28, 2016 11:07 PM

So to clarify you do not need to replace IUrlSegmentGenerator in IOC container, you can instead set the regexp on UrlSegmentOptions.

Vincent
Vincent Nov 1, 2016 12:54 AM

Nice work mate.

I can read Chinese, and I can confirm each Chinese character is translated to appropriate Pinyin. 

Stephan Lonntorp
Stephan Lonntorp Nov 2, 2016 01:57 PM

@code monkey: Thanks!

@Johan: Being able to replace the regexp isn't really useful for transliteration though, I use another library for transliteration, and AFAIK that has nothing to do with regular expressions. It's nice that you've made it configurable, but being able to change a regulare expression really just caters to a use case for using regular expressions to generate url segments.

Please login to comment.
Latest blogs
Optimizely Web... 6 Game Changing Features in 2024

If you are interested in learning about what's new within Optimizely Web, you are in the right place. Carry on reading to learn about the 6 greates...

Jon Jones | Mar 3, 2024 | Syndicated blog

Headless forms reloaded (beta)

Forms is used on the vast majority of CMS installations. But using Forms in a headless setup is a bit of pain since the rendering pipeline is based...

MartinOttosen | Mar 1, 2024

Uploading blobs to Optimizely DXP via PowerShell

We had a client moving from an On-Prem v11 Optimizely instance to DXP v12 and we had a lot of blobs (over 40 GB) needing uploading to DXP as a part...

Nick Hamlin | Mar 1, 2024 | Syndicated blog

DbLocalizationProvider v8.0 Released

I’m pleased to announce that Localization Provider v8.0 is finally out.

valdis | Feb 28, 2024 | Syndicated blog