Accented characters in URLS not being replaced

Danny WINBOURNE

Vote:

Hi,

Accented chars in URLS not being replaced

We have found an issue where certain characters that should form part of the URL.

For example, we have the following page title that also forms the URL: (this is Czech)

Náměty pro cvičení

Episerver correctly replaces the "á" with a standard "a".
However, the "ě" and the "č" are just being removed from the URL.

The URL is therefore: /Namty-pro-cvieni/ when it should be /namety-pro-cviceni/

Is there a "mapping" file somewhere which allows you specify replacements characters?

#49055

Mar 02, 2011 16:20

Anders Hattestad

Vote:

You can change this behaivor by attaching your self to this event

UrlSegment.CreatingUrlSegment += new EventHandler<UrlSegmentEventArgs>(UrlSegment_CreatingUrlSegment);

Then you can change the replace string, or as this example translate to english when characters in the url segment is only -

       static void UrlSegment_CreatingUrlSegment(object sender, UrlSegmentEventArgs e)
        {
            string uRLSegment = e.PageData.URLSegment;
            if (string.IsNullOrEmpty(uRLSegment))
            {
                uRLSegment = e.PageData.PageName;
                if (uRLSegment == null)
                {
                    uRLSegment = "";
                }
            }
            string urlFriendlySegment = ReplaceIllegalChars(uRLSegment);
            if (string.IsNullOrEmpty(urlFriendlySegment) || urlFriendlySegment.Replace("-","")=="")
            {
                urlFriendlySegment=TranslateUsingGoogle(e.PageData.PageName, e.PageData.LanguageID, "en");
            }
            e.PageData.URLSegment = urlFriendlySegment;
           
        }
        internal static string ReplaceIllegalChars(string inputString)
        {
            string InvalidSegmentNames = @"%|^COM[0-9]([/\.]|$)|^LPT[0-9]([/\.]|$)|^PRN([/\.]|$)|^CLOCK\$([/\.]|$)|^AUX([/\.]|$)|^NUL([/\.]|$)|^CON([/\.]|$)";
            Regex  regexValidUrlChars = new Regex(@"^[A-Za-z0-9\-_~]+$", RegexOptions.Compiled);
            Regex regexFindInvalidUrlChars = new Regex(@"[^A-Za-z0-9\-_~]{1}", RegexOptions.Compiled);
            Regex regexInvalidSegmentNames = new Regex(InvalidSegmentNames, RegexOptions.Compiled | RegexOptions.IgnoreCase);


            StringBuilder builder = new StringBuilder(inputString);
            MatchCollection matchs = regexFindInvalidUrlChars.Matches(inputString);
            for (int i = 0; i < matchs.Count; i++)
            {
                object obj2 = UrlSegment.GetURLCharacterMap()[builder[matchs[i].Index]];
                if (obj2 != null)
                {
                    builder[matchs[i].Index] = (char)obj2;
                }
                else
                {
                    builder[matchs[i].Index] = '?';
                }
            }
            builder.Replace("?", "");
            return builder.ToString();
        }

public static string TranslateUsingGoogle(string text, string fromLang, string toLang)
        {
            if (fromLang == null)
                fromLang = "auto";
            if (toLang == null)
                toLang = "en";

            string address = string.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", text, fromLang + "|" + toLang);
            string str7 = new WebClient().DownloadString(address);
            str7 = str7.Substring(str7.IndexOf("id=result_box"), 500);
            str7 = str7.Substring(str7.IndexOf(">"));
            str7 = str7.Substring(0, str7.IndexOf("</div"));
            Regex removeSpan=new Regex("<[^>]*>");
            str7=removeSpan.Replace(str7,"");
           
            return str7;
        }

#49060

Edited, Mar 02, 2011 20:37

Danny WINBOURNE

Vote:

Thanks for the reply, but it doesn’t recognise the characters that are causing my issues, to replaces them with nothing, as in the build in functionality.

Is it possible to add items to the GetURLCharacterMap() collection?

#49077

Mar 03, 2011 9:33

Anders Hattestad

Vote:

you have to make your own CreatingUrlSegment, and for instance replace ě" and the "č to e and c before you call the base method

#49078

Mar 03, 2011 9:35