Can I mix standard characters and emojis?

Can I mix standard characters and emojis?

10.Oct.2021

Yes, with 🏹.to you can create hybrid short links that mix characters and emojis such as 🏹.to/❤️Barcelona or 🏹.to/🐎Horse (which renders as: → 👉 xn--trc04f).

This is possible because we use the following Unicode character encoding: UTF-8

 

_ _ _ _ _ | | |_| (_) (_) ___| |__ ___ _____ __ __ ____ _______ / \ | || '_ \ / _ \ |/ __|| '_ \ / _` || __|| | / ___ \ _____ (_)___ \ | || |_) || (_) || |_| || (_| || |_| || (_| ____) | (___/(____ /\__,_)|_.__/ \___\__,_||_| \___\__,_||_|

 

The emojis are in UTF-8 charset so they converted nicely to characters.

 

This is the encoding used on our site for both shortening and expanding URLs. You can see it when you expand a URL or when you try to create new one: https://encodex.com/tools/unicode-decoder

 

The UTF-8 charset is also supported by most languages, meaning you can mix languages in URLs. 

We only strip the unsafe characters ʼ , ˣ , ̏ so they are safe to use without worrying about encoding errors.

 

_ _ _ _ _ | | |_| (_) (_) ___| |__ ___ _____ __ __ ____ _______ / \ | || '_ \ / _ \ |/ __|| '_ \ / _` || __|| | / ___ \ _____ (_)___ \ | || |_) || (_) || |_| || (_| || |_| || (_| ____) | (___/(____ /\__,_)|_.__/ \___\__,_||_| \___\__,_||_|

 

You can safely use this encoding on your sites as well. If you use the encodeURIComponent function or some other Javascript code that encodes using UTF-8 charset (even if it doesn't explicitly specify it) we will try to read the short link from UTF-8 bytes and fail trying to decode with our charset set as ASCII: https://encodex.com/tools/convert/unicode

 

This is not an issue on our site since we care about mixing languages and Unicode right now but in general this could happen doing URL encoding at any context. 

If you mix languages, the resulting URL will be garbled but if you use UTF-8 charset it should work fine.

 

_ __ ___ __ _ _____ | \/ | (_) / _|___ ___ ____| | \ / |__ ___ ______| |_|\___/|_| \___\___\__,_|_||_\__\___|

 

UTF-8 is a variable width encoding that can encode all Unicode characters while being backward compatible with ASCII. This means that any valid ASCII text also valid UTF-8 and it will be exactly the same in memory. The downside of this is that some encodings look identical for different values: 

 

To solve this, we use heuristic methods to detect if a short link has UTF-8 encoded or not. As aforementioned, our charset is set as ASCII so that the URL can be decoded in both ways without issues. If you try to send us non-ASCII bytes using our API the decoding errors will return different results depending on the language of your site (e.g where are you sending it from): 

 

Please note that Unicode characters take 1 byte per character but emojis are not standard Unicode characters and are usually 2 bytes each one. The utf8encode function changes the value of this encoding, but we don't support it.

 

_ ___ ____ ___ | _ )|__ \ / __| | ) | (_) || (___/\__ \\___||_|_/__\___|

 

 *If you want a shorter version of this article that's a lot more technical and uses mathematics, check https://encodex.com/blog/can-i-use-emojis-in-urls#technical*

 

Also published on Medium.

Como se podría mezclar las caracteres y los emoji para crear un enlace cómico o útil? :D Publicado por Aniket Yadav el 19 de agosto de 2017 en Encodex https://blog.encodex.com/category/tutorials/ . Traducción al español por Carmen Tornero García, traductora autónoaa profesional independiente.

 

FAQ 

- No, este estándar no existe y no podemos implementarlo ya que perderíamos compatibilidad con el URL para decodificación antigua  https://en.wikipedia.org/wiki/Percent-encoding#Example_of_percent-encoded_characters 

Short answer: No, this standard doesn't exist and we cannot implement it because it would break compatibility with old decoders.

- Podemos tener una API diferente con UTF8-RAW o UTF-8 bytes, pero no están sujetas a ningún estándar https://en.wikipedia.org/wiki/UTF-8#Comparison_of_encodings 

Short answer: No, we could have a different API using UTF8-RAW or UTF-8 bytes but they aren't standard either.

We currently support over 200 languages and adding support for every language is not only tedious but also expensive, so it's better to focus on what most people use :) For this reason, we don't want to add support for emoji URL encoding since it's not a standard and Unicode is well supported by all browsers, so we have UTF-8 as the main option. If you need non-ASCII or emojis in short links though, use https://encodex.com/emoji 

 

View this article on blog: https://blog.encodex.com/article/can-i-mix-standard-characters-and-emojis (published on 2017 Aug 19)

See also : Trabajar con emoji en HTML5 - Emoji e caracteres incorporados en el DOM de los navegadores 

- Mire, esta semana le traigo un artículo sobre los caracteres incorporados Unicode porque no sé si nos va a quedar espacio para otra cosa :) Publicado por Aniket Yadav el 10 de agosto de 2017 en Encodex https://blog.encodex.com/article/unicode-characters (published on 2017 Aug 10)

See also : Working with emojis in HTML5 - Emojis and incorporated characters in browsers' DOM

 

View article on blog: https://blog.encodex.com/can-i-mix-standard-characters-and-emojis (published on 2017 Aug 19)

See also : Working with emojis in HTML5 - Emojis and incorporated characters in browsers' DOM

 

View article on blog: https://blog.encodex.com/article/unicode-characters (published on 2017 Aug 10)

Now, to the technical part :)

  UTF8 is a variable width encoding that can encode all Unicode characters while being backward compatible with ASCII. This means that any valid ASCII text also valid UTF8 and it will be exactly the same in memory. The downside of this is that, to denote character 256 (decimal) we need bytes with the highest bit set, 2 bytes instead of 1 byte. UTF8 has 3 modes:

  UTF8-RAW - each UCS-4 codepoint takes exactly 4 bytes (32 bits or unsigned int). This is great for encoding Unicode strings, but not good for interchange because it's not compatible with decoders that expect ASCII. This mode can represent any possible Unicode character and thus can represent emojis as well.

UTF8-NOBOM - each UCS-4 codepoint up to U+FFFF take a variable number of bytes, at least one and at most 4. The first byte starts with 11110xxx 10 , where xxx are the bits of the codepoint. The xxx bits are then complemented and appended to the end of the first byte. This mode is a little bit more compressed than UTF8-RAW, but it's still not good for interchange because it remains incompatible with ASCII decoders.

We are social