Non-Latin characters in TLDs

Talk about code development, features, specific bugs, enhancements, patches, and similar things.
Forum rules
Please keep everything here strictly on-topic.
This board is meant for Pale Moon source code development related subjects only like code snippets, patches, specific bugs, git, the repositories, etc.

This is not for tech support! Please do not post tech support questions in the "Development" board!
Please make sure not to use this board for support questions. Please post issues with specific websites, extensions, etc. in the relevant boards for those topics.

Please keep things on-topic as this forum will be used for reference for Pale Moon development. Expect topics that aren't relevant as such to be moved or deleted.
GigaWatt

Non-Latin characters in TLDs

Unread post by GigaWatt » 2019-04-03, 23:47

Don't really know if this is bug or a feature request :think:... the moderators can move the thread in either case.

OK... so, here's the problem. Copy/pasting URLs with a non-Latin TLD.

Example - copy/pasting an URL with a latin TLD, but the rest of the URL is in Cyrillic:

Code: Select all

https://plusinfo.mk/заев-катица-и-сјо-ја-вратија-надежта-за/
https://plusinfo.mk/%D0%B7%D0%B0%D0%B5% ... %B7%D0%B0/

No problem, the browser converts the characters to %XX. But, when the TLD is not a Latin one, here's what happens.

Code: Select all

http://кто.рф/
http://кто.рф/

As you can see, it doesn't convert the characters to %XX.

As I said, don't know if this is expected behavior, it's still not implemented or if it's a bug :|.

In the first example, I even had to "bend" the expected behavior a bit by copy/pasting the TLD first, then the rest of the URL in a text editor to get to the URL in the code LOL :). There was no way I could copy/paste the whole URL without the Cyrillic part of the URL being converted to %XX. But, that's not what happens when the TLD has non-Latin characters in it :|.

vannilla
Moon Magic practitioner
Moon Magic practitioner
Posts: 2193
Joined: 2018-05-05, 13:29

Re: Non-Latin characters in TLDs

Unread post by vannilla » 2019-04-04, 01:11

It's intended behaviour.
The management of non-ASCII characters (i.e. any code above 127 decimal in the UTF-8 encoding) is defined in different standards (which I can't find right now.)
When non-ASCII characters are in the resource name (the part after the top-level domain), they are to be encoded as %N, where N is the character's code in decimal.
This is also true for spaces (%20).
When non-ASCII characters are in the domain name itself (top-level or not), clients (so not only browsers) have to handle them as-is, so users will read the address proper rather than a %N-encoded mess.
Actually, those addresses are encoded specially, but that encoding isn't normally visible to users. It is, however, useful for security: there are many characters (so called "confusables") that look similar with each other, despite being completely different. The use of these confusables is a common phishing method.
If you check the website certificate or something like that, if you see the special encoding rather than, say, the lowercase letter I, then someone is trying to pull a scam on you.
This is a very simplifyied explanation, but we're not here to discuss RFCs in their entirety.

GigaWatt

Re: Non-Latin characters in TLDs

Unread post by GigaWatt » 2019-04-04, 12:23

Thank you for the detailed explanation ;).

I wasn't aware that TLDs don't fall under the "convert to %N" rule when copy/pasting URLs. That explains a lot. And yes, I know it's more convenient to actually see the TLD instead of a string of %N characters, but apparently, this is a problem for some PHP platforms when recognizing and converting pasted links into hyperlinks in posts... that is why I was asking, to try and establish if the problem is browser side or at the other end.

Once again, thank you ;).

vannilla
Moon Magic practitioner
Moon Magic practitioner
Posts: 2193
Joined: 2018-05-05, 13:29

Re: Non-Latin characters in TLDs

Unread post by vannilla » 2019-04-04, 12:42

If PHP platforms can't handle multibyte names, that's a problem of the platform.
The RFC is there, the developers just have to follow it and build the interface from there.
I'm aware that this isn't as easy as saying it, but still.
Out of curiosity, what is the problem with pasting links?

GigaWatt

Re: Non-Latin characters in TLDs

Unread post by GigaWatt » 2019-04-05, 22:58

It's not that this particular platform can't handle UTF-8 characters, the problem is converting the URLs into hyperlinks in posts - it doesn't recognize the IRI URLs, thus it can't convert them to hyperlinks, so when a post is made, all URLs with nonstandard Latin or Cyrillic characters are posted as plain text, not as a hyperlink.

For example, phpBB recognizes the IRI TLD in my previous post as an URL, thus, it converts it to a hyperlink. The platform I'm referring to doesn't do that.

On the other hand, it has no problem detecting the URL and converting it to a hyperlink if the TLD is a standard one (non IRI). To sum up, if the URL is like this:

Code: Select all

http://somedomain.tld/%B7%D0%B0/
no problem, URL detection works and it converts it to a hyperlink. But if the TLD is like this:

Code: Select all

http://нешто.тлд/%B7%D0%B0/
it doesn't matter what comes after the TLD, it doesn't detect it as an URL. Basically, the problem is what comes after http://. It doesn't recognize non ASCII characters as part of a URL, so it doesn't convert it to a hyperlink.



PS: Would asking for a config switch in Pale Moon to also convert non ASCII characters in TLDs to %N be too much :P :D?

User avatar
Moonchild
Pale Moon guru
Pale Moon guru
Posts: 35635
Joined: 2011-08-28, 17:27
Location: Motala, SE
Contact:

Re: Non-Latin characters in TLDs

Unread post by Moonchild » 2019-04-06, 02:07

This is not a limit of the browser. Converting to hyperlinks is by definition something done by the software presenting the content to you -- if it doesn't understand IDNs, then it needs to be taught how to convert them to punycode representation. Asking for a solution in the browser for this is the wrong place, sorry.
"Sometimes, the best way to get what you want is to be a good person." -- Louis Rossmann
"Seek wisdom, not knowledge. Knowledge is of the past; wisdom is of the future." -- Native American proverb
"Linux makes everything difficult." -- Lyceus Anubite

GigaWatt

Re: Non-Latin characters in TLDs

Unread post by GigaWatt » 2019-04-06, 16:06

Yes, I know that ;)... and that is what I was actually saying, the software doesn't recognize links with IRI TLDs, so basically, if it doesn't recognize them, it can't convert them to hyperlinks ;).
Moonchild wrote:Asking for a solution in the browser for this is the wrong place, sorry.
That was a joke :). I should've added j/k after that, sorry :). I know the problem is not browser side (well, at least now... I suspected as much, but wasn't sure ;)).

User avatar
Moonchild
Pale Moon guru
Pale Moon guru
Posts: 35635
Joined: 2011-08-28, 17:27
Location: Motala, SE
Contact:

Re: Non-Latin characters in TLDs

Unread post by Moonchild » 2019-04-06, 17:52

GigaWatt wrote:That was a joke :). I should've added j/k after that, sorry :). I know the problem is not browser side (well, at least now... I suspected as much, but wasn't sure ;)).
But you posted it in the "Suggestions/feature requests" board, which is the place to make suggestions or feature requests for the browser, i.e. asking for browser solutions to problems... :) 8-)
"Sometimes, the best way to get what you want is to be a good person." -- Louis Rossmann
"Seek wisdom, not knowledge. Knowledge is of the past; wisdom is of the future." -- Native American proverb
"Linux makes everything difficult." -- Lyceus Anubite

GigaWatt

Re: Non-Latin characters in TLDs

Unread post by GigaWatt » 2019-04-07, 04:38

After vannilla explained that this is basically, a platform related issue (not related to the browser itself), I thought I do a little joke ;). I wasn't aware that that is how non-Latin TLDs are treated by browsers when copy/pasting links.

To be fair, I didn't try it out in any other browser :P... I just assumed Pale Moon is "special" (and it kind of is :)).

Locked