Question about specific regex for email address validation

Users and developers helping users with generic and technical Pale Moon issues on all operating systems.

Moderator: trava90

Forum rules
This board is for technical/general usage questions and troubleshooting for the Pale Moon browser only.
Technical issues and questions not related to the Pale Moon browser should be posted in other boards!
Please keep off-topic and general discussion out of this board, thank you!
User avatar
UCyborg
Fanatic
Fanatic
Posts: 171
Joined: 2019-01-10, 09:37

Question about specific regex for email address validation

Unread post by UCyborg » 2023-02-02, 09:04

The app I'm dealing with at work uses some weird regex to structurally validate mail addresses. The validation function looks like:

Code: Select all

function validateEmail(sEmail) {
  var filter = /^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,20}|[0-9]{1,3})(\]?)$/;
  if (filter.test(sEmail)) {
    return true;
  }

  return false;
}
regex101.com reports the error for this expression: You cannot create a range with shorthand escape sequences. Since Pale Moon 32.0.0, the validation no longer passes for mails like name.lastname@provider.com. Of course, the other browsers still stomach it.

The regex should probably be modified. What do you experts think? Hopefully, we're really just more compliant now...?

Maybe the first part in parenthesis was meant to be written like: ([\w\.-]+)

User avatar
Moonchild
Pale Moon guru
Pale Moon guru
Posts: 35404
Joined: 2011-08-28, 17:27
Location: Motala, SE
Contact:

Re: Question about specific regex for email address validation

Unread post by Moonchild » 2023-02-02, 09:36

UCyborg wrote:
2023-02-02, 09:04
Maybe the first part in parenthesis was meant to be written like: ([\w\.-]+)
Correct.

There is an accepted concession in web browsers from this stuff in the past because there was too much PEBCAK with web devs not understanding that - designates a range in a set. As such the concession was made to parse it literally if either one is a character class instead of a single character (except in unicode mode where this was enforced strictly).
I never agreed the first time I implemented this way back when, and the complacency about accepting invalid ranges still results in issues as you can see. IMHO an invalid range should always throw.

I'm fixing a typo which breaks this one in particular at the moment (seems too much copypasta has been going on with [\w-\.] being everywhere...? :P) Issue #2103 (UXP)
"Sometimes, the best way to get what you want is to be a good person." -- Louis Rossmann
"Seek wisdom, not knowledge. Knowledge is of the past; wisdom is of the future." -- Native American proverb
"Linux makes everything difficult." -- Lyceus Anubite

BenFenner
Astronaut
Astronaut
Posts: 588
Joined: 2015-06-01, 12:52
Location: US Southeast

Re: Question about specific regex for email address validation

Unread post by BenFenner » 2023-02-02, 14:19

While we're on the topic, e-mail validation is a tricky beast and something I have researched to death. There are "good enough" e-mail RegEx checks out there that I've been happy with in the past ( http://www.regular-expressions.info/email.html ) when paired with a 254-char length validation. There is also a total and complete e-mail spec RegEx example out there as a reference ( https://code.iamcal.com/php/rfc822/full_regexp.txt from https://www.iamcal.com/publish/articles ... sing_email ), but it is 22,174 characters long. While kept up to date, it is a beast and I much prefer things I understand and can debug myself. I've never tried to use it.

That said, I eventually found some issues with even the "good enough" RegEx I was using. For example, the e-mail spec allows for spaces, and @ symbols in the local part. The e-mail spec actually says you should accept pretty much anything in the local part. So as to accept double quotes, spaces, and @ symbols in the local part I loosened up the e-mail validation on all of my web apps and ended up with something I am even more happy with. I think the key here is to not try to have the RegEx do everything. For example, the true length limit of an e-mail address is 254 characters. There is no reason to have the RegEx deal with that as well. Just check that independently. I also split up the task of handling whitespace to a separate validation. So I check 3 things. Length, whitespace in the local, and then the overall structure with RegEx.

I think my [Ruby] code says it best:

Code: Select all

  # Original e-mail format RegEx cribbed from: http://www.regular-expressions.info/email.html
  # Local part made much more permissive based on advice from: https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx
  # More fun reading: https://www.iamcal.com/publish/articles/php/parsing_email
  #
  # Don't attempt to make this RegEx self-commenting unless you know exactly what you're doing.
  # 1) Placing the RegEx on multiple lines allows for line comments only if you ignore whitespace in the RegEx and we
  #    have a space character in there we want to preserve.
  # 2) Building regular expressions from individual strings in Ruby is more difficult than you'd think. Interpolating via
  #    double-quoted strings or even into RegEx literals (it can be done!) cause unexpected character escaping or other
  #    nonsense.
  # For these reasons the RegEx is assigned as one long line but described individually below:
  #
  # Description          RegEx                                      Explanation
  # --------------------------------------------------------------------------------------------------------------------
  # Begin-string anchor: \A                                         Match the beginning of the string. The JavaScript version requires the less-ideal begin-line anchor `^`.
  # Local part:          [A-Z0-9 .!#$%&\'"*+\/\\=?^_`{|}~@-]{1,64}  1 to 64 characters inclusive of alphanumeric, a literal space, and .!#$%&'"*+/\=?^_`{|}~@- some of which are escaped with a backslash.
  # Delimiter:           @                                          1 required literal @ symbol. The e-mail delimiter we all know and love. (Note the local part can also include this character.)
  # Domain part:         (?:[A-Z0-9-]{1,63}\.){1,125}[A-Z]{2,63}    See: http://www.regular-expressions.info/email.html
  # End-string anchor:   \z                                         Match the end of the string. The JavaScript version requires the less-ideal end-line anchor `$`.
  #
  # The final resulting RegEx is designed to be used while ignoring case.
  EMAIL_FORMAT_REGEX        = /\A[A-Z0-9 .!#$%&\'"*+\/\\=?^_`{|}~@-]{1,64}@(?:[A-Z0-9-]{1,63}\.){1,125}[A-Z]{2,63}\z/i.freeze
  EMAIL_FORMAT_REGEX_FOR_JS = /^[A-Z0-9 .!#$%&\'"*+\/\\=?^_`{|}~@-]{1,64}@(?:[A-Z0-9-]{1,63}\.){1,125}[A-Z]{2,63}$/i.freeze
That's the main RegEx.
There is then also an overall length validation to make sure the address is 254 chars or fewer.
And then here is my whitespace validation:

Code: Select all

#
# This validator is used alongside normal length and RegEx format validations to ensure e-mail addresses are valid.
# This specific validator makes sure that if spaces exist in the local portion of the e-mail address that they are only
# included inside a double-quoted section of said local portion.
# Blanks are allowed; use a separate presence validation to enforce that if desired.
#
# Valid examples:
#  ''
#  'normal.valid@email.address'
#  'missing.domain.delimiter'
#  '"This e-mail address has quoted spaces in the local part"@valid.address'
#  '"This e-mail address is otherwise invalid"@'
#  '"Two or more of these"quoted.sections"are also completely valid"@email.address'
#
# Invalid examples:
#   'Spaces without quotes@domain.tld'
#   'Spaces but some are "not inside" quotes@domain.tld'
#   '"Two of these" quoted sections "but some spaces outside"@domain.tld'
#
class EmailAddressLocalWhitespaceValidator < ActiveModel::EachValidator

  ###################################################################
  #
  # #validate_each()
  #
  ###################################################################
  def validate_each(record, attribute, value)
    # Blanks are allowed.
    return if value.blank?

    attribute_name = attribute.to_s.titleize
    message        = ''

    if !self.class.whitespace_valid?(value)
      message = "The provided #{attribute_name} contains invalid spaces. If the e-mail address truly contains spaces they should be wrapped inside double quotes."
    end

    record.errors.add(attribute, message) if message.present?
  end



  ###################################################################
  #
  # Public Class Methods
  #
  ###################################################################
  def self.whitespace_valid?(email_address)
    return true if email_address.blank?  # Entire e-mail address was blank, so local part whitespace is valid.

    local = get_local_part(email_address)

    return true  if local.blank?           # Local part was blank or no domain delimiter (@) exists, so assume local part whitespace is valid.
    return true  if !local.include?(' ')   # No spaces in local part, so local part whitespace is valid.
    return false if local.count('"').odd?  # Spaces exist but with the wrong number of double-quotes, so local part whitespace is invalid.

    # The way split works, the even indexes always hold the unquoted parts regardless of where the quotes are.
    # We check the even indexes for [invalid] spaces.
    # For example:
    #  even even even "odd odd odd" even even "odd odd"@domain.tld
    #  "odd odd odd" even even even "odd odd" even even@domain.tld
    parts = local.split('"')
    parts.each_with_index do |part, index|
      return false if index.even? && part.include?(' ')
    end
    true
  end



  ###################################################################
  #
  # Private Class Methods
  #
  ###################################################################
  def self.get_local_part(email_address)
    return '' if email_address.blank?

    parts = email_address.split('@') # Split address by '@' so we can remove the domain part.
    parts.pop if parts.size > 1      # Remove the domain part from the array (in place) if it exists.
    parts.join('@')                  # Join the local part back together.
  end
  private_class_method :get_local_part
end

Of course this all relies on servers down the line you don't control also adhering to the spec and not blowing up on spec-legal but otherwise uncommon things like having two @ symbols in an address or whitespace in the local part. That should be monitored and taken into account. I've not run into any issues so far and it's been a few years in production.
Last edited by BenFenner on 2023-02-02, 15:49, edited 2 times in total.

BenFenner
Astronaut
Astronaut
Posts: 588
Joined: 2015-06-01, 12:52
Location: US Southeast

Re: Question about specific regex for email address validation

Unread post by BenFenner » 2023-02-02, 14:47

This has me curious, in the strictest sense should the 3rd hyphen in the RegEx partial below be escaped since it does not describe a range but instead describes a literal hyphen character?
I'm not entirely sure I understand which mistake web devs have been making that caused the concession mentioned above. I'd prefer not to be one of those web devs if possible.

Code: Select all

/\A[A-Z0-9 .!#$%&\'"*+\/\\=?^_`{|}~@-]{1,64}\z/i

BenFenner
Astronaut
Astronaut
Posts: 588
Joined: 2015-06-01, 12:52
Location: US Southeast

Re: Question about specific regex for email address validation

Unread post by BenFenner » 2023-02-02, 15:57

Seems the issue in the OP was figured out in a thread a day prior: viewtopic.php?f=70&t=29424

vannilla
Moon Magic practitioner
Moon Magic practitioner
Posts: 2181
Joined: 2018-05-05, 13:29

Re: Question about specific regex for email address validation

Unread post by vannilla » 2023-02-02, 16:57

BenFenner wrote:
2023-02-02, 14:47
This has me curious, in the strictest sense should the 3rd hyphen in the RegEx partial below be escaped since it does not describe a range but instead describes a literal hyphen character?
On the matter of the hyphen: when used inside square brackets, an hyphen between two characters makes an interval unless used as the first or last element where it is considered a character to match. Therefore:

[a-z] matches all (UTF-8) symbols between lowercase letter A and lowercase letter Z. For a coincidence caused by the ASCII inheritance, this means all lowercase latin alphabet. In the form [a-z]+ it is often used to match "a non-empty english word", but that works only as long as the encoding (e.g. UTF-8) has contiguous latin letters, which is why "classes" like \w exist, to avoid this encoding dependency.

[a-] matches either lowercase letter A or the hyphen.

[a--] matches all (UTF-8) symbols between lowercase letter A and the hyphen, which is an empty set because in UTF-8 the hyphen comes before lowercase letter A. Therefore, this regex matches nothing.

[--a] matches all (UTF-8) symbols between the hyphen and lowercase letter A. Similar to the first example, by chance it matches a bunch of symbols and all the uppercase latin alphabet.

[a-z-] matches either all the symbols as in example one, or the hyphen.

[-a-z] is the same as the previous example.

As someone once said: "You have a problem. You try to solve it using a regex. Now you have two problems."

User avatar
Moonchild
Pale Moon guru
Pale Moon guru
Posts: 35404
Joined: 2011-08-28, 17:27
Location: Motala, SE
Contact:

Re: Question about specific regex for email address validation

Unread post by Moonchild » 2023-02-02, 17:39

BenFenner wrote:
2023-02-02, 14:47
in the strictest sense should the 3rd hyphen in the RegEx partial below be escaped since it does not describe a range but instead describes a literal hyphen character?
In the strictest sense, a hyphen when intended to be literal should always be escaped (and you can never go wrong by sticking to that rule). The concession against that was made because people would stubbornly just use it as a literal unescaped, then be confused the regex broke and complain about it. This is why we now have this complex checking if a range is valid or not and treat it as a literal sometimes and as a range delimiter other times, and why /u acts differently (actually proper) than non-/u.
"Sometimes, the best way to get what you want is to be a good person." -- Louis Rossmann
"Seek wisdom, not knowledge. Knowledge is of the past; wisdom is of the future." -- Native American proverb
"Linux makes everything difficult." -- Lyceus Anubite

Locked