Mobile and Emerging Technologies

But How Do It Know? Dissecting Regular Expressions

To truly understand regular expressions, you have to actually use them. A lot of people who use regular expressions (regex) don’t care to understand how they work; they just Google one that looks like it might work for them, throw it in there, and hope for the best.

I don’t know how it works, it just does…

We’ll go through a couple of commonly-needed regular expressions examples and tutorials for date format, email addresses, and phone numbers. I’ll cut them up, break them down, and go through their pieces to show how they work to perform the task they were designed for.

Date Format Regex

There’s a lot of fun ones in this category. Sometimes you need to enforce a certain format of a date, usually an HTML text input or when attempting to parse a date from an external source. A valid date format is a valuable thing to know before attempting to cast to a Date() object.

/^([01]*\d)[./\-]([0-3]*\d)[./\-](\d{2}|\d{4})$/

Synopsis

  • a one or two-digit month (if it’s two digits, it requires that the first digit is either a 0 or a 1),
  • a one or two-digit day (if it’s two digits, it requires that the first digit is either a 01, 2, or 3),
  • either a two-digit or a four-digit year,
  • and separators that are either periods, forward slashes, or hyphens.

Break it Down

  1. /: This denotes and opens a regex literal.
  2. ^: This carat character denotes the beginning of the string to be checked; this posits that the regular expression must match starting from the very first character of the string to be tested.
  3. (: This opens the first capture group. This capture group represents the month segment of the date string.
    1. [: This opens a new character set. This character represents a leading zero or a one in the tens place of a two digit month segment.
      1. 01: These are treated as literal characters; when inside this character set, they mean: “either a zero or a one.”
    2. ]: This closes the character set (the first digit of a two-digit month segment).
    3. *: This quantifies the previous character set; this makes it mean: “either a zero, a one, or no character at all.”
    4. \d: This metacharacter means “any digit” (note the lack of a quantifier here, this means exactly one digit is required for a match).
  4. ): This closes the first capture group (the month segment).
  5. [: This opens a new character set. This character set represents the separator character between the month segment and day segment of the date string.
    1. .: This period is treated as a literal period. Since it is within a character set, there is no need to escape it like \., as you would if it were outside a character set.
    2. /: This forward slash represents a literal forward slash. As was the case with the preceding period, it does not need to be escaped inside of a character set.
    3. \-: This escape sequence represents a literal hyphen. Depending on your regex engine, it may or may not need to be escaped. Hyphens can denote ranges when used inside of character sets, I personally always escape hyphens that I want to be “literal hyphens” when using them within character sets. Escaping them if the engine doesn’t require them to be escaped seems to do no damage to the actual function of the regex, it just helps my human brain to read it better.
  6. ]: This closes the character set (the separator between the month and day segments).
  7. (: This opens the second capture group. This capture group represents the day segment of the date string.
    1. [: This opens a new character set. This character set represents the first digit of a two-digit day segment.
      1. 0-3: The hyphen here is not literal and makes this character set a range between zero and three.
    2. ]: This closes the character set (the first digit of a two-digit day segment).
    3. *: This quantifies the previous character set; this makes it mean: “either a zero, a one, a two, a three, or no character at all.”
    4. \d: This metacharacter means “any digit” (note the lack of quantifier here, this means exactly one digit is required for a match).
  8. ): This closes the second capture group (the day segment).
  9. [: This opens a new character set. This character set represents the separator character between the day segment and year segment of the date string.
    1. .: This period is treated as a literal period (does not need to be escaped when used within a character set).
    2. /: This forward slash represents a literal forward slash (does not need to be escaped when used within a character set).
    3. \-: This escape sequence represents a literal hyphen (does not need to be escaped when used within a character set, though I do for readability’s sake).
  10. ]: This closes the character set (the separator between the day and year segments).
  11. (: This opens the third capture group. This capture group represents the year segment of the date string.
    1. \d{2}: This sequence is for a two-digit year.
      1. \d: This metacharacter means “any digit.”
      2. {2}: This quantifies the previous metacharacter; this makes it mean: “exactly two digit-characters.”
    2. |: This pipe character means “or” here, in context it reads as: “two-digits or four-digits.”
    3. \d{4}: This sequence is for a four-digit year.
      1. \d: This metacharacter means “any digit.”
      2. {4}: This quantifies the previous metacharacter; this makes it mean: “exactly four digit-characters.”
  12. ): This closes the third capture group (the year segment).
  13. $: This character denotes the end of the string to be checked; this posits that the regular expression must match ending at the very last character of the string to be tested.
  14. /: This denotes and closes the regex literal.

Email Address Regex

This one is the bane of a lot of developers’ existences. Emails are notoriously hard to check for, and arguably there are billions of permutations to take account of. You can take a look at an example of an email address validation regex here.
The official RFC 5322 standard regex is:

/^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/
Covid 19
COVID-19: Digital Insights For Enterprise Action

Access Perficient’s latest insights into how you can leverage digital technologies to not only respond to the pandemic, but drive your operations forward and deliver experiences your customers need.

Get Informed

A more simplified but still extremely inclusive (99.99%) regex is:

/^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$/i

These regexes are behemoths, and they make me sad. Theoretically, they should account for all (or almost all) eligible email addresses that follow the RFC 5322 spec. We rarely deal with email addresses that are IP addresses, or email addresses that contain a boatload of special characters (in-fact, we usually want to exclude “tricky” looking emails); however, we do want to allow people to use symbols like + and allow funky TLDs like .io, .co.uk, .guru, or .museum. A much more reasonable (and legible) but still fairly inclusive email regex would be something like:

/^\w+([\-+.]\w+)*@\w+([\-.]\w+)*\.\w+([\-.]\w+)*$/

Synopsis

  • the address segment begins with a word-character,
  • the address segment might contain hyphens, periods, or pluses,
  • there is absolutely one “at” symbol (@),
  • the domain segment begins with a word-character,
  • the domain segment must have at least one period,
  • the domain segment must contain at least one word-character in the TLD,
  • the domain segment may contain more than one period in the TLD.

Break it Down

  1. /: This denotes and starts a regex literal.
  2. ^: This carat denotes the start of the test string, it posits that the pattern must match starting from the beginning of the string to be tested.
  3. \w: This metacharacter represents any word-character (a-z case-insensitive, 0-9, and underscore).
  4. +: This quantifier modifies the preceding metacharacter and makes it mean: “one or more word-characters.”
  5. (: This starts a capture group, we’ll call this the “additional address segment.”
    1. [: This starts a new character set.
      1. \-: This is a literal hyphen (note that it doesn’t need to be escaped but is for legibility).
      2. +: This is a literal plus sign.
      3. .: This is a literal period.
    2. ]: This ends the character set.
    3. \w: This metacharacter represents any word-character.
    4. +: This quantifier modifies the previous metacharacter to mean “one or more word-characters.”
  6. ): This ends the “additional address segment” capture group.
  7. *: This quantifier modifies the preceding capture group; it makes it mean “match this capture group none or more times.” This effectively loops the capture group until it hits the next symbol in the pattern.
  8. @: This matches a literal at-symbol.
  9. \w: This metacharacter represents any word-character.
  10. +: This quantifier modifies the preceding metacharacter and makes it mean: “one or more word-characters.”
  11. (: This starts a capture group; we’ll call this the “additional pre-TLD domain segment” (it should look similar to the “additional address segment” capture group).
    1. [: This starts a new character set.
      1. \-: This is a literal hyphen (note that it doesn’t need to be escaped but is for legibility).
      2. .: This is a literal period.
    2. ]: This ends the character set.
    3. \w: This metacharacter represents any word-character.
    4. +: This quantifier modifies the previous metacharacter to mean “one or more word-characters.”
  12. ): This ends the “additional pre-TLD domain segment” capture group.
  13. *: This quantifier modifies the preceding capture group, it makes it mean “match this capture group none or more times.” This effectively loops the capture group until it hits the next symbol in the pattern.
  14. \.: This escaped period means a literal period, this ensures that the domain contains at least one period.
  15. \w: This metacharacter represents any word-character.
  16. +: This quantifier modifies the preceding metacharacter and makes it mean: “one or more word-characters.” This ensures there is at least one character in the TLD.
  17. (: This starts a capture group, we’ll call this the “additional post-dot domain segment” (it should look similar to the “additional address segment” capture group and is in fact the same as the “additional pre-TLD domain segment”).
    1. [: This starts a new character set.
      1. \-: This is a literal hyphen (note that it doesn’t need to be escaped but is for legibility).
      2. .: This is a literal period.
    2. ]: This ends the character set.
    3. \w: This metacharacter represents any word-character.
    4. +: This quantifier modifies the previous metacharacter to mean “one or more word-characters.”
  18. ): This ends the “additional post-dot domain segment” capture group.
  19. *: This quantifier modifies the preceding capture group, it makes it mean “match this capture group none or more times.” This effectively loops the capture group until it hits the next symbol in the pattern.
  20. $: This denotes the end of the string to be checked; it posits that the pattern must match starting until the ending of the string to be tested.
  21. /: This denotes and ends the regex literal.

You could go even more simple but you’d risk letting a lot of bad email address get through:

/^\S@\S$/

Phone Number Regex

Sometimes when your form requires the input of a phone number, you have to check to make sure that phone number looks like a valid format. This can be a daunting task, there are a lot of ways to write just a 10-digit, US-style phone number. Without getting too crazy, we’ll start by saying this regex doesn’t support extensions and is meant to match a few common patterns of a 10-digit US-style phone number.

/^\(*(\d{3})\)*[\-. ]*(\d{3})[\-. ]*(\d{4})$/

Synopsis

  • captures a three-digit area code,
  • captures a three-digit prefix,
  • captures a four-digit line number,
  • allows the area code to be wrapped in parenthesis,
  • allows for multiple separators, including spaces, hyphens, or periods.

Break it Down

  1. /: This denotes and starts a regex literal.
  2. ^: This carat denotes the start of the test string, it posits that the pattern must match starting from the beginning of the string to be tested.
  3. \(: This escape sequence represents a literal left paren character.
  4. *: This quantifier modifies the previous character, making it mean: “one left paren or no left paren.”
  5. (: This begins the first capture group. This group represents the area code segment.
    1. \d: This metacharacter is any digit-character; zero through nine.
    2. {3}: This quantifier modifies the previous metacharacter, making it mean: “exactly three digit-characters.”
  6. ): This ends the first capture group (the area code segment).
  7. \): This escape sequence represents a literal right paren character.
  8. *: This quantifier modifies the previous character, making it mean: “one right paren or no right paren.”
  9. [: This starts a new character set. This set represents the separator between the area code segment and the prefix segment.
    1. \-: This is a literal hyphen (note that the escape isn’t necessary but included for readability).
    2. .: This is a literal period.
    3.  : This is a literal space character (note that it may be difficult to see, but this inclusion of a space will match an actual space character and is not included for code legibility purposes).
  10. ]: This ends the character set.
  11. *: This quantifier modifies the previous character set, it makes it mean: “either a single hyphen, a single period, a single space, or nothing at all.”
  12. (: This begins the second capture group. This group represents the prefix segment.
    1. \d: This metacharacter is any digit-character; zero through nine.
    2. {3}: This quantifier modifies the previous metacharacter, making it mean: “exactly three digit-characters.”
  13. ): This ends the second capture group (the prefix segment).
  14. [: This starts a new character set. This set represents the separator between the prefix segment and the line number segment (and is the same as the character set between the area code segment and prefix segment).
    1. \-: This is a literal hyphen.
    2. .: This is a literal period.
    3.  : This is a literal space character.
  15. ]: This ends the character set.
  16. *: This quantifier modifies the previous character set, it makes it mean: “either a single hyphen, a single period, a single space, or nothing at all.”
  17. (: This begins the third capture group. This group represents the line number segment.
    1. \d: This metacharacter is any digit-character; zero through nine.
    2. {4}: This quantifier modifies the previous metacharacter, making it mean: “exactly four digit-characters.”
  18. ): This ends the third capture group (the line number segment).
  19. $: This denotes the end of the string to be checked, it posits that the pattern must match until the ending of the string to be tested.
  20. /: This denotes and ends the regex literal.

TL;DR

Regular expressions can solve common problems. They come in multiple flavors, ranging from simple to extreme. Your best bet for understanding how they work is to break them down into their smallest chunks. Separate out the character sets and the capture groups. “Decode” the metacharacters. Apply the quantifiers. Pay attention to escape sequences and “space” characters. Use tools like the online regex tester to help understand how they work (and if they work).
 
 

About the Author

More from this Author

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up