Skip to main content

Mobile

The Magic of RegEx: An Intro to Regular Expressions

Despite its bad rap and its esoteric nature, the black-magic voodoo science of regular expressions intrigues me. Few things in web development grip me like regex does. And somehow, like magic, like the magic of regex itself; I become “the regex guy.” Weekly, sometimes daily, I get asked to help write a regex. I, of course, help where I can. I’m no expert at regex, but I have picked up a few tricks and regex recipes along the way. It has been said to me on many occasions:

You have a problem? Solve it with a regex… Now you have 99 problems.

What is regex?

Regular expressions, regex for short, is a method of matching strings to a pattern. A regular expression is a pattern, defined in a specific syntax, that a regex engine can parse the search subject string and locate or replace the matched segments. There are different syntaxes and multiple engines; not all of them are the same and not all of them are compatible. General principles and methods typically apply across the board though.

Why use regex?

There are things you shouldn’t use a regex for. For instance, a simple string comparison, where you know ahead of time that there could be very little variations in possible matches. If you are checking if a variable is equal to the word “potato,” you probably don’t need regex.

if (stringVariable === 'potato') {...}

If you have a variable that contains several words that could possibly contain the word “potato,” and you just need to see if “potato” is there, then you could use .indexOf().

if (stringVariable.indexOf('potato') > -1) {...}

However, say the variable is actually in a sentence structure; and say you are looking for “potato” or “tomato,” with any form of capitalization. And what if you want to return all instances of either word? This is a perfect use case for a regular expression.

var matches = stringVariable.match(/(potato|tomato)/mgi);

This regex returns a list of any and all matches of either the string “potato” or “tomato,” irrespective of case.

When to use regex?

Regex can be used in JavaScript and server side, in languages like C#, PHP, or Java. Most code IDEs have a regular expression find functionality. Some design and layout software can use regular expressions to find and replace text in textbox flows.

How to use regex

For now, I’m going to focus on regex usage as it applies to JavaScript. JavaScript has multiple methods for working with regex. The most common uses would probably be .match() or .replace(). There are also two common ways to use regexes: regex literals (appearing between two forward slashes with optional flags (/[a-z0-9]/i), and new RegExp(). Newing up a RegExp allows you to build a regular expression from a string; this allows you to use variables in the building of the pattern.
Example

var stringVariable = ‘you say potato, i say potato’;
stringVariable.match(/potato/gi);
stringVariable.match(new RegExp(‘potato’, ‘gi’));

The anatomy of a regex

For simplicity’s sake, we’re going to focus on the use of regular expression literals. Regex literals are not strings, they are instead enclosed between a matching set of forward slashes (known as “delimiters”). Regex literals can also have optional flags, placed after the closing delimiter. Regexes also have different types of parts that can be arranged in infinite permutations. Some of the important ones are:

  • Capture groups: denoted by the use of parentheses (abc). Capture groups grab parts of a regular expression to be used later, usually for replacement purposes. There are different kinds of capture groups, including some that don’t actually capture. These will be discussed in another installment.
  • Character sets: denoted by the use of square brackets [a-z]. Character sets (sometimes called “character classes”) are used instead of a character literal to match sets of characters in a range or a list of characters. Character sets can use the carat as a “not these characters” modifier: [^abc]
  • Metacharacters: denoted by a character or sequence of characters preceded by a backslash. There are a lot of metacharacters and I’m not going to cover them all. Here are a few commonly used ones:
    • .: matches any character
    • \d: matches any digit character
    • \w: matches a word character (alphanumeric and underscores)
    • \s: matches a whitespace character
    • \t: matches a tab character
    • \n: matches a newline character
  • Quantifiers: there are several quantifiers. Each quantifier modifies the acceptable number of previous characters, metacharacters, groups, or sets.
    • *: means “match the preceding character zero or more times”
    • +: means “match the preceding character one or more times”
    • {n}: means “match the preceding character exactly n times”
    • {n,}: means “match the preceding character n or more times”
    • {n,m}: means “match the preceding character between n and m times inclusive”
    • ?: means “match the preceding character zero or one time” unless it is preceded by another quantifier, it then means “match the fewest possible times that fulfills the previous quantifier” (this would look something like this: [a-z]+?)

With knowledge of these parts and how to combine them, you will have a majority of regex’s power at your disposal.

TL;DR

Regex is extremely powerful, but it can be “overused.” There are cases where it makes more sense to use something other than regex. There is a lot of intricacies to regex, but the knowledge of the individual pieces can help bridge the gap between “a science” and “an arcane mystery.”

Recommended Resources

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Bryce Taylor

More from this Author

Categories
Follow Us