Regex: what they are and what they are used for (including in SEO)

Put us to the test
Put us to the test!
Analyze your site
Select the database

They are a truly powerful and versatile language used in various fields of computer science and programming. Regular expressions, commonly known as regexes, are an elegant solution for manipulating and searching text strings: in fact, despite their apparent complexity, their use can give us a great deal of support in perfecting data analysis and monitoring, including for SEO. Let’s focus on Regex or Regular Expressions, the functions that allow you to filter the results, to find out which are the common regex operators, the most useful for SEO and those that we can use in tools such as Google Analytics and Google Search Console.

What Regex or Regular Expressions are

Regex or regular expressions are pattern matching tools used to identify and manipulate strings of text.

The ideal partner for your SEO!
SEOZoom is the software that simplifies the analysis, monitoring and SEO optimization of every site. More visits, more customers, more revenue.
Registrazione

Put another way, it is a string of symbols for filtering and identifying a search pattern, or more precisely a sequence of characters that enables a function to filter, compare, or identify strings of characters or code.

Regexes are supported by almost all modern programming languages, including Python, JavaScript, and PHP, and are built into many development and data analysis tools and word processing applications. Their syntax may seem cryptic at first but, once understood, offers a level of control and precision that few other technologies can match. Indeed, regexes are also essential for data validation, string substitution, and log file manipulation, making them an essential element for anyone working with large amounts of data or text.

Therefore, we should not think that they are something useful only for programmers: digital marketing professionals, data analysts and SEO specialists can also benefit enormously from their use, as can we “ordinary” users, as we shall see.

Regex meaning and simple explanation

Let’s start again with basic definitions.

Regex comes from the contraction of “REGular EXpression” (sometimes we also find references such as regexp or RE).

They are composed of a combination of normal characters and metacharacters, which together form a powerful language for describing complex patterns of text.

Normal characters represent themselves and are used to look for literal matches in text: for example, the regex “dog” will find all occurrences of the word “dog” in a text.

Metacharacters, on the other hand, have special meanings and are used to construct more complex search patterns. Some of the most common metacharacters include.

  • the dot (.), which represents any single character;
  • the asterisk (*), which represents zero or more occurrences of the preceding character;
  • the question mark (?), which represents zero or one occurrence of the previous character.

These metacharacters can be combined to create powerful and flexible search patterns: for example, the regex “c.ne” will find all four-letter words beginning with “c” and ending with “ne,” such as “dog” and “cine.” Understanding metacharacters and their combinations is critical to using regular expressions effectively.

Where regular expressions are found

Regular expressions are used in search engines, in the search and replace dialogs of word processors and text editors, in text processing utilities and in lexical analysis, but also many programming languages provide regex functionalities built in or through libraries, as the situations of use are many.

The most immediate example for understanding what regexes are is to think of the “find” or “find and replace” function on Word-which searches for the exact character string within the document and eventually replaces it with the desired other string-or the search function in Web browsers (the one that is usually activated by clicking the CTRL+F keys).

Speaking of Microsoft Word, regexes (called “wildcards” in the application’s terminology) are used for advanced search and replacement: this can be especially useful for those who work with large documents and need to find and replace complex patterns of text. For example, regexes can be used to find all occurrences of a certain date format or to replace all words beginning with a certain letter.

More generally, regexes are supported by almost all modern programming languages, including Python, JavaScript, PHP, Java, and many others. This makes them an indispensable tool for programmers who need to manipulate strings of text or parse large amounts of data. For example, in Python the re module provides functions for searching, replacing, and manipulating strings using regex. In JavaScript, regexes are built directly into the language and can be used with methods such as match, replace, search, and split.

In addition to programming, regexes are widely used in development and data analysis tools: for example, advanced text editors support regexes for search and replace within files. This is especially useful for programmers and web developers who need to edit large amounts of code or content. Data analysis tools such as Google Analytics and Google Search Console also support regexes to filter and segment data, allowing for more detailed and specific analysis.

The examples of regular expressions and their application

Understanding regexes takes some practice: once mastered, however, they become an indispensable tool for anyone working with large amounts of text or data.

These patterns can also be combined and modified to create extremely specific and powerful expressions.

To better understand how regexes work, let us consider a practical example.

Suppose we have a list of email addresses and we want to find all addresses that belong to a specific domain, such as “example.com.” A regex for this task might be “\b[A-Za-z0-9.%+-]+@example.com\b.”

Let us analyze this regex in detail: the character “\b” represents a word boundary, ensuring that the search begins and ends on a complete word. The sequence “[A-Za-z0-9.%+-]+” represents the email username, which can contain letters, numbers and some special characters. The “@” symbol is a literal character that separates the username from the domain. Finally, “example.com” represents the specific domain, with the dot (.) preceded by a backslash () to indicate that it should be interpreted as a literal character and not as a wildcard.

If done manually, searching for all occurrences of an email address in a text document hundreds of pages long would be an arduous, time-consuming and error-prone task. Regular expressions, on the other hand, make it possible to define a search pattern that can quickly and accurately identify all desired matches.

Again.

A regex can be used to find all words beginning with a capital letter, to validate an email address, or to extract numbers from a text. Even more simply, the regex \d{4} matches any sequence of four digits, while \b[A-Za-z]+\b matches any word consisting only of letters.

We have a web server log file and we want to extract all IP addresses that have accessed the server. An IP address is generally composed of four groups of numbers separated by periods, such as 192.168.1.1. The regex to identify an IP address might be something like \b\d{1,3}.\d{1,3}.\b}. This regular expression looks for four groups of one to three digits, separated by periods. The wildcard \d represents a digit, {1,3} specifies that we are looking for one to three occurrences of the preceding character, and \b indicates a word boundary, ensuring that there are no adjacent alphanumeric characters that would invalidate the match. Using this regex, we can quickly extract all IP addresses from the log file, saving time and reducing the possibility of errors compared to a manual search.

In short, as we said, the ability to master regular expressions can transform the way we approach complex text manipulation tasks, making them more manageable and less time-consuming.

The history of Regex

Regular expressions have a fascinating history going back several decades, long before they became a common tool in the world of programming and computer science. Regexes originated in the context of mathematical theory and formal linguistics, and their evolution is closely linked to developments in the field of theoretical computer science.

According to reconstructions, the first formulation and formalization dates back to the 1940s, but it was not until the following decade that regexes began to make their way, thanks to the work of the American mathematician Stephen Cole Kleene, who described precisely a regular language. Kleene was working on the theory of automata and formal grammars, and he developed regular expressions as a way to describe formal languages. His work was published in 1951 in a paper titled “Representation of Events in Nerve Nets and Finite Automata,” where regular expressions were used to represent sets of strings that could be recognized by finite automata, a fundamental concept in the theory of formal languages and automata.

In the 1960s and 1970s, regexes began to find practical applications in computer science. One of the first significant uses of regular expressions in a computing context was in the programming language SNOBOL, developed by David J. Farber, Ralph E. Griswold, and Ivan P. Polonsky at Bell Labs. SNOBOL included powerful string manipulation capabilities, and regular expressions were a key part of these capabilities.

The real turning point for regexes came with the introduction of the Unix programming language in the 1970s. Ken Thompson, one of the creators of Unix, initially used them in the QED editor in 1966 and then implemented regular expressions in ed command, the Unix text editor. Later, regexes were integrated into many other Unix tools, such as grep, sed, and awk, making them a standard tool for text manipulation and pattern searching. This further solidified the importance of regular expressions in the world of computing.

It was then in the 1980s that, thanks to the Perl programming language, which natively allowed their use, regular expressions became commonplace, and since then different syntaxes for writing regexes, such as the POSIX standard and the Perl syntax, have become established.

Today regexes can be used on JavaScript, Python, and other programming languages, and can become a versatile and powerful SEO tool.

What Regex are for

Due to their characteristics, regular expressions serve to simplify the search for common data and information within a document or set of resources, specifying precisely the rules that are used to describe the set of possible strings that you want to match to discover search results that, at first glance, may seem to have little in common.

With this tool you can, for example, include complex search strings, partial matches and wildcard characters, perform case-sensitive searches or set other advanced instructions, almost as if it were an online programming language for doing text searches.

Other advantageous aspects of Regex are the validity in every language and regardless of the platform and the great practicality of use, although it must be kept in mind that their language is relatively small and not all string processing activities can be performed this way.

Why to use Regex

Regular expressions are incredibly powerful tools that offer a wide range of practical applications. Fundamentally, one of their greatest benefits lies in their ability to simplify and automate complex text manipulation tasks: regexes allow advanced searches and substitutions to be performed with an accuracy and speed that would be impossible to achieve manually. Moreover, their flexibility and power allow complex problems to be tackled with elegant and concise solutions, reducing the time and effort required to complete repetitive and laborious tasks.

Concretely, one of the main advantages of regular expressions is that they offer the ability to perform complex searches in an extremely efficient way. With just a few lines of code, large amounts of text can be found and manipulated, saving time and reducing the possibility of errors.

Regexes are also highly flexible: they can be used to search for specific patterns, replace text, validate input, and much more. For example, they can be used to verify that an email address is in the correct format, to extract phone numbers from a document, or to find all occurrences of a word in a text. In addition, regexes are supported by almost all modern programming languages and are integrated into many development and data analysis tools, making them easily accessible and usable in a wide range of contexts.

What are the main practical applications of regular expressions

The practical applications of regular expressions are virtually endless.

As a rule, regexes are useful for avoiding all kinds of repetitive work by automating certain functions to save time and effort.

Among the possible applications of Regex there is the redirection of a set of pages, the URL rewriting on web server Apache or the validation of the format of an e-mail inserted in a form. By definition, in fact, e-mail addresses must be composed in this way: begin with a sequence of alphanumeric characters and special characters, followed by the spiral symbol, followed by other alphanumeric characters, followed by the point, followed by two or three letters: codifying this informal rule in a regex we get this result,

^[_A-Za-z0-9-\+]+(\.[_A-Za-z0-9-]+)*@[A-Za-z0-9-]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})$

or

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$

Come appare una regex - da evemilano

In the field of programming, regexes are used to parse and manipulate strings of text, making it easier to handle complex data. For example, they can be used to extract specific information from log files, to validate user input in web applications, or to parse configuration files.

In the SEO world, as we will see, regexes are used to parse server log data, filter specific URLs, and improve crawling and indexing strategies. For example, a regex can be used to identify all the pages on a website that contain a specific URL parameter, allowing the site’s structure to be optimized and search engine rankings to be improved.

Even in the field of computer security, regexes are used to detect suspicious patterns in system logs, identify vulnerabilities in source codes, and more. Their versatility and power make them an indispensable tool for anyone working with large amounts of text or data, offering elegant and efficient solutions to complex problems.

What a regex looks like: format and syntax

What probably scares off those unfamiliar with this language is the form, which at first glance appears complex and cryptic; yet, once familiar with their basic components, regexes become extremely powerful and versatile tools.
Typically, a regular expression includes a combination of text (of which it will give exact matches in search results), along with several operators that act more like wildcards to search for a pattern match.

In practical terms, as Dan Taylor explains in an interesting article, a Regex can include just a single wildcard character, a match for one or more characters or a match for zero or more characters, as well as optional characters, subexpressions nested between brackets and “o” functions. By combining these different operations together, it is possible to construct a complex expression that allows to obtain far-reaching, but very specific results.

More precisely, you can use regular expressions to filter out simple strings (but it is “like hunting for flies with a rocket launcher,” says Giovanni Sacheli), use single characters (the simplest filter, remembering that character is any letter, number, symbol and space), take advantage of the special characters and meta-characters (the characters that have a special meaning), and then again the anchors (which serve to indicate in which position of the text to perform the analysis) and modifiers (which expand or restrict the portion of text to be analyzed).

The common operators of regular expressions

Each element of regexes has a specific function.

Metacharacters are the basic components of regexes and include symbols such as . (period), * (asterisk), + (plus), ? (question mark), and \ (backslash).

The dot (.) corresponds to any single character, while the asterisk (*) indicates zero or more occurrences of the preceding character. The plus (+) represents one or more occurrences of the previous character, and the question mark (?) indicates zero or one occurrence of the previous character. “The backslash (\) is used as an escape character for metacharacters, allowing them to be treated as literal characters.

Character classes are another fundamental element of regexes. A character class is a set of characters enclosed in square brackets, such as [abc], which corresponds to any character a, b, or c. Character classes can also include ranges, such as [a-z], which corresponds to any lowercase letter. Quantifiers are used to specify the number of occurrences of a character or group of characters. For example, \d{3} corresponds to exactly three digits, while \d{1,3} corresponds to one, two, or three digits.

Examples of common regex operators include:

. (fullstop) is a wildcard character, can therefore represent each individual character

* (asterisk) selects a match for zero or more items.

+ (plus sign) selects a match for one or more items.

? (question mark) makes the previous character an optional part of the expression

digit o d sets a match for each digit 0-9

| (pipe, vertical line) means an opposite function (OR).

^ (circumflex accent) is used to denote the beginning of a string.

$ (dollar symbol) is used to denote the end of a string.

( ) (round brackets) are used to nest a sub-expression.

(backslash) if inserted before an operator or a special character is intended as “escape”, it basically allows to exclude a special character.

Some programming languages, such as Javascript, also allow the inclusion of flags after the regex pattern itself, which can further influence the result, such as:

g returns all matches instead of just the first one.

i returns results without distinction between upper and lower case.

m activates the multi-line mode.

s activates the “All” mode.

u activates full Unicode support.

y looks for the position of the specific text (“sticky” mode).

The combined use of these operators and flags allows to create a complex logical language and offers the possibility to obtain very specific results on large and not ordered datasets. It should also be remembered that depending on the programming language we are using, the engine that makes regular expressions work can change, and this affects the way the main commands (which remain identical) are applied.

How to write a regex

Building an effective regex requires an understanding of the patterns you want to identify and a knowledge of metacharacters and quantifiers. Let’s start with a simple example that allows us to understand how to combine various elements of regexes to create complex and specific patterns.

Suppose we want to find all occurrences of email addresses in a text. A typical email address has the form username@domain.com. The regex to identify an email address might be something like \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b.

Let us analyze this regex step by step:

  • \b indicates a word boundary, ensuring that the email address is a complete word.
  • [A-Za-z0-9._%+-]+ corresponds to the username part of the email address. [A-Za-z0-9._%+-] is a character class that includes uppercase and lowercase letters, numbers, and some special characters. The + indicates that there must be at least one character.
  • @ is the snail symbol, which must be present in every email address.
  • [A-Za-z0-9.-]+ corresponds to the domain part of the email address. Similar to the username part, this character class includes letters, numbers, periods and hyphens.
  • \.corresponds to a literal dot, which is needed to separate the domain from the extension.
  • [A-Z|a-z]{2,} corresponds to the domain extension, which must consist of at least two letters.
  • \b indicates another word boundary, ensuring that the email address is a complete word.
The tools for testing regexes

It’s not enough to know how to write regular expressions; you also need to test them for correctness. In fact, there are numerous online tools that facilitate the creation and testing of regular expressions, making the process of learning and using regexes easier and more intuitive.

These tools offer user-friendly interfaces that allow you to enter a regex and sample text, immediately displaying the match results.

One of the most popular is Regex101, which supports several regex syntaxes (including PCRE, Python, and JavaScript) and offers a detailed explanation of each part of the regex. This is especially useful for those who are new to regex and want to better understand how regular expressions work. Regex101 also includes advanced features such as the ability to save and share regexes, making it a valuable tool for collaboration and learning.

Another very useful tool is RegExr, which offers an intuitive interface and an extensive library of examples and teaching resources. RegExr allows testing regexes in real time, highlighting matches in the example text and providing hints and explanations to improve understanding of regular expressions. RegExr also supports several regex syntaxes and offers advanced features such as the ability to save and share regexes.

Other notable online tools include RegexPal, RegexPlanet, and Debuggex, each with its own unique features and advanced functionality. These tools make it easier and more accessible to use regexes, allowing you to test and refine regular expressions quickly and efficiently.

Regex and SEO, how to leverage regular expressions

The ability to identify specific patterns within large amounts of data makes regexes particularly useful for website analysis and optimization. When used correctly, regexes can help improve crawling efficiency, identify indexing problems, and optimize site structure, thus helping to improve search engine rankings.

Thus, regular expressions can help us even in areas far from textual analysis, and in recent years they are finding their way into SEO as well, particularly to perform 301 redirects within the .htaccess file or to take control of data and filter out the parts that are less useful for performance analysis.

For example, they can be used to limit traffic sources to a single source, specific medium, or geographic target area, or even to explore queries used by different user segments, identify queries common to specific content areas, searches that direct traffic to specific parts of the site, and more.

In all these cases, we can create fairly simple regex expressions to achieve a basic “include” or “exclude” filter, or we can write longer expressions that work similarly to programming code to achieve complex and very specific results.Familiarizing ourselves with this tool helps us verify that SEO efforts are achieving the goals, ambitions, and results we had in mind.

How to use Regex in Google Analytics

One of the most common uses of regex for SEO is in Google Analytics, where regular expressions can be used to set filters to display only the data we are interested in, so as to isolate and analyze only particular traffic segments. In this sense, the expression is used to exclude results, rather than to generate a set of inclusive search results.

This is particularly useful for identifying pages with high bounce rates or for monitoring the performance of specific marketing campaigns.

In particular, you can exclude from reports the traffic within the company or local network, by setting the 192.168 filter. *. * to remove the entire range from 192.168.0.0 to 192.168.255.255, and in combination with Regex expressions we can filter any range of IP with just one string.

Still, another useful feature is that of the advanced segments, which can serve to split non-branded keywords so that you can specifically analyze the results of specific SEO activities – the Regex in this case must exclude all branded keywords, in possible variants – or to filter traffic from social networks.

In his article, Taylor suggests specific example of more complex usage of regular expressions in Google Analytics, based on the two trademarks regex247 and regex365: if we want to filter the results that match any combination of Urls that contain these brand names, such as regex247.biz or www.regex365.org, we can set a basic alternative expression, which is

.*regex247.*|.*regex365.*

that allows you to exclude from Analytics data all the corresponding Urls, including subfolder paths and specific page Urls that appear on those domain names.

From a practical point of view, before creating a regex filter on Google Analytics you need to set the type of ratio you want (e.g. Behavior > Site content > All pages or Acquisition > All traffic > Source/ Medium); to view the advanced filter options we need to click Advanced next to the search box that appears below the chart at the top of the data table.

Here we can include or exclude data based on a certain size or metric: in the dropdown list, after selecting the size, we choose “with espr. reg” (Matching Regexp) and then insert the expression into the text box.

To create an “or” alternative expression in Google Analytics, it is sufficient to include the pipe character (the vertical tract symbol |) among the appropriate segments of the expression. Google Analytics regular expressions do not support “and” instructions within a single regular expression, but we can still add another filter to achieve the same effect.

In particular, just enter another regex under the first one by clicking “Add a size or metric” and adding all the useful expressions, which will be processed as a single logical instruction “and” when filtering data.

Practical advice for Regex in Google Analytics

Taylor also clarifies some aspects about the use of regular expressions in Analytics, noting first that a function “badly written can easily filter most or all data, including a match with unlimited wildcard characters”.

The good news is that, in many cases SEO, the filter is applied to the data only during reporting and therefore, by modifying or eliminating the regular expression, it is easy to restore the full visibility of the data.

To avoid problems, however, it is best to test regular expressions on a set of online test tools, in order to see if they get the expected result, by running the “sandbox” of the regex before leaving them free in the entire data set.

Let’s give some more practical examples.

Suppose we want to filter all URLs that contain a specific parameter, such as utm_source, to analyze traffic from different marketing campaigns. A regex such as .*utm_source=.* can be used to identify all URLs that contain this parameter, allowing us to isolate and analyze the traffic generated by the campaigns. Another example would be the use of regexes to identify pages with certain URL patterns, such as all product pages in an e-commerce store. A regex such as /product/.* can be used to filter all URLs that contain /product/, allowing the performance of product pages to be analyzed. These are just a few examples of how regexes can be used to improve website analysis and optimization, demonstrating their usefulness and versatility in the context of SEO.

How to use Regex in Google Search Console

Since last April 2021, Google has also introduced the use of Regex in the Search Console, starting support for Re2 syntax to allow webmasters to include and exclude data within the user interface, also including in June 2021 the extension to negative regex, which gives all information filtered by correspondence excluded.

Registrazione
Your partner for flawless SEO
Powerful tools, data and insights for your strategic decisions

Among the notes of use of the Regex in Search Console there is the limit of characters for the function – 4096 characters, usually sufficient length to the analysis needs – as well as a specific syntax, clarified by this document on Github.

Regex in Search Console can be used to filter queries containing a specific brand and variations that users might type: for example, for Facebook you could use variants (including spelling errors)

.*facebook.*|face*book.*|fb.*|fbook.*|f*book.*

that also help segment users who already know the brand. Another useful use is filtering users who find the site through terms of commercial intent, such as

.*(migliore|top|alternativa|alternativa|vs|contro|recensione*).*

More analytically, Regex can help us to verify the amount and type of traffic to a section of the website and to understand the intent of users, as suggested by a Google guide.

In the first case, we can use a regex that focuses on site-specific directories, so as to understand what are the common queries for each of the content areas. For example, the document says, “if the structure of the URL is example.com/[product]/[brand]/[size]/[color] and you want to display the traffic to the green shoes, but you are not interested in the brand or number, you can use shoes/. */green”. A regular expression can also serve to analyze the types of queries that lead users to different sections of the website: for example, “you may be interested in queries containing words that introduce a question; a query filter what|how|when|why could show results that indicate that your content should easily answer questions, perhaps via a frequently asked question section.”, while with other filters we can “check what are the names of the products used more often or less rarely”.

Iscriviti alla newsletter

Try SEOZoom

7 days for FREE

Discover now all the SEOZoom features!
TOP