regex: markdown links

2020-09-27

 | 

~4 min read

 | 

685 words

In the Markdown Spec there are several ways to write a link. The most common way I’ve seen / used is a pair of brackets followed by parenthesis: [](). There is however, the “reference link” syntax which is two bracket pairs which is then tied to a reference later.1

I had a project recently where it would be useful if I could reliably tease apart the different pieces of the link. Specifically, given the standard link format of [link text](resource "title") could I break out those three pieces?

David Wells gave me a good start with his post on regex to match markdown links. I’m not sure what I was doing wrong, but something didn’t quite work with his approach for me, but at least I now knew it was possible and had a lead!

For my project in particular, all of my links were relative paths, so I didn’t have to worry too much about whether or not they were fully qualified https links, etc. That said, I ultimately adopted this approach to make my solution more general purpose.

Some of the solutions I came up with are the following.

The “simplest” (and by that I mean the fewest moving parts) is the following:

const pattern = /!?\[([^\]]*)\]\(([^\)]+)\)/gm

What I really like about this one is the first group: \[([^\]]*)\].2

Breaking this down:

  • \[ means a literal [ character is expected. This is paired at the end with \]. This gives us our brackets surrounding the first part of the link.
  • The ( and it’s matching ) create a group, which will be useful when we want to use what’s captured by the regex later.
  • [^\]]* This is saying match a character not in the range of \], i.e. that does not match a closing bracket (]). Moreover, do this as many times as you want because of the *. Said another way, the only way to get out of this match is to actually have a ].

The second group does something very similar, except instead of looking for a literal [ and it’s paired ], it’s looking for parentheses.

Given that we’re actually talking about links, however, we can make the second group more robust. This is where David Wells’ approach proved most useful as it was really the inspiration for the whole thing:

const pattern = /!?\[(.+)?\]\(((https?:\/\/)?[A-Za-z0-9\:\/\. ]+)(\"(.+)\")?\)/gm

Focusing here on the second half of the link we are making use of several groups (one larger, and two sub groups - one for the link itself and one for the title).

Unlike the first attempt, we’re not allowing just any character to be in the link. We’re asking up front, is it prefixed with https:// (though this is all optional - interestingly, it will break on other protocols, though this could be easily remedied with an or pipe (|).

After that, some characters are not allowed in a URL, and this accounts for that by saying we’re looking only for A-Z, a-z, 0-9, a : , / or a . Then do that as many times as you want (the +).

Finally, we look at the title which is always wrapped in "" if it’s present, but is optional overall.

My final, preferred solution combines these two approaches

const pattern = /!?\[([^\]]*)?\]\(((https?:\/\/)?[A-Za-z0-9\:\/\. ]+)(\"(.+)\")?\)/gm

While this may look easy now, it took me a long time to actually understand it. And, even though I understand it now, I’m writing it down because I know when I come back later, I won’t be able to reason through it without these notes!

I guess that’s a fun part of regex - there’s so much more I can learn to get comfortable but these sorts of experiences certainly remind me how powerful it can be once mastered!

Here’s a playground with examples of this in use.

Footnotes

  • 1 An example of a reference link might look like this:

    [reference][id]
    [id]: link
  • 2 I’m ignoring the !? for the moment as this is there to provide support for image links.

Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!