Skip to content

Instantly share code, notes, and snippets.

@jdevalk
Created Jul 2, 2012
Embed
What would you like to do?
Regex to match meta description in content
<?php
preg_match_all( '#<meta (name|content)="(.*)" (name|content)="(.*)"(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
preg_match_all( "#<meta (name|content)='(.*)' (name|content)='(.*)'(\s+)?/?>#i", $content, $matches2, PREG_SET_ORDER );
@jdevalk
Copy link
Author

jdevalk commented Jul 2, 2012

It should catch shit like this:

<meta name="description" content="<?php bloginfo('description'); ?>"/>

and this:

<meta name='description' content='<?php bloginfo('description'); ?>'/>

as well as "normal" meta's like this:

<meta name="description" content="meta bla"/>

@ericmann
Copy link

ericmann commented Jul 2, 2012

First of all, I'd use a group for the " and ' characters and simplify things down to one regex match:

preg_match_all( '#<meta (name|content)=("|')(.*)("|') (name|content)=("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

This has the added benefit of catching typos like: <meta name='description" content="blah' /> which I've actually seen in the wild once or twice.

If you're concerned about catching multiple whitespace issues, I'd recommend subbing in a \s* for every as well:

preg_match_all( '#<meta\s*(name|content)=("|')(.*)("|')\s*(name|content)=("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

If there are any other special conditions you want to handle, I'm sure we can address those as well.

@chrisle
Copy link

chrisle commented Jul 2, 2012

What about things like?
<meta http-equiv="Content-Type" content="text/html" >

How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >

@ericmann
Copy link

ericmann commented Jul 2, 2012

What about things like?

<meta http-equiv="Content-Type" content="text/html" >

Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.

How about terrible terrible whitespaces?

< meta name="keywords" content = "wikipedia,encyclopedia" >

That all depends on how many spaces you want to filter for. But you could just as easily add \s* before and after any words. The following would catch the specific situation you posted (it already matched the missing / at the end):

preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

@jdevalk
Copy link
Author

jdevalk commented Jul 2, 2012

Thanks Eric, that does catch a lot of weird cases :)

@gerardjp
Copy link

gerardjp commented Jul 3, 2012

You could catch the 'terrible' whitespace parts in a conditional character set like ..content(\s)?=.. But that would miss it if there are 2 spaces. \s+ would get one or more.

The trick is to setup a readable regex, catch a high amount of the typo's but foremost keep it maintainable because there's always a typo you didn't think of.

As for readable, take a look at an old IP catcher i wrote: http://bit.ly/HNog0k If I remember correctly it was Guido van Rossum who once said: Code is more often read then written. Apologies for my deviation on the subject btw :)

@silasrm
Copy link

silasrm commented Jan 23, 2019

What about things like?
<meta http-equiv="Content-Type" content="text/html" >

Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.

How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >

That all depends on how many spaces you want to filter for. But you could just as easily add \s* before and after any words. The following would catch the specific situation you posted (it already matched the missing / at the end):

preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

You need to scape simple quotes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment