Skip to content

Instantly share code, notes, and snippets.

@jdevalk
Created July 2, 2012 18:41
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdevalk/3034833 to your computer and use it in GitHub Desktop.
Save jdevalk/3034833 to your computer and use it in GitHub Desktop.
Regex to match meta description in content
<?php
preg_match_all( '#<meta (name|content)="(.*)" (name|content)="(.*)"(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
preg_match_all( "#<meta (name|content)='(.*)' (name|content)='(.*)'(\s+)?/?>#i", $content, $matches2, PREG_SET_ORDER );
@chrisle
Copy link

chrisle commented Jul 2, 2012

What about things like?
<meta http-equiv="Content-Type" content="text/html" >

How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >

@ericmann
Copy link

ericmann commented Jul 2, 2012

What about things like?

<meta http-equiv="Content-Type" content="text/html" >

Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.

How about terrible terrible whitespaces?

< meta name="keywords" content = "wikipedia,encyclopedia" >

That all depends on how many spaces you want to filter for. But you could just as easily add \s* before and after any words. The following would catch the specific situation you posted (it already matched the missing / at the end):

preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

@jdevalk
Copy link
Author

jdevalk commented Jul 2, 2012

Thanks Eric, that does catch a lot of weird cases :)

@gerardjp
Copy link

gerardjp commented Jul 3, 2012

You could catch the 'terrible' whitespace parts in a conditional character set like ..content(\s)?=.. But that would miss it if there are 2 spaces. \s+ would get one or more.

The trick is to setup a readable regex, catch a high amount of the typo's but foremost keep it maintainable because there's always a typo you didn't think of.

As for readable, take a look at an old IP catcher i wrote: http://bit.ly/HNog0k If I remember correctly it was Guido van Rossum who once said: Code is more often read then written. Apologies for my deviation on the subject btw :)

@silasrm
Copy link

silasrm commented Jan 23, 2019

What about things like?
<meta http-equiv="Content-Type" content="text/html" >

Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.

How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >

That all depends on how many spaces you want to filter for. But you could just as easily add \s* before and after any words. The following would catch the specific situation you posted (it already matched the missing / at the end):

preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );

You need to scape simple quotes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment