Created
July 2, 2012 18:41
-
-
Save jdevalk/3034833 to your computer and use it in GitHub Desktop.
Regex to match meta description in content
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?php | |
preg_match_all( '#<meta (name|content)="(.*)" (name|content)="(.*)"(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER ); | |
preg_match_all( "#<meta (name|content)='(.*)' (name|content)='(.*)'(\s+)?/?>#i", $content, $matches2, PREG_SET_ORDER ); |
What about things like?
<meta http-equiv="Content-Type" content="text/html" >
Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.
How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >
That all depends on how many spaces you want to filter for. But you could just as easily add
\s*
before and after any words. The following would catch the specific situation you posted (it already matched the missing/
at the end):preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
You need to scape simple quotes!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You could catch the 'terrible' whitespace parts in a conditional character set like ..content(\s)?=.. But that would miss it if there are 2 spaces. \s+ would get one or more.
The trick is to setup a readable regex, catch a high amount of the typo's but foremost keep it maintainable because there's always a typo you didn't think of.
As for readable, take a look at an old IP catcher i wrote: http://bit.ly/HNog0k If I remember correctly it was Guido van Rossum who once said: Code is more often read then written. Apologies for my deviation on the subject btw :)