-
-
Save jdevalk/3034833 to your computer and use it in GitHub Desktop.
<?php | |
preg_match_all( '#<meta (name|content)="(.*)" (name|content)="(.*)"(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER ); | |
preg_match_all( "#<meta (name|content)='(.*)' (name|content)='(.*)'(\s+)?/?>#i", $content, $matches2, PREG_SET_ORDER ); |
First of all, I'd use a group for the "
and '
characters and simplify things down to one regex match:
preg_match_all( '#<meta (name|content)=("|')(.*)("|') (name|content)=("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
This has the added benefit of catching typos like: <meta name='description" content="blah' />
which I've actually seen in the wild once or twice.
If you're concerned about catching multiple whitespace issues, I'd recommend subbing in a \s*
for every
as well:
preg_match_all( '#<meta\s*(name|content)=("|')(.*)("|')\s*(name|content)=("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
If there are any other special conditions you want to handle, I'm sure we can address those as well.
What about things like?
<meta http-equiv="Content-Type" content="text/html" >
How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >
What about things like?
<meta http-equiv="Content-Type" content="text/html" >
Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.
How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >
That all depends on how many spaces you want to filter for. But you could just as easily add \s*
before and after any words. The following would catch the specific situation you posted (it already matched the missing /
at the end):
preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
Thanks Eric, that does catch a lot of weird cases :)
You could catch the 'terrible' whitespace parts in a conditional character set like ..content(\s)?=.. But that would miss it if there are 2 spaces. \s+ would get one or more.
The trick is to setup a readable regex, catch a high amount of the typo's but foremost keep it maintainable because there's always a typo you didn't think of.
As for readable, take a look at an old IP catcher i wrote: http://bit.ly/HNog0k If I remember correctly it was Guido van Rossum who once said: Code is more often read then written. Apologies for my deviation on the subject btw :)
What about things like?
<meta http-equiv="Content-Type" content="text/html" >
Since we're specifically looking for the name/content for the meta description and (I'm assuming) keywords, I think catching content types is a bit much.
How about terrible terrible whitespaces?
< meta name="keywords" content = "wikipedia,encyclopedia" >
That all depends on how many spaces you want to filter for. But you could just as easily add
\s*
before and after any words. The following would catch the specific situation you posted (it already matched the missing/
at the end):preg_match_all( '#<\s*meta\s*(name|content)\s*=\s*("|')(.*)("|')\s*(name|content)\s*=\s*("|')(.*)("|')(\s+)?/?>#i', $content, $matches, PREG_SET_ORDER );
You need to scape simple quotes!
It should catch shit like this:
<meta name="description" content="<?php bloginfo('description'); ?>"/>
and this:
<meta name='description' content='<?php bloginfo('description'); ?>'/>
as well as "normal" meta's like this:
<meta name="description" content="meta bla"/>