Skip to content

Instantly share code, notes, and snippets.

@jameshfisher
Last active April 18, 2023 14:53
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save jameshfisher/8072162 to your computer and use it in GitHub Desktop.
Save jameshfisher/8072162 to your computer and use it in GitHub Desktop.
Semantics for PHP

PHP semantics

Motivation

PHP is usually included in the top five or six most popular programming languages, as measured by various metrics implemented by e.g. Tiobe, LangPop, PYPL, lang-index. Alongside it sit C, Java, Obj-C, C++, C#, Javascript, and Python. All of these have a formal semantics or at least a rigorous specification. C has ANSI and ISO specifications, much work on formal semantics, and even a formally verified compiler. Java has a language specification and a formal subset, "Featherweight Java". Objective-C has some specification in the form of its C subset, and decent documentation. C++, similarly, has C as a formally defined subset, is defined in an ISO standard and has some work on formalizing fragments of it. C# has an ECMA standard and at least one paper formalizing it. Javascript is really ECMAScript, which has a specification, and some work on the essence of Javascript formalizes it and builds a reference interpreter. Python has an operational semantics.

PHP is notably different. It has no specification other than an informal and sparse "language reference". It is said to be defined by a reference implementation: the complex and optimized Zend interpreter, written in C.

Syntactic sugar

Many language features can be understood as syntactic sugar. This creates a smaller core language, with fewer syntactic forms to which we must assign semantics.

Variables

All variables are looked up dynamically in the environment. This can be done dynamically: if variable "x" maps to value v in the environment, and expression e evaluates to "x", then the expression ${e} evaluates to v. Variables can also be assigned dynamically: if e1 evaluates to "x", then the expression ${e1} = e2 assigns the value of e2 to the variable at "x".

This means that $x can be understood as syntactic sugar for ${'x'}, both as an expression and as the target of an assignment (or reference assignment).

Variable variables

As well as $x, PHP variables can take the forms $$x, $$$x, etc, and even $$${'x'}. The $ in this case is a prefix operator and associates to the right, like $($($x))). The form $$$x is sugar for ${${${'x'}}}.

Control structures

Omitted braces

Several control structures allow one to omit braces; this is purely syntactic:

if (e) s1;              ==>   if (e) { s1; }
if (e) s1; else s2;     ==>   if (e) { s1; } else { s2; }
while (e) s1;           ==>   while (e) { s1; }
etc.

Nested if-else

This is just sugar; it associates to the right:

if (e1) { s1; } else   if (e2) { s2; } else { s3; }
==>
if (e1) { s1; } else { if (e2) { s2; } else { s3; } }

elseif

PHP provides a keyword elseif, which is semantically identical to else if. (The Zend implementation apparently optimizes it differently, but this is unimportant.)

If without else

The statement if (e) { s1; } is just sugar for the more general if (e) { s1; } else {}.

do-while

do-while should be understood as primitive, and not as

do { s; } while (e);     ==>  s; while (e) { s; }

since the block may contain break statements which skip to the end of the loop.

for

Since code blocks do not introduce a new scope:

for (e1; e2; e3) { s; }        ==>     e1; while (e2) { s; e3; }

foreach

The manual implies that both forms of foreach are just sugar in terms of reset (??) and each (an internal function):

foreach ($a as       $v) { s; }    ==>     reset($a); while (list(  , $v) = each($a)) { s; }
foreach ($a as $k => $v) { s; }    ==>     reset($a); while (list($k, $v) = each($a)) { s; }

Double-quoted string

The double-quoted string can be syntactically transformed into an expression using only single-quoted strings, e.g.

"foo $bar baz"  ==> 'foo ' . $bar . ' baz'

(This step is indeed taken by the Zend interpreter before execution.)

Escaping

PHP's mechanism for escaping should be seen as echo in disguise:

?>foo bar baz<?php  ==> echo 'foo bar baz';
?>foo bar baz[EOF]  ==> echo 'foo bar baz';

This has one niggle: the PHP file starts out in escaped mode, so the file can simply be understood to start with a ?>; i.e., the first statement is always an echo. Thus, we can compile out the escaping to a language without escaping:

#!/usr/bin/env php
blah
<?php
echo 'foo';
?>

==>

echo "blah\n";
echo 'foo';
echo '';

This is indeed how Zend compiles the escaped text.

PHP bytecode

PHP is described as an "interpreted language". However, this is a misnomer, and PHP can be compiled:

  • The Zend "interpreter" compiles a PHP file to bytecode known as "opcode" before execution. The compiled op-codes are pleasingly short. Install the "Vulcan Logic Dumper" PHP extension and run php -dvld.active=1 -dvld.execute=0 file.php in order to view the compiled opcodes. However, the opcodes are extremely underspecified.
  • [HipHop Virtual Machine] compiles PHP to HipHop bytecode (HHBC), which has a much better specification. I don't know if it's possible to view the HHBC for a given file, though.

PHP can therefore be understood by specifying the compilation step, then specifying the semantics for the bytecode.

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment