Taint support for PHP

Wietse Venema
IBM T.J. Watson Research Center
Hawthorne, NY, USA

Introduction

This is a preliminary implementation of support for tainted variables in PHP. The goal is to help PHP application programmers find and eliminate opportunities for HTML script injection, SQL or shell code injection, or PHP control hijacking, before other people can exploit them. The implementation provides taint support for basic operators and for a selection of built-functions and extensions. A list of what is implemented sofar is at the end of this document.

The good news is that performance is better than I hoped it would be. However, the implementation is incomplete, so please don't be surprised when something is still missing. For example, I have not yet implemented taint support for object-specific operations, and taint checks assume that output has a Content-Type: of text/html. It also does not yet fully adhere to coding and documentation conventions. All this needs to be taken care of in future releases.

I need your feedback to make this code complete. I hope to do several quick 1-2 month release cycles in which I collect feedback, fill in missing things, and adjust course until things stabilize. Right now the code is based on PHP 5.2.3, but I expect to catch up with the current PHP release next time.

This document covers the following topics:

A quick example

To give an idea of the functionality, consider this simple PHP program with an obvious HTML script injection bug:

<?php
$inputfield = $_GET['inputfield'];
echo "You entered: $inputfield\n";
?>

With default .ini settings, this program does exactly what the programmer wrote: it echos the contents of the client's inputfield request attribute, including all the HTML script code that an attacker may have supplied along with it.

When I add one setting to a php.ini file, or the equivalent ini_set() call to the script itself, the program still produces the same output, but it also produces a warning:

Add to php.ini: taint_error_level = E_WARNING
Add to script: ini_set("taint_error_level", E_WARNING);
 
Warning: echo(): Argument contains data that is not converted with htmlspecialchars() or htmlentities() in /path/to/script on line 3

When I change the taint error level from E_WARNING into E_ERROR, script execution terminates before echo produces any output.

Finally, when I honor the warning message and convert the $inputfield value as shown below, the program becomes immune to HTML script injection and the warning message disappears.

<?php
$inputfield = htmlspecialchars($_GET['inputfield']);
echo "You entered: $inputfield\n";
?>

At this point I can either leave taint support turned on as a safety net in case someone introduces new mistakes, or I can disable taint support altogether. The run-time performance will not differ measurably, as long as the application does not trigger any alarms.

Introducing multiple flavors of taint

Conversion functions such as htmlspecialchars() exist not only for boring security reasons! They are also required for robustness. Without the proper output conversion, shell or SQL commands fail when given a legitimate name such as O'Reilly. Bugs like this are easily overlooked, because they trigger only with unusual data. However, these bugs are trivial to find with taint support, because you get the "missing conversion" warning message even when you test the program with ordinary data. This point is worth repeating, so I will repeat it now:

With taint support, you don't need malicious inputs to find out where a PHP script may have opportunities for HTML script injection, shell or SQL code injection, or PHP control hijacking.

To encourage programmers to use the RIGHT conversion function, I have implemented multiple flavors of taint. Each time data enters a PHP application from the web, from database or from elsewhere, it may be "tainted" with zero or more taint flavors, so that the PHP engine can warn the programmer and suggest an appropriate conversion function.

In the case of the buggy example program, data is marked as "dangerous for use in HTML" (and other contexts :-) when it is received from the web. The echo() primitive detects the presence of this taint flavor in one of its arguments, issues a warning, and suggests using htmlspecialchars() or htmlentities().

The table below summarizes a number of taint flavors: it shows where a specific flavor may be added to data, where its presence may raise warnings, and how you get rid of the taint flavor. Please ignore the ugly TC_XXX names for now. That's low-level stuff that still needs to be hidden behind a user interface.

Taint flavor When added Where it may raise warnings How to remove
TC_HTML Input from web or database HTML output htmlspecialchars(),
htmlentities()
TC_SHELL Input from web or database Shell command arguments escapeshellcmd(),
escapeshellarg()
TC_MYSQL Input from web or database mysql query parameters mysql_escape_string(),
mysql_real_escape_string()
TC_MYSQLI Input from web or database mysqli query parameters mysqli_escape_string()
TC_SELF Input from web Parameters to eval(), include() and other operations that affect the PHP application itself untaint($var, TC_SELF)

The fifth flavor, TC_SELF, is different from the other four. Instead of code injection, its purpose is to detect opportunities to hijack control over the PHP application itself. Currently, there is no conversion function that makes all data safe as input for eval(), include() etc. Instead, the application itself is supposed to verify that data is "good" and mark it as such. Until a better user interface exists, this means calling the low-level untaint() function directly.

Using taint support with real PHP applications

I have built taint support with the cli, cgi and Apache loadable module SAPIs, and with the mysqli and sql extensions. Other SAPIs and extensions will follow as time permits.

What about those other SAPIs and extensions? They will work just fine as long as you leave taint_error_level at its default setting. They may trigger false warnings when you raise the taint error level, because they don't know how to properly initialize certain bits that taint support relies on. This problem should not exist, but unfortunately there is a lot of PHP source code that does not use standard macros when initializing PHP data structures.

To build PHP with taint support:

# CLI and CGI
$ make distclean
$ ./configure --enable-taint --with-mysqli --with-mysql ...
$ make

# Apache module
$ make distclean
$ ./configure --enable-taint --with-apxs=/path/to/apxs ...
$ make

To experiment with taint support, copy the file taint_ini.php from the top-level PHP source directory to your PHP script directory, edit the file per the instructions below, and include it into the PHP script. The file begins like this:

# Enable warning messages without messing up web pages.
ini_set("taint_error_level", E_WARNING);
ini_set("log_errors", true);
ini_set("display_errors", false);

# Uncomment one of these if you don't want to log to the server's log.
# ini_set("error_log", "syslog");
# ini_set("error_log", "/path/to/errorlog");

# Temporary workaround to avoid false alarms. Unfortunately, $_SERVER[]
# contains a mixed bag of data: some is safe, and some highly dangerous.
untaint($_SERVER["SCRIPT_FILENAME"]);
untaint($_SERVER["PHP_SELF"]);
untaint($_SERVER["DOCUMENT_ROOT"]);
untaint($_SERVER["HTTP_HOST"]);	# Not entirely safe.
. . . several other lines . . .

Notes:

Performance

The performance is quite good: the overhead for "make test" is within 1-2% when comparing the user-mode CPU time of unmodified PHP against a PHP version with taint support (the number depends on the CPU used and on build options). I know that a fraction of that time is spent in non-PHP processing, but the bulk is spent in PHP and that is what really matters. If a better "macro" benchmark exists then I am of course interested.

The "bench.php" script that comes with PHP source is even less representative of applications: it is a loop-intensive affair that doesn't do any input or output. Nevertheless, it suffers only a modest overhead of 1-3%. This is good enough for a start; I can try to squeeze out more CPU cycles later if necessary.

As long as the application triggers no warnings, it does not make a measurable difference whether taint support is turned on or not. This is due to the way the support is implemented. Without going into detail, the trick is to avoid introducing extra conditional or unconditional jumps in the critical path.

Low-level implementation

Taint support is implemented with some of the unused bits in the zval data structure. The zval is the PHP equivalent of a memory cell. Besides a type (string, integer, etc.) and value, each zval has a reference count and a flag that says whether the zval is a reference to yet another zval that contains the actual value.

Right now I am using seven bits, but there is room for more: 32-bit UNIX compilers such as GCC add 16 bits of padding to the current zval data structure, and this amount of padding isn't going to be smaller on 64-bit architectures. If I really have to squeeze the taint bits in-between the existing bits, the taint support performance hit goes up. If squeezing is necessary, all PHP code will need to be changed to use official initialization macros, so that expensive shift/mask operations can be avoided as much as possible.

The preliminary configuration user interface is rather low-level, somewhat like MS-DOS file permissions :-( This is good enough for testing and debugging the taint support itself, but I would not want to have wires hanging out of the machine like this forever. The raw bits will need to be encapsulated so that applications can work with meaningful names and abstractions.

To give an idea of what the nuts and bolts look like, this is the preliminary list of bits, or should I say: binary properties, together with the parameters that control their handling:

Taint propagation policy

Before implementing the above policies, the first order of business was adding taint propagation to the PHP core: for each operator, including type conversion, a decision had to made how to propagate taint from source operands to results.

The general taint propagation rules are:

Most of this taint propagation is finished, but there are a few minor issues that still need to be resolved.

While adding taint propagation I found that a lot of PHP source code fails to use the official macros when initializing a zval. In these cases I added another line of code to initialize the taint bits by hand. Also, more internal documentation (other than empty man page skeletons) could have reduced development time.

PHP core changes

To make the implementation manageable, most of the taint-specific code is implemented as one-line macro calls that either implement taint support, or that expand into nothing. This avoids massive amounts of scar tissue with #ifdef . . #endif around small pieces of code. These macros are defined (and documented!) in the file Zend/taint_marks.h.

In some cases an internal API had to be extended with an extra argument to propagate taint information. Where possible I preserved the old API as a #define that invokes the new API with a default taint argument, so that old code still compiles and works (unfortunately this trick is not possible with SAPI calls that are made through function pointers that are being passed around via data structures). Here is an example for the core function that copies a string into a hash. The change is an extra argument with the taint marks of the input string. In the example, the TAINT_MARKS_CC and TAINT_MARKS_DC macros are very much like to the macros used by ZTS (thread-safe resources) support. They expand into nothing when taint support is not compiled in.

Old API:

ZEND_API int add_assoc_string_ex_t(zval *arg, char *key, uint
 key_len, char *str TAINT_MARKS_DC(taint_marks), int duplicate);

New API:

#define add_assoc_string_ex(__arg, __key, __len, __str, __duplicate)
 add_assoc_string_ex_t(__arg, __key, __len, __str TAINT_MARKS_CC(TC_NONE),
 __duplicate)

ZEND_API int add_assoc_string_ex_t(zval *arg, char *key, uint
 key_len, char *str TAINT_MARKS_DC(taint_marks), int duplicate);

The zend_parse_parameters() API was also extended, so that I could propagate the taint bits from function input arguments to function outputs, and so that I could enforce taint checks on input arguments. To the existing list of existing type modifiers: |!/ I added another two: `'. Their meaning is defined in the table below. The example after the table is a fragment from the basename() function.

Type modifier Meaning
` Copy taint marks from the current PHP-level argument. The destination pointer is specified with the next C-level zend_parse_parameters() argument.
' Enforce taint check on the current PHP-level argument. The taint check is specified with the next C-level zend_parse_parameters() argument.
PHP_FUNCTION(basename)
{
	. . .
#ifdef HAVE_TAINT
        TAINT_MARKS_T taint_marks = 0;

        if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s`|s'", 
                &string, &string_len, &taint_marks,
                &suffix, &suffix_len, EG(taint_checks_self)) == FAILURE) {
                return;
        }
#else
	. . . old zend_parse_parameters() call . . .
#endif

With this change, the taint bits are copied from the string input argument to the taint_marks variable, which is used later to update the taint marks of the function result value. The suffix input argument is checked whether it could be under attacker control. This could give a malicious user control over what part is removed from the end of the function result value, which may be undesirable. In this case I haven't figured out a way to hide the changes behind a bunch of macros. Perhaps someone will have a stroke of genius after seeing this.

Loose ends

I already mentioned the loose wires hanging out of the machine; the user interface for taint policy control will need to be made more suitable for people who aren't primarily interested in PHP core hacking.

Support for tainted objects is still incomplete. In particular, conversions between objects and non-objects may lose taint bits.

For now, I manually added taint support to a number of standard built-ins (file, process, *scanf, *printf, and a subset of the string functions) and extensions (mysql, mysli). I hope this will be sufficient to get some experience with taint support.

Taint-unaware SAPIs and extensions will work properly as long as the taint error level is left at its default (i.e. disabled), and as long as these extensions are recompiled with the patched PHP header files. When taint checking is turned on, some SAPIs or extensions may trigger false alarms when they fail to use the official macros to initialize zval structures, thereby leaving some taint bits at uninitialized values.

I still hope that it will somehow be possible to annotate extensions so that taint support can be added without modifying lots of extension source code. However, having multiple flavors of taint, instead of just one, will make the job so much more interesting.

Other items on the TODO list:

Distant future

Currently, only data is labeled (and only with binary attributes). No corresponding attributes exist for sources and sinks (files, network connections, databases, authenticated users, etc.). If we knew that a connection is encrypted, or whether something is an intranet or extranet destination, or who the user is at the other end, then we could implement more sophisticated policies than the simple MS-DOS like file permissions that I have implemented now.

But all this is miles beyond the immediate problem that I am trying to solve today: helping programmers find the holes in their own code before other people do.

Feature summary

This is the preliminary list of implemented features. The default taint marking and checking policies are good enough to gain some experience with taint support, and will have to be refined in the light of experience.


php.ini settings

taint_error_level (default: 0) error level for taint check warnings
 
taint_checks_shell (default: TC_SHELL) taint flavors detected in shell commands; use TC_SHELL to detect code injection opportunities
taint_checks_html (default: TC_HTML) taint flavors detected in HTML output; use TC_HTML to detect code injection opportunities.
taint_checks_mysql (default: TC_MYSQL) taint flavors detected in mysql commands; use TC_MYSQL to detect code injection opportunities.
taint_checks_mysqli (default: TC_MYSQLI) taint flavors detected in mysqli commands; use TC_MYSQLI to detect code injection opportunities
taint_checks_self (default: TC_SELF) taint flavors detected in eval(), include(), etc.; use TC_SELF to detect control hijack possibility
taint_checks_user1 (default: TC_USER1) application-controlled taint flavor
taint_checks_user2 (default: TC_USER2) application-controlled taint flavor
 
taint_marks_egpcs (default: TC_ALL) taint flavors added to data from the web (environment, get, post, cookie, server)
taint_marks_dbms (default: TC_SHELL | TC_HTML | TC_MYSQL | TC_MYSQLI) taint flavors added to data from database
taint_marks_other (default: 0) taint flavors added to data from other external sources
 
taint_marks_non_str (default: TC_SELF) taint flavors preserved when converting string to number or bool

core

arithmetic operators propagate taint marks
bit-wise operators propagate taint marks
relational operators don't propagate taint marks
boolean operators partial propagation, may be removed entirely
zend_parse_parameters() additional type modifiers: ` reports the taint marks of a PHP argument, and ' enforces a taint check on a PHP argument
echo, print detect html injection possibility
eval, include detect control hijack possibility
exit detect html injection possibility

dir extension

opendir() detect control hijack possibility via pathname argument

exec extension

exec(), system(), passthru() detect shell command injection possibility
detect html injection possibility
taint mark input from command depending on taint_marks_other setting
escapeshellcmd(), escapeshellarg() propagate taint marks except TC_SHELL
shell_exec() detect shell command injection possibility
taint mark input from command depending on taint_marks_other setting
proc_nice() detect control hijack possibility via priority argument

file extension

flock() detect control hijack possibility via operation argument
get_meta_tags() detect control hijack possibility via pathname, include_path
taint mark input from file depending on taint_marks_other setting
file_get_contents() detect control hijack possibility via pathname, include path, offset, maxlen
taint mark input from file depending on taint_marks_other setting
file_put_contents() detect control hijack possibility via pathname, flags
file() detect control hijack possibility via pathname, flags
taint mark input from file depending on taint_marks_other setting
tempnam() detect control hijack possibility via both arguments
fopen() detect control hijack possibility via pathname, mode, include path arguments
popen() detect shell command injection possibility
detect control hijack possibility via mode argument
fgets() detect control hijack possibility via length argument
taint mark input from stream depending on taint_marks_other setting
fgetc() taint mark input from stream depending on taint_marks_other setting
fgetss() detect control hijack possibility via length, allowable tags
taint mark input from stream depending on taint_marks_other setting
fscanf() detect control hijack possibility via format string
taint mark input from stream depending on taint_marks_other setting
fseek() detect control hijack possibility via offset, whence
mkdir() detect control hijack possibility via pathname, mode, recursive arguments
rmdir() detect control hijack possibility via pathname argument
readfile() detect control hijack possibility via pathname, include path arguments
taint mark input from file depending on taint_marks_other setting
detect html injection possibility (depending on taint_marks_other setting)
umask() detect control hijack possibility via mode argument
fpassthru() taint mark input from file depending on taint_marks_other setting
detect html injection possibility (depending on taint_marks_other setting)
rename() detect control hijack possibility via old name and new name arguments
unlink() detect control hijack possibility via pathname
ftruncate() detect control hijack possibility via size argument
copy() detect control hijack possibility via source or target arguments
fread() detect control hijack possibility via length argument
taint mark input from stream depending on taint_marks_other setting
fgetcsv() taint mark input from stream depending on taint_marks_other setting
realpath() propagate taint marks from input argument
fnmatch() detect control hijack possibility via pattern or flags arguments.

formatted_print extension

printf(), fprintf(), sprintf() detect control hijack possibility via format string
propagate taint marks from input arguments
detect html injection possibility (printf() only)

head extension

header() detect control hijack possibility via header name, replace, response code arguments

html extension

htmlentities(), htmlspecialchars() detect control hijack possibility via quote_style, charset, double_encode arguments
propagate all taint marks except TC_HTML

mysql extension

mysql_connect() detect control hijack possibility via host, username, password
mysql_escape_string(), mysql_real_escape_string() propagate taint marks except TC_MYSQL
mysql_select_db() detect control hijack possibility via database name argument
mysql_query() detect sql injection possibility via query argument
mysql_fetch_array() detect control hijack possibility via result_type argument

mysqli extension

mysqli_connect() detect control hijack possibility via host, username, password
mysqli_real_escape_string() propagate taint marks except TC_MYSQLI
mysqli_select_db() detect control hijack possibility via database name argument
mysqli_query() detect sql injection possibility via query argument
mysqli_fetch_array() detect control hijack possibility via result_type argument

proc_open extension

proc_open() detect shell command injection possibility
detect control hijack possibility via pathname argument

string extension

strcspn(), strspn() detect control hijack possibility via string2, start, length
trim(), rtrim(), ltrim() detect control hijack possibility via charlist argument
propagate taint marks from input string
wordwrap() detect control hijack possibility via line width, break and cut arguments
propagate taint marks from input string
explode() detect control hijack possibility via delimiter and limit arguments
propagate taint marks from input string
implode() propagate taint marks from delimiter and input array members
strtok() detect control hijack possibility via delimiter argument
propagate taint marks from input string
basename() detect control hijack possibility via suffix
propagate taint marks from input pathname
dirname() propagate taint marks from input pathname
pathinfo() propagate taint marks from input pathname to dirname, basename, extension, filename results
stristr(), strstr() propagate taint marks from haystack argument
strpos() propagate taint marks from haystack argument
strchr(), strrchr() propagate taint marks from haystack argument
chunk_split() detect control hijack possibility via chunklen, end arguments
propagate taint marks from input string
substr() detect control hijack possibility via start, length arguments
propagate taint marks from input string
quotemeta() propagate taint marks from input string
ord() propagate taint marks from input string, subject to taint_marks_non_str setting
chr() propagate taint marks from input argument
ucfirst(), ucwords() propagate taint marks from input argument
strip_tags() detect control hijack possibility via allowable_tags argument
propagate taint marks from input argument
parse_str() detect control hijack possibility if target is global name space
propagate taint marks from input string
sscanf() detect control hijack possibility via format string
propagate taint from input string
str_word_count() detect control hijack possibility via format, charlist arguments
propagate taint from input string
money_format() detect control hijack possibility via format, charlist arguments
propagate taint from number argument
str_split() detect control hijack possibility via length argument
propagate taint from input string
substr_compare() detect control hijack possibility via offset, length, case sensitivity argument

taint extension

istainted(mixed expr) return taint bits from argument
taint(variable [, taint_mark]) raise the specified taint bits on a variable (default: all)
untaint(variable [, taint_mark]) clear the specified taint bits on a variable (default: all)