CAPEC 80

Using UTF-8 Encoding to Bypass Validation Logic

Attack Pattern ID: 80 (Detailed Attack Pattern Completeness: Complete)

Typical Severity: High

Status: Draft

Description

Summary

This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the ``shortest possible'' encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.

Attack Execution Flow

Explore

Survey the application for user-controllable inputs:

Using a browser or an automated tool, an attacker follows all public links and actions on a web site. He records all the links, the forms, the resources accessed and all other potential entry-points for the web application.

Attack Step Techniques

ID	Attack Step Technique Description	Environments
1	Use a spidering tool to follow and record all links and analyze the web pages to find entry points. Make special note of any links that include parameters in the URL.	env-Web
2	Use a proxy tool to record all user input entry points visited during a manual traversal of the web application.	env-Web
3	Use a browser to manually explore the website and analyze how it is constructed. Many browsers' plugins are available to facilitate the analysis or automate the discovery.	env-Web

Indicators

ID	type	Indicator Description	Environments
1	Positive	Inputs are used by the application or the browser (DOM)	env-Web
2	Inconclusive	Using URL rewriting, parameters may be part of the URL path.	env-Web
3	Inconclusive	No parameters appear to be used on the current page. Even though none appear, the web application may still use them if they are provided.	env-Web
4	Negative	Applications that have only static pages or that simply present information without accepting input are unlikely to be susceptible.	env-Web

Outcomes

ID	type	Outcome Description
1	Success	A list of URLs, with their corresponding parameters (POST, GET, COOKIE, etc.) is created by the attacker.
2	Success	A list of application user interface entry fields is created by the attacker.
3	Success	A list of resources accessed by the application is created by the attacker.

Security Controls

ID	type	Security Control Description
1	Detective	Monitor velocity of page fetching in web logs. Humans who view a page and select a link from it will click far slower and far less regularly than tools. Tools make requests very quickly and the requests are typically spaced apart regularly (e.g. 0.8 seconds between them).
2	Detective	Create links on some pages that are visually hidden from web browsers. Using IFRAMES, images, or other HTML techniques, the links can be hidden from web browsing humans, but visible to spiders and programs. A request for the page, then, becomes a good predictor of an automated tool probing the application.
3	Preventative	Use CAPTCHA to prevent the use of the application by an automated tool.
4	Preventative	Actively monitor the application and either deny or redirect requests from origins that appear to be automated.

Experiment

Probe entry points to locate vulnerabilities:

The attacker uses the entry points gathered in the "Explore" phase as a target list and injects various UTF-8 encoded payloads to determine if an entry point actually represents a vulnerability with insufficient validation logic and to characterize the extent to which the vulnerability can be exploited.

Attack Step Techniques

ID	Attack Step Technique Description	Environments
1	Try to use UTF-8 encoding of content in Scripts in order to bypass validation routines.	env-Web
2	Try to use UTF-8 encoding of content in HTML in order to bypass validation routines.	env-Web
3	Try to use UTF-8 encoding of content in CSS in order to bypass validation routines.	env-Web

Indicators

ID	type	Indicator Description	Environments
1	Positive	The application accepts user-controllable input.	env-Web

Outcomes

ID	type	Outcome Description
1	Success	The attacker's UTF-8 encoded payload is processed and acted on by the application without filtering or transcoding
2	Failure	The application decodes the charset and filters the inputs.

Security Controls

ID	type	Security Control Description
1	Preventative	Implement input validation routines that filter or transcode for UTF-8 content.
2	Preventative	Specify the charset of the HTTP transaction/content.
3	Detective	Monitor inputs to web servers. Alert on unusual charset and/or characters.
4	Preventative	Actively monitor the application and either deny or redirect requests from origins that appear to be attack attempts.

Attack Prerequisites

The application's UTF-8 decoder accepts and interprets illegal UTF-8 characters or non-shortest format of UTF-8 encoding.

Input filtering and validating is not done properly leaving the door open to harmful characters for the target host.

Typical Likelihood of Exploit

Likelihood: High

Methods of Attack

Injection
Protocol Manipulation
API Abuse

Examples-Instances

Description

Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this—http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe—the server didn't correctly handle %c0%af in the URL. What do you think %c0%af means? It's 11000000 10101111 in binary; and if it's broken up using the UTF-8 mapping rules, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.

So when the attacker requested the tainted URL, he accessed http://servername/scripts/../../winnt/system32/cmd.exe. In other words, he walked out of the script's virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where he could pass commands to the command shell, Cmd.exe.

Related Vulnerabilities

CVE-2000-0884

Attacker Skills or Knowledge Required

Skill or Knowledge Level: Low

An attacker can inject different representation of a filtered character in UTF-8 format.

Skill or Knowledge Level: Medium

An attacker may craft subtle encoding of input data by using the knowledge that she has gathered about the target host.

Probing Techniques

Attacker may try to inject dangerous characters using UTF-8 different representation using (example of invalid UTF-8 characters). The attacker hopes that the targeted system does poor input filtering for all the different possible representations of the malicious characters. Malicious inputs can be sent through an HTML form or directly encoded in the URL.

The attacker can use scripts or automated tools to probe for poor input filtering.

Indicators-Warnings of Attack

A web page that contains overly long UTF-8 codes constitute a protocol anomaly, and could be an indication that an attacker is attempting to exploit a vulnerability on the target host.

A attacker can use a fuzzer in order to probe for a UTF-8 encoding vulnerability. The fuzzer should generate suspicious network activity noticeable by an intrusion detection system.

An IDS filtering network traffic may be able to detect illegal UTF-8 characters.

Obfuscation Techniques

According to OWASP, sometimes cross-site scripting attackers attempt to hide their attacks in Unicode encoding.

Solutions and Mitigations

The Unicode Consortium recognized multiple representations to be a problem and has revised the Unicode Standard to make multiple representations of the same code point with UTF-8 illegal. The UTF-8 Corrigendum lists the newly restricted UTF-8 range (See references). Many current applications may not have been revised to follow this rule. Verify that your application conform to the latest UTF-8 encoding specification. Pay extra attention to the filtering of illegal characters.

The exact response required from an UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

1. Insert a replacement character (e.g. '?', '').

2. Ignore the bytes.

3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map).

4. Not notice and decode as if the bytes were some similar bit of UTF-8.

5. Stop decoding and report an error (possibly giving the caller the option to continue).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 only requires that UTF-8 decoders must not decode "overlong sequences" (where a character is encoded in more bytes than needed but still adheres to the forms above). The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

To maintain security in the case of invalid input, there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless. Another possibility is to avoid conversion out of UTF-8 altogether but this relies on any other software that the data is passed to safely handling the invalid data.

Another consideration is error recovery. To guarantee correct recovery after corrupt or lost bytes, decoders must be able to recognize the difference between lead and trail bytes, rather than just assuming that bytes will be of the type allowed in their position.

For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. If you use a parser to decode the UTF-8 encoding, make sure that parser filter the invalid UTF-8 characters (invalid forms or overlong forms).

Look for overlong UTF-8 sequences starting with malicious pattern. You can also use a UTF-8 decoder stress test to test your UTF-8 parser (See Markus Kuhn's UTF-8 and Unicode FAQ in reference section)

Assume all input is malicious. Create a white list that defines all valid input to the software system based on the requirements specifications. Input that does not match against the white list should not be permitted to enter into the system. Test your decoding process against malicious input.

Attack Motivation-Consequences

Privilege Escalation
Run Arbitrary Code
Data Modification
Denial of Service

Injection Vector

The injection vector is an illegal sequences of bytes matching an UTF-8 characters or a "non-shortest form" in UTF-8 encoding format.

Payload

The interpretation of malicious characters can cause unexpected responses from the target host.

Activation Zone

The request or command interpreter is responsible for interpreting the request sent by the client.

Payload Activation Impact

The malicious characters can defeat the data filtering mechanism and have many different outcomes such as path manipulation, remote code execution, etc.

Related Weaknesses

CWE-ID	Weakness Name	Weakness Relationship Type
173	Failure to Handle Alternate Encoding	Targeted
172	Encoding Error	Targeted
180	Incorrect Behavior Order: Validate Before Canonicalize	Targeted
181	Incorrect Behavior Order: Validate Before Filter	Targeted
171	Cleansing, Canonicalization, and Comparison Errors	Secondary
73	External Control of File Name or Path	Targeted
21	Pathname Traversal and Equivalence Errors	Targeted
74	Failure to Sanitize Data into a Different Plane ('Injection')	Secondary
20	Improper Input Validation	Secondary
697	Insufficient Comparison	Targeted
692	Incomplete Blacklist to Cross-Site Scripting	Targeted

Related Attack Patterns

Nature	Type	ID	Name	View(s) this relationship pertains to $View$s$$
PeerOf	Attack Pattern	64	Using Slashes and URL Encoding Combined to Bypass Validation Logic	Mechanism of Attack1000
PeerOf	Attack Pattern	71	Using Unicode Encoding to Bypass Validation Logic	Mechanism of Attack1000
ChildOf	Attack Pattern	267	Leverage Alternate Encoding	Mechanism of Attack (primary)1000

Related Security Principles

Reluctance to Trust

Related Guidelines

RFC 3629 - http://www.faqs.org/rfcs/rfc3629.html

Purposes

Penetration

CIA Impact

Confidentiality Impact: HighIntegrity Impact: HighAvailability Impact: Medium

Technical Context

Architectural Paradigms	All
Frameworks	All
Platforms	All
Languages	All

References

G. Hoglund and G. McGraw. "Exploiting Software: How to Break Code". Addison-Wesley. February 2004.

CWE - Input Validation

David Wheeler - http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html

Michael Howard and David LeBlanc - Writing Secure Code, chap12, Microsoft Press

Bruce Schneier - Crypto-Gram Newsletter, July 15, 2000 - http://www.schneier.com/crypto-gram-0007.html

WikiPedia page about UTF-8, http://en.wikipedia.org/wiki/UTF-8

RFC 3629 - http://www.faqs.org/rfcs/rfc3629.html

IDS Evasion with Unicode, by Eric Hacker, Jan. 3, 2001 - http://www.securityfocus.com/infocus/1232

Corrigendum #1: UTF-8 Shortest Form - http://www.unicode.org/versions/corrigendum1.html

UTF-8 and Unicode FAQ for Unix/Linux, by Markus Kuhn - http://www.cl.cam.ac.uk/~mgk25/unicode.html

UTF-8 decoder capability and stress test, by Markus Kuhn - http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples/UTF-8-test.txt

Content History

Submissions
Submitter	Organization	Date
G. Hoglund and G. McGraw. Exploiting Software: How to Break Code. Addison-Wesley, February 2004.	Cigital, Inc	2007-03-01

Modifications
Modifier	Organization	Date	Comments
Eric Dalci	Cigital, Inc	2007-02-13	Fleshed out content to CAPEC schema from the original descriptions in "Exploiting Software"
Sean Barnum	Cigital, Inc	2007-03-07	Review and revise
Richard Struse	VOXEM, Inc	2007-03-26	Review and feedback leading to changes in Name, Description and Related Attack Patterns
Sean Barnum	Cigital, Inc	2007-04-16	Modified pattern content according to review and feedback
Romain Gaucher	Cigital, Inc	2009-02-10	Created draft content for detailed description
Sean Barnum	Cigital Federal, Inc	2009-04-13	Reviewed and revised content for detailed description

CAPEC 80

COMPANY

STANDARDS

RECENT POSTS

MENU