URL Encoding Explained: Percent-Encoding, Reserved Characters, and Common Mistakes
8 March, 2026 Updated: 2 March, 2026 Security
URLs are the universal addressing system of the web, but they were designed in an era when the internet was ASCII-only and the set of characters with special meaning in a URL was small. Today, developers constantly need to include arbitrary data — user names with spaces, search queries with ampersands, file paths with slashes, passwords with special characters — inside URL components. URL encoding (formally called percent-encoding) is the mechanism that makes this safe and unambiguous. This guide covers the full technical picture: the RFC 3986 rules, reserved versus unreserved characters, the difference between query encoding and form encoding, the double encoding vulnerability, and the encoding functions available in PHP, Python, and JavaScript. You can experiment with the URL encoder/decoder as you read.
What Is URL Encoding
Percent-encoding is a mechanism for representing arbitrary data in a URI (Uniform Resource Identifier) using only the ASCII characters that are safe to transmit across all systems. The name comes from the encoding format itself: each unsafe byte is represented as a percent sign % followed by two hexadecimal digits representing the byte's value.
Why It Exists
URLs were originally designed to carry a limited set of ASCII characters. The constraints come from several directions:
- Protocol safety: Older protocols and proxies sometimes strip or mangle certain bytes (control characters, bytes above 127, etc.)
- Delimiter ambiguity: Characters like
?,&,=,/, and#have specific structural meaning in a URL. If you want to include a literal&in a query parameter value, you must encode it so parsers do not interpret it as a separator. - Non-ASCII characters: Unicode characters (e.g., Cyrillic, Chinese, emoji) must be encoded as their UTF-8 byte sequences, with each byte percent-encoded.
The Encoding Mechanism
The process is straightforward:
- Take the character you want to encode.
- Express it as its UTF-8 byte value (for ASCII, this is the same as the ASCII code).
- Write
%followed by the two uppercase hexadecimal digits for that byte.
Examples:
| Character | UTF-8 Byte(s) | Percent-Encoded |
|---|---|---|
| Space | 0x20 | %20 |
& |
0x26 | %26 |
= |
0x3D | %3D |
# |
0x23 | %23 |
@ |
0x40 | %40 |
/ |
0x2F | %2F |
€ |
0xE2 0x82 0xAC | %E2%82%AC |
я |
0xD1 0x8F | %D1%8F |
Percent-encoding is case-insensitive: %2f and %2F are equivalent, but RFC 3986 recommends uppercase.
Percent-Encoding Rules (RFC 3986)
The current authoritative specification for URIs is RFC 3986, published in January 2005. It supersedes the earlier RFC 2396 (1998) and RFC 2732. Understanding the distinction matters because older software and documentation may reference RFC 2396 behaviour, which differs in subtle ways (for example, in how it treats the ~ tilde character - RFC 2396 recommended encoding it, RFC 3986 designates it as unreserved).
The Core Rule
Every octet that does not belong to the unreserved character set and is not being used as a reserved delimiter must be percent-encoded when placed in a URI.
Formally, from RFC 3986:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
Any character not in the unreserved set must be percent-encoded unless it is a reserved character being used in its reserved capacity (as a delimiter).
Reserved vs Unreserved Characters
This distinction is the most important concept in URL encoding.
Unreserved Characters - Never Encode These
Unreserved characters are safe to use anywhere in a URI without encoding. They carry no special structural meaning. The RFC 3986 unreserved set is:
ALPHA / DIGIT / "-" / "." / "_" / "~"
That is: uppercase and lowercase letters A-Z and a-z, digits 0-9, hyphen, period, underscore, and tilde.
Reserved Characters - Context-Dependent
Reserved characters have syntactic meaning in URIs. They should only be percent-encoded when they appear in a position where their special meaning must be suppressed - typically inside parameter values.
: / ? # [ ] @ (gen-delims)
! $ & ' ( ) * + , ; = (sub-delims)
Character Encoding Reference Table
| Character | Unreserved? | RFC 3986 rawurlencode |
Form encoding urlencode |
encodeURI (JS) |
encodeURIComponent (JS) |
|---|---|---|---|---|---|
A-Z, a-z, 0-9 |
Yes | unchanged | unchanged | unchanged | unchanged |
- . _ ~ |
Yes | unchanged | unchanged | unchanged | unchanged |
| Space | No | %20 |
+ |
%20 |
%20 |
& |
Reserved | %26 |
%26 |
unchanged | %26 |
= |
Reserved | %3D |
%3D |
unchanged | %3D |
+ |
Reserved | %2B |
%2B |
unchanged | %2B |
# |
Reserved | %23 |
%23 |
unchanged | %23 |
/ |
Reserved | %2F |
%2F |
unchanged | %2F |
? |
Reserved | %3F |
%3F |
unchanged | %3F |
@ |
Reserved | %40 |
%40 |
unchanged | %40 |
[ ] |
Reserved | %5B %5D |
%5B %5D |
unchanged | %5B %5D |
! $ ' ( ) * , ; |
Sub-delims | encoded | encoded | unchanged | encoded |
Query String Encoding
The query string is the part of a URL after the ? character. It typically carries key-value pairs separated by &, with keys and values separated by =.
https://example.com/search?q=hello+world&lang=en&page=2
^ ^ ^
key=value key=val key=val
Encoding Rules in Query Strings
When constructing query strings programmatically:
- The
?delimiter is not part of the query value - it is the separator between path and query - Within parameter values, encode
&as%26and=as%3D; otherwise the parser treats them as delimiters - Encode
#as%23; an unencoded#is treated as the start of the fragment - Keys should also be encoded if they contain special characters
# Correct: encoding the & in a parameter value
https://example.com/page?content=cats+%26+dogs&lang=en
# Wrong: the parser sees three parameters, not two
https://example.com/page?content=cats+&+dogs&lang=en
# Parsed as: content="cats ", (orphan key "+dogs"), lang="en"
Path Encoding
The path component of a URL uses / as the segment delimiter. This creates a special complication: if you need a literal / inside a path segment (for example, a filename containing a slash), you must encode it as %2F.
/files/reports%2F2026/summary.pdf
# "reports/2026" is a single path segment containing a literal slash
# as opposed to:
/files/reports/2026/summary.pdf
# which has three path segments: "files", "reports", "2026"
Security Implication: %2F Normalisation
Most web servers and reverse proxies automatically normalise %2F back to / before passing the path to the application. This behaviour has been exploited in directory traversal attacks. An attacker might encode ../ as ..%2F to bypass path validation that looks for literal ../ sequences, but a server that decodes it first will traverse up the directory tree.
Always decode and normalise paths before performing security checks on them. Never validate the raw, still-encoded URL.
Form Encoding: application/x-www-form-urlencoded
When an HTML form is submitted with method="POST" and the default enctype, the browser encodes the form data using the application/x-www-form-urlencoded format. This format predates RFC 3986 and differs from it in one important way: spaces are encoded as + instead of %20.
# RFC 3986 / rawurlencode style (correct for URIs):
name=John%20Doe&city=New%20York
# application/x-www-form-urlencoded style (HTML form POST):
name=John+Doe&city=New+York
The + convention comes from early HTML and HTTP specifications. It applies only to the body of a POST request or the query string when generated by an HTML form. It does not apply to path segments or other URI components.
Newline Encoding in Forms
Form encoding also encodes newlines as %0D%0A (carriage return + line feed), not just %0A. This is a historical artifact of early Windows-centric web standards.
# A textarea value "line1\nline2" becomes:
line1%0D%0Aline2
Double Encoding
Double encoding is both an accidental bug and an intentional attack technique.
How It Happens
If you percent-encode a string that is already percent-encoded, the % characters themselves get encoded as %25, producing double-encoded output:
Original: hello world
Encoded once: hello%20world
Encoded twice: hello%2520world
^^^
%25 is the encoding of %
When the doubly-encoded string is decoded once, you get hello%20world (still encoded). A second decode yields the original hello world. This two-step decode can be exploited if different layers of a system decode the URL at different stages.
CVE-2001-0333: IIS Double Decode Vulnerability
One of the most notorious examples is CVE-2001-0333, a critical vulnerability in Microsoft IIS 5.0 and earlier. IIS performed URL decoding in two passes. An attacker could encode ../ (used for directory traversal) as ..%2F, and then encode the % as %25, producing ..%252F.
- Pass 1 decode:
..%252Fbecomes..%2F - Pass 2 decode:
..%2Fbecomes../
This allowed attackers to traverse outside the web root directory and read or execute arbitrary files on the server - including system files and scripts outside the webroot. The fix was to decode only once and reject paths containing ../ sequences after decoding.
# Attack payload:
GET /scripts/..%255c..%255cwinnt/system32/cmd.exe?/c+dir HTTP/1.0
# After double-decode on vulnerable IIS:
GET /scripts/../../winnt/system32/cmd.exe?/c+dir
The lesson: always decode once and validate after decoding, never before.
Code Examples
PHP
PHP provides two functions for percent-encoding, serving different purposes:
<?php
declare(strict_types=1);
// urlencode() - application/x-www-form-urlencoded
// Spaces become +, not %20
// Use for HTML form data and query string values in PHP web forms
$query = urlencode('hello world & more');
// Result: "hello+world+%26+more"
// rawurlencode() - RFC 3986 compliant
// Spaces become %20
// Use for path segments and API query parameters
$path = rawurlencode('reports/2026');
// Result: "reports%2F2026"
$slug = rawurlencode('hello world');
// Result: "hello%20world"
// http_build_query() - builds a query string from an array
// Uses urlencode() internally (+ for spaces)
$params = [
'q' => 'cats & dogs',
'lang' => 'en',
'page' => 2,
];
$queryString = http_build_query($params);
// Result: "q=cats+%26+dogs&lang=en&page=2"
// For RFC 3986 compliant output, use the enc_type parameter:
$queryStringRfc = http_build_query($params, '', '&', PHP_QUERY_RFC3986);
// Result: "q=cats%20%26%20dogs&lang=en&page=2"
// Decoding:
$decoded = urldecode('hello+world'); // "hello world"
$decoded = rawurldecode('hello%20world'); // "hello world"
Python
from urllib.parse import quote, quote_plus, urlencode, unquote
# quote() - RFC 3986 compliant (like rawurlencode in PHP)
# The safe parameter defaults to '/', preserving slashes in paths
encoded = quote('hello world & more')
# Result: 'hello%20world%20%26%20more'
# Encode path segments (no safe characters)
segment = quote('reports/2026', safe='')
# Result: 'reports%2F2026'
# quote_plus() - application/x-www-form-urlencoded (like urlencode in PHP)
# Spaces become +
encoded_form = quote_plus('hello world & more')
# Result: 'hello+world+%26+more'
# urlencode() - builds a query string from a dict
params = {
'q': 'cats & dogs',
'lang': 'en',
'page': 2,
}
query_string = urlencode(params)
# Result: 'q=cats+%26+dogs&lang=en&page=2'
# RFC 3986 compliant query string:
query_string_rfc = urlencode(params, quote_via=quote)
# Result: 'q=cats%20%26%20dogs&lang=en&page=2'
# Decoding:
decoded = unquote('hello%20world') # "hello world"
decoded_plus = unquote_plus('hello+world') # "hello world"
JavaScript
JavaScript provides two encoding functions with importantly different behaviour:
// encodeURI() - encodes a COMPLETE URI
// Leaves reserved characters intact (they may be needed as delimiters)
// Also leaves: A-Z a-z 0-9 - _ . ! ~ * ' ( )
const full = encodeURI('https://example.com/search?q=hello world&lang=en');
// Result: 'https://example.com/search?q=hello%20world&lang=en'
// Note: & and = are NOT encoded (they are kept as delimiters)
// encodeURIComponent() - encodes a COMPONENT (value) within a URI
// Encodes everything except: A-Z a-z 0-9 - _ . ! ~ * ' ( )
// This is what you should use for individual parameter values
const value = encodeURIComponent('cats & dogs');
// Result: 'cats%20%26%20dogs'
const value2 = encodeURIComponent('hello world');
// Result: 'hello%20world'
// Correct way to build a query string in JS:
const params = {
q: 'cats & dogs',
lang: 'en',
page: 2,
};
const queryString = Object.entries(params)
.map(([k, v]) => `${encodeURIComponent(k)}=${encodeURIComponent(v)}`)
.join('&');
// Result: 'q=cats%20%26%20dogs&lang=en&page=2'
// Modern alternative: URLSearchParams (handles encoding automatically)
const sp = new URLSearchParams(params);
const queryStringAlt = sp.toString();
// Result: 'q=cats+%26+dogs&lang=en&page=2'
// Note: URLSearchParams uses + for spaces (form encoding style)
// Decoding:
decodeURI('hello%20world'); // 'hello world'
decodeURIComponent('hello%20world'); // 'hello world'
Common Mistakes
1. Encoding the Full URL Instead of Its Components
The most frequent mistake is calling an encoding function on a complete, already-assembled URL. This encodes the delimiters (://, ?, &, =) that must remain literal.
// Wrong: encodes the delimiters
const url = encodeURIComponent('https://example.com/search?q=hello world');
// Result: 'https%3A%2F%2Fexample.com%2Fsearch%3Fq%3Dhello%20world'
// This is a broken URL
// Correct: encode only the values
const query = encodeURIComponent('hello world');
const url = `https://example.com/search?q=${query}`;
// Result: 'https://example.com/search?q=hello%20world'
2. Confusing + and %20 in APIs
The +-for-space convention only applies to application/x-www-form-urlencoded. REST APIs and HTTP headers expect RFC 3986 (%20 for space). Sending a + to an API that does not apply form-decoding means the server will receive a literal + character.
<?php
// Wrong for REST APIs: uses + for spaces
$url = 'https://api.example.com/search?q=' . urlencode('hello world');
// Sends: ...?q=hello+world (literal plus sign if server doesn't form-decode)
// Correct for REST APIs:
$url = 'https://api.example.com/search?q=' . rawurlencode('hello world');
// Sends: ...?q=hello%20world
3. Double Encoding in String Concatenation
When building URLs by concatenation over multiple steps, it is easy to accidentally encode an already-encoded string.
<?php
// Danger zone: encoding at two different layers
$value = rawurlencode('hello world'); // "hello%20world"
// ... value passed to another function that also encodes it:
$url = 'https://example.com/?q=' . rawurlencode($value);
// Result: ?q=hello%2520world (double-encoded!)
// Solution: encode exactly once, as late as possible
$rawValue = 'hello world';
$url = 'https://example.com/?q=' . rawurlencode($rawValue);
4. Not Encoding Special Characters in Query Values
Forgetting to encode user-supplied data that contains &, =, #, or + in query parameter values leads to broken URLs and potential injection:
// User input: "cats & dogs"
const userInput = 'cats & dogs';
// Wrong: the & breaks the query string
const url = `https://example.com/search?q=${userInput}&lang=en`;
// Parsed as: q="cats ", (orphan " dogs"), lang="en"
// Correct:
const url = `https://example.com/search?q=${encodeURIComponent(userInput)}&lang=en`;
// Parsed as: q="cats & dogs", lang="en"
5. Assuming Case-Insensitivity Is Universal
While percent-encoded triplets are officially case-insensitive (%2f equals %2F), some servers, proxies, and caches treat them as case-sensitive strings. Always normalise to uppercase hex digits (RFC 3986 recommendation) for maximum compatibility.
What to Internalise
URL encoding is a small but critical part of building correct, secure web applications. The key distinctions to internalise are: encode components, not complete URLs; use RFC 3986 (%20) for paths and API queries; use form encoding (+) only for HTML form submissions; and never validate a URL path before fully decoding it. Double encoding is both an accidental bug and a class of security vulnerability — encode exactly once, as close to the transport layer as possible.