URL Slugs: Rules, SEO Impact, and Transliteration

22 March, 2026 Web

URL Slugs: Rules, SEO Impact, and Transliteration

A URL slug is the human-readable segment at the end of a URL path that identifies a specific resource. In https://example.com/blog/url-slug-best-practices, the slug is url-slug-best-practices. It is short, descriptive, and composed entirely of lowercase ASCII characters separated by hyphens. Getting slugs right matters for readability, SEO, and long-term URL stability. Getting them wrong — silently truncating Unicode, using underscores, or generating unstable identifiers — costs real traffic and creates maintenance debt. You can test the rules below against any title with the slug generator.


The Rules

A well-formed slug satisfies these invariants:

  • Lowercase only. URLs are case-sensitive by spec (RFC 3986), but most web servers treat path segments case-insensitively. Having two representations of the same resource (/Blog/Post and /blog/post) creates duplicate content issues. Normalising to lowercase eliminates the problem entirely.
  • Hyphens as word separators, not underscores. Google's crawlers treat a hyphen as a word separator, meaning url-slug is indexed as two words: "url" and "slug". An underscore is treated as a word joiner: url_slug is indexed as a single token urlslug. This is documented in Google's URL structure guidelines and has been confirmed by John Mueller repeatedly. Use hyphens.
  • Only [a-z0-9-] characters. Strip everything else: punctuation, special characters, emoji. Any character outside this set either needs percent-encoding (which harms readability) or causes inconsistent behaviour across systems.
  • No leading or trailing hyphens. A slug like -my-post- is technically valid but looks broken. Always trim hyphens from both ends after processing.
  • No consecutive hyphens. my--post is an artifact of the slugification process (typically from stripping punctuation that was surrounded by spaces). Collapse runs of hyphens to a single one.

The canonical regex that validates a correctly formed slug:

^[a-z0-9]+(?:-[a-z0-9]+)*$

SEO Specifics

Hyphen vs Underscore - the Google Preference

Google's documentation on URL structure explicitly recommends hyphens over underscores. The practical consequence: a post titled "Node.js Best Practices" slugged as nodejs_best_practices will rank for the single token "nodejsbestpractices", not for "node js best practices". The hyphenated version nodejs-best-practices is decomposed into individual words that match user queries.

Slug Stability and PageRank

A URL is an identity. When you change a slug - even to fix a typo - you create a new URL. The original URL has accumulated PageRank, inbound links, and cached positions in search indices. Without a 301 permanent redirect from the old slug to the new one, that equity is discarded. The practical rule: treat slugs as permanent the moment a page is indexed. Add the redirect if you must change a slug, but prefer not to change it at all.

URL Length

Google does not publish a strict character limit for URLs, but their crawlers handle shorter URLs more reliably, and shorter URLs display better in search results. The practical guidance is to keep the path under roughly 75 characters. This means slug generation must truncate long titles - at a word boundary, not in the middle of a word.

Canonical URLs

When the same content is accessible under multiple URLs (with and without trailing slash, HTTP vs HTTPS, www vs non-www), use a canonical link element to tell search engines which URL is authoritative. Slug generation is upstream of this concern, but consistent slug rules prevent accidental duplicate URLs at the slug level.


Unicode Transliteration

The majority of web content is not ASCII. Blog posts, product names, and user-generated content arrive in Russian, Chinese, Arabic, German, and hundreds of other scripts. Slugifying "Héllo Wörld" as an empty string or a string of percent-encoded bytes is wrong. The correct approach is transliteration: converting non-ASCII characters to their nearest ASCII equivalent before applying slug rules.

The Algorithm

  1. NFD normalisation. Unicode Normalisation Form D (Canonical Decomposition) decomposes precomposed characters into their base character plus combining mark(s). é (U+00E9, LATIN SMALL LETTER E WITH ACUTE) becomes e (U+0065) + ◌́ (U+0301, COMBINING ACUTE ACCENT). This separates the "letter" from the "decoration".
  2. Strip combining characters. Characters in the Unicode category Mn (Mark, Nonspacing) are the combining marks. Removing them converts ée, üu, ñn, çc.
  3. Map remaining non-ASCII. NFD + strip handles Latin-script languages with diacritics. For non-Latin scripts (Cyrillic, Greek, Chinese, Arabic, Hebrew, Japanese), a transliteration table is needed. The ICU (International Components for Unicode) library provides Any-Latin transliteration that covers most scripts. A word like Привет (Russian for "Hello") becomes Privet; 北京 becomes Běijīng, which after NFD stripping becomes Beijing.
  4. Apply slug rules. Lowercase, replace non-[a-z0-9] with hyphens, collapse multiple hyphens, trim.

The result: "Héllo Wörld" → NFD → "He\u0301llo Wo\u0308rld" → strip marks → "Hello World" → lowercase + replace spaces → "hello-world".


Implementation

PHP

PHP's intl extension (shipped with most distributions and required by Symfony) exposes ICU transliteration directly.

<?php

declare(strict_types=1);

function slugify(string $text, int $maxLength = 75): string
{
    // Transliterate any script to ASCII using ICU's Any-Latin; Latin-ASCII chain
    $text = transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', $text);

    // Replace any character that is not a lowercase letter, digit, or hyphen with a hyphen
    $text = preg_replace('/[^a-z0-9]+/', '-', $text);

    // Collapse multiple hyphens and trim from both ends
    $text = trim((string) $text, '-');

    if ($text === '') {
        return '';
    }

    // Truncate at word boundary if too long
    if (strlen($text) > $maxLength) {
        $text = substr($text, 0, $maxLength);
        $lastHyphen = strrpos($text, '-');
        if ($lastHyphen !== false && $lastHyphen > $maxLength / 2) {
            $text = substr($text, 0, $lastHyphen);
        }
        $text = trim($text, '-');
    }

    return $text;
}

// Examples:
slugify('Héllo Wörld');           // "hello-world"
slugify('Привет мир');            // "privet-mir"
slugify('北京欢迎你');              // "bei-jing-huan-ying-ni"
slugify('PHP: The Right Way!');   // "php-the-right-way"

If the intl extension is unavailable, a fallback using iconv handles Latin-script diacritics:

<?php

declare(strict_types=1);

function slugifyFallback(string $text): string
{
    // Convert to ASCII using iconv transliteration (Latin scripts only)
    $ascii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $text);
    $ascii = strtolower((string) $ascii);
    $ascii = preg_replace('/[^a-z0-9]+/', '-', $ascii);

    return trim((string) $ascii, '-');
}

The iconv approach does not handle Cyrillic, Chinese, or Arabic. Prefer transliterator_transliterate for any site with non-Latin content.

JavaScript

JavaScript does not ship built-in transliteration for non-Latin scripts. NFD normalization and diacritic stripping are native; for full transliteration use the transliteration or slug npm packages.

// Native: handles Latin-script diacritics via NFD normalization
function slugify(text, maxLength = 75) {
    let slug = text
        .normalize('NFD')                     // decompose diacritics
        .replace(/[\u0300-\u036f]/g, '')      // strip combining marks
        .toLowerCase()
        .replace(/[^a-z0-9]+/g, '-')          // non-alphanumeric to hyphen
        .replace(/^-+|-+$/g, '');             // trim leading/trailing hyphens

    if (slug.length > maxLength) {
        slug = slug.slice(0, maxLength);
        const lastHyphen = slug.lastIndexOf('-');
        if (lastHyphen > maxLength / 2) {
            slug = slug.slice(0, lastHyphen);
        }
        slug = slug.replace(/^-+|-+$/g, '');
    }

    return slug;
}

// For non-Latin scripts, add the transliteration library:
// import { slugify } from 'transliteration';
// slugify('Привет мир') => 'privet-mir'

slugify('Héllo Wörld');          // "hello-world"
slugify('PHP: The Right Way!');  // "php-the-right-way"

Python

Python's unidecode library provides ICU-quality transliteration for all scripts in a single call.

from unidecode import unidecode  # pip install unidecode
import re


def slugify(text: str, max_length: int = 75) -> str:
    # Transliterate any script to ASCII
    text = unidecode(text)
    text = text.lower()
    # Replace non-alphanumeric characters with hyphens
    text = re.sub(r'[^a-z0-9]+', '-', text)
    text = text.strip('-')

    if len(text) > max_length:
        text = text[:max_length]
        last_hyphen = text.rfind('-')
        if last_hyphen > max_length // 2:
            text = text[:last_hyphen]
        text = text.strip('-')

    return text


# Examples:
slugify('Héllo Wörld')         # 'hello-world'
slugify('Привет мир')          # 'privet-mir'
slugify('北京欢迎你')            # 'bei-jing-huan-ying-ni'
slugify('PHP: The Right Way!') # 'php-the-right-way'

Edge Cases

All-Numeric Slugs

A slug like 12345 is valid by the character rules but ambiguous. Visitors and systems often cannot tell whether it is an ID or a meaningful path segment. Some frameworks route numeric segments to ID-based controllers rather than slug-based ones. If your titles can produce all-numeric slugs (e.g., a post titled "2026"), consider prepending a category prefix: post-2026 or year-2026.

Empty Result After Stripping

A title composed entirely of emoji, special characters, or a script your transliterator does not handle can produce an empty string after slugification. Never silently use an empty slug. Fallback strategies in order of preference:

  1. Use a hash of the original title (first 8 hex characters of SHA-1 is enough for this purpose).
  2. Use a UUID v4.
  3. Raise a validation error and require a manually entered slug.

Reserved Words and System Paths

Do not allow slugs that conflict with system paths. Common conflicts: admin, api, static, assets, login, logout, register, feed, sitemap, robots. Maintain a blocklist and append a suffix when a generated slug matches a reserved word.

Very Long Titles

Truncate at a word boundary, not mid-word. The implementation examples above show the pattern: truncate to maxLength, find the last hyphen in the truncated string, and cut there. This avoids slugs ending in ...best-practi.


Collision Handling

When two different resources produce the same slug, you need a deterministic resolution strategy. The standard approach is sequential suffixing:

<?php

declare(strict_types=1);

function uniqueSlug(string $baseSlug, callable $exists): string
{
    if (!$exists($baseSlug)) {
        return $baseSlug;
    }

    $counter = 2;
    do {
        $candidate = $baseSlug . '-' . $counter;
        $counter++;
    } while ($exists($candidate));

    return $candidate;
}

// Usage:
$slug = uniqueSlug(
    slugify($title),
    static fn (string $s): bool => $articleRepository->existsBySlug($s),
);
// "my-post", "my-post-2", "my-post-3", etc.

Avoid appending a random hash as the default collision resolution. Hashes produce unstable, unguessable URLs and defeat the readability purpose of having a slug at all. Sequential suffixes (-2, -3) are predictable and human-friendly.


UUID-Based vs Human-Readable Slugs

Some applications avoid the collision and stability problems entirely by using UUID or ULID-based URLs:

/posts/01JPXK3G8EQ4FVZMCQ7N1BWSRH   ← ULID-based
/posts/a1b2c3d4                       ← short hash
/posts/my-post-title                  ← human-readable slug

The trade-off is explicit:

Property UUID/ULID Human-readable slug
Uniqueness Guaranteed by construction Requires collision handling
Stability Permanent, never changes At risk when title is edited
Readability None High
SEO value Minimal - no keywords Moderate - keywords in URL
Guessability Zero Moderate
Implementation Simple Requires transliteration + deduplication

A middle-ground approach that works well for large content platforms: generate a human-readable slug, append the first 8 characters of the record's ULID as a suffix, and never change it regardless of title edits:

/posts/my-post-title-01jpxk3g

The slug is readable, collision-free by construction, and stable because it is tied to the record ID rather than the title.


The Checklist

A good slug is lowercase, hyphen-separated, ASCII-only, bounded in length, and stable over time. The implementation details that matter most in practice are: use ICU transliteration (not naive iconv) for non-Latin scripts, truncate at word boundaries not character boundaries, handle empty-after-strip gracefully, and treat a published slug as immutable. Use 301 redirects when you must change a slug, maintain a blocklist of reserved paths, and if stability is more important than readability, embed a short ID suffix so the slug can survive title edits.

More Articles

CSV vs JSON for Data Exchange: When Each Format Wins

A practical comparison of CSV and JSON for APIs, data pipelines, and file exports. Covers structure, parsing, streaming, schema enforcement, size, tooling, and clear guidelines for choosing the right format.

15 April, 2026

SEO for AI Search: How to Optimise for ChatGPT, Perplexity, and Google AI Overviews

How AI-powered search engines discover, evaluate, and cite web content. Practical strategies for optimising your pages for ChatGPT Browse, Perplexity, Google AI Overviews, and other AI answer engines.

14 April, 2026

Image to Base64 Data URIs: When to Inline and When Not To

A practical guide to embedding images as Base64 data URIs. Covers the data URI format, size overhead, performance trade-offs, browser caching, Content Security Policy, and clear rules for when inlining helps vs hurts.

10 April, 2026