Diff Algorithms Explained: LCS, Myers, and How Git Tracks Changes
14 March, 2026 Algorithms
Every time you run git diff, open a pull request, or deploy configuration with drift detection, a diff algorithm is working quietly underneath. Understanding how these algorithms work makes you a better developer - you'll know why certain diffs look the way they do, when to choose one format over another, and how to implement diffing in your own tools.
This article covers the theory and practice of text diffing: from the classic LCS dynamic programming approach to the O(ND) Myers algorithm that Git uses by default, the unified diff format, JSON structural diffing, and real code examples in Python, JavaScript, and PHP. You can follow along with the text diff tool open in another tab.
What Is a Diff
A diff is a compact description of how to transform one text into another. The word comes from the Unix diff utility (1974), but the concept is general: given strings A and B, produce a minimal set of edit operations that converts A into B.
The three fundamental operations are:
- Equal - this region is unchanged between A and B
- Insert - add lines/characters present in B but not in A
- Delete - remove lines/characters present in A but not in B
There is no "replace" operation at the primitive level - a replacement is just a delete followed by an insert.
Edit Distance (Levenshtein Distance)
Edit distance, also called Levenshtein distance after Vladimir Levenshtein who defined it in 1965, counts the minimum number of single-character edits needed to transform one string into another. The allowed operations are insert, delete, and substitute.
For example:
kittentosittinghas edit distance 3 (substitute k→s, substitute e→i, insert g)ABCDEtoACEhas edit distance 2 (delete B, delete D)
In the context of line-based diffing (as Git uses), "edit distance" usually means the number of inserted + deleted lines, ignoring substitutions (treating them as delete + insert pairs).
The Patch Metaphor
A diff output is sometimes called a patch - you "apply" the patch to the original file to produce the modified file. This is exactly what git apply and patch commands do. Patches are the foundation of:
- Version control - Git stores commits as trees of blobs, but patches are used for interchange, code review, and cherry-picking
- Code review - PR diffs show reviewers exactly what changed
- Deployment systems - config drift detection compares current state against desired state
- Document collaboration - Google Docs, Notion, and similar tools use operational transforms that are patch-like structures
Longest Common Subsequence (LCS)
The earliest practical approach to diffing is based on the Longest Common Subsequence problem.
Definition
A subsequence of a string is any subset of its characters taken in order, but not necessarily contiguous. For example, from ABCDE, the subsequence ACE is valid (take positions 1, 3, 5) but CAE is not (order is violated).
The LCS of two strings A and B is the longest sequence of elements that appears in both A and B in the same relative order. Finding the LCS tells you what the two strings have "in common" - the equal regions in the diff. Everything not in the LCS needs to be deleted from A or inserted from B.
For strings A = ABCDE and B = ACE:
- LCS =
ACE(length 3) - Characters to delete from A: B, D (positions 2 and 4)
- Characters to insert from B: none
- Edit distance = 2
Dynamic Programming Solution
The standard algorithm uses a 2D DP matrix. Let dp[i][j] = length of LCS of A[1..i] and B[1..j].
Recurrence:
dp[i][j] = dp[i-1][j-1] + 1 if A[i] == B[j]
dp[i][j] = max(dp[i-1][j], dp[i][j-1]) otherwise
Base case: dp[i][0] = dp[0][j] = 0 for all i, j.
Worked Example: "ABCDE" vs "ACE"
Building the DP matrix (rows = A = ABCDE, columns = B = ACE):
"" A C E
"" [ 0, 0, 0, 0 ]
A [ 0, 1, 1, 1 ]
B [ 0, 1, 1, 1 ]
C [ 0, 1, 2, 2 ]
D [ 0, 1, 2, 2 ]
E [ 0, 1, 2, 3 ]
Reading off: dp[5][3] = 3, so LCS length = 3. Backtracking through the matrix from bottom-right to top-left recovers the actual LCS: ACE.
Time and space complexity: O(mn) where m = len(A) and n = len(B). For large files (say 10,000 lines each), this means 100 million cells - slow and memory-intensive.
From LCS to Diff
Once you have the LCS, constructing the diff is straightforward: walk through both strings simultaneously. Characters in the LCS are "equal" regions; characters in A but not the LCS are deletions; characters in B but not the LCS are insertions.
Myers Algorithm
The LCS approach was the standard for years, but in 1986 Eugene W. Myers published a landmark paper: "An O(ND) Difference Algorithm and Its Variations" (Algorithmica, 1986). This algorithm is what Git uses by default and what most modern diff tools use.
Key Insight: Edit Graph
Myers frames the problem as a shortest path problem through an "edit graph". Imagine a grid where:
- The x-axis represents positions in string A (0 to m)
- The y-axis represents positions in string B (0 to n)
- Moving right = delete a character from A
- Moving down = insert a character from B
- Moving diagonally (down-right) = both strings have a matching character here (a "snake")
The shortest path from (0,0) to (m,n) through this graph - using as many diagonal moves as possible - is the optimal diff. The number of non-diagonal moves is the edit distance D.
O(ND) Complexity
The key insight is that if D (the edit distance) is small, you only need to explore a small portion of the grid. Myers uses a "greedy" search that explores paths diagonal by diagonal, parameterized by D.
- Time complexity: O(ND) where N = m + n and D = edit distance
- Space complexity: O(ND) in the basic version, reduceable to O(N) with the linear space refinement
Compare this to LCS: O(mn) time regardless of how similar the files are. For two files that are 95% identical (typical in code review), D is small and Myers is dramatically faster. For 10,000-line files with 500 changed lines, Myers explores roughly N*D = 20,000 * 500 = 10 million operations vs LCS's 100 million cells.
Why Git Uses Myers
Git's diff engine defaults to Myers because:
- Speed on similar files - code changes are usually small relative to file size
- Minimal diff - Myers produces the shortest diff (fewest edit operations)
- Well-understood - 40 years of implementation experience, highly optimised
Patience Diff
An optional alternative in Git is the patience diff algorithm (developed by Bram Cohen, also used by Mercurial). You can activate it with:
git diff --patience
Patience diff works better when code has been moved or refactored. Its strategy:
- Find lines that appear exactly once in both files (unique lines)
- Diff these unique lines to create anchor points
- Recursively diff the regions between anchors
The result is more human-readable for refactoring: instead of matching the wrong closing brace or a repeated return null;, it anchors on unique context and produces cleaner hunks. The trade-off is that it can be slower on files with few unique lines.
Unified Diff Format
The unified diff format is the standard output of diff -u and git diff. It is also the format used for patches exchanged between developers (mailing lists, git format-patch).
Structure
A unified diff file looks like this:
--- a/src/Calculator.php
+++ b/src/Calculator.php
@@ -12,7 +12,9 @@ class Calculator
public function divide(int $a, int $b): float
{
- return $a / $b;
+ if ($b === 0) {
+ throw new \DivisionByZeroError('Cannot divide by zero');
+ }
+ return $a / $b;
}
public function multiply(int $a, int $b): int
Breaking it down:
--- a/src/Calculator.php- the old version of the file (a/prefix is a Git convention)+++ b/src/Calculator.php- the new version (b/prefix)@@ -12,7 +12,9 @@- chunk header (hunk header):-12,7means the old file region starts at line 12 and spans 7 lines+12,9means the new file region starts at line 12 and spans 9 lines (2 lines added)- The text after
@@is optional context (method name, class name) added by Git for readability
- Lines with no prefix - context lines (unchanged, shown for context, typically 3 lines by default)
- Lines with
-prefix - deletions (present in old, absent in new) - Lines with
+prefix - insertions (absent in old, present in new)
Applying Patches
A unified diff can be applied with:
patch -p1 < my.patch # standard patch tool
git apply my.patch # git's version
The -p1 flag strips one path component prefix (the a/ or b/ Git convention).
Word-Level vs Line-Level Diff
Line-Level Diff (Default)
Standard Unix diff and Git work at the line level. The unit of comparison is a full line. This is efficient and works well for code, where lines are the natural unit of change.
Limitation: if you change one word in a long line, the entire line is marked as deleted and a new version of the line is marked as inserted. The line-level diff gives you no insight into what changed within the line.
Word-Level Diff
git diff --word-diff
This produces output like:
return [-$a / $b;-]{+$b === 0 ? throw new \DivisionByZeroError() : $a / $b;+}
Or in colour mode, the deleted and inserted words are highlighted within the line. Word-level diff is more informative for prose and for lines with many tokens.
Character-Level Diff
The most granular level. Useful for:
- Spotting a single changed character in a long string
- Comparing configuration values
- Reviewing SQL queries
Trade-off table:
| Level | Granularity | Noise level | Best for |
|---|---|---|---|
| Line | Low | Low | Code, structured text |
| Word | Medium | Medium | Prose, config files |
| Character | High | Can be high | Short strings, finding typos |
JSON Diff
Text diffing applied to JSON produces poor results. JSON files are often minified (single line), or formatted differently between two versions (different indentation, key ordering, trailing commas). A text diff will show the entire file as changed even if only one value changed.
Structural JSON Diff
A structural JSON diff compares the parsed data structures, not the raw text. The diff is expressed in terms of JSON paths and operations.
The standard format is RFC 6902 JSON Patch, which defines a set of operations on a JSON document:
| Operation | Description |
|---|---|
add |
Add a value at a path (or append to array) |
remove |
Remove the value at a path |
replace |
Replace the value at a path |
move |
Move value from one path to another |
copy |
Copy value from one path to another |
test |
Assert that a value at a path equals given value |
JSON Patch Example
Given these two JSON documents:
Before:
{
"name": "Alice",
"age": 30,
"roles": ["user"],
"address": {
"city": "London"
}
}
After:
{
"name": "Alice",
"age": 31,
"roles": ["user", "admin"],
"email": "alice@example.com"
}
The RFC 6902 JSON Patch to transform "before" into "after":
[
{ "op": "replace", "path": "/age", "value": 31 },
{ "op": "add", "path": "/roles/1", "value": "admin" },
{ "op": "add", "path": "/email", "value": "alice@example.com" },
{ "op": "remove", "path": "/address" }
]
JSON Patch is machine-readable, precise, and can be applied programmatically. It is used in REST APIs (the HTTP PATCH method commonly uses JSON Patch), configuration management, and event sourcing systems.
Limitations of Text Diff on JSON
Consider this simple change - only the age value changed:
// Before (compact)
{"name":"Alice","age":30,"roles":["user"]}
// After (compact)
{"name":"Alice","age":31,"roles":["user"]}
Text diff handles this fine. But if the formatter changed too:
// Before (compact)
{"name":"Alice","age":30,"roles":["user"]}
// After (pretty-printed)
{
"name": "Alice",
"age": 31,
"roles": [
"user"
]
}
Text diff shows every line as changed - completely unhelpful. Structural diff correctly shows only /age changed from 30 to 31.
Practical Use Cases
Version Control (Git)
Every git commit stores a snapshot, but git log -p, git diff, and PR interfaces reconstruct patches on demand using Myers diff. Merge conflict detection compares three versions: the common ancestor, the current branch, and the incoming branch.
Code Review
Pull request tools (GitHub, GitLab, Bitbucket) display unified diffs. Understanding the @@ chunk headers and context lines helps reviewers navigate large PRs quickly.
Deployment and Configuration Drift
Tools like Ansible, Terraform, and Kubernetes use diffing to show what will change before applying it. terraform plan output is essentially a structural diff of infrastructure state. Config drift detection alerts when running configuration diverges from desired state.
Database Migration Scripts
Comparing schema snapshots between environments uses structural diff techniques. Tools like Flyway and Liquibase track schema evolution as ordered patches.
API Contract Testing
Comparing OpenAPI specifications between API versions to detect breaking changes (removed fields, changed types) uses JSON structural diff.
Document Editing
Google Docs, Notion, and Figma use Operational Transforms (OT) or CRDTs (Conflict-free Replicated Data Types) - more sophisticated relatives of diff algorithms designed for real-time collaborative editing. They handle concurrent edits without conflicts.
Code Examples
Python: difflib
Python's standard library includes difflib with multiple diff strategies:
import difflib
old = """def calculate(a, b):
return a / b
"""
new = """def calculate(a, b):
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
"""
old_lines = old.splitlines(keepends=True)
new_lines = new.splitlines(keepends=True)
# Unified diff
diff = difflib.unified_diff(
old_lines,
new_lines,
fromfile='calculator_old.py',
tofile='calculator_new.py',
n=3 # context lines
)
print(''.join(diff))
# Similarity ratio (0.0 to 1.0)
matcher = difflib.SequenceMatcher(None, old, new)
print(f"Similarity: {matcher.ratio():.2%}")
# Word-level diff on a single line
old_line = "return a / b"
new_line = "return a / b if b != 0 else None"
words_old = old_line.split()
words_new = new_line.split()
delta = difflib.ndiff(words_old, words_new)
print(' '.join(delta))
difflib.SequenceMatcher uses a variant of the Ratcliff/Obershelp algorithm, not Myers, but it is fast and good enough for most in-process use cases.
JavaScript: diff library (jsdiff)
import { createPatch, diffWords, diffLines } from 'diff';
const oldStr = `function divide(a, b) {
return a / b;
}`;
const newStr = `function divide(a, b) {
if (b === 0) throw new Error('Division by zero');
return a / b;
}`;
// Unified diff patch
const patch = createPatch(
'calculator.js',
oldStr,
newStr,
'old',
'new'
);
console.log(patch);
// Word-level diff
const wordDiff = diffWords(
'The quick brown fox',
'The slow brown dog'
);
wordDiff.forEach(part => {
if (part.added) process.stdout.write('\x1b[32m' + part.value + '\x1b[0m');
else if (part.removed) process.stdout.write('\x1b[31m' + part.value + '\x1b[0m');
else process.stdout.write(part.value);
});
// Line-level diff with change detection
const changes = diffLines(oldStr, newStr);
changes.forEach(change => {
const prefix = change.added ? '+' : change.removed ? '-' : ' ';
change.value.split('\n').filter(Boolean).forEach(line => {
console.log(`${prefix} ${line}`);
});
});
Install with: npm install diff
PHP: SebastianBergmann/diff
<?php
declare(strict_types=1);
require 'vendor/autoload.php';
use SebastianBergmann\Diff\Differ;
use SebastianBergmann\Diff\Output\UnifiedDiffOutputBuilder;
use SebastianBergmann\Diff\Output\StrictUnifiedDiffOutputBuilder;
$old = <<<PHP
function divide(int \$a, int \$b): float
{
return \$a / \$b;
}
PHP;
$new = <<<PHP
function divide(int \$a, int \$b): float
{
if (\$b === 0) {
throw new \\DivisionByZeroError('Cannot divide by zero');
}
return \$a / \$b;
}
PHP;
// Standard unified diff output
$builder = new UnifiedDiffOutputBuilder(
"--- Original\n+++ New\n",
true // show line numbers
);
$differ = new Differ($builder);
echo $differ->diff($old, $new);
// Strict unified diff (compatible with patch command)
$strictBuilder = new StrictUnifiedDiffOutputBuilder([
'collapseRanges' => true,
'commonLineThreshold' => 6,
'contextLines' => 3,
'fromFile' => 'Calculator.php',
'fromFileDate' => '2026-01-01 00:00:00',
'toFile' => 'Calculator.php',
'toFileDate' => '2026-03-14 00:00:00',
]);
$strictDiffer = new Differ($strictBuilder);
echo $strictDiffer->diff($old, $new);
Install with:
composer require sebastian/diff
This library is used internally by PHPUnit to produce readable assertion failure messages.
JSON Diff in PHP
<?php
declare(strict_types=1);
function jsonPatch(array $before, array $after, string $path = ''): array
{
$ops = [];
foreach ($before as $key => $value) {
$currentPath = $path . '/' . $key;
if (!array_key_exists($key, $after)) {
$ops[] = ['op' => 'remove', 'path' => $currentPath];
} elseif (is_array($value) && is_array($after[$key])) {
$ops = array_merge($ops, jsonPatch($value, $after[$key], $currentPath));
} elseif ($value !== $after[$key]) {
$ops[] = ['op' => 'replace', 'path' => $currentPath, 'value' => $after[$key]];
}
}
foreach ($after as $key => $value) {
if (!array_key_exists($key, $before)) {
$ops[] = ['op' => 'add', 'path' => $path . '/' . $key, 'value' => $value];
}
}
return $ops;
}
$before = ['name' => 'Alice', 'age' => 30, 'roles' => ['user']];
$after = ['name' => 'Alice', 'age' => 31, 'roles' => ['user', 'admin'], 'email' => 'alice@example.com'];
$patch = jsonPatch($before, $after);
echo json_encode($patch, JSON_PRETTY_PRINT);
For production use, consider libraries like swaggest/json-diff which handle arrays, ordering, and RFC 6902 compliance properly.
Algorithm Comparison Summary
| Algorithm | Time | Space | Best for |
|---|---|---|---|
| LCS (DP) | O(mn) | O(mn) | Small inputs, educational use |
| Myers (default) | O(ND) | O(ND) | Code diffs, similar files |
| Patience | O(N log N) | O(N) | Refactored code, moved blocks |
| Histogram | O(N log N) | O(N) | Alternative to patience in recent Git |
Where m, n = file sizes; N = m+n; D = edit distance.
The histogram diff (git diff --histogram) is a refinement of patience diff that Git added later. It handles common lines better and is the default in some Git configurations.
Worth Knowing Deeply
Diff algorithms are a beautiful intersection of computer science theory and everyday developer tooling. The jump from O(mn) LCS to O(ND) Myers represents a fundamental insight: when files are mostly similar, you should only do work proportional to the differences, not to the total size. Patience and histogram diffs show that even well-understood problems have room for improvement when you think about human readability as a first-class goal.
Whether you are building a code review tool, a configuration management system, an API versioning checker, or just want to understand what git diff is doing, these algorithms are worth knowing deeply.