Rich Text to Markdown: How to Convert Google Docs, Word, and Notion Cleanly
1 May, 2026 Web
You hit Cmd+V, expect clean Markdown, and instead get a wall of <span style="font-family: 'Arial Narrow MT'; mso-font-charset: 0"> garbage that takes half an hour to scrub. I've done this more times than I care to count — pasting from a Google Doc into a GitHub README, watching the editor fill up with vendor noise, then spending twenty minutes with find-and-replace trying to recover something a decent converter should have handled in milliseconds. Rich text editors and Markdown speak different languages, and the bridge between them breaks in places nobody bothers to document.
This piece covers what actually survives the conversion, where the common sources — Google Docs, Word, Notion, Confluence — each introduce their own particular flavour of chaos, and how to clean up the output when the converter leaves cruft behind.
Why Markdown Wins for Long-Term Content
The core problem with Google Docs and Word is lock-in. Your content lives inside a proprietary format that requires proprietary tooling to open, edit, and export reliably. Markdown is just text. It renders the same in a GitHub README, a static-site generator, a wiki, an AI prompt, or a terminal. That portability is not a nice-to-have — it is the difference between content you own and content you rent from a vendor.
Markdown is also diff-able. Version-control systems understand line changes; they do not understand binary .docx deltas. If you are reviewing a pull request for documentation, you want to see what changed — not a flag that says "binary file differs." The ability to diff, review, and revert content is exactly why engineering teams reach for Markdown for documentation even when the rest of the organisation lives in Google Docs.
If you are new to Markdown's edge cases, the Markdown syntax reference is the place to start — it covers the CommonMark vs GFM distinctions this article assumes you already know.
How Rich Text to Markdown Conversion Actually Works
The pipeline has four steps: parse the source into an HTML DOM, sanitise that DOM to strip vendor noise and unsafe content, walk the tree node by node, and emit Markdown for each node type. That sequence matters. Skip the sanitisation step and you are converting raw vendor HTML — a mess of mso-* attributes, tracking pixels, and occasionally base64-encoded payloads sitting in <img src> — directly into Markdown. The output will be wrong.
HTML is the universal middle layer here because every major rich text source exposes it. When you copy from Google Docs, your clipboard receives both text/plain and text/html. Same for Word. Notion does the same. The raw HTML is ugly, but it is structured — and a library like Turndown can walk the DOM and apply per-node rules to emit Markdown. A sanitiser like DOMPurify runs first, with an allowlist of tags and attributes, so only the semantic signal reaches the converter.
One thing to understand clearly: this is a one-way operation. Markdown is a strict subset of what HTML can express. Converting back from Markdown to rich text recovers structure but discards everything presentational — colours, fonts, column layouts, tracked changes. Round-tripping loses data. It always will.
What Survives the Conversion
Most of what you care about comes through cleanly. The table below maps common HTML elements to their Markdown equivalents and notes where renderer compatibility matters.
| Element | Source HTML | Markdown output | Notes |
|---|---|---|---|
| Headings | <h1> to <h6> |
# to ###### |
ATX style, clean |
| Bold / italic | <strong> / <em> |
**bold** / _italic_ |
Mixed marks survive |
| Lists | <ul> / <ol> |
- / 1. |
Nested lists preserved |
| Links | <a href> |
[text](url) |
Title attribute kept |
| Tables | <table> |
GFM pipe table | Requires GFM-compatible renderer |
| Code | <code> / <pre> |
`inline` / fenced |
Language hint when class is set |
| Strikethrough | <s>, <del> |
~~text~~ |
GFM only |
| Blockquote | <blockquote> |
> |
Nested quotes survive |
| Horizontal rule | <hr> |
--- |
|
| Images | <img> |
 |
See dedicated section below |
Tables and strikethrough require a GFM-compatible renderer — they are not part of CommonMark. If your destination is a strict CommonMark parser, treat those two rows as "does not survive."
What Doesn't Survive (and Why)
Think of Markdown as structure, not presentation. Anything that is purely visual — anything a designer would care about — should be treated as collateral damage before you start.
Inline styles are the biggest category. color, font-family, text-align, background-color, custom CSS — all of it drops. Markdown has no syntax for these. A red heading becomes a regular heading. Centred text becomes left-aligned text.
Multi-column layouts and floats flatten to a single column. There is no Markdown concept of a float. The content survives; the layout does not.
Comments and tracked changes in Word and Google Docs are stripped entirely. A converter sees the accepted state of the document — the comment threads and pending revisions are not in the HTML payload.
Footnotes are the awkward edge case. GFM supports them, but most rich-text sources expose footnote references as superscript numbers linked to anchors at the bottom of the page. A converter that does not specifically handle this pattern will emit something structurally mangled. Most don't handle it.
Embedded objects — charts, drawings, equations rendered as images — become <img> references at best. If the source did not serialise them to image files, they disappear entirely.
Source-Specific Quirks
Google Docs
Google Docs is the most deceptive source. It wraps content in <b style="font-weight: normal"> — using bold tags with inline styles to unset the default bold behaviour — which means a naïve converter will emit **text** for content that was never actually bold. A smart sanitiser has to detect this pattern and strip the tag without applying the bold rule.
Lists come through as <ul style="list-style-type: disc"> or <ul style="list-style-type: circle">. That type information is lost in Markdown, where all unordered lists use the same bullet character. For most content this is irrelevant. For anyone who deliberately used different bullet styles to communicate structure, it is a problem.
Use the HTML clipboard stream, not text/plain. The plain-text version strips all structure.
Microsoft Word
Word pastes Office Open XML wrapped in HTML, with mso-* styles on nearly every element and class="MsoNormal" on most paragraphs. The <o:p> empty paragraph element appears constantly — it needs to be stripped before conversion or it produces blank lines throughout the output.
Smart quotes (" and ") come through as Unicode characters. Most Markdown renderers handle this without incident, but some older parsers escape them. Worth checking in the preview step.
Notion
Notion is the cleanest source of the three. It already thinks in blocks, and its HTML export reflects that block structure with reasonable fidelity. Code blocks include the language as a class on the <code> element — a good converter picks that up and emits a fenced block with the language hint.
The things that don't survive are Notion-specific features: toggles collapse into plain paragraphs, callouts become blockquotes, synced blocks lose their sync relationship entirely. The content is there. The Notion-specific behaviour is gone.
Confluence
Confluence macro output — {info}, {code}, {expand}, {warning} — comes through as <div class="confluence-information-macro"> with the macro type in the class name. A generic converter will treat this as an unstyled div and emit a plain paragraph. Inline JIRA issue links survive as plain hyperlinks; the issue metadata and macro context do not.
The Sanitisation Step Nobody Talks About
Pasting raw clipboard HTML directly into a Markdown converter is a security choice as much as a technical one. Tracking pixels, inline scripts, and base64-encoded payloads inside <img src> attributes all arrive on the clipboard as-is. A converter that skips sanitisation passes all of that through — or worse, embeds it silently in the output.
A good converter sanitises before conversion, running the clipboard HTML through a strict allowlist. Only tags that have semantic Markdown equivalents are permitted: <h1> through <h6>, <strong>, <em>, <a>, <ul>, <ol>, <li>, <blockquote>, <code>, <pre>, <table>, <img>, and a handful of structural containers. Everything else is stripped. Attributes get the same treatment — href and src are allowed; onclick, style (mostly), and data-* tracking attributes are not.
This is the same defence-in-depth thinking that applies to URL encoding and Base64 — if you accept untrusted input, sanitise before you parse, not after.
Images: Inline, Linked, or Lost?
Pasted images are the conversion step most likely to produce broken output. When you copy an image from a Google Doc or a Word file, the clipboard often carries a data:image/png;base64,... URI — the entire image encoded inline as a string that can run to hundreds of kilobytes. Most Markdown destinations do not render these. GitHub will not display a base64 data URI in a  tag. Neither will most static-site generators.
Three options, in order of how much they actually work:
- Keep the data URI. Only viable for small icons under a few kilobytes. For anything larger, the Markdown file becomes unwieldy and the URI often gets rejected at render time.
- Upload to a CDN, replace
src. The right answer for serious migrations. Script the upload step, get back a stable URL, replace the data URI in the output. The Image to Base64 Data URIs article covers when inlining is actually acceptable. - Drop the image, leave a placeholder. For content where the image is decorative, sometimes the honest move is to emit
and handle it manually.
A converter that silently drops images without a placeholder is worse than one that leaves the data URI — at least the data URI signals that something was there.
Cleaning Up the Output
Even a well-configured converter leaves artefacts. Stray <br> tags that the sanitiser didn't strip. Double blank lines between sections. Escaped characters that didn't need escaping — \. where a plain . would have been fine, \- at the start of a line that wasn't actually a list item.
A short cleanup pass fixes most of this. Strip any remaining <br> with a regex substitution. Collapse runs of three or more blank lines to two. Remove unnecessary escape sequences by matching \\([^\w\s]) and replacing with the bare character where it doesn't affect parsing.
The regex guide has ready-to-use patterns for exactly these cases — stripping <br>, normalising whitespace, and de-escaping common false positives.
When Not to Convert
If your destination renders HTML, convert to Markdown and you've created extra work for no gain. You have taken a format the destination understands natively, converted it to a subset format, and introduced the possibility of information loss. Just keep the HTML.
If you need styling fidelity — brand colours, exact fonts, specific layout — Markdown is the wrong target. The conversion will work, and the output will look nothing like the source.
If the source is already a Markdown-aware tool — GitLab wikis, Bear, Obsidian — use the native export. Those tools emit clean Markdown directly. Running their output through an HTML-intermediary converter adds a lossy step that serves nobody.
Doing the Conversion
I built a rich text to Markdown converter that runs entirely in the browser. Paste from Google Docs, Word, or Notion and get GFM-compatible Markdown back. The conversion happens client-side — the document never leaves the page.
The interface has three tabs: the paste area, the Markdown output, and the raw HTML the converter received. That HTML tab is the diagnostic layer — if the output looks wrong, check what the sanitiser actually received before filing a bug. You can also upload .html files directly for batch-style processing.
Copy the Markdown output with one click, or download it as a .md file. The GFM output handles tables, strikethrough, and fenced code blocks with language hints. For strict CommonMark targets, check the output for those three element types.
Verifying the Output
Render it before you commit it. Paste the output into the Markdown previewer and look at the structure — heading hierarchy, table alignment, nested list indentation. Visual inspection in a renderer catches problems that reading raw Markdown misses.
For batch migrations, use the text diff tool to compare source HTML against converted Markdown at the structural level — you'll spot systematic losses quickly. If the same element type is broken in ten files, it's a converter configuration issue, not ten individual problems.
For SEO-driven content, run the result through a word counter and verify that headings produce clean URL slugs — the URL slugs guide covers what makes a slug well-formed and what Markdown-to-HTML pipelines typically do to heading text.
Convert through HTML, sanitise before you parse, and treat anything purely visual as collateral damage. For a one-off conversion the in-browser tool covers 95% of cases cleanly. For repeat migrations — moving a Confluence space to a docs-as-code repo, or exporting a Notion database to a static site — build a cleanup script around a Markdown formatter and a regex pass, diff every file before you commit it, and budget for the images separately. The conversion itself is the easy part.