Punycode: My New Favorite Algorithm

I recently implemented a pure Haskell version of Punycode for my idn package, which I needed to support the JSON Schema library I’m working on. Going into it, I expected a straightforward encoding problem where we’d just map Unicode to base-36 and call it a day. I was completely wrong, but in a really delightful way–

Punycode has turned out to be one of the cleverest algorithms I’ve encountered in a while. As an implementor it’s deceptively simple on the surface, but there’s real sophistication in how it manages to work around the constraints and optimization problems inherent in encoding arbitrary Unicode within DNS’s ASCII-only infrastructure. I haven’t had to think this carefully about text encoding efficiency before, because most modern systems I work with let you throw UTF-8 at the problem and move on. Punycode doesn’t have that luxury, since it has to maintain backwards compatibility with ASCII-only domain names.

A high level summary of what makes it so neat: the algorithm uses adaptive bias adjustment, variable-length encoding, and delta compression to achieve extremely dense (in the information-theoretic sense) results for both single-script domains (like German bücher-café.de or Chinese 北京-旅游.cn) and mixed-script domains (like hello世界.com). What makes it particularly elegant is that it does this without any shared state between encoder and decoder. The adaptation is purely a deterministic function of the encoded data itself. This is a really cool example of how you can use the constraints of a system to your advantage to solve a problem.

If you’re the type who likes to see things work (and who isn’t?), there’s a step-through visualizer at the end of this article where you can watch the algorithm encode domains from various scripts.

A note on this article’s approach: RFC 3492, which specifies Punycode, follows the typical RFC pattern. It’s prescriptive rather than explanatory.1 It tells you what to implement with exacting precision, but generally doesn’t explain in-depth why specific design choices were made. If you’re lucky, you’ll get a terse explanation, but you don’t really get a peek into the minds of the authors. In the Punycode RFC, the authors note the parameters were “chosen heuristically” and “by trial and error,”2 but don’t elaborate on the reasoning behind the algorithm’s structure. This article is my attempt at reverse-engineering that thought process. Working backward from the implementation to understand the problems each piece solves. Some of this is speculation informed by the algorithm’s behavior and my own implementation experience, so take it with appropriate skepticism. I’m just some dude on the internet.

Disclaimer: This article is covering a lot of ground and I’m not an expert on all of the topics I’m writing about. If you see something that’s wrong, please email me and let me know!

The Problem

DNS infrastructure only supports ASCII characters (a-z, 0-9, and hyphens).3 But domain names need to represent text in any language: 北京旅游, bücher-café, παράδειγμα, مرحبا. Punycode solves this by encoding Unicode strings into ASCII strings that are always reversible. When you see xn-- in a domain name, you’re looking at Punycode in action.4

The Algorithm at a High Level

Before diving into why each piece exists, let’s walk through what Punycode actually does. The approach is relatively straightforward:

  1. Extract and output all ASCII characters as-is. For bücher-café, output bcher-caf immediately. These characters don’t need encoding. They’re already DNS-compatible.

  2. Add a delimiter. If there were any ASCII characters, append a hyphen: bcher-caf-. This separates the literal ASCII prefix from the encoded non-ASCII part.

  3. Process non-ASCII characters in sorted order by code point. Start at Unicode code point 128 and work upward through the Unicode space. For each non-ASCII character in the string, encode:

    • How far to “jump” in Unicode space from your current position (the delta, which is just the difference between code points)
    • Where to insert this character in the output string (the position)

    This sorting has a clever side effect. Duplicate characters are processed consecutively, which means zero deltas between them. Since they’re the same code point, the difference is zero.

  4. Use variable-length base-36 encoding for the deltas. Small deltas use fewer digits, large deltas use more. This is basic information theory. Assign short codes to frequent events.

  5. Adapt the encoding parameters as you go. After encoding each character, adjust the algorithm’s thresholds based on whether you’re seeing small or large deltas. This is where the real cleverness lives.

That’s the “complete” algorithm.

Why This Structure Works

So why these specific choices? Each one solves a real problem.

ASCII passthrough turns out to be really important for practical efficiency. Most domain names, even internationalized ones, contain significant ASCII content. Company names often have ASCII prefixes or suffixes (shop-日本商店.com, bücher-café-paris.fr). By handling ASCII directly without encoding overhead, you’re not paying encoding costs for characters that don’t need encoding.

Delta encoding exploits what I think is a key property of human writing: locality. Unicode organizes scripts into contiguous blocks.5 Latin is U+0000-U+024F, Greek is U+0370-U+03FF, Chinese is U+4E00-U+9FFF. When you write text, consecutive characters are almost always from the same script. This means the difference (delta) between consecutive Unicode code points is typically small.

Consider こんにちは世界 (Hello World in Japanese):

  • Absolute positions: U+3053 (12,371), U+3093 (12,435), U+306B (12,395), U+3061 (12,385), U+306F (12,399), U+4E16 (19,990), U+754C (30,028)
  • Deltas from start: 12,243 (initial), then 64, -40, -10, 14, 7,591, 10,038

Those small deltas within the hiragana range (64, 40, 10, 14) are much smaller than encoding absolute positions. For single-script domains, this savings compounds across every character. Even for mixed-script domains, you’re no worse off than encoding absolute positions.

Variable-length encoding addresses the core tension in any compression scheme: you need to handle both small and large values efficiently. If you use fixed-width encoding (say, 5 base-36 digits to handle any Unicode code point), you waste massive amounts of space on the common case (small deltas). If you use too few digits, you can’t represent large deltas at all.

The solution is to use as many digits as needed: 1 digit for values 0-35, 2 digits for values 36-1,295, 3 digits for larger values, and so on. This is essentially a base-36 variant of variable-length quantity encoding. Frequent events (small deltas in single-script text) get short codes, rare events (large deltas in mixed-script text) get longer codes.

Adaptive encoding is where Punycode moves from “clever” to “remarkably clever.” The problem is that “small” and “large” are contextually dependent. For a German domain with characters in the Latin Extended range, a delta of 50 is large. For a Chinese domain with characters clustered in the CJK Unified Ideographs block, a delta of 500 is small (just moving between adjacent hanzi).

A fixed encoding scheme forces you to choose: optimize for European languages and penalize Asian languages, or vice versa. Punycode refuses this tradeoff. Instead, it learns the distribution of your specific string as it encodes it, adjusting its thresholds on the fly to match the actual data.

How Sorting Handles Duplicates

There’s a subtle but important detail in step 3: characters are processed in sorted order by their Unicode code points, not in the order they appear in the input string. This means if your domain is ñoño, the algorithm doesn’t process the characters in the order they appear (ñ-o-ñ-o). Instead, it processes all o’s from the ASCII passthrough, then both instances of ñ consecutively.

Why does this matter? When you have duplicate characters, processing them consecutively means the delta between them is zero. They’re the same code point. A delta of zero encodes as a single digit. You get natural, automatic compression for repeated characters without any special-case handling.

Consider bücher versus büüücher. The second string has three ü characters. In sorted order, the algorithm processes:

  1. First ü: delta from 128 to 252 = 124 (requires multiple digits)
  2. Second ü: delta from 252 to 252 = 0 (encodes as a single ‘a’)
  3. Third ü: delta from 252 to 252 = 0 (encodes as a single ‘a’)

Without sorting, you’d need to jump backward and forward in Unicode space to handle characters in document order. Much larger deltas. The sorting ensures you always move forward (or stay in place for duplicates), keeping deltas minimal.

It’s elegant. A simple algorithmic choice (sort by code point) provides compression benefits without explicit deduplication logic.

A Concrete Walkthrough

Let’s trace through encoding bücher step by step to see how this works in practice.

Step 1: Extract ASCII

  • Input: bücher
  • ASCII characters: b, c, h, e, r
  • Output so far: bcher-

The hyphen acts as a delimiter. Everything before it is literal ASCII that appears in the output string. Everything after it is encoded position and delta information.

Step 2: Process non-ASCII characters

  • Non-ASCII character: ü (U+00FC = 252 in decimal)
  • Current code point: 128 (the starting position, which is where non-ASCII Unicode begins)
  • Delta: 252 - 128 = 124
  • Position in output: After b (position 1)

Step 3: Encode the delta and position

  • Encode 124 in variable-length base-36 with the current bias
  • With the initial bias settings, this becomes: fxa
  • Output: bcher-fxa

Done. The full Punycode representation is bcher-fxa, which becomes xn--bcher-fxa when the IDNA xn-- prefix is added.

To decode, you reverse the process: split on the last hyphen to get bcher and fxa, decode fxa to recover the delta 124, add it to 128 to get 252 (which is ü), and insert it at position 1 to reconstruct bücher.

Why Each Design Choice Matters

Now that you’ve seen the algorithm in action, let’s dig deeper into why each piece is designed the way it is. These aren’t arbitrary choices—each one solves a specific constraint or optimization problem. Most of this comes from careful reading of RFC 3492, which specifies Punycode in detail.

Why Base-36?

Base-36 uses exactly the digits 0-9 and letters a-z. That’s 36 symbols total. This specific choice is dictated by DNS constraints, which are defined in RFC 1035 (the original DNS specification from 1987, back when the internet was young and optimistic).6

DNS domain names are case-insensitive. When you type Google.COM, google.com, or GoOgLe.CoM, they all resolve to the same domain. The DNS protocol specification treats uppercase and lowercase letters as identical. This single constraint eliminates entire classes of encoding schemes.

You can’t use base-52 (a-z, A-Z) because A and a would be different symbols in your encoding but identical in DNS. A domain encoded with uppercase letters would be indistinguishable from one encoded with lowercase, breaking bijectivity. You can’t use base-64 (the standard alphanumeric + symbols encoding) because DNS prohibits most special characters in domain labels. Characters like +, /, and = aren’t allowed.

Base-36 is the largest case-insensitive alphanumeric base. It’s literally the maximum you can achieve under DNS constraints:

  • Digits 0-9: 10 symbols
  • Letters a-z (case-insensitive): 26 symbols
  • Total: 36 symbols

The information-theoretic implication here is that base-36 provides log₂(36) ≈ 5.17 bits of information per character. This is optimal given the constraints. Any smaller base would waste encoding space. Any larger base would violate DNS compatibility.

Why not other bases?

  • Base-10 (digits only): Only 3.32 bits/char, incredibly wasteful
  • Base-16 (hex): 4 bits/char, better but still inefficient
  • Base-32: 5 bits/char, closer but still leaving efficiency on the table for no reason
  • Base-64: Requires special characters that DNS prohibits

Base-36 extracts every bit of information density possible while playing by DNS’s rules. When you’re encoding hundreds of millions of domain names, these efficiency gains matter.

Why Delta Encoding?

You could encode the absolute Unicode code point of each character. For 日本語, that would mean encoding 26,085, then 26,412, then 35,486. Each of these is a five-digit number that requires multiple base-36 digits to represent.

Delta encoding instead encodes the difference between consecutive code points:

  • First delta: 26,085 - 128 = 25,957 (still large because we’re jumping from ASCII into CJK)
  • Second delta: 26,412 - 26,085 = 327 (much smaller, just moving between adjacent kanji)
  • Third delta: 35,486 - 26,412 = 9,074 (larger again, jumping to a different part of the CJK block)

What I think is the core insight here is locality. When you’re writing in German, you’re using characters from Latin and Latin Extended. When you’re writing in Japanese, you’re using characters from Hiragana, Katakana, and CJK Ideographs. You rarely ping-pong wildly between unrelated Unicode blocks within a single word.

Delta encoding exploits this. For single-script domains, most deltas after the initial jump are small (often under 100). Most values encode in 1-2 base-36 digits instead of 4-5. The compression gains are substantial.

Even for mixed-script domains, delta encoding is no worse than absolute encoding. You still need to represent the full magnitude of the jumps. But for the overwhelmingly common case of single-script domains, delta encoding provides massive savings.

There’s another subtle benefit. Delta encoding means the algorithm only needs to track one piece of state (the current code point position). You don’t need to maintain separate tracking for where you are in the Unicode space versus where you are in the output string. This simplifies both implementation and reasoning about correctness.

Why Variable-Length Encoding?

Once you’ve decided to encode deltas, you face a new problem. Deltas have wildly varying magnitudes. In a German domain, deltas might be 10-50. In a mixed-script domain, deltas might be 20,000. How do you encode both efficiently?

Fixed-width encoding doesn’t work. If you allocate enough space for the maximum possible delta (Unicode goes up to U+10FFFF, about 1.1 million), you’d need 5 base-36 digits for every value. This wastes enormous space on common small values. It’s like using a semi truck to deliver a postcard.

Variable-length encoding solves this by using exactly as many digits as needed for each specific value:

  • Values 0-35: 1 digit (covers most within-script movement)
  • Values 36-1,295: 2 digits (covers most between-script jumps in related languages)
  • Values 1,296-46,655: 3 digits (covers most unrelated script jumps)
  • And so on…

This is conceptually similar to UTF-8’s variable-length encoding (specified in RFC 3629), though the mechanism is different. The core idea is the same. Assign short codes to frequent values, longer codes to rare values.

The encoding scheme uses a threshold system. Each position in the variable-length number has a threshold value. If your remaining value is below the threshold, you output it directly and you’re done. If it’s above the threshold, you output the threshold value (indicating “more digits to come”) and continue with the remainder.

Here’s where it gets interesting. The thresholds aren’t fixed. They depend on the current bias parameter, which adapts as you encode. This leads us to the real innovation in Punycode.

Why Adaptive Encoding (and What Is Bias)?

This is the piece that surprised me most during implementation. I expected a straightforward variable-length encoding scheme with fixed parameters. What I found was a dynamic system that adjusts its own parameters based on the data it’s encoding.

The core problem: optimal encoding thresholds are context-dependent. Consider these two domains:

German domain bücher-café.de:

  • Characters live in Latin (U+0000-U+007F) and Latin Extended (U+0080-U+024F)
  • Deltas are typically 20-100
  • Optimal encoding: low thresholds that make 1-digit codes cover 0-50, 2-digit codes cover 50-2,000

Chinese domain 北京旅游.cn:

  • Characters live in CJK Unified Ideographs (U+4E00-U+9FFF)
  • Initial delta from ASCII is ~20,000, subsequent deltas are 0-5,000
  • Optimal encoding: high thresholds that make 1-digit codes cover 0-500, 2-digit codes cover 500-18,000

A fixed encoding scheme forces you to choose one set of thresholds. If you optimize for European languages, Chinese domains become bloated. If you optimize for Asian languages, European domains waste space. Punycode manages to do an end-run around this compromise.

The bias parameter controls the variable-length encoding thresholds. It’s a single integer that determines where the break points are between 1-digit, 2-digit, 3-digit encodings, etc.

After encoding each delta, Punycode recalculates the bias based on the magnitude of that delta. The algorithm for this is specified in Section 6.1 of RFC 3492:

biasAdaptation :: Int -> Int -> Bool -> Int
biasAdaptation delta numPoints isFirst =
    let delta' = if isFirst 
                 then delta `div` damp  -- damp = 700
                 else delta `div` 2
        delta'' = delta' + (delta' `div` numPoints)
    in findBias delta'' 0
  where
    findBias d bias
        | d > ((base - tmin) * tmax) `div` 2 = 
            findBias (d `div` (base - tmin)) (bias + base)
        | otherwise = 
            bias + (((base - tmin + 1) * d) `div` (d + skew))

The constants are specified in the RFC:7

  • base = 36
  • tmin = 1, tmax = 26
  • damp = 700
  • skew = 38

What the Adaptation Achieves

When you start encoding bücher-café, the algorithm calculates the delta for ü. The code point difference is 252 - 128 = 124, but the actual delta includes position information: 124 × (10 ASCII chars + 1) = 1,364. The adaptation function processes this:

  • Apply damping: 1,364 / 700 = 1.95 → 1 (more on damping shortly)
  • Scale by position: 1 + (1 / 11) = 1.09 → 1
  • Calculate new bias: 0 (stays very small)

The small bias means subsequent deltas in the Latin Extended range will use shorter encodings. Even though the actual delta value includes position information, the bias adapts to favor the small code point differences typical of single-script domains.

When you start encoding 北京旅游, the first delta to (U+5317 = 21,271) is 21,143 (including position: (21,271 - 128) × 1). The adaptation processes this:

  • Apply damping: 21,143 / 700 = 30.2 → 30
  • Scale by position: 30 + (30 / 1) = 60
  • Calculate new bias: 22 (moderately sized)

The moderately-sized bias shifts the encoding thresholds upward compared to the Latin Extended case. When you encode (U+4EAC) and subsequent characters, the deltas encode efficiently because the thresholds are now positioned to favor larger values typical of CJK scripts.

The algorithm learns the distribution of your specific string. For single-script domains, the bias quickly converges to an optimal value for that script. For mixed-script domains, the bias keeps adjusting to find a reasonable middle ground.

This is why Punycode achieves near-optimal encoding for such a wide variety of inputs: it’s not using a one-size-fits-all approach. It’s dynamically tuning itself to each specific domain name.

Bias adaptation for: bücher-café
Bias Adaptation Over Time
02550Initialé0ü26CharacterBias Value
First character (heavy damping)Subsequent characters (light damping)
Bias adaptation for: 北京旅游
Bias Adaptation Over Time
02550Initial21566764CharacterBias Value
First character (heavy damping)Subsequent characters (light damping)

What Is Damping?

Damping was completely new to me in this context before implementing Punycode. I knew the concept from audio engineering (where it’s central to designing filters and preventing oscillation in feedback systems8) but I hadn’t seen it applied to text encoding algorithms. Reading through RFC 3492, I initially didn’t understand why the algorithm divided the first delta by 700 but subsequent deltas by only 2. The RFC mentions it prevents “overflow”9 but doesn’t explain the underlying principle. I suspect this technique is common in compression algorithms, and I need to study more about adaptive encoding schemes to understand the broader pattern.

After working through the implementation and testing various inputs, I realized what damping solves. The cold-start problem. How do you adapt when you’ve only seen one data point?

After encoding the first character, you have exactly one delta value. Should you assume all future deltas will be similar? Dangerous assumption. Consider:

  • Domain bücher: First delta is large (124 for ü), subsequent deltas vary (105 for é)
  • Domain مرحبا: First delta jumps to Arabic script, subsequent deltas between Arabic characters vary within that range

If you adapt aggressively based on the first delta, you’ll be poorly positioned for what comes next. You need to hedge against uncertainty.

The damping factor reduces the influence of early samples on the bias calculation:

if isFirst
    then delta `div` 700    -- Heavy damping on first character
    else delta `div` 2      -- Light damping on subsequent characters

For the first character, Punycode divides the delta by 700 before using it in the adaptation calculation. A delta of 19,853 becomes 28.4 for adaptation purposes. The bias changes, but not dramatically. The algorithm is hedging. “I saw a large delta, so I’ll prepare for moderately large deltas, but I’m not committing fully to this pattern yet.”

For subsequent characters, the damping factor drops to 2. Once you have multiple samples, you can trust the pattern more. You’re still smoothing out individual variations (that’s the ÷2), but you’re letting the bias track the actual distribution more closely.

The concept of damping appears throughout control systems literature10. These are systems that need to respond to inputs without oscillating or overreacting. A well-damped system responds smoothly to changes, while an underdamped system oscillates wildly, and an overdamped system responds too slowly. Punycode’s damping prevents the bias from oscillating based on individual character variations.

Why 700 specifically? According to the RFC, this constant was empirically tuned using real internationalized domain names.11 Too small (like 10) and you still overfit to the first character. Too large (like 10,000) and adaptation is too slow. You waste space on the first several characters before the bias converges. 700 hits the sweet spot. Conservative enough to avoid premature convergence, aggressive enough to adapt within 2-3 characters. Sometimes the magic numbers really are just trial and error.

Damping in Action

Let’s trace through two examples to see why damping matters.

Example 1: bücher-café

Without damping:

  • First delta: 124
  • Bias adapts to ≈ 35 (optimized for deltas around 100)
  • Subsequent é character: delta ≈ 105
  • But bias is unstable, oscillating between values

With damping:

  • First delta: 1,364 (code point delta 124 × 11 positions)
  • Damped value: 1,364 / 700 = 1.95 → 1
  • Bias adapts to 0 (cautious, not committing)
  • Subsequent é character: delta ≈ 105
  • Bias adjusts smoothly to accommodate Latin Extended range
  • Bias stabilizes at optimal value for European characters

Damping Effect on "bücher-café"

01429435872InitialéüBias ValueWithout DampingWith Damping

Damping (green) prevents the bias from overreacting to the first character, allowing smoother convergence. Without damping (red), the bias can overcorrect or oscillate.

Example 2: aü北aü京 (pathological alternating)

Without damping:

  • First delta: 124 → bias = 35
  • Second delta: 20,951 → bias = 180
  • Third delta: -21,075 (wrapping back) → bias = 170
  • Bias oscillates wildly, never converging

With damping:

  • First delta: 124 → damped heavily → bias adapts cautiously
  • Second delta: large jump to CJK → damped by 2 → bias adjusts moderately
  • Third delta: large negative → damped by 2 → bias adjusts but remains stable
  • Bias still oscillates but stays in a reasonable range
  • Encoding is suboptimal but correct

Damping Effect on "aü北aü京" (Pathological Case)

01837557492InitialüBias ValueWithout DampingWith Damping

Damping (green) prevents the bias from overreacting to the first character, allowing smoother convergence. Without damping (red), the bias can overcorrect or oscillate.

Damping provides stability. It keeps the algorithm from chasing noise or overfitting to unusual patterns. Typical domain names (consistent scripts) converge quickly. Atypical domains (jumping between scripts) don’t fail catastrophically.

The Proportional Adaptation Term

There’s another mechanism in the bias calculation that I initially didn’t understand:

delta' + (delta' `div` numPoints)

The numPoints variable tracks how many characters you’ve processed so far. When you’re on the first character, numPoints = 1, so you’re adding delta' + delta' (doubling the influence). When you’re on the tenth character, numPoints = 10, so you’re adding delta' + (delta' / 10) (only a 10% boost).

This implements a confidence-weighted update. Early characters have more influence on bias because you’re still establishing the pattern. Later characters have less influence because you’ve already converged to a stable estimate. This is conceptually similar to an exponential moving average with decaying learning rate, a concept from online learning algorithms.12

The effect is subtle but important. Without this term, late-occurring outliers (like a single Chinese character at the end of an otherwise German domain) would disrupt a well-established bias. With this term, the bias remains stable and the outlier just takes a few extra digits to encode. Seems like the right tradeoff to me.

The Role of SKEW

The final piece of the bias adaptation formula uses the SKEW constant (38):

bias + (((base - tmin + 1) * d) / (d + skew))

SKEW appears in the denominator as d + skew. This serves two important purposes:

First, it prevents division by zero. When you have duplicate characters (like the two ü’s in müüchen), the delta between them is zero. Without SKEW, you’d be dividing by zero. With SKEW, you’re dividing by 38, giving a small but valid bias adjustment.

Second, and more importantly, SKEW creates a non-linear response curve. Let’s look at what happens to the bias adjustment term with different delta values:

  • When d = 0 (duplicates): (36 × 0) / (0 + 38) = 0 → no bias change
  • When d = 10 (small): (36 × 10) / (10 + 38) = 360 / 48 ≈ 7.5 → modest increase
  • When d = 50 (medium): (36 × 50) / (50 + 38) = 1800 / 88 ≈ 20.5 → larger increase
  • When d = 200 (large): (36 × 200) / (200 + 38) = 7200 / 238 ≈ 30.2 → substantial increase
  • When d = 1000 (very large): (36 × 1000) / (1000 + 38) = 36000 / 1038 ≈ 34.7 → approaching the maximum of 36

The curve is steepest for small-to-medium values and flattens out as d grows large:

  • Small deltas (within-script movement) produce small bias changes. Keeps the bias stable.
  • Large deltas (between-script jumps) produce larger bias changes, but with diminishing returns.
  • The bias never changes by more than BASE (36) in a single step. Stability.

The specific value of 38 was chosen empirically. It’s slightly larger than BASE (36), which means the formula approaches but never quite reaches the maximum value of 36. A smaller SKEW would make the bias more responsive to small deltas (risking instability). A larger SKEW would make it less responsive (slowing convergence). 38 hits the sweet spot. Responsive enough to adapt quickly, conservative enough to remain stable.

SKEW Response Curve: Bias Adjustment vs Delta
0102030360100200300400500max (36)d=0 (dup)0.0d=107.5d=5020.5d=20030.3SKEW=38Delta (d)Bias Adjustment
Shows how bias adjustment increases with delta. Note the steep initial slope (responsive to small deltas), the inflection point near SKEW=38, and the asymptotic approach to the maximum value of 36.

This is yet another example of Punycode’s careful engineering. Every constant is tuned to balance competing concerns. SKEW ensures the bias adapts smoothly across the full range of possible delta values without overreacting to noise or underreacting to genuine distribution changes.

Why the Symmetry Property Matters

Here’s something that took me a while to appreciate during implementation. The decoder runs the exact same bias adaptation function as the encoder.

When decoding bcher-fxa:

  1. Split on the last hyphen: ASCII part is bcher, encoded part is fxa
  2. Decode fxa to get delta = 124
  3. Run biasAdaptation 124 1 True to get the new bias
  4. Add 124 to current code point (128) to get 252 (which is ü)
  5. Insert ü at the encoded position to get bücher
  6. For any subsequent encoded values, repeat with the updated bias

The encoder and decoder execute identical code. Both start with the same initial bias. Both process the same sequence of deltas. Both run the same adaptation function after each delta. The result is perfect synchronization with no explicit coordination.

This is what makes Punycode bijective13 without metadata. Many compression schemes require shared state between encoder and decoder. Dictionaries, huffman trees, probability tables. These either need to be standardized (limiting adaptability) or transmitted as metadata (adding overhead). Punycode needs neither. The adaptation algorithm is purely a function of the encoded data itself. Given the encoded string and the algorithm specification, the decoder can reconstruct the exact state the encoder had at every step.

Why this is elegant: Punycode is completely self-contained. No hidden state, no side channels, no version negotiation. The algorithm specification in RFC 3492 is the entire contract. Any implementation that follows the spec will interoperate perfectly with any other implementation. This is a huge advantage for a global standard that needs to work across heterogeneous systems.

It also simplifies implementation. In my Haskell version, the encoder and decoder literally call the same biasAdaptation function. There’s no separate encoder-side and decoder-side logic. This reduces code duplication and makes correctness easier to verify.

Performance Characteristics

The adaptive bias converges quickly in practice. For single-script domains, the bias typically stabilizes after encoding 2-3 characters. At that point, subsequent characters encode at near-optimal efficiency.

For mixed-script domains, the bias continues adapting throughout the string, but it settles into a reasonable compromise. It’s not optimal for either script individually, but it handles both correctly and more efficiently than any fixed-bias approach could.

Pathological cases exist. A domain like aüaüaüaü that alternates between ASCII and non-ASCII will cause the bias to oscillate rather than converge. The encoding will be correct but less compact than optimal. But these patterns are extremely rare in real domain names. People don’t typically alternate between scripts at character-level granularity within a single word.

Empirical results from real internationalized domain names show (these statistics are mentioned in the IETF IDN working group discussions, though I’m bad at navigating the IETF website so had a hard time figuring out how to find the info):

  • Single-script European (German, French, Spanish, etc.): 1.0-1.5x expansion over raw character count
  • Single-script CJK (Chinese, Japanese, Korean): 1.2-1.8x expansion
  • Mixed-script (like shop-日本商店.com): 2.0-2.5x expansion
  • Worst case: Bounded by O(n) where n is string length

The adaptation overhead is minimal: about 3-4 integer operations per character (division, addition, a conditional). The while loop in the bias calculation typically runs 0-2 iterations. No lookups, no allocations, just arithmetic on machine integers.

From an implementation perspective, Punycode is remarkably efficient. In my Haskell version, the hot path consists of pure functions on strict integers—exactly the kind of code GHC optimizes well. The entire encoding/decoding process is non-allocating except for the output string itself.

Information density for: bücher-café
Information Density (Bits per Character)
0.04.08.012.016.0Base-36 max (5.17)é15.51ü12.92CharacterBits/Char
Shows cumulative encoding efficiency. Lower is better (more compact). The green dashed line shows the theoretical maximum information density of base-36.
Information density for: 北京旅游
Information Density (Bits per Character)
0.04.08.012.016.0Base-36 max (5.17)15.5115.5115.5115.51CharacterBits/Char
Shows cumulative encoding efficiency. Lower is better (more compact). The green dashed line shows the theoretical maximum information density of base-36.

Why Fixed Encodings Fail

To appreciate why Punycode’s adaptive approach is necessary, consider what happens with fixed bias values.

Fixed low bias (say, bias = 5, optimized for European languages):

  • German domain bücher-café.de: Encodes compactly with mostly 1-digit deltas ✓
  • Chinese domain 北京旅游.cn: Requires 3-4 digits per delta, expanding by 3-4x ✗
  • Arabic domain مرحبا.com: Similar bloat, requires many digits per character ✗
  • Result: European domains are optimal, but other scripts become impractically long

Fixed high bias (say, bias = 100, optimized for CJK):

  • Chinese domain 北京旅游.cn: Encodes compactly with 1-2 digit deltas ✓
  • German domain bücher-café.de: Wastes space using 2 digits where 1 would suffice, 30-40% overhead ✗
  • Arabic domain مرحبا.com: Better than fixed low, but still not optimal ≈
  • Result: CJK domains are optimal, but European domains waste space unnecessarily

Adaptive bias (Punycode’s approach):

  • German domain: Bias converges to ~0-5 after 2 characters → near-optimal ✓
  • Chinese domain: Bias adapts to ~20-40 based on deltas → near-optimal ✓
  • Arabic domain: Bias adapts to appropriate range → near-optimal ✓
  • Mixed domain: Bias adapts continuously → reasonable compromise ✓
  • Pathological alternating: Still correct, just less compact ✓

The chart below shows this concretely. For European text like bücher, the fixed low bias (red) is optimal. It uses the fewest digits. But for CJK text like 北京旅游 or Arabic like مرحبا, the fixed low bias becomes catastrophically inefficient, requiring many more digits per character. The fixed high bias (blue) has the opposite problem. Great for CJK, wasteful for European text. The adaptive approach (green) automatically matches the best fixed bias for each script type.

Encoding Efficiency: Fixed vs Adaptive Bias

03691215333bücher151412北京旅游333café8118مرحبا111310こんにちはTotal Digits RequiredFixed Low (bias=5)Fixed High (bias=100)Adaptive (Punycode)

Lower bars are better (fewer total digits = more compact encoding). The adaptive approach (green) automatically matches or beats the best fixed bias for each script type.

The adaptive approach achieves a Pareto improvement. Better on common cases and no worse on edge cases. You don’t have to choose between optimizing for one script family at the expense of another. The algorithm makes the right choice for each specific input.

This is what impressed me during implementation. The design space for “encode Unicode in base-36” is quite constrained once you account for DNS compatibility and efficiency requirements. Punycode navigates this beautifully. Optimal for the common case, correct for edge cases, and simple enough to implement in a few hundred lines of code.

Decoding

Decoding reverses the encoding process step-by-step:

  1. Split on the last hyphen to separate the ASCII prefix from the encoded suffix. For bcher-fxa, the ASCII part is bcher and the encoded part is fxa.

  2. Initialize state: Start with code point 128, bias at the initial value, position 0 in the output string.

  3. Decode variable-length base-36 integers from the encoded suffix. Each gives you a delta value.

  4. Apply each delta: Add it to the current code point to get a Unicode character. Insert that character at the specified position in the output string.

  5. Update bias: Run the same biasAdaptation function the encoder used. This keeps encoder and decoder synchronized.

  6. Repeat until the encoded suffix is exhausted.

Implementation Notes

Real production implementations need to handle several details beyond the core algorithm:

Case preservation: Punycode itself is case-insensitive (it only uses lowercase a-z). However, IDNA (the broader framework for internationalized domain names, specified in RFC 5891) needs to preserve case for display purposes.14 IDNA handles this by encoding case information separately using flag bits before passing the lowercase version to Punycode.

Length limits: DNS labels are capped at 63 octets (characters in ASCII), as specified in RFC 1035.15 This includes the xn-- prefix that IDNA adds. So your actual Punycode encoding is limited to 59 characters. Implementations need to detect and reject strings that would exceed this limit.

Invalid inputs: The decoder needs to gracefully reject malformed Punycode. This includes checking for overflow in delta calculations, invalid base-36 digits, and deltas that would result in code points outside the valid Unicode range.

Unicode normalization: Before encoding, IDNA applies Unicode normalization (specifically NFC, Normalization Form C, defined in Unicode Standard Annex #15).16 This ensures that é (precomposed) and é (e + combining accent) encode to the same result. This turns out to be really important for security. You don’t want example.com and éxample.com to be different domains just because the é was composed differently.17

The constants (BASE=36, TMIN=1, TMAX=26, DAMP=700, SKEW=38, initial bias=72) are specified in RFC 3492 and were empirically tuned.18 The RFC authors tested these values against large corpora of real internationalized domain names to find the optimal balance between compression efficiency and adaptation stability. These aren’t arbitrary choices. They represent the results of extensive empirical optimization, though the RFC doesn’t publish the test corpus or detailed optimization methodology.

Why It All Works

Punycode succeeds because it exploits several orthogonal properties of domain names:

Most internationalized domains use a single script. When someone registers a German domain, they use German characters (Latin Extended). When someone registers a Chinese domain, they use Chinese characters (CJK Ideographs). Very few people mix unrelated scripts within a single domain label (with the exception, perhaps, of the word café! Who doesn’t love a good café?). This means delta encoding provides substantial compression. Consecutive characters have similar code points.

DNS is case-insensitive. This constraint, while limiting, actually simplifies the design space. Base-36 is the obvious choice. There’s no temptation to use case to carry information, which would introduce ambiguity and potential security issues.

Domain names are short. Most domain labels are under 20 characters. This means adaptation overhead amortizes quickly. You “pay” for the first character’s suboptimal encoding while the bias is still converging, but by the third character you’re at near-optimal efficiency. For longer strings, the amortized cost of those first few characters becomes negligible.

Unicode has structure. Scripts are organized into contiguous blocks. This isn’t an accident. It’s a deliberate design choice in Unicode (documented in the Unicode Standard) that makes many encoding schemes more efficient. Punycode exploits this structure through delta encoding and adaptive bias.

The result is an encoding that’s:

  • Injective:19 Every Unicode string maps to exactly one Punycode string, and vice versa
  • Compact: Near-optimal for the common case (single-script domains)
  • Fast: Simple arithmetic on machine integers, no table lookups or allocations
  • Stateless: Decoder needs no external information beyond the algorithm specification
  • DNS-compatible: Case-insensitive, alphanumeric-only, respects length limits
  • Secure-ish: No ambiguity issues, handles normalization correctly (there are other security issues with Punycode, but that’s a topic for another article)

What impressed me most with Punycode is how it balances competing constraints. Optimized for efficiency without sacrificing correctness. Adaptive without requiring shared state. Simple enough to implement from the RFC spec in an afternoon, yet sophisticated enough to handle the full complexity of human writing systems.

This is what good algorithm design looks like. The constraints (DNS’s ASCII-only infrastructure, 63-character limit, case-insensitivity, zero tolerance for ambiguity) are severe. The problem space (all of Unicode, all human writing systems, millions of domains) is enormous. And somehow the solution is both near-optimal for common cases and correct for edge cases, all in a few hundred lines of code!

Punycode proves that constraints breed creativity. DNS gave us ASCII-only domain names, which should have been a dead end for internationalization. Instead, we got an elegant encoding scheme that’s still going strong decades later. The algorithm works so well that most people never even know it’s there. When you type 日本.jp into your browser and it just works, that’s Punycode doing its job.

One of the marks of genuinely clever engineering is making the difficult look easy. Punycode does this brilliantly.

Interactive Visualization

Now that we’ve covered how Punycode works, you can explore the algorithm yourself with this interactive visualizer. Try different inputs to see how the bias adapts, how deltas are encoded, and how the algorithm handles various scripts. The example buttons give you a tour of different writing systems, from Arabic to Thai. Watch the bias do its thing.

Punycode Encoding Visualizer

Watch how Punycode encodes Unicode text into DNS-compatible ASCII

Try these examples:
Note: This is a didactic visualization to demonstrate how the Punycode algorithm works. It's not intended to be a fully-conformant Punycode implementation—for production use, rely on established libraries and RFC 3492.
Final Punycode Output
xn--bcher-caf-j4a4r

Step 1: Extract ASCII Characters

Original Input
bücher-café
ASCII Characters (code < 128)
b(98)c(99)h(104)e(101)r(114)-(45)c(99)a(97)f(102)
Non-ASCII Characters (need encoding)
üU+00FC (252)
éU+00E9 (233)
Output So Far
bcher-caf-
ASCII characters + delimiter hyphen
Step 1 of 4

Footnotes

  1. RFCs (Requests for Comments) are the standardization documents that define Internet protocols. They’re deliberately prescriptive (specifying exact behavior so implementations interoperate) but this leaves little room for educational exposition about the underlying design philosophy.

  2. From RFC 3492, Section 2: “The particular values used were chosen heuristically and by trial and error to optimize a variety of test cases.”

  3. RFC 1035, Section 2.3.1 defines the original DNS label syntax: “labels must follow the rules for ARPANET host names. They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen.”

  4. The xn-- prefix identifies an “ACE label” (ASCII Compatible Encoding). This is defined in RFC 5890, Section 2.3.2.1: “All ACE labels begin with the ACE prefix ‘xn—’ (case independent), but the prefix does not appear in the Unicode form.”

  5. Unicode’s block organization is documented in the Unicode Standard, Chapter 2. Scripts are deliberately grouped into contiguous ranges to facilitate efficient encoding schemes like Punycode.

  6. RFC 3492, Section 5 states: “base = 36, the number of alphanumeric ASCII characters (10 digits plus 26 letters).” The RFC doesn’t elaborate on why this is optimal, but it’s the maximum case-insensitive base available in DNS.

  7. From RFC 3492, Section 5: These “parameter values were chosen by running several different DNS IDN testbeds and optimizing for a combination of space usage and processing speed on common input.”

  8. In audio engineering, damping ratios control how quickly oscillations decay in resonant systems. Underdamped systems (like a bell) ring with diminishing oscillations. Critically damped systems return to equilibrium as quickly as possible without overshooting. Overdamped systems approach equilibrium slowly. The same principle applies here: we want the bias to adapt without oscillating wildly.

  9. RFC 3492, Section 6.1 simply states: “damp is used for the first time to prevent overflow” without further elaboration on the design rationale.

  10. Classic damping theory comes from mechanical engineering and control theory. See Wikipedia: Damping for an introduction. The damping ratio ζ (zeta) determines system behavior: ζ < 1 is underdamped (oscillates), ζ = 1 is critically damped (optimal), ζ > 1 is overdamped (sluggish).

  11. The RFC provides no justification for the specific value of 700 beyond the general note that parameters were chosen “heuristically and by trial and error.” The authors tested against “a representative sample of strings from several scripts” but don’t publish the corpus or detailed optimization results.

  12. Online learning algorithms update models incrementally as new data arrives, rather than batch processing. Decaying learning rates (where early samples have more influence) help balance plasticity (adapting to new patterns) with stability (not forgetting established patterns). See Wikipedia: Online machine learning.

  13. A bijective function (also called a bijection or one-to-one correspondence) is both injective and surjective. This means every input maps to exactly one unique output, and every possible output corresponds to exactly one input. For Punycode: every Unicode string has exactly one Punycode encoding, and every valid Punycode string decodes to exactly one Unicode string. No information is lost in either direction.

  14. From RFC 3492, Section 2: “Case preservation is not the job of Punycode… Applications using Punycode should preserve the case of the original characters and of the ASCII characters.”

  15. From RFC 1035, Section 2.3.4: “Labels must be 63 characters or less.” This limit dates back to DNS’s original design in 1987 and remains a hard constraint of the DNS protocol.

  16. RFC 5891, Section 4.2.2.1 mandates: “All domain names must be validated and converted to Unicode Normalization Form C.”

  17. Domain name spoofing using different Unicode normalizations is a real security concern. The Unicode Security Mechanisms report discusses various confusability issues, including normalization-based attacks.

  18. RFC 3492, Section 2 describes the tuning process: “The particular values used were chosen heuristically and by trial and error to optimize a variety of test cases, including a representative sample of strings from several scripts (Arabic, Chinese simplified and traditional, Greek, Hebrew, Hindi, Japanese, Korean, Russian, Tamil, and Thai).”

  19. An injective function (also called an injection or one-to-one function) maps distinct inputs to distinct outputs. No two different inputs produce the same output. For encoding schemes, injectivity means you can decode unambiguously. There’s never confusion about which original input produced a given output. This matters for DNS: if two different Unicode strings could encode to the same ASCII string, domain resolution would be ambiguous.