Arithmetic Coding: Assigning a Single Interval to an Entire Message

by Jennifer

Compression is often explained using “codes” that map symbols to bits—think Huffman coding, where each character gets its own bit pattern. Arithmetic coding works differently. Instead of giving every symbol a separate codeword, it represents the entire message as one fractional number within an interval between 0 and 1. This approach can achieve compression efficiency very close to the theoretical limit, especially when symbol probabilities are uneven or when a model can predict upcoming symbols well.

In this article, you will learn how arithmetic coding assigns a single interval to a message, why it can compress better than many prefix codes, and what practical issues matter when implementing it.

Why Arithmetic Coding Exists

Traditional entropy coders such as Huffman coding assign an integer number of bits to each symbol. That constraint can waste space when a symbol’s ideal information content is not an integer. For example, a symbol with probability 0.1 ideally needs about 3.32 bits, but Huffman cannot assign fractional bits per symbol. Over long messages, that rounding can accumulate into measurable overhead.

Arithmetic coding avoids this by encoding the whole message into one interval whose size reflects the message probability. Highly probable messages map to larger intervals (fewer bits), while unlikely messages map to tiny intervals (more bits). This is why arithmetic coding often performs better when paired with good probability models—an important topic in many compression and modelling modules within a data scientist course.

Core Idea: One Interval, Refined Step by Step

Arithmetic coding starts with a range, usually [0,1)[0, 1)[0,1). It then narrows that range as it processes each symbol.

  1. Start with the full interval: low = 0.0, high = 1.0

  2. Partition the interval according to symbol probabilities.
    Suppose an alphabet {A, B, C} has probabilities:

    • A: 0.5

    • B: 0.3

    • C: 0.2
      Then the interval [0,1)[0,1)[0,1) is split into:

    • A: [0.0,0.5)[0.0, 0.5)[0.0,0.5)

    • B: [0.5,0.8)[0.5, 0.8)[0.5,0.8)

    • C: [0.8,1.0)[0.8, 1.0)[0.8,1.0)

  3. Choose the sub-interval for the next symbol and repeat.
    If the first symbol is B, the new interval becomes [0.5,0.8)[0.5, 0.8)[0.5,0.8).
    For the second symbol, you again split this new interval in the same probability proportions and select the portion for that symbol.

After processing all symbols, you end with a final interval [low,high)[low, high)[low,high). Any number inside this interval uniquely represents the message (given the same probability model). The encoder outputs enough bits to specify a value that lies within the final range.

A Small Numerical Illustration

Let the message be “BA” using the probabilities above.

  • Start: [0.0,1.0)[0.0, 1.0)[0.0,1.0)

  • Symbol B selects [0.5,0.8)[0.5, 0.8)[0.5,0.8)

  • Now refine for A within [0.5,0.8)[0.5, 0.8)[0.5,0.8): the width is 0.3

    • A takes the first 50% of this subrange: width = 0.15

    • New interval becomes [0.5,0.65)[0.5, 0.65)[0.5,0.65)

So “BA” maps to [0.5,0.65)[0.5, 0.65)[0.5,0.65). The encoder can output a binary fraction that falls inside this interval (for example, 0.101… in binary if it lies within the bounds). The exact bitstream depends on implementation details like renormalisation, but the principle stays the same.

Why It Can Compress Better Than Huffman

Arithmetic coding can get extremely close to the entropy of the source because it effectively supports fractional-bit coding across the message. The average code length approaches −log⁡2P(message)-\log_2 P(message)−log2​P(message) without forcing per-symbol rounding.

This advantage becomes more visible when:

  • Probabilities are skewed (some symbols are much more likely)

  • A context model predicts symbols well (e.g., text, image residues, or predictive coding outputs)

  • The alphabet is large or adaptive probabilities change over time

For learners building a strong foundation in information theory and probabilistic modelling, these ideas connect naturally to topics taught in a data science course in Pune, especially where modelling, probability estimation, and optimisation intersect with real-world systems.

Practical Implementation: Renormalisation and Precision

In theory, arithmetic coding works with real numbers. In practice, computers use finite-precision integers. Implementations maintain low/high bounds using large integers (for example, 32-bit or 64-bit ranges) and apply renormalisation:

  • When the high and low bounds share leading bits, those bits can be emitted because they will not change with future symbols.

  • The interval is scaled (shifted) so the remaining uncertainty stays within the numeric range.

  • Special handling is required for the “underflow” region (often called E3 scaling), where bounds get too close but do not share a stable leading bit.

These details are why arithmetic coding is considered “advanced”: the concept is clean, but robust implementations require careful boundary and precision management.

Decoding: Reversing the Refinement

Decoding mirrors encoding:

  1. The decoder reads a number (from the bitstream) that lies in the final interval.

  2. Starting from [0,1)[0,1)[0,1), it checks where that number falls within the probability partitions to determine the first symbol.

  3. It then refines the interval and repeats to recover each subsequent symbol.

As long as the decoder uses the same probability model updates as the encoder, the message reconstructs exactly.

Conclusion

Arithmetic coding represents a message as a single interval rather than a sequence of symbol codes. By repeatedly narrowing a range based on symbol probabilities, it can achieve compression performance close to the theoretical optimum, especially when used with strong predictive models. While implementation requires careful handling of finite precision and renormalisation, the payoff is excellent coding efficiency and flexibility. If you are exploring compression, probability modelling, or information theory through a data scientist course, arithmetic coding is a valuable technique to understand. It also fits naturally into broader applied learning in a data science course in Pune, where efficient representation and probabilistic reasoning show up in many practical domains.

Business Name: ExcelR – Data Science, Data Analytics Course Training in Pune

Address: 101 A ,1st Floor, Siddh Icon, Baner Rd, opposite Lane To Royal Enfield Showroom, beside Asian Box Restaurant, Baner, Pune, Maharashtra 411045

Phone Number: 098809 13504

Email Id: enquiry@excelr.com

You may also like