Daily Breach

Blog Cyber Weekly Tech Update

TOON: How a New Data Format Slashes Your LLM Token Costs by 50% or More

This image is an informational graphic for Token-Oriented Object Notation (TOON), a compact, human-readable serialization format for JSON data used in LLM prompts.The graphic is divided into three main horizontal sections:Header: Displays the project name "Token-Oriented Object Notation" and its tagline: "Compact, human-readable serialization of JSON data for LLM prompts."Workflow Section:Shows a flow diagram: JSON $\rightarrow$ encode() $\rightarrow$ TOON $\rightarrow$ LLM.A bar graph illustrates token savings, indicating that TOON uses approximately 30–60% fewer tokens than JSON.Includes a command-line instruction to try the tool: npx @toon-format/cli *.json.Comparison Footer:Average Tokens: A visual bar chart shows TOON's significantly lower token footprint compared to JSON.Retrieval Accuracy: Lists TOON at 73.9% accuracy versus JSON at 69.7%.Best For: Lists ideal use cases including repeated structures, tables, varying fields, and deep trees.

The Hidden Cost of JSON in the Age of AI

In the world of Large Language Models (LLMs), every character matters. Data scientists, engineers, and product managers are embracing AI, but they quickly run into the same practical constraint: token cost and latency.

The problem lies with our most common data interchange format: JSON (JavaScript Object Notation).

While JSON is a phenomenal, human-readable, and universally parsable format for general programming, it is wildly inefficient when feeding structured data into an LLM. Why? Because JSON’s rigor is its downfall in this context. Every single bracket ([) and (]), brace ({) and (}), comma (,), and double-quote (") counts as at least one, and often more, tokens. When you have a large array of objects—say, 50 product records or analytics metrics, the vast majority of your tokens are spent on redundant punctuation instead of valuable data.

This is the limitation that TOON (Token-Oriented Object Notation) seeks to address. Its mission is simple: provide a structured data format for LLMs that conveys the exact same information as JSON but at a fraction of the token cost.

The TOON Solution: Data Density Over Punctuation

TOON is not designed to replace JSON everywhere; it is an engineered solution specifically for the LLM context window. It preserves the hierarchical structure and human readability we expect, but it strips away the ceremonial punctuation that wastes tokens. The results are compelling, often achieving 30% to over 60% token reduction depending on the structure of the data.

How does it achieve such dramatic savings? The core of TOON’s architecture relies on three key shifts:

1. Structural Clarity via Indentation

TOON eliminates the need for curly braces ({}) and square brackets ([]) to denote objects and arrays. Instead, it leverages a simple, fixed two-space indentation for nested objects, similar to YAML or Python.

For instance, a nested object structure like this in JSON:

{
  "user": {
    "id": 123,
    "name": "Ada"
  }
}

…becomes this concise structure in TOON:

user:
  id: 123
  name: Ada

This simple change removes four structural tokens immediately.

2. The Power of Tabular Arrays (The Game Changer)

The biggest token waste in JSON is the repetition of keys and structural punctuation within arrays of objects. Consider an array of three simple product records. In JSON, the keys (“sku”, “name”, etc.) and delimiters are repeated for every single item.

TOON solves this with its Tabular Array format. It declares the field structure once in a header, then lists the data rows separated by newlines, relying on a simple comma-separated value (CSV) style:

JSON (117 tokens)TOON (49 tokens)
json { "items": [ { "sku": "A1", "name": "Widget", … }, … ] }items[3]{sku,name,qty,price}: A1,Widget,2,9.99 B2,Gadget,1,14.5 ...

In the TOON format, the keys (sku,name,qty,price) are written just once. The LLM is instructed to parse the following lines based on that fixed schema, resulting in more than a 58% token reduction for this example.

3. LLM Guardrails and Tokenizer-Aware Formatting

TOON isn’t just about compression; it’s about reliability. It includes two features designed to make structured generation and parsing more reliable for the LLM:

  • Explicit Length Markers: All arrays include their length in the header (e.g., items[3]). This gives the LLM a clear target for how many elements to generate, reducing truncation errors.
  • Minimal and Deterministic Quoting: TOON is highly selective about quoting string values, only doing so if a value contains a space, comma, or another special character that would break parsing. It also offers an option to switch from a comma delimiter to a Tab or Pipe delimiter, which can provide even greater token savings and reduce quoting needs in specific use cases.

Technology and Implementation

The johannschopplich/toon repository provides a practical, production-ready implementation of this notation. It’s packaged as an easy-to-use library, making the barrier to entry minimal for any tech-literate team.

The core technology is an encode function that takes any standard JavaScript/JSON object and converts it into the ultra-compact TOON string format:

import { encode } from '@byjohann/toon'

const data = {
  // Your standard JSON object here
  users: [
    { id: 1, name: 'Alice' },
    { id: 2, name: 'Bob' }
  ]
}

// Convert JSON to TOON before sending to the LLM API
const toonString = encode(data) 
/* Output (drastically fewer tokens):
users[2]{id,name}:
1,Alice
2,Bob
*/

By adding this simple encoding step to their workflow, engineers can immediately realize significant operational cost savings and push larger, richer datasets into their LLM prompts without hitting context limits. In a world where AI cost is a major line item, Token-Oriented Object Notation is a powerful, architectural lever for efficiency.

Shubhendu Sen

Shubhendu Sen

About Author

Shubhendu Sen is a law graduate and former software developer with two years of professional experience, having worked on both frontend and backend development of web applications, primarily within the JavaScript ecosystem. He is currently pursuing a Master of Cyber Law and Information Security at NLIU Bhopal and is ISC2 Certified in Cybersecurity (CC). His interests include cyber law, malware research, security updates, and the practical implementation and audit of GRC frameworks.

Leave a Reply

Your email address will not be published. Required fields are marked *