Local LLM Output: Why My Em-Dash Preprocessing Was Flawed
Back to Blog
AI8 min read

Local LLM Output: Why My Em-Dash Preprocessing Was Flawed

HHazrat Ummar ShaikhJune 21, 20260 views

When you're building systems that rely on processing text generated by Large Language Models, particularly for tasks like RAG, summarization, or structured data extraction, the quality and consistency of that text are paramount. I learned this lesson the hard way while developing llmclean, an internal Python library designed to normalize and sanitize LLM outputs before they hit downstream services. My initial assumption, born from years of wrestling with human-generated, often messy, web content, was that LLMs, especially local ones, would spew out a veritable wild west of Unicode characters and inconsistent punctuation.

Specifically, I was convinced that em-dashes (—) would be a common culprit, frequently replaced by double hyphens (--) or even just single hyphens (-) in an inconsistent manner. My plan was to implement a robust em-dash normalizer, standardizing everything to a consistent format. It seemed like a no-brainer. After all, if you've ever dealt with user-generated content in a Discord bot context or processed free-form text from any API, you know the pain of inconsistent punctuation. I even considered how this might affect my FastAPI backends, where strict data contracts are king.

Before committing to this feature, however, a nagging voice in my head (or perhaps it was just good engineering practice) told me to test my hypothesis. I decided to run a quick sweep across several popular local LLMs. What I found not only surprised me but completely reshaped the direction of llmclean 0.3.0 and highlighted three critical misconceptions I held about LLM output.

Isometric 3D rendering of a developer's workstation, dark mode IDE open, showing Python code with LLM outputs, a stylize

The Em-Dash Conundrum: My Flawed Assumption

My core assumption was simple: LLMs, trained on vast and diverse datasets, would reflect the full spectrum of human textual quirks, including the often-interchangeable use of em-dashes, en-dashes, and hyphens. In English typography, the em-dash (—) signifies a sudden break in thought or an emphatic pause, an en-dash (–) indicates a range or connection, and a hyphen (-) joins words. Humans frequently misuse or substitute these due to keyboard limitations or lack of typographic awareness.

For structured parsing, inconsistent use of these characters is a nightmare. A simple regex looking for a specific pattern might fail if the dash type varies. My initial thought process went like this: if I'm building a system to, say, parse user commands from a Discord server (a task I've tackled extensively when building a custom ticket bot, as discussed in Building a High-Performance Discord Ticket Bot in Python), I'd want to normalize all text inputs. This 'normalize-everything' mindset had subconsciously carried over to LLM outputs.

I envisioned scenarios where an LLM might generate a bulleted list, for instance:

- Item one -- a crucial point.
- Item two --- and another thought.
- Item three – a final piece of advice.

And I wanted to transform it consistently to:

- Item one — a crucial point.
- Item two — and another thought.
- Item three — a final piece of advice.

This seemed like a sensible cleaning step for any text intended for further programmatic processing.

[IMAGE: Detailed high-tech concept illustration of a data pipeline, with text flowing through a series of stylized processing nodes. One node is highlighted, representing a

Need a Professional Mobile & Backend Developer?

I build premium native mobile apps (Android, iOS) and high-performance backend systems (FastAPI, Ktor). Let's collaborate on your next project!

H

Written by

Hazrat Ummar Shaikh

Android Developer with 4+ years of experience. Built production Android apps, Ktor backends, Discord bots, and SaaS products using Kotlin, Python, and MongoDB. Passionate about building robust systems and writing clean code.