Clarity-OMR vs Audiveris: 5 OMR Accuracy Tests
A deep-dive comparison of Clarity-OMR's machine learning approach against Audiveris's traditional computer vision for optical music recognition — with real benchmark data on 10 classical piano pieces.
A deep-dive comparison of Clarity-OMR's machine learning approach against Audiveris's traditional computer vision for optical music recognition — with real benchmark data on 10 classical piano pieces.

The best open-source optical music recognition software depends on what you're scanning. Audiveris is the safer all-around pick with a 44.0 average quality score across 10 classical piano benchmarks. But Clarity-OMR — a brand-new ML-based OMR tool — scores nearly triple Audiveris on clean, rhythmic pieces like Bartók (69.5 vs 25.9) and Joplin's The Entertainer (66.2 vs 33.9).
The real takeaway: OMR accuracy depends heavily on your source material. A developer going by [Clarity___ on r/Machine Learning](https://www.reddit.com/r/Machine Learning/comments/1ru8okj/p_ive_trained_my_own_omr_model_optical_music/) just dropped an entirely ML-driven optical music recognition pipeline that converts sheet music PDFs into MusicXML. It's called Clarity-OMR, and it's going head-to-head with Audiveris — the reigning open-source OMR engine that musicians have relied on for years. Both are free. Both output MusicXML. But they take radically different approaches to the same problem, and those differences produce some surprising results.
Let's break it down.
| Feature | Clarity-OMR | Audiveris |
|---|---|---|
| Approach | Deep learning (Da ViT + Transformer) | Hybrid CV + NN glyph classifier |
| Input Format | Sheet music PDFs | PDFs and images |
| Output Format | MusicXML | MusicXML |
| Avg Quality Score | 42.8 (mir_eval, 10 pieces) | 44.0 (mir_eval, 10 pieces) |
| Best Single Score | 69.5 (Bartók) | ~44 (consistent across pieces) |
| Open Source | Yes | Yes |
| GPU Required | Yes (CUDA) | No |
| Language | Python / PyTorch | Java |
| Price | Free | Free |
| Maturity | New (2026) | Established (10+ years) |
Clarity-OMR's architecture is honestly pretty clever in the way it breaks the problem apart. Think of it like an assembly line with four specialized stations instead of one worker trying to do everything at once.
Stage 1: Staff Detection (YOLO). A YOLO model scans each PDF page and identifies individual staves. This is the divide-and-conquer step — instead of processing an entire page end-to-end (which would blur fine details), Clarity-OMR crops each staff and processes them individually at 192px height. That resolution choice matters. It preserves grace notes, articulation marks, and dynamic markings that full-page approaches tend to smear into noise.
Stage 2: Recognition (Da ViT + Transformer Decoder). Each cropped staff image hits a Da ViT-Base encoder paired with a custom Transformer decoder using RoPE positional encoding. The decoder outputs tokens from a 487-element music vocabulary — essentially a specialized language where each "word" represents a musical element like a note, rest, barline, or dynamic marking. The model uses DoRA rank-64 on all linear layers, a parameter-efficient fine-tuning technique that keeps the model size manageable without sacrificing quality.
Stage 3: Grammar-Constrained Beam Search (FSA). This is where the architecture gets clever. A finite state automaton enforces structural validity during decoding. The model literally can't output musically impossible sequences. Beat consistency, chord well-formedness, measure completeness — all checked in real-time as tokens are generated.
Stage 4: MusicXML Export. The validated token sequence gets converted to standard MusicXML.
Audiveris takes the opposite road entirely. It combines classical computer vision techniques — image processing, connected component analysis, and rule-based systems — with a neural network glyph classifier introduced in version 5.x. The result is a hybrid approach where symbol detection uses trained classifiers, but the overall pipeline still relies heavily on hand-crafted rules for layout analysis and musical interpretation.

The advantage? Rock-solid predictability. No GPU needed. No Python environment to wrangle. Download a JAR file, double-click, and go. And because Audiveris has been refined over many years of community contribution, it handles a genuinely wide variety of input formats and edge cases.
But even with NN classifiers for symbol recognition, the overall pipeline still depends on hand-crafted rules for layout analysis. Each new notation style or engraving quirk can require someone to manually add another rule. End-to-end ML models like Clarity-OMR learn these patterns from data instead.
According to [benchmarks shared by the developer](https://www.reddit.com/r/Machine Learning/comments/1ru8okj/p_ive_trained_my_own_omr_model_optical_music/), tested on 10 classical piano pieces using mir_eval:
Those aren't marginal differences. Clarity-OMR is scoring roughly double to nearly triple Audiveris on these pieces. For clean, professionally typeset, rhythmically clear classical music, the ML approach is in a different league.
Clarity-OMR's best performances are 2-3x better than Audiveris on the same pieces. But averages tell a different story — and averages are what matter in production.
The overall average across all 10 test pieces gives Audiveris a slight edge: 44.0 vs 42.8. That means Clarity-OMR's worst-case performances drag its average below an engine that doesn't use ML at all.
The developer is refreshingly honest about why: Clarity-OMR struggles "when the notes aren't properly on the stave." Slightly offset noteheads, unusual engraving conventions, dense polyphonic textures — these trip up the model. As of March 2026, this is the main limitation.
So the variance is the story. Audiveris is the tortoise — steady, predictable, mediocre everywhere. Clarity-OMR is the hare — brilliant when conditions align, fragile when they don't.
This is where Clarity-OMR's grammar FSA really earns its keep. By enforcing structural rules during decoding, the output is musically coherent even when individual note recognition stumbles. You won't get a 4/4 measure with five beats. You won't get a chord that violates basic voice-leading constraints.
Audiveris can sometimes output technically valid XML that's musically nonsensical — a misidentified time signature cascading into wrong beat groupings for an entire system. It's like a spell-checker that knows every word but can't tell if a sentence makes sense.
The grammar FSA is Clarity-OMR's secret weapon. It's like having a music theory professor checking every measure as it's decoded — catching structural errors that raw accuracy scores don't capture.
No sugarcoating this one — Audiveris wins in a landslide.

Audiveris: install Java 17+, download the package, and launch the GUI with drag-and-drop PDF import. Straightforward if you already have Java.
Clarity-OMR: install Python, install PyTorch with CUDA support, clone the inference repo from GitHub, download the model weights from Hugging Face, and run from the command line. If you've never set up a Python ML environment before, expect to spend at least an hour on dependencies alone (and that's if CUDA cooperates on the first try).
For a musician who just wants to digitize some Chopin? Audiveris. No question.
Both tools are fully open-source, but they invite fundamentally different kinds of contributions.
Audiveris needs Java developers who understand both music notation and image processing — a narrow intersection. Adding support for a new symbol type means writing new detection rules, new template matchers, new heuristics. It scales linearly with effort.
Clarity-OMR needs ML engineers and (critically) better training data. The training code is open-source, so anyone with GPU access can experiment. And the developer has identified clear improvement paths: better polyphonic training data, smarter grammar constraints, and more diverse synthetic score rendering. Add more data, retrain, get better results. That's exponential scaling.
As of March 2026, Clarity-OMR is at version 1.0 with enormous room to grow. Audiveris has had over a decade to mature, and its improvement curve has naturally flattened.
Both tools are completely free and open-source. The only cost difference is hardware.
| Cost Factor | Clarity-OMR | Audiveris |
|---|---|---|
| Software License | Free (open source) | Free (open source) |
| Minimum Hardware | NVIDIA GPU with CUDA | Any CPU |
| Cloud GPU Cost | ~$0.50-2.00/hour if needed | $0 |
| Typical Setup Time | 30-60 minutes | 5 minutes |
The hidden cost with Clarity-OMR is the GPU requirement. If you don't own a CUDA-capable card, you're looking at either renting cloud GPU time or simply not using it. That's a meaningful barrier for casual users.
Don't skip this part. the developer benchmarked both tools on 10 classical piano pieces using mir_eval, which is the standard evaluation framework for music information retrieval. Here's the standout data:
| Piece | Clarity-OMR Score | Audiveris Score | Margin |
|---|---|---|---|
| Bartók (selected work) | 69.5 | 25.9 | +43.6 Clarity-OMR |
| The Entertainer (Joplin) | 66.2 | 33.9 | +32.3 Clarity-OMR |
| Overall Average (10 pieces) | 42.8 | 44.0 | -1.2 (Audiveris) |
The pattern is clear. Clarity-OMR's ceiling is dramatically higher than Audiveris's — but its floor is lower too. If you know your source material is clean and well-typeset, you can confidently pick Clarity-OMR. If you're processing a mixed batch of unknown quality? Audiveris gives you safer, more predictable results.
With cherry-picked scores, Clarity-OMR should outperform Audiveris. But you don't get to cherry-pick in production.
As of March 2026, optical music recognition is hitting an inflection point. Rule-based approaches like Audiveris have been the default for over a decade. But ML-based approaches are catching up fast, and Clarity-OMR proves a single developer with the right architecture can reach competitive results against years of accumulated engineering.

The developer behind Clarity-OMR mentions a fourth possible approach: combining model-based recognition with general-purpose vision models. Imagine feeding a difficult passage to both Clarity-OMR and a vision-language model, then merging the results. That hybrid strategy could cover each system's blind spots — something worth watching as vision-language models continue to improve.
But let's not get ahead of ourselves. Right now, neither tool delivers perfect OMR. The overall quality scores (42.8 and 44.0 out of 100) tell you that optical music recognition remains a genuinely hard problem. Sheet music packs an absurd amount of information density into tiny spatial areas — far more than text OCR, which is the problem most people compare it to.
The tools that will win this space are the ones that can scale with data. And that favors the ML approach.
For reliability today: Audiveris. It's mature, CPU-friendly, and handles diverse inputs with less variance. A 44.0 vs 42.8 average gap is small, but Audiveris achieves it consistently.
For peak performance on clean scores: Clarity-OMR, and it's not close. Scoring 69.5 where Audiveris hits 25.9 isn't incremental improvement — it's a generational jump on the right material.
For long-term potential: Clarity-OMR. ML approaches improve with more data and compute. Rule-based approaches improve with more hand-written rules. One of those paths scales. The other doesn't.
For most users today: Start with Audiveris. If you're processing clean classical scores and have GPU access, test Clarity-OMR on your specific material. Both tools are free — the best approach is to run both on a sample and compare the MusicXML output directly.
This is a space worth watching. Clarity-OMR is exactly the kind of scrappy, well-architected open-source project that tends to snowball with community support. Give it better training data and a year of iteration, and these benchmarks could look very different.
Sources
No. As of March 2026, Clarity-OMR is designed for professionally typeset sheet music PDFs. The model struggles even with typeset scores where notes are slightly offset from staff lines. Handwritten notation would require a completely different training dataset and likely architectural changes. For handwritten scores, commercial tools like PlayScore or manual transcription remain more practical options.
You need an NVIDIA GPU with CUDA support. The model uses a DaViT-Base encoder with DoRA rank-64, which is relatively lightweight by modern ML standards. A card with 6-8 GB of VRAM (like an RTX 3060 or better) should handle inference comfortably. If you don't have a local GPU, cloud options like Google Colab Pro or Lambda Labs work — expect to pay roughly $0.50-2.00 per hour of processing time.
Not directly. Clarity-OMR outputs MusicXML only. However, MusicXML can be easily converted to MIDI using free tools like MuseScore (import the MusicXML, export as MIDI) or command-line utilities like music21 in Python. The MusicXML output preserves richer musical information than MIDI — dynamics, articulations, text markings — so it's actually the better intermediate format.
Not natively. Clarity-OMR requires CUDA, which is NVIDIA-only. Apple Silicon Macs use Metal for GPU acceleration, and PyTorch's MPS backend doesn't support all the operations Clarity-OMR uses. Your best option is running it on a cloud GPU instance (Google Colab, AWS, or Lambda Labs) and uploading your PDFs. Audiveris runs natively on any Mac since it's Java-based and CPU-only.
Processing time depends on the number of staves per page and your GPU. Each staff goes through four pipeline stages (YOLO detection, DaViT encoding, beam search decoding, MusicXML export). On a modern GPU like an RTX 4070, expect roughly 5-15 seconds per page for a typical piano score with two staves. Orchestral scores with 15+ staves per page will take proportionally longer. Audiveris is generally faster per page since it runs on CPU without the overhead of neural network inference.