Shadman Ahmed
Software Architect
Software architect and AI tools enthusiast. I test, benchmark, and review AI models and developer tools so you don't have to.
123
Articles
47,660
Total Views
220K
Words Written
All Articles (123 total)
AI Benchmarks Are Broken — This Book Explains Why
A new book by Moritz Hardt argues that benchmark rankings — not scores — are what actually matter. We tested his thesis against every major 2026 AI benchmark.
Claude Desktop: 5-Step Setup From MCP to Cowork
Set up the Claude desktop app from scratch — MCP extensions, Cowork agent, Computer Use, and power-user tips that'll save you hours.
OpenAI Japan's 5-Pillar Teen Safety Blueprint Explained
OpenAI Japan just launched its Teen Safety Blueprint — a framework combining age estimation, parental controls, and well-being safeguards to protect the 46% of Japanese high schoolers already using generative AI.
Krasis vs llama.cpp: Is 10x Faster LLM Inference Real?
Krasis LLM Runtime claims dramatically faster inference than llama.cpp for large MoE models on a single NVIDIA GPU. We break down the real numbers, the retracted benchmarks, and when each tool wins.
A $500 GPU Just Beat Claude Sonnet at Coding Tasks
ATLAS, a source-available AI system built by a Virginia Tech student, scores 74.6% on LiveCodeBench using a single $500 consumer GPU — outperforming Claude Sonnet's 71.4% at roughly $0.004 per task.
Google Opens Lyria 3 API: AI Music for 4 Cents a Track
Google Lyria 3 is now available to developers through the Gemini API at $0.04 per 30-second clip. Here's what you get, what's missing, and how it stacks up against Suno and Udio.
ChatGPT Becomes a Shopping Mall: 7 Retailers Already In
OpenAI just turned ChatGPT into a visual shopping assistant with product comparisons, image search, and feeds from Target, Sephora, Best Buy, and more — all powered by the Agentic Commerce Protocol.
Clarity-OMR vs Audiveris: 5 OMR Accuracy Tests
A deep-dive comparison of Clarity-OMR's machine learning approach against Audiveris's traditional computer vision for optical music recognition — with real benchmark data on 10 classical piano pieces.
5 Ways OpenAI Protects Sora 2 Users — And 3 Gaps
OpenAI details its five-layer safety system for Sora 2, including C2PA metadata, CSAM detection, and teen protections. But real-world testing reveals stubborn blind spots that watermarks and classifiers can't fix.
Grammarly AI Cloned 100+ Writers — A $5M Lawsuit and an Apology
Superhuman's CEO sat for a Decoder interview with The Verge's editor — one of the writers Grammarly's AI cloned without permission. It got tense.
ROCm 7 vs Vulkan on Mi50: 4-Model Benchmark Results
New benchmarks pit ROCm 7 nightly against Vulkan on an AMD Mi50 32GB running llama.cpp. Vulkan wins short-context dense inference, but ROCm dominates everything else — with a stability catch.
CRYSTAL Benchmark Exposes How AI Models Fake Reasoning
A new benchmark tested 20 multimodal AI models and found 19 of them cherry-pick reasoning steps while skipping actual thinking. The gap between accuracy and reasoning quality is alarming.