NOBODY IS TALKING ABOUT THE MOST EXPENSIVE STEP IN EVERY AI PIPELINE..
It isn't the model. It isn't the embeddings. It's PDF extraction and most teams are still feeding their LLMs flattened garbage.
@nutrientdocs CLI quietly fixed this:
→ Messy PDF in, clean structured markdown out. Headings, tables, lists preserved, not mangled.
→ Runs locally. No signup, no API key. Free up to 1,000 docs/month.
→ One command: npm install -g @ pspdfkit/pdf-to-markdown. Point it at a file or a whole folder.
→ On Claude Code or Cursor? Install it as a skill and your agent calls it automatically when it sees a PDF.
Your retrieval is only as good as your chunks. Your chunks are only as good as your extraction.
Everyone is optimizing the top of the stack. The real unlock is at the bottom.