Forgeron3
/ TrendsDec 22, 20256 min read

Multimodal for SMBs: demo or production?

Photos, diagrams, scanned invoices, meeting audio: the multimodal promise is attractive. Here’s what works today, what remains fragile, and where to honestly put your budget.

F3
The Forgeron3 teamMarseille & Paris

The promise, in two sentences

An assistant that sees, reads, listens, and writes in a single conversation. You show it an invoice, it summarizes; a technical sketch, it explains; a meeting recording, it produces minutes. Beautiful in demo.

Reading images and scans: production-ready

Cases that work, routinely, on French SMBs:

  • Data extraction from scanned invoices and quotes. 95%+ accuracy with good OCR paired with a vision model.
  • Reading tables and charts. Solid on simple structures, fragile on stacked or multi-axis charts.
  • Identifying elements in field photos (construction, maintenance, quality control). Often best paired with a dedicated vertical model.

What’s still fragile: reading handwritten French, especially older handwriting.

Audio and meetings: partially production-ready

French audio transcription has progressed enormously since 2024. On clear meetings, in studio or video conference, models reach 95%+ accuracy.

Several challenges remain:

  • Multi-speaker meetings: models still confuse similar voices.
  • Noisy locations or hybrid meetings on low-bandwidth video.
  • Strong regional accents or highly specialized technical vocabulary.

Production-ready for raw transcription. Automatic summarization (structured minutes) still requires a human review on high-stakes meetings.

Practical ruleThe higher the stakes (legal, medical, contractual), the more a human must re-read. Multimodal speeds up the draft — it doesn’t replace the check.

Technical diagrams: still demo

On diagrams (architectural plans, electrical schematics, mechanical drawings, complex diagrams), generic models don’t hold up in production. They give plausible descriptions that are often wrong on the details.

Two serious options for these cases:

  1. A specialized vision model trained on your type of diagrams (high cost, reserved for large volumes).
  2. Pairing with an existing CAD system that decodes the semantics before passing them to the AI.

A sensible strategy for an SMB in 2026

  1. Document reading (invoices, contracts, scanned quotes): deploy.
  2. Meeting transcription with human review: deploy.
  3. Simple field image analysis (before/after, visual inspection): targeted pilot.
  4. Complex technical diagrams: wait another 12-18 months or go custom.

For general scoping, see Generative AI in 2026 and How to succeed with your project.

Test multimodal on your documents

Twenty minutes with a sample of your real invoices, field photos, or recordings. We look at quality, we size the effort.

Book a demo