Computer Vision APIs — What They Can Do Without a PhD
The full landscape of computer vision tasks, which API to use for each, and what things cost
Computer vision is the field of making computers understand images and video. Five years ago, doing this required training your own neural network. Today, you call an API. This lesson maps the landscape so you know which tool solves which problem before you write a line of code.
The six core computer vision tasks
- Image classification — What is in this image? Returns a label and confidence score. "This is a cat (97%). This is a golden retriever (89%)."
- Object detection — Where are the objects in this image? Returns bounding boxes with labels. Useful for counting items, detecting faces, or finding products on shelves.
- OCR (optical character recognition) — Extract text from images. Receipts, invoices, ID documents, screenshots — convert pixels to structured text.
- Image generation — Create images from text prompts. DALL-E, Stability AI, Flux. Also: edit existing images, extend them (outpainting), or apply styles.
- Face detection and analysis — Detect faces, estimate age/emotion, verify identity. AWS Rekognition and Azure Face API are the main providers.
- Video analysis — All the above, applied to video frames. Object tracking, activity recognition, content moderation at scale.
Choosing the right API
- For image generation — DALL-E 3 (OpenAI) for highest quality prompts. Stability AI or Flux for cost-efficient batch generation. Flux is the fastest-improving option in 2026.
- For OCR and document extraction — Google Cloud Vision for general OCR. AWS Textract for structured documents (forms, tables). Azure Document Intelligence for invoices and receipts.
- For image understanding / multimodal — Claude claude-sonnet-4-6 or GPT-4o for analysing images with natural language reasoning. Better for complex questions than simple classification APIs.
- For custom object detection — Roboflow for training and deploying custom models via API. Google Vertex AI Vision for enterprise scale.
The specialist clinic analogy
Computer vision APIs are like medical specialists. You do not send every patient to a radiologist. You send the chest X-ray to radiology, the invoice dispute to accounting, and the "is this person who they say they are" to identity verification. Matching the task to the right specialist — rather than using one API for everything — is how you get accurate results at reasonable cost.
Cost comparison (approximate 2026 pricing)
- Google Cloud Vision OCR — $1.50 per 1,000 images for the first million. Very cost-effective for document processing at scale.
- DALL-E 3 image generation — $0.04–0.08 per image depending on size and quality. Expensive for batch generation, fine for user-triggered generation.
- Claude vision (claude-sonnet-4-6) — Priced per token. A 1MB image costs roughly $0.002–0.005 to analyse depending on the output length.
- Roboflow inference — Free tier with 10,000 predictions/month. Scales to $0.0008 per prediction at volume.
Start with multimodal models
If you are not sure which CV task you need, start by sending your image to Claude or GPT-4o with a plain English question. Multimodal models handle ambiguous tasks better than specialised APIs and cost less than you think.
Try this
Pick an image from your current project (or a product photo, receipt, or screenshot). Describe in one sentence what you want to know about it. Then identify which category it falls into: classification, detection, OCR, generation, or understanding. That tells you which lesson to focus on first.