Computer Vision

Vision Transformer (ViT)

Applying the Transformer architecture (originally built for text) to images, and discovering it works even better than CNNs.

Definition

A model architecture that applies the Transformer's self-attention mechanism to image patches rather than text tokens. Has largely replaced CNNs as state-of-the-art for many computer vision tasks.

Why it matters

Unified the architectures for vision and language AI, enabling multimodal models like GPT-4V.

Related terms in Computer Vision

CNN (Convolutional Neural Network)

An AI architecture designed to look at pictures, scanning them like a grid to find edges, shapes, and objects.

Computer Vision

Teaching computers to "see" and understand images and video just like humans do.

Object Detection

An AI that can find and identify multiple objects in an image, drawing boxes around each person, car, or sign it sees.

Image Segmentation

Teaching an AI to color-code every pixel in an image, identifying exactly where each object begins and ends.

Back to the full glossary

From vocabulary to outcomes

Ready to put Vision Transformer (ViT) to work?

Knowing the term is step one. Deploying it inside a revenue architecture that compounds is what Sophizo builds.

Book a Discovery Call