Computer Vision
Vision Transformer (ViT)
Applying the Transformer architecture (originally built for text) to images, and discovering it works even better than CNNs.
Definition
A model architecture that applies the Transformer's self-attention mechanism to image patches rather than text tokens. Has largely replaced CNNs as state-of-the-art for many computer vision tasks.
Why it matters
Unified the architectures for vision and language AI, enabling multimodal models like GPT-4V.
Related terms in Computer Vision
CNN (Convolutional Neural Network)
An AI architecture designed to look at pictures, scanning them like a grid to find edges, shapes, and objects.
Computer Vision
Teaching computers to "see" and understand images and video just like humans do.
Object Detection
An AI that can find and identify multiple objects in an image, drawing boxes around each person, car, or sign it sees.
Image Segmentation
Teaching an AI to color-code every pixel in an image, identifying exactly where each object begins and ends.
From vocabulary to outcomes
Ready to put Vision Transformer (ViT) to work?
Knowing the term is step one. Deploying it inside a revenue architecture that compounds is what Sophizo builds.
Book a Discovery Call