Glossary

0-9

0-shot learning 1-shot learning 2-stage detector 3D convolution 4D data 5G + AI 6DoF pose estimation 7D representation 8-bit quantization 9-layer network

A

AGI / Artificial General Intelligence Algorithm Artificial Intelligence (AI)Attention Autoencoder

B

Backpropagation Batch Normalization BERT Bias Boosting

C

Chatbot Classifier / Classification Clustering CNN / Convolutional Neural Network Cross-Validation

D

Data Augmentation Deep Learning Deepfake Deterministic Model Discriminative Model

E

Embedding Encoder Ensemble Learning Epoch Explainable AI (XAI)

F

Feature Extraction Fine-tuning Forward Propagation Foundation Model Fusion / Multimodal Fusion

G

GAN / Generative Adversarial Network Generative AI Gradient Descent Graph Neural Network (GNN)Grounding

H

Hallucination Heuristic Hidden Layer Hierarchical Model Hyperparameter

I

Imbalanced Data Instance / Sample Instruction tuning Intelligence Amplification / Augmentation Interpretability

J

JAX Jittering Joint Embedding JSONL / JSON-lines Juxtaposition

K

K-means Clustering K-Shot Learning Kernel Trick KL Divergence (Kullback–Leibler Divergence)Knowledge Distillation

L

Large Language Model (LLM)Latent Variable Learning Rate Loss Function LSTM / Long Short-Term Memory

M

Machine Learning (ML)Meta-learning Model Multi-head Attention Multimodal / Multimodality

N

Neural Network NLP / Natural Language Processing NLU / Natural Language Understanding Normalization Novelty Detection / Anomaly Detection

O

Objective Function One-hot Encoding Online Learning Optimizer Overfitting

P

Parameter Policy / Reinforcement Learning Policy Pooling Pretraining Prompt

Q

Q-learning Quality Estimation Quantization Query Queue / Buffer

R

Regularization Reinforcement Learning (RL)Representation Learning Retrieval Augmented Generation (RAG)RNN / Recurrent Neural Network

S

Sampling Self-Supervised Learning Sequence Modeling Softmax Supervised Learning

T

Tokenizer Training Data Transfer Learning Transformer Tuning / Hyperparameter Tuning

U

U-Net Uncertainty Estimation Underfitting Universal Approximation Theorem Unsupervised Learning

V

Validation Set Vanishing / Exploding Gradient Variational Autoencoder (VAE)Vector Embedding Vision Transformer (ViT)

W

Weak Supervision Weight Decay Whitening / Whitening Transformation Word Embedding Workflow

X

X-axis / feature axis XAI / Explainable AI XLM XLNet XOR problem

Y

Y-axis / feature axis Y-transform / YUV YAGNI (You Aren't Gonna Need It)Yield (model yield / throughput)Yoga of AI

Z

Z-score Normalization Zero-centric / Zero-bias initialization Zero-gradient phenomenon Zero-shot Learning / Zero-shot inference Zygosity in augmentation

What is Knowledge Distillation

Knowledge Distillation is a model compression and knowledge transfer technique primarily used to extract and transfer knowledge from a complex model (often a deep learning model) to a simpler one. The fundamental principle is to train a smaller model (student model) to mimic the output of a larger model (teacher model), thereby reducing computational resource consumption while maintaining high performance.

The background of this technique stems from the increasing complexity of deep learning models, which require more computational resources during inference. By applying knowledge distillation, one can effectively reduce model size and improve inference speed while trying to minimize accuracy loss. The operation involves generating soft labels from the teacher model on training data and then using these soft labels to train the student model.

In typical scenarios, knowledge distillation is widely used in areas like image recognition, natural language processing, and speech recognition. For instance, in an image classification task, a large Convolutional Neural Network (CNN) can serve as a teacher model, while a lightweight network acts as the student model during training. Future trends indicate that as AI models become more complex, the application of knowledge distillation will become increasingly common, especially in mobile and edge computing devices.

The advantages of knowledge distillation include significantly improving the inference speed and efficiency of models while reducing memory usage. However, there are drawbacks, such as the potential for the student model to fail to fully capture the knowledge of the teacher model, leading to performance degradation. Furthermore, choosing the appropriate architecture for both the teacher and student models is crucial for successful distillation.

What is Knowledge Distillation - Glossary