CS231n作业3保姆级避坑指南:用PyTorch 2.6.0搞定Transformer图像描述任务

张开发
2026/4/12 12:30:15 15 分钟阅读

分享文章

CS231n作业3保姆级避坑指南:用PyTorch 2.6.0搞定Transformer图像描述任务
CS231n作业3实战用PyTorch 2.6.0构建Transformer图像描述系统的避坑指南1. 环境配置与数据准备在开始CS231n作业3的Transformer图像描述任务前正确的环境配置是成功的第一步。许多同学在初始阶段就会遇到各种环境兼容性问题特别是PyTorch版本差异导致的numerical check失败。关键环境配置步骤conda create -n assignment3 python3.9 conda activate assignment3 pip install torch2.6.0 torchvision0.21.0 torchaudio2.6.0 --index-url https://download.pytorch.org/whl/cu126 pip install numpy matplotlib h5py imageio ipykernel注意必须使用PyTorch 2.6.0版本因为不同版本的torch.manual_seed(231)会产生不同的随机数序列导致numerical check无法通过。COCO数据集处理技巧从课程官网下载coco_captioning.zip并解压到/cs231n/datasets/coco_captioning目录检查image_utils.py文件确保以下修正已应用# 修改前 img img.resize((width, height), Image.BICUBIC) # 修改后 img img.resize((width, height), Image.Resampling.BICUBIC)常见问题排查表问题现象可能原因解决方案numerical check误差大PyTorch版本不对严格使用2.6.0版本数据集加载失败路径错误或权限问题检查解压路径是否为绝对路径内存不足批量大小过大减小batch_size或使用梯度累积2. Transformer核心模块实现2.1 MultiHeadAttention实现要点MultiHeadAttention是Transformer的核心组件需要特别注意维度变换和mask处理class MultiHeadAttention(nn.Module): def __init__(self, embed_dim, num_heads, dropout0.1): super().__init__() assert embed_dim % num_heads 0 self.key nn.Linear(embed_dim, embed_dim) self.query nn.Linear(embed_dim, embed_dim) self.value nn.Linear(embed_dim, embed_dim) self.proj nn.Linear(embed_dim, embed_dim) self.attn_drop nn.Dropout(dropout) self.n_head num_heads self.emd_dim embed_dim self.head_dim self.emd_dim // self.n_head def forward(self, query, key, value, attn_maskNone): N, S, E query.shape h, D self.n_head, self.head_dim # 线性变换并分头 Q self.query(query).view(N, S, h, D).transpose(1, 2) K self.key(key).view(N, -1, h, D).transpose(1, 2) V self.value(value).view(N, -1, h, D).transpose(1, 2) # 计算注意力分数 Attn_raw torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(D) if attn_mask is not None: Attn_score Attn_raw.masked_fill(attn_mask 0, float(-inf)) else: Attn_score Attn_raw # Softmax和Dropout Attn_weight self.attn_drop(torch.softmax(Attn_score, -1)) # 多头输出拼接 Attn_out torch.matmul(Attn_weight, V) output self.proj(Attn_out.transpose(1,2).reshape(N,S,E)) return output调试技巧使用torch.autograd.gradcheck验证反向传播对小批量数据(如batch_size2)进行手动计算验证检查attention权重矩阵是否合理(不应全为0或1)2.2 PositionalEncoding实现细节位置编码需要严格遵循公式实现否则会影响模型对序列顺序的理解class PositionalEncoding(nn.Module): def __init__(self, embed_dim, dropout0.1, max_len5000): super().__init__() assert embed_dim % 2 0 pe torch.zeros(1, max_len, embed_dim) position torch.arange(max_len).unsqueeze(1) div_term torch.exp(torch.arange(0, embed_dim, 2) * (-math.log(10000.0) / embed_dim)) pe[0, :, 0::2] torch.sin(position * div_term) pe[0, :, 1::2] torch.cos(position * div_term) self.register_buffer(pe, pe) self.dropout nn.Dropout(pdropout) def forward(self, x): x x self.pe[:, :x.size(1)] return self.dropout(x)提示位置编码的维度必须与词嵌入维度一致且不需要训练参数。如果发现模型无法学习序列顺序首先检查这里。3. TransformerDecoderLayer构建Decoder层需要实现三个关键子层Masked Self-Attention、Cross-Attention和FFN。特别注意Pre-LN与Post-LN的区别class TransformerDecoderLayer(nn.Module): def __init__(self, input_dim, num_heads, dim_feedforward2048, dropout0.1): super().__init__() self.self_attn MultiHeadAttention(input_dim, num_heads, dropout) self.cross_attn MultiHeadAttention(input_dim, num_heads, dropout) self.ffn nn.Sequential( nn.Linear(input_dim, dim_feedforward), nn.ReLU(), nn.Dropout(dropout), nn.Linear(dim_feedforward, input_dim) ) self.norm1 nn.LayerNorm(input_dim) self.norm2 nn.LayerNorm(input_dim) self.norm3 nn.LayerNorm(input_dim) self.dropout nn.Dropout(dropout) def forward(self, tgt, memory, tgt_maskNone): # Masked Self-Attention residual tgt tgt self.norm1(tgt) tgt self.self_attn(tgt, tgt, tgt, tgt_mask) tgt residual self.dropout(tgt) # Cross-Attention residual tgt tgt self.norm2(tgt) tgt self.cross_attn(tgt, memory, memory) tgt residual self.dropout(tgt) # FFN residual tgt tgt self.norm3(tgt) tgt self.ffn(tgt) tgt residual self.dropout(tgt) return tgt架构选择建议对于小数据集使用Pre-LN(先LayerNorm)更稳定对于大数据集Post-LN可能获得更好最终性能残差连接后的dropout率通常设为0.1-0.34. 完整模型训练与调优4.1 CaptioningTransformer整合将各组件整合为完整图像描述模型class CaptioningTransformer(nn.Module): def __init__(self, word_to_idx, input_dim, wordvec_dim, num_heads4, num_layers2, max_length50): super().__init__() self._start word_to_idx[START] self._null word_to_idx[NULL] self.visual_proj nn.Linear(input_dim, wordvec_dim) self.embedding nn.Embedding(len(word_to_idx), wordvec_dim) self.pos_enc PositionalEncoding(wordvec_dim, max_lenmax_length) decoder_layer TransformerDecoderLayer(wordvec_dim, num_heads) self.transformer nn.ModuleList( [copy.deepcopy(decoder_layer) for _ in range(num_layers)]) self.classifier nn.Linear(wordvec_dim, len(word_to_idx)) def forward(self, features, captions): # 图像特征处理 memory self.visual_proj(features).unsqueeze(1) # 文本嵌入与位置编码 caption_emb self.embedding(captions) caption_emb self.pos_enc(caption_emb) # 因果掩码 tgt_mask torch.tril(torch.ones(captions.size(1), captions.size(1))) # Transformer解码 output caption_emb for layer in self.transformer: output layer(output, memory, tgt_mask) # 分类头 scores self.classifier(output) return scores4.2 训练技巧与参数设置优化策略使用AdamW优化器(比Adam更适合Transformer)学习率预热(warmup)对于稳定训练很关键标签平滑(Label Smoothing)减轻过拟合推荐超参数参数建议值说明batch_size64-128根据GPU内存调整learning_rate3e-4配合warmup使用warmup_steps4000学习率线性增加到峰值dropout0.1嵌入层和注意力均可应用weight_decay0.01防止过拟合损失函数实现def captioning_loss(scores, targets): # 忽略pad标记的损失计算 mask (targets ! self._null) loss F.cross_entropy( scores.view(-1, scores.size(-1)), targets.view(-1), reductionnone ) loss (loss * mask.view(-1)).sum() / mask.sum() return loss4.3 评估与生成实现beam search生成可以显著提升描述质量def beam_search(self, features, beam_size3, max_length30): with torch.no_grad(): # 初始beam memory self.visual_proj(features).unsqueeze(0) start_vec torch.tensor([[self._start]], devicefeatures.device) beams [(start_vec, 0.0)] for _ in range(max_length): new_beams [] for seq, score in beams: # 跳过已生成EOS的序列 if seq[0, -1] self._end: new_beams.append((seq, score)) continue # 获取下一个词的概率 output self.forward(memory.expand(seq.size(0), -1), seq) log_probs F.log_softmax(output[:, -1], dim-1) topk_probs, topk_ids log_probs.topk(beam_size, dim-1) # 扩展beam for i in range(beam_size): new_seq torch.cat([seq, topk_ids[:, i:i1]], dim-1) new_score score topk_probs[0, i].item() new_beams.append((new_seq, new_score)) # 选择top-k beams beams sorted(new_beams, keylambda x: x[1], reverseTrue)[:beam_size] return beams[0][0]常见生成问题及解决重复生成增加重复惩罚项短描述调整长度归一化无关内容改进beam搜索多样性5. ViT扩展实现虽然作业主要关注Captioning但实现ViT(Vision Transformer)有助于深入理解Transformer在视觉任务中的应用。5.1 PatchEmbedding关键实现class PatchEmbedding(nn.Module): def __init__(self, img_size224, patch_size16, in_chans3, embed_dim768): super().__init__() num_patches (img_size // patch_size) ** 2 self.proj nn.Conv2d(in_chans, embed_dim, kernel_sizepatch_size, stridepatch_size) self.pos_embed nn.Parameter(torch.zeros(1, num_patches 1, embed_dim)) self.cls_token nn.Parameter(torch.zeros(1, 1, embed_dim)) def forward(self, x): B, C, H, W x.shape x self.proj(x).flatten(2).transpose(1, 2) cls_tokens self.cls_token.expand(B, -1, -1) x torch.cat((cls_tokens, x), dim1) x x self.pos_embed return x5.2 ViT在小数据集上的训练技巧强数据增强MixUp, CutMix, RandAugment知识蒸馏使用CNN教师模型指导ViT训练渐进式训练先在小分辨率训练再微调大分辨率正则化策略DropPath, Label SmoothingViT与CNN性能对比表指标ViT-B/16ResNet-50说明参数量86M25MViT通常更大计算量17.6G4.1G需要更多FLOPs小数据性能中等优秀CNN归纳偏置优势大数据性能优秀良好ViT缩放性更好训练速度较慢较快ViT需要更多epoch6. 高级调试与性能优化6.1 梯度检查与数值稳定性确保各模块梯度正常流动def grad_check(): model MultiHeadAttention(embed_dim64, num_heads4) input torch.randn(2, 10, 64, requires_gradTrue) test torch.autograd.gradcheck( lambda x: model(x, x, x).sum(), input, eps1e-6, atol1e-4) print(Gradient check passed:, test)常见数值问题NaN/Inf出现检查softmax前的数值范围梯度爆炸添加梯度裁剪(grad_clip)训练不稳定尝试更小的学习率或Pre-LN结构6.2 混合精度训练利用PyTorch AMP加速训练并减少内存占用scaler torch.cuda.amp.GradScaler() for data, target in dataloader: optimizer.zero_grad() with torch.cuda.amp.autocast(): output model(data) loss criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()6.3 内存优化技巧梯度检查点以计算时间换内存from torch.utils.checkpoint import checkpoint output checkpoint(self.transformer, input)激活压缩使用torch.utils.activation_checkpoint批量拆分手动实现梯度累积7. 可视化与结果分析7.1 注意力可视化理解模型关注点def visualize_attention(image, caption, attention_weights): fig plt.figure(figsize(15, 15)) len_caption len(caption) for i in range(len_caption): attn attention_weights[:, i].reshape(7, 7) ax fig.add_subplot(len_caption//2, 2, i1) ax.set_title(caption[i], fontsize12) img ax.imshow(image) ax.imshow(attn, cmapgray, alpha0.6, extentimg.get_extent()) plt.tight_layout() return fig7.2 评估指标实现除了BLEU还可实现CIDEr、SPICE等高级指标def compute_cider(candidate, references, df, n4): candidate: 生成的描述(str) references: 参考描述列表(list of str) df: 语料库词频统计 n: n-gram大小 # 实现CIDEr计算逻辑 ...典型评估结果分析模型BLEU-1BLEU-4METEORCIDErSPICE基准模型0.650.250.230.850.15注意力机制0.680.280.250.920.17beam搜索0.710.310.271.050.19完整模型0.750.350.301.200.228. 扩展思考与前沿方向8.1 Transformer在CV中的演进层次化设计Swin Transformer引入窗口注意力混合架构Convolution与Attention结合高效注意力线性注意力、稀疏注意力变体自监督学习MAE、MoCo v3等方法8.2 作业之外的探索建议尝试CLIP风格的对比学习实现DETR风格的目标检测探索扩散模型(Diffusion)与Transformer结合研究轻量化Transformer部署方案推荐扩展阅读Attention Is All You Need (原始论文)An Image is Worth 16x16 Words (ViT论文)Swin Transformer: Hierarchical Vision TransformerMasked Autoencoders Are Scalable Vision Learners

更多文章