别再死磕C3D了!用PyTorch从零复现Two-Stream网络,在UCF101上跑出你的第一个动作识别模型

张开发
2026/4/21 18:34:01 15 分钟阅读

分享文章

别再死磕C3D了!用PyTorch从零复现Two-Stream网络,在UCF101上跑出你的第一个动作识别模型
用PyTorch实战Two-Stream网络在UCF101上构建动作识别模型当你在监控视频中发现异常行为时计算机如何像人类一样理解打网球和打高尔夫的区别这正是视频动作识别技术要解决的核心问题。不同于静态图像分类视频分析需要同时捕捉空间外观和时间运动模式——这正是Two-Stream网络设计的精妙之处。本文将带你用PyTorch从零实现这一经典架构在UCF101数据集上完成端到端的动作识别项目。1. 环境配置与数据准备1.1 开发环境搭建推荐使用Python 3.8和PyTorch 1.10环境这是经过验证的稳定组合。以下是使用conda创建环境的命令conda create -n action_rec python3.8 conda activate action_rec pip install torch1.10.0 torchvision0.11.1 opencv-python4.5.5注意如果使用CUDA 11.3需要对应安装torch的cu113版本。建议先运行nvidia-smi确认CUDA版本。1.2 UCF101数据集处理UCF101包含13,320个视频片段涵盖101类人类动作。下载解压后目录结构应如下UCF101/ ├── ApplyEyeMakeup/ │ ├── v_ApplyEyeMakeup_g01_c01.avi │ └── ... ├── Archery/ │ ├── v_Archery_g01_c01.avi │ └── ... └── ...我们需要将数据集划分为训练集和测试集。官方提供了三个预划分方案这里采用第一个import os from sklearn.model_selection import train_test_split def split_ucf101(data_path, test_size0.2): classes os.listdir(data_path) video_paths [] labels [] for label, class_name in enumerate(classes): class_dir os.path.join(data_path, class_name) videos [os.path.join(class_dir, v) for v in os.listdir(class_dir)] video_paths.extend(videos) labels.extend([label]*len(videos)) return train_test_split(video_paths, labels, test_sizetest_size, stratifylabels)2. 光流提取实战2.1 TV-L1光流算法实现Two-Stream网络的时间流需要光流序列作为输入。我们使用OpenCV的TV-L1算法import cv2 import numpy as np def extract_optical_flow(video_path, save_dir, num_frames10): cap cv2.VideoCapture(video_path) ret, prev_frame cap.read() prev_gray cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY) h, w prev_gray.shape flows [] for _ in range(num_frames): ret, curr_frame cap.read() if not ret: break curr_gray cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY) flow cv2.optflow.calcOpticalFlowDenseTVL1(prev_gray, curr_gray) # 归一化并保存 flow (flow - flow.min()) / (flow.max() - flow.min()) * 255 flows.append(flow.astype(np.uint8)) prev_gray curr_gray # 保存为堆叠的numpy数组 flow_stack np.stack(flows, axis0) np.save(os.path.join(save_dir, os.path.basename(video_path)[:-4].npy), flow_stack)提示TV-L1算法比传统Farneback算法更精确但计算量更大。对于实时性要求高的场景可考虑FlowNet2等深度学习光流方法。2.2 光流可视化技巧理解光流对调试模型非常重要。使用以下代码可视化def visualize_flow(flow): hsv np.zeros((flow.shape[0], flow.shape[1], 3), dtypenp.uint8) hsv[..., 1] 255 mag, ang cv2.cartToPolar(flow[..., 0], flow[..., 1]) hsv[..., 0] ang * 180 / np.pi / 2 hsv[..., 2] cv2.normalize(mag, None, 0, 255, cv2.NORM_MINMAX) return cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)3. 双流网络架构实现3.1 空间流CNN设计空间流处理单帧RGB图像我们基于ResNet-50实现import torch.nn as nn from torchvision.models import resnet50 class SpatialStream(nn.Module): def __init__(self, num_classes101): super().__init__() base_model resnet50(pretrainedTrue) self.features nn.Sequential(*list(base_model.children())[:-1]) self.classifier nn.Linear(base_model.fc.in_features, num_classes) def forward(self, x): x self.features(x) x x.view(x.size(0), -1) return self.classifier(x)3.2 时间流CNN设计时间流处理光流序列需要调整输入通道class TemporalStream(nn.Module): def __init__(self, num_classes101): super().__init__() base_model resnet50() # 修改第一层卷积适应光流输入 original_conv1 base_model.conv1 self.conv1 nn.Conv2d(10, 64, kernel_size7, stride2, padding3, biasFalse) # 复制预训练权重(取前两个通道的均值) with torch.no_grad(): self.conv1.weight nn.Parameter(original_conv1.weight.mean(dim1, keepdimTrue).expand(-1,10,-1,-1)) self.features nn.Sequential( self.conv1, *list(base_model.children())[1:-1] ) self.classifier nn.Linear(base_model.fc.in_features, num_classes) def forward(self, x): # 输入形状: (batch, 10, H, W) x self.features(x) x x.view(x.size(0), -1) return self.classifier(x)3.3 双流融合策略两种主流融合方式对比融合方式实现复杂度准确率计算成本后期平均低中等低中期融合高高中等联合训练最高最高高我们实现效果和复杂度折中的中期融合class TwoStreamNet(nn.Module): def __init__(self, num_classes101): super().__init__() self.spatial SpatialStream(num_classes) self.temporal TemporalStream(num_classes) # 融合层 self.fc nn.Linear(num_classes*2, num_classes) def forward(self, rgb, flow): spatial_out self.spatial(rgb) temporal_out self.temporal(flow) combined torch.cat([spatial_out, temporal_out], dim1) return self.fc(combined)4. 训练优化技巧4.1 数据增强策略视频数据增强需要同时考虑空间和时间维度from torchvision import transforms train_transform transforms.Compose([ transforms.ToPILImage(), transforms.RandomHorizontalFlip(), transforms.RandomResizedCrop(224, scale(0.8, 1.0)), transforms.ColorJitter(brightness0.2, contrast0.2, saturation0.2), transforms.ToTensor(), transforms.Normalize(mean[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225]) ]) flow_transform transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean[0.5]*10, std[0.5]*10) ])4.2 训练循环实现使用不同的学习率策略优化两个流def train_model(model, dataloader, criterion, optimizer, epochs50): model.train() for epoch in range(epochs): for rgb, flow, labels in dataloader: rgb, flow, labels rgb.to(device), flow.to(device), labels.to(device) optimizer.zero_grad() outputs model(rgb, flow) loss criterion(outputs, labels) loss.backward() optimizer.step() # 调整学习率 if epoch % 10 0: for param_group in optimizer.param_groups: param_group[lr] * 0.14.3 GPU内存优化处理视频数据时容易遇到显存不足问题解决方案梯度累积小batch训练多次累积后更新混合精度训练使用apex库的AMP功能帧采样策略随机采样而非连续帧from apex import amp model, optimizer amp.initialize(model, optimizer, opt_levelO1) with torch.cuda.amp.autocast(): outputs model(rgb, flow) loss criterion(outputs, labels) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()5. 模型评估与部署5.1 评估指标实现除了准确率还需关注混淆矩阵from sklearn.metrics import confusion_matrix import seaborn as sns def evaluate(model, dataloader): model.eval() all_preds [] all_labels [] with torch.no_grad(): for rgb, flow, labels in dataloader: outputs model(rgb.to(device), flow.to(device)) _, preds torch.max(outputs, 1) all_preds.extend(preds.cpu().numpy()) all_labels.extend(labels.numpy()) cm confusion_matrix(all_labels, all_preds) plt.figure(figsize(20,20)) sns.heatmap(cm, annotTrue, fmtd) plt.xlabel(Predicted) plt.ylabel(Actual)5.2 模型轻量化部署使用TorchScript导出生产环境可用的模型example_rgb torch.rand(1, 3, 224, 224).to(device) example_flow torch.rand(1, 10, 224, 224).to(device) traced_model torch.jit.trace(model, (example_rgb, example_flow)) traced_model.save(two_stream.pt)在实际项目中我发现将光流计算移出模型、作为预处理步骤能显著提高推理效率。对于实时系统可以预先计算并缓存光流特征。

更多文章