使用Python爬虫构建Retinaface+CurricularFace训练数据集-开发者社区

使用Python爬虫构建Retinaface+CurricularFace训练数据集

人脸识别模型的性能很大程度上取决于训练数据的质量和多样性。本文将介绍如何利用Python爬虫技术，高效构建适用于Retinaface+CurricularFace模型的高质量人脸数据集。

1. 项目背景与需求分析

在实际的人脸识别项目开发中，我们经常面临训练数据不足的问题。公开数据集虽然容易获取，但往往无法满足特定场景的需求。比如，我们需要不同光照条件、多种姿态角度、多样化人种和年龄分布的图像，这些在现有数据集中可能不够全面。

Retinaface作为优秀的人脸检测模型，能够精准定位人脸和关键点，而CurricularFace则通过课程学习策略提升了人脸识别的准确性。但要充分发挥这两个模型的潜力，我们需要大量高质量的标注数据。

传统的数据收集方法耗时耗力，而Python爬虫技术可以帮助我们自动化地从互联网上收集所需图像，大大提升数据构建效率。接下来，我将分享一套完整的爬虫方案，帮助你快速构建属于自己的高质量人脸数据集。

2. 爬虫方案设计与技术选型

构建人脸数据集爬虫时，我们需要考虑几个关键因素：数据源的多样性、图像质量、版权问题以及采集效率。基于这些考虑，我选择了以下技术方案：

核心工具选择：

Requests库用于网页内容获取
BeautifulSoup进行HTML解析
OpenCV用于初步的图像质量检查
多线程加速提高采集效率

为了避免法律风险，我们只从允许爬取的网站获取数据，并且严格控制采集频率，遵守robots.txt协议。同时，我们会设置合理的延时，避免对目标网站造成过大压力。

import requests from bs4 import BeautifulSoup import cv2 import os import time from concurrent.futures import ThreadPoolExecutor class FaceDataCollector: def __init__(self, save_dir="./face_dataset"): self.save_dir = save_dir self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } os.makedirs(save_dir, exist_ok=True)

3. 爬虫实现步骤详解

3.1 确定数据源与采集策略

选择合适的数据源是成功的第一步。我建议从多个来源获取数据，以确保数据集的多样性。可以考虑 celebrity数据集、学术机构公开的人脸数据集，以及一些允许爬取的头像网站。

在实际操作中，我们需要分析目标网站的页面结构，找到图片的实际链接。以某个公开图片库为例，我们可以这样提取图像链接：

def extract_image_links(self, url): """从目标页面提取所有人像图片链接""" try: response = requests.get(url, headers=self.headers, timeout=10) soup = BeautifulSoup(response.content, 'html.parser') image_links = [] # 根据不同网站的HTML结构调整选择器 for img_tag in soup.find_all('img', {'class': 'portrait'}): img_url = img_tag.get('src') if img_url and 'portrait' in img_url: image_links.append(img_url) return image_links except Exception as e: print(f"提取图片链接时出错: {e}") return []

3.2 图像下载与质量过滤

不是所有下载的图像都适合用于训练。我们需要对图像进行初步筛选，确保只保留高质量的人脸图像：

def download_and_validate_image(self, img_url, index): """下载并验证图像质量""" try: # 下载图像 response = requests.get(img_url, headers=self.headers, timeout=15) if response.status_code != 200: return False # 保存临时文件 temp_path = os.path.join(self.save_dir, f"temp_{index}.jpg") with open(temp_path, 'wb') as f: f.write(response.content) # 使用OpenCV检查图像质量 img = cv2.imread(temp_path) if img is None: return False # 检查图像尺寸和清晰度 height, width = img.shape[:2] if width < 200 or height < 200: # 过滤掉尺寸太小的图像 return False # 计算图像清晰度（使用拉普拉斯方差） gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) fm = cv2.Laplacian(gray, cv2.CV_64F).var() if fm < 100: # 过滤掉模糊图像 return False # 重命名合格图像 final_path = os.path.join(self.save_dir, f"face_{index}_{int(fm)}.jpg") os.rename(temp_path, final_path) return True except Exception as e: print(f"下载或验证图像时出错: {e}") return False

3.3 多线程加速采集

为了提高效率，我们使用多线程同时下载多个图像：

def batch_download_images(self, image_urls, max_workers=5): """使用线程池批量下载图像""" successful_downloads = 0 with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = [] for i, img_url in enumerate(image_urls): # 添加随机延时，避免请求过于频繁 time.sleep(0.5 + random.random()) futures.append(executor.submit(self.download_and_validate_image, img_url, i)) for future in futures: if future.result(): successful_downloads += 1 print(f"成功下载 {successful_downloads}/{len(image_urls)} 张图像")

4. 数据清洗与预处理

采集到的原始数据需要经过仔细清洗和预处理，才能用于模型训练。这个阶段的工作直接影响最终模型的性能。

4.1 人脸检测与对齐

使用Retinaface对下载的图像进行人脸检测和对齐，确保每张图像只包含一张正脸：

def align_and_crop_faces(self, input_dir, output_dir): """使用Retinaface检测和对齐人脸""" os.makedirs(output_dir, exist_ok=True) detector = insightface.app.FaceAnalysis() detector.prepare(ctx_id=0, det_size=(640, 640)) processed_count = 0 for img_name in os.listdir(input_dir): img_path = os.path.join(input_dir, img_name) img = cv2.imread(img_path) if img is None: continue # 人脸检测 faces = detector.get(img) if len(faces) == 1: # 只处理包含单张人脸的图像 face = faces[0] # 获取人脸边界框和关键点 bbox = face.bbox.astype(int) landmarks = face.landmark.astype(int) # 对齐并保存人脸 aligned_face = self.face_alignment(img, landmarks) output_path = os.path.join(output_dir, f"aligned_{processed_count}.jpg") cv2.imwrite(output_path, aligned_face) processed_count += 1

4.2 数据增强与扩充

为了增加数据多样性，我们对对齐后的人脸图像进行数据增强：

def augment_face_data(self, input_dir, output_dir, augmentations_per_image=5): """对每张人脸图像生成多个增强版本""" os.makedirs(output_dir, exist_ok=True) augmentations = [ self.random_rotate, self.random_brightness, self.random_contrast, self.add_gaussian_noise, self.random_flip ] for img_name in os.listdir(input_dir): img_path = os.path.join(input_dir, img_name) img = cv2.imread(img_path) if img is None: continue # 保存原始图像 base_name = os.path.splitext(img_name)[0] cv2.imwrite(os.path.join(output_dir, f"{base_name}_original.jpg"), img) # 生成增强版本 for i in range(augmentations_per_image): augmented = img.copy() for augment in random.sample(augmentations, 3): augmented = augment(augmented) output_path = os.path.join(output_dir, f"{base_name}_aug_{i}.jpg") cv2.imwrite(output_path, augmented)