记录一次文字点选验证码的破解(中篇)

4 minute read

接上回汉字定位,本篇来讲讲汉字的分类识别

数据集准备

首先来做汉字的截取

import cv2


with open('results/comp4_det_test_hanzi.txt', 'r') as f:
    for line in f:
        vals = line.strip().split()
        fname, p, x1, y1, x2, y2 = vals
        fname = 'jiyan/data/valid/' + fname + '.jpg'
        x1 = int(float(x1))
        x2 = int(float(x2))
        y1 = int(float(y1))
        y2 = int(float(y2))
        p = float(p)
        img = cv2.imread(fname)
        if p > 0.5:  # 只有置信度大于50%时截取汉字
            crop_img = img[y1:y2, x1:x2]
            resized_image = cv2.resize(crop_img, (60, 60))  # 统一大小为60x60
            cv2.imwrite(fname+'.{}.jpg'.format(p), resized_image)

我总共截取了10w张图片,不过肯定有一些误识别的,这里花了半个小时肉眼过滤。10w张图片一个人标注非常慢,决定借助打码平台,吐槽一下云打码的Python client写的真是一言难尽。将图片的文件路径随机写到30个文件里,开启多线程,还是很快的,大概花了5个小时搞定。

import concurrent.futures
import string
import logging

import redis

from ydm import YDMHttp

redis_conn = redis.StrictRedis()
logger = logging.getLogger("label")


def login(username, password, appid, appkey):
    client = YDMHttp(username, password, appid, appkey)
    uid = client.login()
    logger.info('uid: %s' % uid)
    balance = client.balance()
    logger.info('balance: %s' % balance)
    return client


def worker(file_path, client):
    with open(file_path, 'r') as f:
        for line in f:
            path = line.strip()
            if redis_conn.sismember('path', path):
                continue
            else:
                redis_conn.sadd('path', path)
            try:
                cid, result = client.decode(codetype=2001, filename=path, timeout=20)

                logger.info('cid: %s, result: %s' % (cid, result))
            except Exception as e:
                logger.warning(e)
            else:
                if (result == '看不清' or len(result) != 1 or result in string.ascii_lowercase) and cid != -3003:
                    try:
                        yundama.report(cid)
                    except Exception as e:
                        print(e)
                else:
                    redis_conn.hset("result", path, result)


def batch(files):
    with concurrent.futures.ThreadPoolExecutor(max_workers=files) as executor:
        future_to_file = {executor.submit(worker, file_path): file_path for file_path in files}
        for future in concurrent.futures.as_completed(future_to_file):
            file_path = future_to_file[future]
            try:
                future.result()
            except Exception as exc:
                logger.warning(exc, exc_info=True)
            else:
                logger.info("complete task: %s" % file_path)
 

完成后对汉字数量做一个统计,我这边分离出的汉字有1158个,需要修改配置文件中的分类大小为1158,图片大小为60x60。

另外需要把标注的图片名加上汉字的unicode,文件名例如verifyCode1580446215_u4e3b.jpg,并且把所有unicode都放入labels.txt,这里给出转换代码。

'我'.encode('unicode-escape').decode().replace('\\', '')
'u6211'

训练

回到gsxt_captcha根目录,将训练集放在chinese_classify/data/train/,验证集放在chinese_classify/data/valid

接着执行,生成train.txt,valid.txt。

cd chinese_classify/data
sh gen_txt.sh

回到根目录,开始训练。

cd ../../
pwd  # gsxt_captcha
sh classify_train.sh

20000轮训练后,average loss 0.03,跑一轮测试。

sh classify_test.sh

top1成功率95%,top100%成功率99%,能用了。

git仓库已同步更新代码~

Updated: