ocr识别出问题 #12

czcxwe · 2020-03-08T07:44:25Z

问题描述

直接fork到的代码不是直接能用的
然后修改了一下

    def depoint(self, img):
        """传入二值化后的图片进行降噪"""
        pixdata = img.load()
        w, h = img.size
        for y in range(1, h - 1):
            for x in range(1, w - 1):
                count = 0
                if pixdata[x, y - 1] > 245:  # 上
                    count = count + 1
                if pixdata[x, y + 1] > 245:  # 下
                    count = count + 1
                if pixdata[x - 1, y] > 245:  # 左
                    count = count + 1
                if pixdata[x + 1, y] > 245:  # 右
                    count = count + 1
                if pixdata[x - 1, y - 1] > 245:  # 左上
                    count = count + 1
                if pixdata[x - 1, y + 1] > 245:  # 左下
                    count = count + 1
                if pixdata[x + 1, y - 1] > 245:  # 右上
                    count = count + 1
                if pixdata[x + 1, y + 1] > 245:  # 右下
                    count = count + 1
                if count > 4:
                    pixdata[x, y] = 255
        return img

    def imge2string(self,image,threshold):
        """
        图片转字符串
        按照threshold进行降噪
        """

        image = image.convert('L')
        # 二值化
        image = image.point(lambda x: 255 if x > threshold else 0)
        #
        # 继续降噪
        image = self.depoint(image)
        # 识别//这里识别还有问题 tesserocr识别内容为空
        result = tesserocr.image_to_text(image)
        print(str(threshold)+"识别到验证码：" + str(result))
        return result

    def crack_code(self):
        '''
        自动识别验证码
        '''
        image = Image.open('./data/crack_code.jpeg')
        # 转为灰度图像

        # 设定二值化阈值
        threshold = 127
        s1 = self.imge2string(image, threshold)
        s2 = self.imge2string(image, threshold+20)
        s3 = self.imge2string(image, threshold-20)
        if s1 == s2 == s3 or s1 == s2 or s1 == s3:
            return self.send_code(str(s1))
        elif s2 == s3:
            return self.send_code(str(s2))

在result = tesserocr.image_to_text(image)这里出现了问题
无论如何识别，或者处理图像，tesserocr返回结果均为空

The text was updated successfully, but these errors were encountered:

czcxwe · 2020-03-08T07:45:21Z

修改代码的部分是：CrackVerifyCode.py的 CrackCode 类中的成员函数

dengwen168 · 2020-03-31T07:48:44Z

你好，我使用中遇到下面的问题，请问如何解决？
File "C:\Users\john1\Desktop\PI\cnki\CNKI-download-master\CNKI-download-master\CrackVerifyCode.py", line 34, in get_im age self.current_url = re.search(r'(.*?)#', current_url).group(1) AttributeError: 'NoneType' object has no attribute 'group'

czcxwe · 2020-03-31T07:59:38Z

cnki 改变了验证模式，会有二次ip验证（返回的网页就不是目标网页）。所以下载文献的代码已经失效了。

…

------------------ 原始邮件 ------------------ 发件人: "dengwen168"<notifications@github.com>; 发送时间: 2020年3月31日(星期二) 下午3:48 收件人: "CyrusRenty/CNKI-download"<CNKI-download@noreply.github.com>; 抄送: "蔡治成"<czc.cai@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [CyrusRenty/CNKI-download] ocr识别出问题 (#12) 你好，我使用中遇到下面的问题，请问如何解决？ File "C:\Users\john1\Desktop\PI\cnki\CNKI-download-master\CNKI-download-master\CrackVerifyCode.py", line 34, in get_im age self.current_url = re.search(r'(.*?)#', current_url).group(1) AttributeError: 'NoneType' object has no attribute 'group' — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dengwen168 · 2020-03-31T08:32:58Z

哦，其实我不用下载文献的，只需要采集详情页的那些关键词，摘要信息，应该还是可以用的吧？

czcxwe · 2020-03-31T08:54:25Z

也不可以，你得重写验证码链接的判断逻辑，以及使用云服务提供商的OCR服务对小图片进行识别（图片太小了，不友好）。

…

------------------ 原始邮件 ------------------ 发件人: "dengwen168"<notifications@github.com>; 发送时间: 2020年3月31日(星期二) 下午4:33 收件人: "CyrusRenty/CNKI-download"<CNKI-download@noreply.github.com>; 抄送: "蔡治成"<czc.cai@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [CyrusRenty/CNKI-download] ocr识别出问题 (#12) 哦，其实我不用下载文献的，只需要采集详情页的那些关键词，摘要信息，应该还是可以用的吧？ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dengwen168 · 2020-03-31T09:27:09Z

好的，谢谢，看样子得自己好好研究一下才行了。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocr识别出问题 #12

ocr识别出问题 #12

czcxwe commented Mar 8, 2020

czcxwe commented Mar 8, 2020

dengwen168 commented Mar 31, 2020

czcxwe commented Mar 31, 2020 via email

dengwen168 commented Mar 31, 2020

czcxwe commented Mar 31, 2020 via email

dengwen168 commented Mar 31, 2020

ocr识别出问题 #12

ocr识别出问题 #12

Comments

czcxwe commented Mar 8, 2020

问题描述

czcxwe commented Mar 8, 2020

dengwen168 commented Mar 31, 2020

czcxwe commented Mar 31, 2020 via email

dengwen168 commented Mar 31, 2020

czcxwe commented Mar 31, 2020 via email

dengwen168 commented Mar 31, 2020