Want to extract information by separate two parts #571

youpengbo2018 · 2021-12-21T07:11:41Z

youpengbo2018
Dec 21, 2021

[China_ptent.pdf](https://github.com/jsvine/pdfplumber/files/7750238/Chin
a_ptent.pdf)

Hi, I wonder that extract the file's data by split into two parts based on the picture.
.This is the code that I try to extract those information before.
import pdfplumber as pr
import pandas as pd
filename = r'F:\GB2652.pdf'
pdf=pr.open(filename)
page = pdf.pages[0]
table = page.extract_tables(table_settings={
"vertical_strategy":"text",
"horizontal_strategy":"text",
"join_tolerance":50
Thank you for your help

youpengbo2018 · 2021-12-21T07:12:52Z

youpengbo2018
Dec 21, 2021
Author

China_ptent.pdf

0 replies

jsvine · 2021-12-24T03:30:53Z

jsvine
Dec 24, 2021
Maintainer

Hi @youpengbo2018, and thanks for your interest in this library. Rather than treating the information as tables, you may have better luck with a more customized approach. Luckily, the PDFs appear to have an obvious vertical line (see page.lines), and then the ----------------- marks as horizontal lines, separating the portions of each page. I would try identifying those, and then using page.crop(...) to select each section that you calculate from that.

To get the horizontal dividers, try:

import re
horizontal_dividers = [ w for w in page.extract_words()
  if re.search(r"^-{10,}$", w["text"]) ]

3 replies

youpengbo2018 Dec 24, 2021
Author

Thank you for your help. Could you describe more details about this, because I have not use crop to extract information. In addition, is that possible I can still use the extract_table to extact the information. I use these code to achieve my goals that to extract the information and store in a dataframe. Now the only challenge that I have faced that is sometimes I use the table_extract method, the methods can not get the last row with the (--------------）mark such as the page 11 of the China_ptent file

china-patent.pdf

These are the codes I have used to achieve the goal.

from decimal import Decimal
import pdfplumber as pr
import re
import pandas as pd
import time

def get_table(pagenum, pdf):
    page = pdf.pages[pagenum]
    table = page.extract_tables(table_settings={
        "vertical_strategy":"text",
        "horizontal_strategy":"text",
        "join_tolerance":50
    })
    return table


def get_page_information2():
    filename = r'F:\FMGKGB2652.pdf'
    pdf = pr.open(filename)
    lastline = ""
    for i in range(9,12):
    # for i in range(len(pdf.pages)):
        print(i)
        table2 = get_table(i, pdf)[0]
        list1 = []
        list2 = []
        for j in range(len(table2)-1):
            list1.append(table2[j][0])
            try:
                list2.append(table2[j][1])
            except:
                pass

            leftpage = ''.join(list1)
            rightpage = ''.join(list2)
        lines = (lastline + leftpage + rightpage).split('-----------------------------------------')
        # print("==========================================lines",lines)
        print(rightpage[-100:])
        for k, line in enumerate(lines):
            # print("enum",k,line)
            line = line.replace("\r\n", "").replace("\n", "").replace('\r', "")
            if k < len(lines) - 1:
                yield line
            else:
                lastline = line

def extract_value(pattern, value):
    try:
        return re.findall(pattern, value)[0][:-4]
    except:
        return None
        pass

def extract_value_last(pattern, value):
    try:
        return re.findall(pattern, value)[0]
    except:
        return None
        pass
def get_var_value(value):       # extract different column value by extract_value method
        valuelist=['v10','v51','v22','v21','v43','v71','v72','v74','v54','v57']
        list8=[]
        pattern10=r'\(10\).*?\(\d\d\)'
        pattern51=r'\(51\).*?\(\d\d\)'
        pattern21=r'\(21\).*?\(\d\d\)'
        pattern22=r'\(22\).*?\(\d\d\)'
        pattern71=r'\(71\).*?\(\d\d\)'
        pattern72=r'\(72\).*?\(\d\d\)'
        pattern54=r'\(54\).*?\(\d\d\)'
        pattern57=r'\(57\).*'
        pattern43 = r'\(43\).*?\(\d\d\)'
        pattern74 = r'\(74\).*?\(\d\d\)'
        v10=extract_value(pattern10,value)
        v51=extract_value(pattern51,value)
        v21 = extract_value(pattern21, value)
        v22 = extract_value(pattern22, value)
        v71 = extract_value(pattern71, value)
        v72 = extract_value(pattern72,value)
        v43 = extract_value(pattern43, value)
        v74 = extract_value(pattern74, value)
        v54 = extract_value(pattern54,value)
        v57 = extract_value_last(pattern57,value)
        list8.append(v51)
        list8.append(v10)
        list8.append(v21)
        list8.append(v22)
        list8.append(v43)
        list8.append(v71)
        list8.append(v72)
        list8.append(v74)
        list8.append(v54)
        list8.append(v57)
        return list8


if __name__ == '__main__':
    list7=[]
    now=time.localtime()
    i=0
    for item in get_page_information2():++++++++++++++++++++++++++++++++++++++record：",i, item)
        i+=1
        list7.append(get_var_value(item))

    df1=pd.DataFrame(list7)
    df1.to_csv('F:\\patenttest.csv')
    # print(df1)
    print(time.localtime())
    print(now)

jsvine Dec 24, 2021
Maintainer

Hi @youpengbo2018, here's a notebook that demonstrates what I was talking about — hopefully this clarifies things: https://notebooksharing.space/view/3cf2635360372719bc95152b027cdfcdb7938658d1575b90d6450399870bcfd1

Since the PDF doesn't really have proper tables, .extract_tables(...) will probably not get you the results you desire.

youpengbo2018 Dec 27, 2021
Author

Hi,thank you for the help. It really help me a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Want to extract information by separate two parts #571

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Want to extract information by separate two parts #571

youpengbo2018 Dec 21, 2021

Replies: 2 comments · 3 replies

youpengbo2018 Dec 21, 2021 Author

jsvine Dec 24, 2021 Maintainer

youpengbo2018 Dec 24, 2021 Author

jsvine Dec 24, 2021 Maintainer

youpengbo2018 Dec 27, 2021 Author

youpengbo2018
Dec 21, 2021

Replies: 2 comments 3 replies

youpengbo2018
Dec 21, 2021
Author

jsvine
Dec 24, 2021
Maintainer

youpengbo2018 Dec 24, 2021
Author

jsvine Dec 24, 2021
Maintainer

youpengbo2018 Dec 27, 2021
Author