non-standard title table extraction problem #645

BrianCKLu · 2022-04-22T09:50:11Z

BrianCKLu
Apr 22, 2022

Hi, thanks for provide such a useful library.
I would like to ask if there have solution to convert the non-standard title table( Figure 1 & Figure 2), into the form of Figure 3.
(remove spec.).

import os,pdfplumber
import pandas as pd

def check_folder_exist(seg_path: str):
    """
    create save output xlsx folder 
    """
    if not os.path.isdir(seg_path):
        os.mkdir(seg_path)

def Pdf_to_excel(pdf_filename, page_list, outputfolder, id: str):
    pdf =  pdfplumber.open(pdf_filename)
    i = 0 
    for page_num in page_list:
        if  page_num != 0: 
            page = pdf.pages[page_num]       
            tables = page.extract_tables() # 自動讀取表格資訊，返回列表
            check_folder_exist(outputfolder)
            xlsx_savefold = os.path.join(outputfolder, "xlsx")
            check_folder_exist(xlsx_savefold)
            table_filename = os.path.join(xlsx_savefold, id + ".xlsx")
            sheet_name = str
            if len(tables) >= 0:
                for i in range(len(tables)):
                    table = tables[i]
                    for j in range(len(table)):
                        for k in range(len(table[j])):
                            table[j][k] = str(table[j][k]).replace('None', '')
                            table[j][k] = str(table[j][k]).replace('\n', '')           
                    if not os.path.isfile(table_filename): #add xlsx
                      #to df 
                        table_df = pd.DataFrame(table[1:], columns = table[0]) 
                      #  save xlsx
                        sheet_name = ("p" + str(page_num + 1) + "_" + str(i))
                        table_df.to_excel(table_filename, index = False, sheet_name = sheet_name)          
                    else:
                        with pd.ExcelWriter(table_filename, mode = 'a', if_sheet_exists = "replace" , engine ='openpyxl') as writer:
                            table_df = pd.DataFrame(table[1:], columns = table[0])    
                            sheet_name = ("p" + str(page_num + 1) + "_" + str(i))
                            table_df.to_excel(writer, index = False, sheet_name = sheet_name)
    pdf.close()
    
outputfolder = "your output folder"
pdffile = "pdf file"
id = "U2501"
page_list = [0, 1]

Pdf_to_excel(pdffile, page_list, outputfolder, id)`

many thanks !

jsvine · 2022-04-22T14:37:29Z

jsvine
Apr 22, 2022
Maintainer

Hi @BrianCKLu, and thanks for the kind words. If the table you want to transform will always look like the examples you have provided, you could do something like the following, and then in into your code:

def fix_table_header(table_rows):
    header_normalized = [ (x or "").strip() for x in table_rows[0] ]
    header_has_blanks = any(x == "" for x in header_normalized)
    if header_has_blanks:
        for i, alt in enumerate(table_rows[1]):
            alt = (alt or "").strip()
            if alt:
                table_rows[0][i] = alt
        table_rows = table_rows[:1] + table_rows[2:]
    return table_rows

Demonstrating:

fix_table_header([
    [ "A", "B", "To discard", None, "D"],
    [ "", "", "C1", "C2", ""],
    [ 1, 2, 3, 4, 5 ],
])

... returns:

[
    ['A', 'B', 'C1', 'C2', 'D'],
    [1, 2, 3, 4, 5]
]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-standard title table extraction problem #645

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

non-standard title table extraction problem #645

BrianCKLu Apr 22, 2022

Replies: 1 comment

jsvine Apr 22, 2022 Maintainer

BrianCKLu
Apr 22, 2022

jsvine
Apr 22, 2022
Maintainer