PDF Tables to CSV

A simple Python program that uses PDFPlumber to take Tables out of PDFs and put them into a CSV.

PDF to CSV Magic

The Problem:

I am currently working on a project that requires me to copy data from existing PDFs into a speadsheet so that I can use the data to make graphs.

The version of excel that we are using does not have the function that extracts data from PDFs. So, my co-workers have been copy and pasting the text. Unfortunatly, when the text is added to a worksheet, it is no longer formatted in columns. Seperating the data across columns is also not an option, because the large numbers use commas. It's basically a mess and it takes a lot of time, and no one is happy. I mean, we can keep doing it the conventional way... but that's just sooooooo boring. And I hate doing rote work.

What to do? What to do?

Well, the infomation is already in text. We don't need any fancy OCR or NLP applications to extract our data. There must be an easier way to get this done.

In steps PDF Plumber, a module for Python.

The Solution:

Using this python module, I was able to parse through the PDF documents and extract just the tables. It's good enough for what I need at this stage. (See code below)

Currently, in the python shell, the user is prompted to add the path to the file that needs to be converted. Then, the user can choose from which page they want the parsing to begin.

The program then creates a CSV file using the name of the original file and appends tables from the PDF to the CSV file.

I am not sure if I needed to import os, but the program is small, so it isn't hurting anyone.

Next Steps:

Now, the program is currently in it's infancy. I can use it, but what about my boss?? What about my co-worker? Do I have the time to teach them basic python? Do they have the time (or want) to learn basic python?

No.

The best solution I can think of for this is to use Tkinter to make a file selection menu that then executes the file, or files, that need to be changed. Adding bulk conversion will save everyone a lot of time.

Also, adding a GUI will help others to better navagate the program and cause less user error.

Now, I just have to figure out how to get Tkinter to cooperate with me.

Python Code I used

import pdfplumber

import os, csv, re

doc_path = input("Please input full path of PDF:\n")

x = input("Which Page?\n")

with pdfplumber.open(doc_path) as pdf:

name_for_file = re.search(r"/([\w\d \_]*).pdf", doc_path)

csv_file = name_for_file.group(1) +'_CSV.csv'

for i in pdf.pages:

table = i.extract_tables()

for j in table:

with open(csv_file, "a") as csvfile:

writer = csv.writer(csvfile, dialect='excel')

try:

writer.writerows(j)

except:

print("unicode error")

Revision of the Code

Newest version uses Tkinter for selection of files and execution of program.

from tkinter import *

from tkinter import ttk, filedialog

#import io

import pdfplumber

#import numpy as np

import os, csv, re

#import pandas as pd

class ReadFileApp:

def __init__(self, master):

self.label = ttk.Label(master, text = "Convert PDF to CSV")

self.label.grid(row = 0, column = 0, columnspan = 2)

ttk.Button(master, text = "Open File",

command = self.select_file).grid(row = 2, column = 1)

ttk.Button(master, text = "to CSV",

command = self.pdf_table_extract).grid(row = 2, column = 2)

def select_file(self):

filename = filedialog.askopenfilename(initialdir=".")

self.infile = open(filename, "r")

#self.infile = io.TextIOWrapper(self.infile, encoding='utf8', newline='')

print(self.infile.name)

def pdf_table_extract(self, start_page=0):

with pdfplumber.open(self.infile.name) as pdf:

name_for_file = re.search(r"([\w\d\/ \_]*).pdf", self.infile.name)

csv_file = name_for_file.group(1) +'_CSV.csv'

for i in pdf.pages:

table = i.extract_tables()

for j in table:

with open(csv_file, "a") as csvfile:

writer = csv.writer(csvfile, dialect='excel')

try:

writer.writerows(j)

except:

print("unicode error")

print("CSV Saved as", csv_file)

def main():

main = Tk()

app = ReadFileApp(main)

main.mainloop()

if __name__ == "__main__": main()

The thing I want to do next is to release it as a package so that users don't need to install python and/or integrate it into a webpage for online hosting so nothing needs to be installed.