Extracting tabular data from a PDF: An example using Python and regular expressions

It is not uncommon for us to need to extract text from a PDF. For small PDFs with minimal data or text it's fairly straightforward to extract the data manually by using 'save as' or simply copying and pasting the data you need. For a recent project, however, we were asked to extract detailed address information from a directory (the National Directory of Drug and Alcohol Abuse Treatment Programs) with more than 700 pages, definitely not a job to be done manually.

The addresses in the PDF were arranged in three columns. Fortunately, the formatting was reasonably consistent throughout the document – phone numbers tended to be in the same format, address elements tended to be in the same order – this definitely makes the job easier. Here is an example of what the data looks like:

dataview

Extracting to raw text is not ideal

In the interest of extracting the data programmatically we started with a brief investigation of the various options. We found several good options for converting PDFs to raw text. These included the R package tm, the Python pyPDF2 package, PDFtoText/Xpdf, pdf2ascii. Extracting the data from these tools produced something that looked like this:

'Changing Echoes\n7632 Pool Station Road\nAngels Camp, CA 95222\n(209) 785-3667\nIntake: (800) 633-7066\nSA s TX DT BU s RS RL s SF PI s AH\n\nANTIOCH\nAddiction Research and Treatment Inc\n(ART)\n3707 Sunset Lane\nAntioch, CA 94509\n(925) 522-0124\nMethadone and Buprenorphine Clients Only\nSA s TX DT MM DM BU s OP s PW s SF \nMD PI s SS s SP F4 F28 F30 F81\n'

We quickly found, though, that raw text was not going to give us enough detail or 'signposts' to work with. Keeping the formatting detail that you can see in the PDF would be valuable for extracting the data.

Extracting to HTML provided the detail we needed

In terms of keeping the useful detail, we found that converting the data to HTML was a good option. Another option would be XML but we found that XML output produced too much text. I should note that there is also a new program called Tabula that is geared toward extracting data from PDFs. In our admittedly very quick experiments with Tabula it appeared that the text areas needed to be manually selected so this didn't seem to be an option for us. There is a command line version of Tabula and it's possible that this is a better option than it seemed and we look forward to learning more about it.

We used the Python module pdfminer

In the end we chose to use Python – converting the PDF to HTML using pdfminer and then using regular expressions to pull out the pieces we needed. We had a couple of false starts with pdfminer, though. We found several snippets of code that seemed to fit our project perfectly but each time we ran the code we got errors such as NameError: name 'process_pdf' is not defined. Ultimately we found that these errors were due to the fact that the pdfminer API underwent significant changes in November 2013 that rendered older code unworkable. See, for example, the discussion here (be sure to scroll down to where it says 'API Changes'). Fortunately, there are a few good snippets of code using the new API. Thanks, in particular, to code from David Burke we were on our way to reading the PDF. This function enabled us to read the PDF into one giant string.

(By the way, the full code for this script is on GitHub but keep in mind that this post and the code should be used as a guide only. Naturally the regular expressions you use would depend on your PDF formatting).

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
import csv


def convert_pdf_to_html(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0 #is for all
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

Running this bit of code converts the entire PDF into one huge text chunk that looks like this:

'<span style="font-family: XZVLBD+GaramondPremrPro-LtDisp; font-size:12px">Changing Echoes\n<br>7632 Pool Station Road\n<br>Angels Camp, CA 95222\n<br>(209) 785-3667\n<br>Intake: (800) 633-7066\n<br>SA </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> TX DT BU </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> RS RL OP PH </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> CO CJ \n<br></span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> SF PI </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> AH SP\n<br></span></div>'

A mess! But the HTML tags help us to distinguish the different addresses and identify specific pieces.

Now for the regular expressions

A few challenges to contend with

We are fortunate that the authors of the PDF were relatively consistent in their formatting allowing us to use regular expressions. Nevertheless, there are several challenges in pulling out the data from this PDF including:

  1. Differing amount of address or program information: some addresses have a suite number; some have three lines under the program name, some have one; most street addresses start with a number followed by letters but not all etc. You can see this in the example image above.
  2. There is quite a bit of variation in those little codes at the end of the address. They can be a two-letter code, they can be a letter followed by a number; they can be two letters followed by a dash; and there is a ton of variation in the number of these codes.
  3. Most have a main phone number with consistent formatting but some have 1 or 2 additional phone numbers we need to pull out.

Explanation of a couple of key regular expressions concepts used below

To help you follow some of the regular expressions below I'm outlining a couple of key concepts here. Note that we relied heavily on the great software RegexBuddy to help us assemble the expressions.

  • To pull out text between bookends use (?<=BOOKEND1) TEXT YOU WANT (?=BOOKEND2) where BOOKEND1 identifies the beginning bookend (and will not be included) and BOOKEND2 identifies the ending bookend (and will also not be included). So, for example, using (?<=A).*(?=E) on the text ABCDE will pull out BCD. The bookends are called 'positive lookbehind' and 'positive lookahead'.

  • Categories of characters you'll see below include:

    • [A-Z] any capital letter
    • \d a digit
    • \w a word character (letters, digits and underscores)
    • \s whitespace (spaces, tabs and whitespace)
  • Curly brackets denote number of repetitions you want. So for example, if you had the text ABCDE and you wanted to pull out ABC you could use [A-Z]{3} (also {1,} means as few as one or infinitely many or {0,3} means as few as none up to 3)

  • As is common in many languages the pipe (|) character is an 'or'. So using A|B identifies the A and the B.

RegEx Step 1. Break the data into address chunks

We decided that the first step would be to try and split the text data into address chunks where all the address information for one program would be in one chunk separated from the others. From a testing and sanity perspective this seemed to be the best option. Fortunately, in this instance we had formatting to help us. You can see in the example above that the program name, in this case 'Changing Echoes', is preceded by a span tag with styling and ends with a br/span/div so we used this in a regular expression to split the data. We used Python's finditer which returns an iterator allowing us to cycle through the results. Here is what the code looks like:

pattern = '(?<=<span style="font-family: XZVLBD\+GaramondPremrPro-LtDisp; font-size:12px">)(.*?)(?=<br></span></div>)'

for programinfo in re.finditer(pattern, biginputstring,  re.DOTALL):
  do looping stuff...

In a regular expression, normally the . character matches any character except a newline character. The re.DOTALL tells Python to match any character including the newline characters (\n).

RegEx Step 2. Pull out the easy bits

There are four parts of the address that were very easy to pull out.

  • Identifying the program name was straightforward because it was the only bit of text not to begin with a <br> tag and it ended with a linefeed character (\n).
  • The city, state, zip always followed the same pattern (text, comma, two-digit capital letters then 5 numbers (or 5 numbers hyphen four numbers))
  • The main phone number was also relatively easy, following the same pattern each time.
  • The alternate phone numbers, if they existed, were also easy to identify. We don't care too much about these so we joined them into one string separated by a semi-colon.

Here is what this code looked like:

programname = re.search('^(?!<br>).*(?=\\n)', programinfo.group(0))
# since some programs have odd characters in the name we need to escape
programname = re.escape(programname)

citystatezip  =re.search('(?<=>)([a-zA-Z\s]+, [a-zA-Z\s]{2} \d{5,10})(?=\\n)', programinfo.group(0))
mainphone  =re.search('(?<=<br>)\(\d{3}\) \d{3}-\d{4}x{0,1}\d{0,}(?=\\n)', programinfo.group(0))
altphones = re.findall('(?<=<br>)[a-zA-Z\s]+: \(\d{3}\) \d{3}-\d{4}x{0,1}\d{0,}(?=\\n)(?=\\n)', programinfo.group(0))

Note that the results of search and findall have numbered group references. So programinfo.group(0) pulls out the first reference.

RegEx Step 3. Pull out the semi-easy bit: the street address

The street address almost always followed the traditional pattern of number followed by street name followed by street type (e.g., Road, Way, Boulevard). But not always! There are addresses with no numbers, for example, 'Adams Avenue'. This left us with a rough approach that seemed to work but might miss some unusual situations. What we did ultimately was wrote a multi-piece regular expression with three parts:

  1. Identify a line that begins with numbers, has a space and then is followed by letters OR…
  2. Identify a line that starts with XX and ends with Avenue or Road or Boulevard and so on OR…
  3. Identify post office boxes

Here is what this piece looked like:

addressBegNum ='((?<=>)(\s{0,})\d{1,}[\w\s-]+(?=\\n))'
addressAltNum = '(?<=>)([\w\s]+)(Avenue|Street|Way|Road|Boulevard|Way|Highway|Cutoff)([\w\s]{0,})(?=\\n)'
addressPOBox = '(?<=>)([\w\s]{0,})P\.O\. Box.*?(?=\\n)'
address = re.search(addressBegNum + '|' + addressAltNum + '|' + addressPOBox, programinfo.group(0))
if address: address = re.escape(address)

RegEx Step 4. Pull out the ending codes

This turned out to be fairly tricky given that there was so much diversity in the codes at the end of the address lines. For this piece we had trouble coming up with a regular expression that fit all the different code situations. Instead the approach we took was to identify the first code then, since the codes always come at the end of the address, select until the very last character. With that information in hand we could easily delete all the HTML pieces (those that occur between <> ). We could also delete the little diamond character that shows up as an S in the Wingdings font (a diamond in the image above) and do a little additional formatting.

I'll be the first to admit that this is NOT the simplest regular expression that could have been written to do this job!

# find text that can start with a space or not, then has a capital letter followed by
# either a single capital letter or 1-2 digits. This can be followed by a newline character
# a dash or white space followed by.... you get the picture.
codes =re.search('(?<=>)(\s{0,1})([A-Z]{1})([A-Z]{1}|\d{1,2})(\\n|([-\s]{1,})(\\n|<|([A-Z]{1})([A-Z]{1}|\d{1,2}))).*', programinfo.group(0), re.DOTALL)

codes = re.sub('</span>', ',', codes.group(0))
codes = re.sub('<.*?>', '', codes)# drop everything between <>
codes = re.sub(' , ', ', ', codes)# replace spaces associated with commas with just comma
codes = re.sub('s', '', codes)
codes = re.sub(',,', ',', codes)
codes = re.sub('\n', '', codes)

RegEx Step 5. Use pieces from #2 above to help us with the other bits

The remaining address pieces are also tricky. As many as two additional lines of address-related information occurred in the addresses. These include more detail about the program name, suite numbers, building details etc. The only thing consistent about these was that the occurred after the program name and before the city/state/zip (and either before or after the street information if it exists). So we took advantage of the order of things to roughly pull out the final bits. Essentially we wrote:

  1. If there is a program name and a street address grab the line between them.
  2. If there is an address and a city/state/zip grab the line between them.
  3. If there is no address but there is a city/state/zip take all of the text between the program name and the city/state/zip. In this example, there is the added complication that there might be more than one line.

Here is what the code looked like:

# If we have a program name and an address call 'altinfo1' the line,
# if it exists, between them
if (programname != '') & (address != ''):
    altinfo1 = re.search('(?<=' + programname + ').*(?=' + address + '\\n)', programinfo.group(0), re.DOTALL)

    if altinfo1:
        altinfo1 = re.sub('<.*>|\\n', '', altinfo1.group(0))
    else:
        altinfo1 = ''

# If we have a address and a citystatezip call 'altinfo2' the line,
# if it exists, between them
if (address != '') & (citystatezip != ''):
    altinfo2 = re.search('(?<=' + address + ').*(?=' + citystatezip + '\\n)', programinfo.group(0), re.DOTALL)

    if altinfo2:
        altinfo2 = re.sub('<.*>|\\n', '', altinfo2.group(0))
    else:
        altinfo2 = ''

# If we DO NOT have an a. In ddress but we do have citystatezip then pull
# out altinfo1 and altinfo2 if they exist.
if (address == '') & (citystatezip != ''):
    altinfoTmp = re.search('(?<=' + programname + ').*(?=' + citystatezip + '\\n)', testing, re.DOTALL)
    altinfo1 = ''
    altinfo2 = ''
    if altinfoTmp:
        altinfoTmp = re.sub('<.*>', '', altinfoTmp.group(0))
        altinfoTmp = altinfoTmp.strip().splitlines()
        n = len(altinfoTmp)
        altinfo1 = altinfoTmp[0]
        if n==2:
            altinfo2 = altinfoTmp[1]
        if n>2:
            altinfo2 = '; '.join(altinfoTmp.pop(0))

Write to CSV

We assemble our array and wrote it as a line in a CSV file. If we were going to work with this data in Python writing to a DICT might have been a better option but the intention here was to share with a non-Python user.

Final notes

Those with experience writing regular expressions will notice right away that the address extraction could have been done with simpler and more elegant regular expressions. For the purpose at hand, though, the code above worked well. In terms of timing, the reading of the PDF itself takes about 20 minutes but then pulling out the pieces for more than 10,000 addresses is quick. Despite the fact that the PDFs have more than 15 million characters it takes less than one minute to extract all the address information and write to a CSV.

3 responses

  1. I receive some files inf PDF format generated by Crystal Reports or other similar tool.

    There is no image, just a data report with headers, footers grupos and details. I would like to define the report structure and identify fields so I can extract the data direct to database tables.

    Do you know a tool where I define the report strucuture using .rpt, rpl or xml and extract the data to database tables?

Leave a Reply

Your email address will not be published. Required fields are marked *