written by cail • posted in How-To • 8,388 views 10 comments

Update 2010-2-9 23:56:56

While v0.11 works great on mac and linux, it does NOT work on windows. So, here is pyPdfRename v0.12, which has been test in all three environments. Enjoy!

Update 2010-2-9 15:43:13

pyPdfRename v0.11 is here.

I just used it to process my download folder, which had over 2000 files. In the end, this python script successfully extracted 1001 PDF files' DOI information and renamed them. It identified another 763 PDF files. All the remaining are other goodies I downloaded from the Internet.

It performs exactly what I want it does. I am very proud of it!

Change Log, from v0.1 to v0.11

  • Better handle on articles from Nature Publishing Group
  • Better process PubMed information - less manual correction needed
  • Add a filter to identify non-pdf files based on file extension

It is my very first python script - naive, buggy, ... it just works
pyrenamepdf 1.png
pyrenamepdf 2.png

What it does?

Rename all the PDF files in the designated folder into the following:

Last name of the first author, Lase name of the last author, published year, Abbreviation of the name of the journal

You can change the script to output the file name with other information.

How it does?

  1. List all the binary files in the designated folder, and further exclude those are not PDF files (such the word, excel, powerpoint files)
  2. Use pyPdf to extra the text from each PDF file and search for DOI identity (Since some PDF files, such as those are scanned, those are published before 1990s and those are password protected/secured cannot be extracted by pyPdf, the script will try its best to skip these files. But, if the script breaks, just move the error file out of the directory, and run the script again.)
  3. Use extracted DOI information and eutils from NCBI to obtain any information in PUBMED related to the PDF files
  4. Generate a text file to log these information for each PDF file (such as the title, the authors, the journal name, the time when it was published, etc)
  5. Rename the PDF file and move it to a specific folder

Why wrote this script?

Most of my collected scientific papers are in PDF format and stored in my download folder, which has hundreds if not thousands files with strange names. I really want to take a good care of them. But I don't want to manually do it.

The program papers is very popular. When I tried to use it to import my giant PDF collection, it crashed, again and again, even on a subset. It is just not for me - when I am onto a paper, I read it either with Preview or as it embedded in the browser; the file is stored in a folder with its original name when I downloaded. Days passed by, files are piling there and I want to organize them.
I did some googling and found simon's excellent code "Query PubMed for citation information using a DOI and Python". If I can automatically extract DOI from the PDF file, in conjunction with simon's script, I should be able to have a program automatically do the organization for me!
Python has been on my learning list for a long time now. I decided to use this project as my very first python learning material. With googling, I had a draft version in a day. But, when I tried to process a folder of 50 pdf files from very different journals, the issue occurred: single pattern match cannot extract DOI of files from different journals; the presentation of DOI information varies a lot between different publishers. It took me another 2-3 days to perfect the matching script ... now, here is the release 0.1 - fully tested on Ubuntu/Linux

How to use it?

  1. If you are using windows, please install python. If you are using linux/mac, python should be already installed.
  2. Download pyPdfRename and extra it.
  3. Use your favorite command line terminal (for windows win+r, type in cmd) and go to the folder of pyPdfRename.
  4. Locate your_target_folder, which has the PDF files you want to rename (the python script will NOT visit any directory included in the target folder), type in:
    python pyRenamePdf.py your_target_folder
  5. The screen should be scrolling now. If error occurs, just move the related file out of the way. It should be minimal. The most new version pyRenamePdf.py is awesome and, of course, smart!
  6. Enjoy!

The script is in a very early development stage. I just put here as it is. Feel free to use it and I would love to hear your feedback/bug report/feature request.

Source code of pyRenamePdf.py version 0.1

#!/usr/bin/env python

# Script to rename all PDF files in the folder, basded on DOI info
# (c) Liang Cai, 2010
# http://en.dogeno.us
# v0.1
import os, os.path, shutil, re, glob, unicodedata, pyPdf
from time import gmtime, strftime

# very inspired by pythonquery.py
# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/
import urllib
from xml.dom import minidom

# http://www.istihza.com/makale_pypdf.html
import warnings
warnings.simplefilter("ignore", DeprecationWarning)


# inspired by pyPdf/pdf.py line 607
def readNextEndLine(stream):
    line = ''
    while True:
        x = stream.read(1)
        stream.seek(-2, 1)
        if x == '\n' or x == '\r':
            while x == '\n' or x == '\r':
                x = stream.read(1)
                stream.seek(-2, 1)
            stream.seek(1, 1)
            break
        else:
            line = x + line
    return line


# inspired by http://www.velocityreviews.com/forums/t320964-p2-determine-file-type-binary-or-text.html
def is_binary(buff):
    """Return true if the given filename is binary"""
    non_text = 0
    all_text = 0
    for i in range(len(buff)):
        a = ord(buff[i])
        all_text = all_text + 1
        if (a < 8) or (a > 13 and a < 32) or (a > 126):
            non_text = non_text + 1
        if all_text == 4096:
            break
    # print non_text, all_text # enable for debug
    if non_text > all_text * 0.0009:
        return 1
    else:
        return 0


# inspired by http://code.activestate.com/recipes/511465/
def getPDFContent(path):
    content = ''
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, 'rb'))
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + ' '
    # Collapse whitespace
    content = ' '.join(content.replace(u'\xa0', ' ').strip().split())
    return content


def get_citation_from_doi(query, email='i@cail.cn', tool='pyRenamePdf.py', database='pubmed'):
    params = {
        'db':database,
        'tool':tool,
        'email':email,
        'term':query,
        'usehistory':'y',
        'retmax':1
    }
# try to resolve the PubMed ID of the DOI
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)
    data = urllib.urlopen(url).read()

# parse XML output from PubMed...
    xmldoc = minidom.parseString(data)
    ids = xmldoc.getElementsByTagName('Id')

# nothing found, exit
    if len(ids) == 0 :
        print url
        raise Exception, 'DoiNotFound'

# get ID
    id = ids[0].childNodes[0].data

# remove unwanted parameters
    params.pop('term')
    params.pop('usehistory')
    params.pop('retmax')
# and add new ones...
    params['id'] = id

    params['retmode'] = 'xml'

# get citation info:
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
    data = urllib.urlopen(url).read()

    return data


def text_output(xml):
    """Makes a simple text output from the XML returned from efetch"""

    xmldoc = minidom.parseString(xml)

    title = xmldoc.getElementsByTagName('ArticleTitle')[0]
    title = title.childNodes[0].data[:-1]
    LOGFILE.writelines(title + '\n')

    authors = xmldoc.getElementsByTagName('AuthorList')[0]
    authors = authors.getElementsByTagName('Author')
    authorlist = []

    for author in authors:
        # inspired http://www.peterbe.com/plog/unicode-to-ascii
        LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data.encode('ascii','ignore')
        Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data.encode('ascii','ignore')
        authorLong = '%s %s' % (LastName, Initials)
        author = LastName
        authorlist.append(authorLong)
    LOGFILE.writelines(', '.join(authorlist) + '\n')

    journalinfo = xmldoc.getElementsByTagName('Journal')[0]
    if journalinfo.getElementsByTagName('ISOAbbreviation'):
        journal = journalinfo.getElementsByTagName('ISOAbbreviation')[0].childNodes[0].data
    else:
        journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data

    journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
    year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
    month = journalinfo.getElementsByTagName('Month')[0].childNodes[0].data
    LOGFILE.writelines('%s %s, %s\n' % (year, month, journal))

    abstract = xmldoc.getElementsByTagName('AbstractText')[0]
    abstract = abstract.childNodes[0].data.encode('ascii','ignore')
    LOGFILE.writelines(abstract + '\n')

    pmid = xmldoc.getElementsByTagName('PMID')[0]
    pmid = pmid.childNodes[0].data[:-1]
    LOGFILE.writelines('PMID:' + pmid + '\n')

    print title
    print authorlist[0], ',', authorlist[-1]
    print journal, year
    print ''

    # output = author.replace(' ','') + '_' + journal.replace('.','').replace(' ','') + '-' + year + '_' + title[:140].replace(' ','-').replace('/','')
    output = authors[0].getElementsByTagName('LastName')[0].childNodes[0].data.encode('ascii','ignore').replace(' ','') + '_' + author.replace(' ','') + '-' + year + '_' + journal.replace('.','').replace(' ','')

    return output


def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text


if __name__ == '__main__':
    from sys import argv, exit
if len(argv) == 1:
    print 'Usage: %s <folder>' % argv[0]
    print ' e.g. %s ./' % argv[0]
    exit()

if os.path.exists(os.path.join(argv[1],'renamed/')) == 0:
    os.mkdir(os.path.join(argv[1],'renamed/'))

today = strftime('%d%b%Y', gmtime())
LOGFILE = open(os.path.join(argv[1],'log-' + today + '.txt'), 'ab')
FileCounter = 0

for input in glob.glob(os.path.join(argv[1],'*')) :
    if os.path.isdir(input) :
        continue

    FileCounter = FileCounter + 1
    VersionDelete = 0
    testfile = open(input, 'rb')
    if is_binary(testfile.read()) :
        testfile.seek(-1, 2)
        line = ''
        while not line:
            line = readNextEndLine(testfile)
        if line[:5] != '%%EOF' :
            # print 'pyPdf EOF marker not found'
            continue
    else :
        # print 'not binary file'
        continue
    testfile.close()
    print input

    extractfull = getPDFContent(input).encode('ascii', 'xmlcharrefreplace')
    # print extractfull[:6999] # enable for debug

    extractDOI = re.search('(?<=doi)/?:?\s?[0-9\.]{7}/\S*[0-9]', extractfull.lower().replace('&#338;','-'))
    if not extractDOI :
        extractDOI = re.search('(?<=doi).?10.1073/pnas\.\d+', extractfull.lower().replace('pnas','/pnas')) # PNAS fix
    if not extractDOI :
        extractDOI = re.search('10\.1083/jcb\.\d{9}', extractfull.lower()) # JCB fix
    # print extractDOI # enable for debug
 
    if extractDOI :
        cleanDOI = extractDOI.group(0).replace(':','').replace(' ','')
        if re.search('^/', cleanDOI) :
            cleanDOI = cleanDOI[1:]

        if re.search('^10.1096', cleanDOI) : # FABSE J fix
            cleanDOI = cleanDOI[:20]

        if re.search('^10.1083', cleanDOI) : # JCB second fix
            cleanDOI = cleanDOI[:21]

        if len(cleanDOI) > 40 :
            cleanDOItemp = re.sub(r'\d\.\d', '000', cleanDOI)
            reps = {'.':'A', '-':'0'}
            cleanDOItemp = replace_all(cleanDOItemp[8:], reps)
            digitStart = 0
            for i in range(len(cleanDOItemp)) :
                if cleanDOItemp[i].isdigit() :
                    digitStart = 1
                if cleanDOItemp[i].isalpha() and digitStart :
                    break
            cleanDOI = cleanDOI[0:(8+i)]

    else :
        print '$$ Doi Fail Extract', input
        continue # break
    
    print cleanDOI
    LOGFILE.writelines('dox:' + cleanDOI + '\n')

    getDOI = 1
    while getDOI :
        getDOI = 0
        try:
            citation = get_citation_from_doi(cleanDOI)
        except:
            getDOI = 1
            cleanDOI = cleanDOI[0:-1] # most nature articles

    while not citation :
        time.sleep(10)
        print 'internet not connected? try hard to access the pubmed'
        citation = get_citation_from_doi(cleanDOI)

    newFilename = os.path.join(argv[1], 'renamed/', '%s.pdf' % text_output(citation))
    if os.path.isfile(newFilename) :
        shutil.move(input, os.path.join(argv[1], 'renamed/', '%s_%s.pdf' % (text_output(citation), FileCounter)))
    else :
        shutil.move(input, '%s' % newFilename)

    LOGFILE.writelines('\n\n')

Previous:
Next:

Leave a Reply