Update 2010-2-9 23:56:56
While v0.11 works great on mac and linux, it does NOT work on windows. So, here is pyPdfRename v0.12, which has been test in all three environments. Enjoy!
Update 2010-2-9 15:43:13
pyPdfRename v0.11 is here.
I just used it to process my download folder, which had over 2000 files. In the end, this python script successfully extracted 1001 PDF files' DOI information and renamed them. It identified another 763 PDF files. All the remaining are other goodies I downloaded from the Internet.
It performs exactly what I want it does. I am very proud of it!
Change Log, from v0.1 to v0.11
- Better handle on articles from Nature Publishing Group
- Better process PubMed information - less manual correction needed
- Add a filter to identify non-pdf files based on file extension
It is my very first python script - naive, buggy, ... it just works


What it does?
Rename all the PDF files in the designated folder into the following:
Last name of the first author, Lase name of the last author, published year, Abbreviation of the name of the journal
You can change the script to output the file name with other information.
How it does?
- List all the binary files in the designated folder, and further exclude those are not PDF files (such the word, excel, powerpoint files)
- Use pyPdf to extra the text from each PDF file and search for DOI identity (Since some PDF files, such as those are scanned, those are published before 1990s and those are password protected/secured cannot be extracted by pyPdf, the script will try its best to skip these files. But, if the script breaks, just move the error file out of the directory, and run the script again.)
- Use extracted DOI information and eutils from NCBI to obtain any information in PUBMED related to the PDF files
- Generate a text file to log these information for each PDF file (such as the title, the authors, the journal name, the time when it was published, etc)
- Rename the PDF file and move it to a specific folder
Why wrote this script?
Most of my collected scientific papers are in PDF format and stored in my download folder, which has hundreds if not thousands files with strange names. I really want to take a good care of them. But I don't want to manually do it.

The program papers is very popular. When I tried to use it to import my giant PDF collection, it crashed, again and again, even on a subset. It is just not for me - when I am onto a paper, I read it either with Preview or as it embedded in the browser; the file is stored in a folder with its original name when I downloaded. Days passed by, files are piling there and I want to organize them.
I did some googling and found simon's excellent code "Query PubMed for citation information using a DOI and Python". If I can automatically extract DOI from the PDF file, in conjunction with simon's script, I should be able to have a program automatically do the organization for me!
Python has been on my learning list for a long time now. I decided to use this project as my very first python learning material. With googling, I had a draft version in a day. But, when I tried to process a folder of 50 pdf files from very different journals, the issue occurred: single pattern match cannot extract DOI of files from different journals; the presentation of DOI information varies a lot between different publishers. It took me another 2-3 days to perfect the matching script ... now, here is the release 0.1 - fully tested on Ubuntu/Linux
How to use it?
- If you are using windows, please install python. If you are using linux/mac, python should be already installed.
- Download pyPdfRename and extra it.
- Use your favorite command line terminal (for windows win+r, type in cmd) and go to the folder of pyPdfRename.
- Locate your_target_folder, which has the PDF files you want to rename (the python script will NOT visit any directory included in the target folder), type in:
python pyRenamePdf.py your_target_folder - The screen should be scrolling now. If error occurs, just move the related file out of the way. It should be minimal. The most new version pyRenamePdf.py is awesome and, of course, smart!
- Enjoy!
The script is in a very early development stage. I just put here as it is. Feel free to use it and I would love to hear your feedback/bug report/feature request.
Source code of pyRenamePdf.py version 0.1
#!/usr/bin/env python
# Script to rename all PDF files in the folder, basded on DOI info
# (c) Liang Cai, 2010
# http://en.dogeno.us
# v0.1
import os, os.path, shutil, re, glob, unicodedata, pyPdf
from time import gmtime, strftime
# very inspired by pythonquery.py
# Simple script to query pubmed for a DOI
# (c) Simon Greenhill, 2007
# http://simon.net.nz/
import urllib
from xml.dom import minidom
# http://www.istihza.com/makale_pypdf.html
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
# inspired by pyPdf/pdf.py line 607
def readNextEndLine(stream):
line = ''
while True:
x = stream.read(1)
stream.seek(-2, 1)
if x == '\n' or x == '\r':
while x == '\n' or x == '\r':
x = stream.read(1)
stream.seek(-2, 1)
stream.seek(1, 1)
break
else:
line = x + line
return line
# inspired by http://www.velocityreviews.com/forums/t320964-p2-determine-file-type-binary-or-text.html
def is_binary(buff):
"""Return true if the given filename is binary"""
non_text = 0
all_text = 0
for i in range(len(buff)):
a = ord(buff[i])
all_text = all_text + 1
if (a < 8) or (a > 13 and a < 32) or (a > 126):
non_text = non_text + 1
if all_text == 4096:
break
# print non_text, all_text # enable for debug
if non_text > all_text * 0.0009:
return 1
else:
return 0
# inspired by http://code.activestate.com/recipes/511465/
def getPDFContent(path):
content = ''
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, 'rb'))
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + ' '
# Collapse whitespace
content = ' '.join(content.replace(u'\xa0', ' ').strip().split())
return content
def get_citation_from_doi(query, email='i@cail.cn', tool='pyRenamePdf.py', database='pubmed'):
params = {
'db':database,
'tool':tool,
'email':email,
'term':query,
'usehistory':'y',
'retmax':1
}
# try to resolve the PubMed ID of the DOI
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)
data = urllib.urlopen(url).read()
# parse XML output from PubMed...
xmldoc = minidom.parseString(data)
ids = xmldoc.getElementsByTagName('Id')
# nothing found, exit
if len(ids) == 0 :
print url
raise Exception, 'DoiNotFound'
# get ID
id = ids[0].childNodes[0].data
# remove unwanted parameters
params.pop('term')
params.pop('usehistory')
params.pop('retmax')
# and add new ones...
params['id'] = id
params['retmode'] = 'xml'
# get citation info:
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
data = urllib.urlopen(url).read()
return data
def text_output(xml):
"""Makes a simple text output from the XML returned from efetch"""
xmldoc = minidom.parseString(xml)
title = xmldoc.getElementsByTagName('ArticleTitle')[0]
title = title.childNodes[0].data[:-1]
LOGFILE.writelines(title + '\n')
authors = xmldoc.getElementsByTagName('AuthorList')[0]
authors = authors.getElementsByTagName('Author')
authorlist = []
for author in authors:
# inspired http://www.peterbe.com/plog/unicode-to-ascii
LastName = author.getElementsByTagName('LastName')[0].childNodes[0].data.encode('ascii','ignore')
Initials = author.getElementsByTagName('Initials')[0].childNodes[0].data.encode('ascii','ignore')
authorLong = '%s %s' % (LastName, Initials)
author = LastName
authorlist.append(authorLong)
LOGFILE.writelines(', '.join(authorlist) + '\n')
journalinfo = xmldoc.getElementsByTagName('Journal')[0]
if journalinfo.getElementsByTagName('ISOAbbreviation'):
journal = journalinfo.getElementsByTagName('ISOAbbreviation')[0].childNodes[0].data
else:
journal = journalinfo.getElementsByTagName('Title')[0].childNodes[0].data
journalinfo = journalinfo.getElementsByTagName('JournalIssue')[0]
year = journalinfo.getElementsByTagName('Year')[0].childNodes[0].data
month = journalinfo.getElementsByTagName('Month')[0].childNodes[0].data
LOGFILE.writelines('%s %s, %s\n' % (year, month, journal))
abstract = xmldoc.getElementsByTagName('AbstractText')[0]
abstract = abstract.childNodes[0].data.encode('ascii','ignore')
LOGFILE.writelines(abstract + '\n')
pmid = xmldoc.getElementsByTagName('PMID')[0]
pmid = pmid.childNodes[0].data[:-1]
LOGFILE.writelines('PMID:' + pmid + '\n')
print title
print authorlist[0], ',', authorlist[-1]
print journal, year
print ''
# output = author.replace(' ','') + '_' + journal.replace('.','').replace(' ','') + '-' + year + '_' + title[:140].replace(' ','-').replace('/','')
output = authors[0].getElementsByTagName('LastName')[0].childNodes[0].data.encode('ascii','ignore').replace(' ','') + '_' + author.replace(' ','') + '-' + year + '_' + journal.replace('.','').replace(' ','')
return output
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
if __name__ == '__main__':
from sys import argv, exit
if len(argv) == 1:
print 'Usage: %s <folder>' % argv[0]
print ' e.g. %s ./' % argv[0]
exit()
if os.path.exists(os.path.join(argv[1],'renamed/')) == 0:
os.mkdir(os.path.join(argv[1],'renamed/'))
today = strftime('%d%b%Y', gmtime())
LOGFILE = open(os.path.join(argv[1],'log-' + today + '.txt'), 'ab')
FileCounter = 0
for input in glob.glob(os.path.join(argv[1],'*')) :
if os.path.isdir(input) :
continue
FileCounter = FileCounter + 1
VersionDelete = 0
testfile = open(input, 'rb')
if is_binary(testfile.read()) :
testfile.seek(-1, 2)
line = ''
while not line:
line = readNextEndLine(testfile)
if line[:5] != '%%EOF' :
# print 'pyPdf EOF marker not found'
continue
else :
# print 'not binary file'
continue
testfile.close()
print input
extractfull = getPDFContent(input).encode('ascii', 'xmlcharrefreplace')
# print extractfull[:6999] # enable for debug
extractDOI = re.search('(?<=doi)/?:?\s?[0-9\.]{7}/\S*[0-9]', extractfull.lower().replace('Œ','-'))
if not extractDOI :
extractDOI = re.search('(?<=doi).?10.1073/pnas\.\d+', extractfull.lower().replace('pnas','/pnas')) # PNAS fix
if not extractDOI :
extractDOI = re.search('10\.1083/jcb\.\d{9}', extractfull.lower()) # JCB fix
# print extractDOI # enable for debug
if extractDOI :
cleanDOI = extractDOI.group(0).replace(':','').replace(' ','')
if re.search('^/', cleanDOI) :
cleanDOI = cleanDOI[1:]
if re.search('^10.1096', cleanDOI) : # FABSE J fix
cleanDOI = cleanDOI[:20]
if re.search('^10.1083', cleanDOI) : # JCB second fix
cleanDOI = cleanDOI[:21]
if len(cleanDOI) > 40 :
cleanDOItemp = re.sub(r'\d\.\d', '000', cleanDOI)
reps = {'.':'A', '-':'0'}
cleanDOItemp = replace_all(cleanDOItemp[8:], reps)
digitStart = 0
for i in range(len(cleanDOItemp)) :
if cleanDOItemp[i].isdigit() :
digitStart = 1
if cleanDOItemp[i].isalpha() and digitStart :
break
cleanDOI = cleanDOI[0:(8+i)]
else :
print '$$ Doi Fail Extract', input
continue # break
print cleanDOI
LOGFILE.writelines('dox:' + cleanDOI + '\n')
getDOI = 1
while getDOI :
getDOI = 0
try:
citation = get_citation_from_doi(cleanDOI)
except:
getDOI = 1
cleanDOI = cleanDOI[0:-1] # most nature articles
while not citation :
time.sleep(10)
print 'internet not connected? try hard to access the pubmed'
citation = get_citation_from_doi(cleanDOI)
newFilename = os.path.join(argv[1], 'renamed/', '%s.pdf' % text_output(citation))
if os.path.isfile(newFilename) :
shutil.move(input, os.path.join(argv[1], 'renamed/', '%s_%s.pdf' % (text_output(citation), FileCounter)))
else :
shutil.move(input, '%s' % newFilename)
LOGFILE.writelines('\n\n')
- 作者:cail
- 版权声明:署名-非商业性使用-禁止演绎 CC BY-NC-ND 3.0
- 原文网址:http://en.dogeno.us/?p=6486
- 最后修改时间:2010年7月26日 23:55 PDT
Previous:
信手拈来 之 Phantom Of The Opera
Next:
刺激落枕穴改善落枕症状 手掌穴位按摩保你全身康健
8 Responses to “My very first python script: organizing scientific papers with pyRenamePdf.py”
Leave a Reply
blog by cail
- » the Paper Link - my latest Creation for PubMed users
- » How to use ImageJ to analyze images?
- » 2shRNA - design oligos for RNAi
- » Play background music
- » about this blog
- » about me
Hot in 'How-To'
- analog - analog/digital - digital, VGA - DVI - HDMI - 74,611 views
- Ez-12 windsurfer antenna - 53,999 views
- Import custom ringtones to iPhone via iTunes (no jailbreaking required) - 47,074 views
- How to add new ringtones to iPhone - 42,493 views
- Use GParted to align partitions on a SSD hard drive for better performance - 40,977 views
I am improving the script in a folder, which has over 2000 files ... stay tuned v0.11 will be out soon
Cool! I always wonder whether there is such tool to organize pdf papers. Thanks!
The software looks cool. But what's the need to keep a copy of the paper? My sense is that as long as you work in a reasonable research institute, you can always access papers as needed.
For windows, follow this part (http://docs.python.org/using/windows.html#excursus-setting-environment-variables) after the installation to make python work in the command window
two reasons:
1. even you affiliated with an institute, when you are traveling, you want to access the paper, you need not only internet connection, but also vpn to obtain the paper which you have already read
2. keep a copy in the hard drive, index it, and you can use desktop search engine, ie spotlight, to get it whenever you want - the log file created during renaming process has the abstract of the paper
if this python cannot efetch pubmed abstract, please let me know. thanks
The script works perfectly! I was searching for something like this since some time, thanks cail.cn.
I just would like to change the output just to the "PMID.pdf", what should I change? I am sorry if the question is too fool, I am absolutely ignorant in these stuffs
thanks!
F
Hi F, I don't understand what is the output you mean, the log file?