Automating DOI extraction with Python

A few months ago, I was working on – the DOI Scraper. This Python script reads a .bib file, hunts down articles without a DOI (Digital Object Identifier), and effortlessly fetches the missing DOIs using the Crossref API. It then updates the .bib file with the new data.

Why Did I Create This?

As a researcher, reference management is a critical yet often time-consuming task. One aspect I found particularly useful when writing my articles and notes in LaTeX is the ability to include the DOIs of the cited articles in the manuscript for easy access. However, when I download the .bib file from Google Scholar, the DOIs are often missing and the manual search for them proved to be a real headache. To save time and enhance efficiency, I decided to automate the process and share this tool with you all.

Prerequisites

  • Python
  • requests library

How to Get Started

  1. Clone the repository or snag the doi_scraper.py file.
  2. Install the required dependencies with:
pip install requests

How to Use

Place your input .bib file in the script’s directory, tweak a couple of variables in doi_scraper.py to fit your needs

input_file = 'input.bib'   # Name of the input .bib file
output_file = 'output.bib' # Name of the output .bib file
INDENT_PRE = 4             # Number of spaces before the field name
INDENT_POST = 16           # Number of spaces after the field name

and run the script:

python doi_scraper.py

Example

Before

@article{Cuadra2020,
title            = {Effect of equivalence ratio fluctuations on planar detonation discontinuities},
author   = {Cuadra, Alberto and Huete, C{\'e}sar and Vera, Marcos},
year    = 2020,
journal  = {Journal of Fluid Mechanics},
publisher    = {Cambridge University Press},
volume       = 903,
pages= {A30 1--39}
}

After

@article{Cuadra2020,
    title            = {Effect of equivalence ratio fluctuations on planar detonation discontinuities},
    author           = {Cuadra, Alberto and Huete, C{\'e}sar and Vera, Marcos},
    year             = 2020,
    journal          = {Journal of Fluid Mechanics},
    publisher        = {Cambridge University Press},
    volume           = 903,
    pages            = {A30 1--39},
    doi              = {10.1017/jfm.2020.651}
}

License

This project is licensed under the MIT License.

Alberto Cuadra-Lara
Alberto Cuadra-Lara
Postdoctoral Researcher at