Kimi
generated.
extract_paper
Introduction
This Python script is designed to extract text content from PDF documents and format it into a structured JSON file. It uses various libraries such as argparse
, glob
, os
, and tqdm
for argument parsing, file path expansion, environment variable setup, and progress indication, respectively. The script also utilizes the mbapy
library for PDF processing and JSON serialization.
Parameters
-i
,--input
: The directory path containing the PDF files to be processed.-o
,--output
: The output file name for the JSON file that will store the extracted data. Defaults to_mbapy_extract_paper.json
.-b
,--backend
: The backend library to use for PDF conversion. Defaults topdfminer
.-l
,--log
: A flag to enable logging. Defaults toFalse
.
Behavior
- The script sets environment variables to control the behavior of certain libraries.
- It parses command line arguments to get the input directory, output file name, backend for PDF conversion, and logging flag.
- It finds all PDF files in the specified input directory.
- For each PDF file, the script attempts to extract bookmarks and convert the PDF to plain text.
- The extracted text and bookmarks are formatted into a structured data format.
- The structured data for all PDF files are compiled into a JSON object.
- The JSON object is saved to a file with the specified output name in the input directory.
- The script provides a progress bar to indicate the completion status.
- It allows for an early stop by the user by pressing the letter
e
.
Notes
- The script uses the
glob
module to find all PDF files in the given directory. - The
mbapy
library is expected to provide functions such asget_section_bookmarks
,convert_pdf_to_txt
, andformat_paper_from_txt
. - The script includes error handling to skip over PDF files that cannot be parsed.
- Logging is configurable and can be enabled by the user through the
--log
flag. - The
Configs.err_warning_level
is set to a high value to suppress warnings if logging is not enabled.
Examples
To run the script on a directory of PDFs and save the extracted data to a file named extracted_papers.json
:
mbapy-cli extract_paper -i . -o extracted_papers.json
To run the script with logging enabled and using a specific backend for PDF conversion:
mbapy-cli extract_paper -i . -l -b pdfminer.six