Skip to main content

Software Development

How to Remove Strikethrough Text from PDFs Using Python

Woman, Cyber Security And Iot With Technology For Team, Meeting Or Programming Working At Night With Digital Overlay. Group Of Female Employee Developers Code, App Or Big Data In Futuristic Network

In this blog post, I will share my journey of developing a Python-based solution to remove strikethrough text from PDFs. This solution is specifically designed for PDFs where strikethrough is applied as a style rather than an annotation.

The Challenge

Strikethrough text in PDFs can be tricky to handle, mainly when applied as a style. Standard PDF manipulation libraries often fall short in these cases. Determined to find a solution, I leveraged Python to create a practical approach.

The Solution

The solution involves three main steps: converting the PDF to a DOCX file, removing the strikethrough text from the DOCX file, and converting the modified DOCX file back to a PDF.

Dependencies

Before diving into the code, install the necessary Python dependencies. You will need:
• pdf2docx for converting PDF to DOCX
• python-docx for manipulating DOCX files
• docx2pdf for converting DOCX back to PDF

You can install these dependencies using pip:

pip install pdf2docx python-docx docx2pdf

Step-by-Step Guide to Remove Strikethrough Text from PDFs

Step 1: Convert PDF to DOCX

The first step is to convert the PDF file to a DOCX file. This allows us to manipulate the text more easily. We use the pdf2docx library for this conversion. Here is the code for the conversion function:

from pdf2docx import Converter
def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

In this function, we create an instance of the Converter class, passing the pdf_file as an argument. The convert method of the Converter class is called to perform the conversion, and the close method is called to release any resources the converter uses. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Step 2: Remove Strikethrough Text

Once we have the DOCX file, we can remove the strikethrough text. This step involves iterating through the paragraphs and runs in the DOCX file and checking for the strikethrough style. We use the python-docx library for this task. Here is the code for the strikethrough removal function:

from docx import Document
def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False
        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True
        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

In this function, we create an instance of the Document class, passing the docx_file as an argument. We iterate through each paragraph in the document and then through each run within the section. If the strike attribute of the run’s font is True, we print a message indicating removing the strikethrough text and set the run’s text to an empty string. If strikethrough text was removed, we save the modified document to a new file with _modified appended to the original filename. If no strikethrough text was found, we return the original DOCX file.

Step 3: Convert DOCX Back to PDF

The final step is to convert the modified DOCX file back to a PDF file. This ensures that the strikethrough text is removed in the final PDF. We use the docx2pdf library for this conversion. Here is the code for the conversion function:

from docx2pdf import convert

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

We call this function the convert function, passing the docx_file and output_pdf as arguments to perform the conversion. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Main Execution Block

The following block of code is the main execution section of the script. It starts by checking if the script is being run directly. It then verifies that the correct number of command-line arguments is provided and that the specified PDF file exists. If these conditions are met, the script defines intermediate file paths and performs the three main steps: converting the PDF to a DOCX file, removing strikethrough text from the DOCX file, and converting the modified DOCX back to a PDF. After completing these steps, it prints the location of the modified PDF file and cleans up any intermediate files. If errors occur during execution, they are caught and printed, and the script exits gracefully.

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Complete Script

import sys
import os
from pdf2docx import Converter
from docx import Document
from docx2pdf import convert

def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False

        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True

        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Run the Script

Execute the script by running the following command, replacing it with the path to your PDF file:

python <script_name>.py <pdf_file_path>


This Python-based solution effectively removes strikethrough text from PDFs by leveraging the strengths of the pdf2docx, python-docx, and docx2pdf libraries. By converting the PDF to DOCX, modifying the DOCX, and converting it back to PDF, we can ensure that the strikethrough text is removed without affecting other content. This approach provides a robust and efficient method for handling strikethrough text in PDFs, making your documents clean and professional.

Tags

Thoughts on “How to Remove Strikethrough Text from PDFs Using Python”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Punam Khond

Punam Khond is a Technical Consultant at Perficient with diverse experience in software development. She excels at solving complex problems and delivering high-quality solutions. Punam is passionate about technology and enjoys collaborating with teams and continuously exploring new innovations.

More from this Author

Follow Us