Punam Khond, Author at Perficient Blogs https://blogs.perficient.com/author/pkhond/ Expert Digital Insights Tue, 15 Jul 2025 14:08:09 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Punam Khond, Author at Perficient Blogs https://blogs.perficient.com/author/pkhond/ 32 32 30508587 Stream API in Java: Enhancements and Use Cases https://blogs.perficient.com/2025/07/15/stream-api-in-java-enhancements-and-use-cases/ https://blogs.perficient.com/2025/07/15/stream-api-in-java-enhancements-and-use-cases/#respond Tue, 15 Jul 2025 14:08:09 +0000 https://blogs.perficient.com/?p=383331

Working with collections in Java used to involve a lot of loops and verbose code. That changed significantly with the introduction of the Stream API in Java 8. It introduced a functional approach to data processing, resulting in cleaner, more concise, and easier-to-read code.

This blog walks through the basics of the Stream API and dives into some of the newer enhancements introduced in Java 9 and beyond. Along the way, real-life examples help make the concepts more practical and relatable.

What is the Stream API All About?

Think of a stream as a pipeline of data. It does not store data, but rather processes elements from a source, such as a list or array. Operations are chained in a fluent style, making complex data handling tasks much more straightforward.

Creating Streams

Streams can be created from many different sources:

List<String> names = List.of("Alice", "Bob", "Charlie");

Stream<String> nameStream = names.stream();


Stream<Integer> numberStream = Stream.of(1, 2, 3);


String[] array = {"A", "B", "C"};

Stream<String> arrayStream = Arrays.stream(array);

Use Case: Processing Employee Data

Here is a familiar example processing a list of employees:

List<Employee> employees = Arrays.asList(

    new Employee("Alice", 70000),

    new Employee("Bob", 50000),

    new Employee("Charlie", 120000),

    new Employee("David", 90000)

);


List<Employee> highEarners = employees.stream()

    .filter(e -> e.salary > 80000)

    .map(e -> new Employee(e.name.toUpperCase(), e.salary))

    .sorted((e1, e2) -> Double.compare(e2.salary, e1.salary))

    .collect(Collectors.toList());

This snippet filters employees earning above 80k, transforms their names to uppercase, sorts them by salary in descending order, and collects the result into a new list.

Common Stream Operations

Streams typically use two types of operations:

  • Intermediate (like filter, map, sorted) — These are lazy and build up the processing pipeline.
  • Terminal (like collect, reduce, forEach) — These trigger execution and produce a result.

A combination of these can handle most common data transformations in Java applications.

Stream API Enhancements (Java 9+)

New features made streams even more powerful:

  1. takeWhile() and dropWhile()

Great for slicing a stream based on a condition:

Stream.of(100, 90, 80, 70, 60)

    .takeWhile(n -> n >= 80)

    .forEach(System.out::println);

// Outputs: 100, 90, 80
  1. Stream.ofNullable()

Helps avoid null checks by returning an empty stream if the value is null.

Stream.ofNullable(null).count(); // Returns 0

 

  1. Collectors Enhancements

Grouping and filtering in one go:

Map<String, List<Employee>> grouped = employees.stream()

    .collect(Collectors.groupingBy(

        e -> e.department,

        Collectors.filtering(e -> e.salary > 80000, Collectors.toList())

    ));

Also helpful is Collectors.flatMapping() when flattening nested data structures during grouping.

Parallel Streams in Java

Parallel streams in Java are a powerful feature that enables concurrent data processing, leveraging multiple threads to enhance performance, particularly with large datasets. Here is a closer look at how you can use parallel streams effectively.

Leveraging Parallel Streams

For large datasets, parallelStream() allows you to split work across multiple threads, which can significantly improve performance:

double totalHighSalary = employees.parallelStream()

    .filter(e -> e.getSalary() > 80000)

    .mapToDouble(Employee::getSalary)

    .sum();

This approach can speed up processing, but it is essential to exercise caution with shared resources due to potential concurrency issues.

Use Case: Grouping by Department

Parallel streams can also be useful for grouping data, which is particularly helpful in applications like payroll, HR, or dashboard services:

Map<string, list> departmentWise = employees.parallelStream()</string, list

    .collect(Collectors.groupingBy(Employee::getDepartment));

Grouping employees by department in parallel can make reporting and analysis more efficient, especially with large datasets.

Best Practices for Using Parallel Streams

To get the most out of parallel streams, keep these tips in mind:

    1. Use for Large, CPU-Intensive Tasks: Ideal for processing large datasets with intensive computations.
    2. Avoid Shared Mutable Data: Ensure operations are thread-safe and don’t modify shared state.
    3. Measure Performance: Always benchmark to confirm that parallelism is improving speed.
    4. Use Concurrent Collectors: When collecting results, use thread-safe collectors like toConcurrentMap().

When Not to Use Parallel Streams

  1. For Small Datasets: The overhead of managing multiple threads can outweigh the benefits when working with small collections, making sequential streams more efficient.
  2. In I/O-Heavy Operations: Tasks involving file access, database queries, or network calls don’t benefit much from parallelism and may even perform worse due to thread blocking.

Conclusion

Java Stream API streamlines data processing by replacing boilerplate heavy code with expressive, functional patterns. The enhancements introduced in Java 9 and beyond, including advanced collectors and conditional stream slicing, provide even more powerful ways to handle data. With just a little practice, working with streams becomes second nature, and the code ends up both cleaner and faster to write.

]]>
https://blogs.perficient.com/2025/07/15/stream-api-in-java-enhancements-and-use-cases/feed/ 0 383331
Automate Release Notes to Confluence with Bitbucket Pipelines https://blogs.perficient.com/2025/02/13/automate-release-notes-to-confluence-with-bitbucket-pipelines/ https://blogs.perficient.com/2025/02/13/automate-release-notes-to-confluence-with-bitbucket-pipelines/#respond Fri, 14 Feb 2025 05:44:45 +0000 https://blogs.perficient.com/?p=376360

In this blog post, I will share my journey of implementing an automated solution to publish release notes for service deployments to Confluence using Bitbucket Pipelines. This aimed to streamline our release process and ensure all relevant information was easily accessible to our team. By leveraging tools like Bitbucket and Confluence, we achieved a seamless integration that enhanced our workflow.

Step 1: Setting Up the Pipeline

We configured our Bitbucket pipeline to include a new step for publishing release notes. This involved writing a script in the bitbucket-pipelines.yml file to gather the necessary information (SHA, build number, and summary of updates).

Step 2: Generating Release Notes

We pulled the summary of updates from our commit messages and release notes. To ensure the quality of the summaries, we emphasized the importance of writing detailed and informative commit messages.

Step 3: Publishing to Confluence

Using the Confluence Cloud REST API, we automated the creation of Confluence pages. We made a parent page titled “Releases” and configured the script to publish a new page.

Repository Variables

We used several repository variables to keep sensitive information secure and make the script more maintainable:

  • REPO_TOKEN: The token used to authenticate with the Bitbucket API.
  • CONFLUENCE_USERNAME: The username for Confluence authentication.
  • CONFLUENCE_TOKEN: The token for Confluence authentication.
  • CONFLUENCE_SPACE_KEY: The key to the Confluence space where the release notes are published.
  • CONFLUENCE_ANCESTOR_ID: The ID of the parent page under which new release notes pages are created.
  • CONFLUENCE_API_URL: The URL of the Confluence API endpoint.

Repovariables

Script Details

Here is the script we used in our bitbucket-pipelines.yml file, along with an explanation of each part:

Step 1: Define the Pipeline Step

- step: &release-notes
      name: Publish Release Notes
      image: atlassian/default-image:3
  • Step Name: The step is named “Publish Release Notes”.
  • Docker Image: Uses the atlassian/default-image:3 Docker image for the environment.

Step 2: List Files

script:
  - ls -la /src/main/resources/
  • List Files: The ls -la command lists the files in the specified directory to ensure the necessary files are present.

Step 3: Extract Release Number

- RELEASE_NUMBER=$(grep '{application_name}.version' /src/main/resources/application.properties | cut -d'=' -f2)
  • Extract Release Number: The grep command extracts the release number from the application.properties file where the property {application_name}.version should be present.

Step 4: Create Release Title

- RELEASE_TITLE="Release - $RELEASE_NUMBER Build- $BITBUCKET_BUILD_NUMBER Commit- $BITBUCKET_COMMIT"
  • Create Release Title: Construct the release title using the release number, Bitbucket build number, and commit SHA.

Step 5: Get Commit Message

- COMMIT_MESSAGE=$(git log --format=%B -n 1 ${BITBUCKET_COMMIT})
  • Get Commit Message: The git log command retrieves the commit message for the current commit.

Step 6: Check for Pull Request

- |
  if [[ $COMMIT_MESSAGE =~ pull\ request\ #([0-9]+) ]]; then
    PR_NUMBER=$(echo "$COMMIT_MESSAGE" | grep -o -E 'pull\ request\ \#([0-9]+)' | sed 's/[^0-9]*//g')
  • Check for Pull Request: The script checks if the commit message contains a pull request number.
  • Extract PR Number: If a pull request number is found, it is extracted using grep and sed.

Step 7: Fetch Pull Request Description

RAW_RESPONSE=$(wget --no-hsts -qO- --header="Authorization: Bearer $REPO_TOKEN" "https://api.bitbucket.org/2.0/repositories/$BITBUCKET_WORKSPACE/$BITBUCKET_REPO_SLUG/pullrequests/${PR_NUMBER}")
PR_DESCRIPTION=$(echo "$RAW_RESPONSE" | jq -r '.description')
echo "$PR_DESCRIPTION" > description.txt
  • Fetch PR Description: Uses wget to fetch the pull request description from the Bitbucket API.
  • Parse Description: Parses the description using jq and saves it to description.txt.

Step 8: Prepare JSON Data

 AUTH_HEADER=$(echo -n "$CONFLUENCE_USERNAME:$CONFLUENCE_TOKEN" | base64 | tr -d '\n')
 JSON_DATA=$(jq -n --arg title "$RELEASE_TITLE" \
                    --arg type "page" \
                    --arg space_key "$CONFLUENCE_SPACE_KEY" \
                    --arg ancestor_id "$CONFLUENCE_ANCESTOR_ID" \
                    --rawfile pr_description description.txt \
                    '{
                      title: $title,
                      type: $type,
                      space: {
                        key: $space_key
                      },
                      ancestors: [{
                        id: ($ancestor_id | tonumber)
                      }],
                      body: {
                        storage: {
                          value: $pr_description,
                          representation: "storage"
                        }
                      }
                    }')
  echo "$JSON_DATA" > json_data.txt
  • Prepare Auth Header: Encodes the Confluence username and token for authentication.
  • Construct JSON Payload: Uses jq to construct the JSON payload for the Confluence API request.
  • Save JSON Data: Saves the JSON payload to json_data.txt.

Step 9: Publish to Confluence

  wget --no-hsts --method=POST --header="Content-Type: application/json" \
      --header="Authorization: Basic $AUTH_HEADER" \
      --body-file="json_data.txt" \
      "$CONFLUENCE_API_URL" -q -O -
  if [[ $? -ne 0 ]]; then
    echo "HTTP request failed"
    exit 1
  fi
  • Send POST Request: This method uses wget to send a POST request to the Confluence API to create or update the release notes page.
  • Error Handling: Checks if the HTTP request failed and exits with an error message if it did.

Script

# Service for publishing release notes
- step: &release-notes
      name: Publish Release Notes
      image: atlassian/default-image:3
      script:
        - ls -la /src/main/resources/
        - RELEASE_NUMBER=$(grep '{application_name}.version' /src/main/resources/application.properties | cut -d'=' -f2)
        - RELEASE_TITLE="Release - $RELEASE_NUMBER Build- $BITBUCKET_BUILD_NUMBER Commit- $BITBUCKET_COMMIT"
        - COMMIT_MESSAGE=$(git log --format=%B -n 1 ${BITBUCKET_COMMIT})
        - |
          if [[ $COMMIT_MESSAGE =~ pull\ request\ #([0-9]+) ]]; then
            PR_NUMBER=$(echo "$COMMIT_MESSAGE" | grep -o -E 'pull\ request\ \#([0-9]+)' | sed 's/[^0-9]*//g')
            RAW_RESPONSE=$(wget --no-hsts -qO- --header="Authorization: Bearer $REPO_TOKEN" "https://api.bitbucket.org/2.0/repositories/$BITBUCKET_WORKSPACE/$BITBUCKET_REPO_SLUG/pullrequests/${PR_NUMBER}")
            PR_DESCRIPTION=$(echo "$RAW_RESPONSE" | jq -r '.description')
            echo "$PR_DESCRIPTION" > description.txt
            AUTH_HEADER=$(echo -n "$CONFLUENCE_USERNAME:$CONFLUENCE_TOKEN" | base64 | tr -d '\n')
            JSON_DATA=$(jq -n --arg title "$RELEASE_TITLE" \
                              --arg type "page" \
                              --arg space_key "$CONFLUENCE_SPACE_KEY" \
                              --arg ancestor_id "$CONFLUENCE_ANCESTOR_ID" \
                              --rawfile pr_description description.txt \
                              '{
                                title: $title,
                                type: $type,
                                space: {
                                  key: $space_key
                                },
                                ancestors: [{
                                  id: ($ancestor_id | tonumber)
                                }],
                                body: {
                                  storage: {
                                    value: $pr_description,
                                    representation: "storage"
                                  }
                                }
                              }')
            echo "$JSON_DATA" > json_data.txt
            wget --no-hsts --method=POST --header="Content-Type: application/json" \
              --header="Authorization: Basic $AUTH_HEADER" \
              --body-file="json_data.txt" \
              "$CONFLUENCE_API_URL" -q -O -
            if [[ $? -ne 0 ]]; then
              echo "HTTP request failed"
              exit 1
            fi
          fi

Confluence_page
Outcomes and Benefits

  • The automation significantly reduced the manual effort required to publish release notes.
  • The project improved our overall release process efficiency and documentation quality.

Conclusion

Automating the publication of release notes to Confluence using Bitbucket Pipelines has been a game-changer for our team. It has streamlined our release process and ensured all relevant information is readily available. I hope this blog post provides insights and inspiration for others looking to implement similar solutions.

]]>
https://blogs.perficient.com/2025/02/13/automate-release-notes-to-confluence-with-bitbucket-pipelines/feed/ 0 376360
How to Implement Spring Expression Language (SpEL) Validator in Spring Boot: A Step-by-Step Guide https://blogs.perficient.com/2025/02/12/how-to-implement-spring-expression-language-spel-validator-in-spring-boot-a-step-by-step-guide/ https://blogs.perficient.com/2025/02/12/how-to-implement-spring-expression-language-spel-validator-in-spring-boot-a-step-by-step-guide/#respond Wed, 12 Feb 2025 07:07:48 +0000 https://blogs.perficient.com/?p=376468

In this blog post, I will guide you through the process of implementing a Spring Expression Language (SpEL) validator in a Spring Boot application. SpEL is a powerful expression language that supports querying and manipulating an object graph at runtime. By the end of this tutorial, you will have a working example of using SpEL for validation in your Spring Boot application.

Project Structure


Project Structure

Step 1: Set Up Your Spring Boot Project

First things first, let’s set up your Spring Boot project. Head over to Spring Initializer and create a new project with the following dependencies:

  • Spring Boot Starter Web
  • Thymeleaf (for the form interface)
    <dependencies>
    	<dependency>
    		<groupId>org.springframework.boot</groupId>
    		<artifactId>spring-boot-starter-web</artifactId>
    		<version>3.4.2</version>
    	</dependency>
    	<dependency>
    		<groupId>org.springframework.boot</groupId>
    		<artifactId>spring-boot-starter-thymeleaf</artifactId>
    		<version>3.4.2</version>
    	</dependency>
    </dependencies>
    

Step 2: Create the Main Application Class

Next, we will create the main application class to bootstrap our Spring Boot application.

package com.example.demo;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }
}

Step 3: Create a Model Class

Create a SpelExpression class to hold the user input.

package com.example.demo.model;

public class SpelExpression {
    private String expression;

    // Getters and Setters
    public String getExpression() {
        return expression;
    }

    public void setExpression(String expression) {
        this.expression = expression;
    }
}


Step 4: Create a Controller

Create a controller to handle user input and validate the SpEL expression.

package com.example.demo.controller;

import com.example.demo.model.SpelExpression;
import org.springframework.expression.ExpressionParser;
import org.springframework.expression.spel.SpelParseException;
import org.springframework.expression.spel.standard.SpelExpressionParser;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.ModelAttribute;
import org.springframework.web.bind.annotation.PostMapping;

@Controller
public class SpelController {

    private final ExpressionParser parser = new SpelExpressionParser();

    @GetMapping("/spelForm")
    public String showForm(Model model) {
        model.addAttribute("spelExpression", new SpelExpression());
        return "spelForm";
    }

    @PostMapping("/validateSpel")
    public String validateSpel(@ModelAttribute SpelExpression spelExpression, Model model) {
        try {
            parser.parseExpression(spelExpression.getExpression());
            model.addAttribute("message", "The expression is valid.");
        } catch (SpelParseException e) {
            model.addAttribute("message", "Invalid expression: " + e.getMessage());
        }
        return "result";
    }
}

Step 5: Create Thymeleaf Templates

Create Thymeleaf templates for the form and the result page.

spelForm.html

<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <title>SpEL Form</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            background-color: #f4f4f9;
            color: #333;
            margin: 0;
            padding: 0;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
        }
        .container {
            background-color: #fff;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
            text-align: center;
        }
        h1 {
            color: #4CAF50;
        }
        form {
            margin-top: 20px;
        }
        label {
            display: block;
            margin-bottom: 8px;
            font-weight: bold;
        }
        input[type="text"] {
            width: 100%;
            padding: 8px;
            margin-bottom: 20px;
            border: 1px solid #ccc;
            border-radius: 4px;
        }
        button {
            padding: 10px 20px;
            background-color: #4CAF50;
            color: #fff;
            border: none;
            border-radius: 4px;
            cursor: pointer;
        }
        button:hover {
            background-color: #45a049;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>SpEL Expression Validator</h1>
        <form th:action="@{/validateSpel}" th:object="${spelExpression}" method="post">
            <div>
                <label>Expression:</label>
                <input type="text" th:field="*{expression}" />
            </div>
            <div>
                <button type="submit">Validate</button>
            </div>
        </form>
    </div>
</body>
</html>

result.html

<!DOCTYPE html>
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <title>Validation Result</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            background-color: #f4f4f9;
            color: #333;
            margin: 0;
            padding: 0;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
        }
        .container {
            background-color: #fff;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
            text-align: center;
        }
        h1 {
            color: #4CAF50;
        }
        p {
            font-size: 18px;
        }
        a {
            display: inline-block;
            margin-top: 20px;
            padding: 10px 20px;
            background-color: #4CAF50;
            color: #fff;
            text-decoration: none;
            border-radius: 4px;
        }
        a:hover {
            background-color: #45a049;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>Validation Result</h1>
        <p th:text="${message}"></p>
        <a href="/spelForm">Back to Form</a>
    </div>
</body>
</html>

Step 6: Run the Application

Now, it’s time to run your Spring Boot application. To test the SpEL validator, navigate to http://localhost:8080/spelForm in your browser.

For Valid Expression


Expression Validator

Expression Validator Result

For Invalid Expression

Expression Validator

Expression Validator Result
Conclusion

Following this guide, you successfully implemented a SpEL validator in your Spring Boot application. This powerful feature enhances your application’s flexibility and robustness. Keep exploring SpEL for more dynamic and sophisticated solutions. Happy coding!

]]>
https://blogs.perficient.com/2025/02/12/how-to-implement-spring-expression-language-spel-validator-in-spring-boot-a-step-by-step-guide/feed/ 0 376468
How to Remove Strikethrough Text from PDFs Using Python https://blogs.perficient.com/2025/01/14/how-to-remove-strikethrough-text-from-pdfs-using-python/ https://blogs.perficient.com/2025/01/14/how-to-remove-strikethrough-text-from-pdfs-using-python/#comments Tue, 14 Jan 2025 14:06:33 +0000 https://blogs.perficient.com/?p=375277

In this blog post, I will share my journey of developing a Python-based solution to remove strikethrough text from PDFs. This solution is specifically designed for PDFs where strikethrough is applied as a style rather than an annotation.

The Challenge

Strikethrough text in PDFs can be tricky to handle, mainly when applied as a style. Standard PDF manipulation libraries often fall short in these cases. Determined to find a solution, I leveraged Python to create a practical approach.

The Solution

The solution involves three main steps: converting the PDF to a DOCX file, removing the strikethrough text from the DOCX file, and converting the modified DOCX file back to a PDF.

Dependencies

Before diving into the code, install the necessary Python dependencies. You will need:
• pdf2docx for converting PDF to DOCX
• python-docx for manipulating DOCX files
• docx2pdf for converting DOCX back to PDF

You can install these dependencies using pip:

pip install pdf2docx python-docx docx2pdf

Step-by-Step Guide to Remove Strikethrough Text from PDFs

Step 1: Convert PDF to DOCX

The first step is to convert the PDF file to a DOCX file. This allows us to manipulate the text more easily. We use the pdf2docx library for this conversion. Here is the code for the conversion function:

from pdf2docx import Converter
def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

In this function, we create an instance of the Converter class, passing the pdf_file as an argument. The convert method of the Converter class is called to perform the conversion, and the close method is called to release any resources the converter uses. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Step 2: Remove Strikethrough Text

Once we have the DOCX file, we can remove the strikethrough text. This step involves iterating through the paragraphs and runs in the DOCX file and checking for the strikethrough style. We use the python-docx library for this task. Here is the code for the strikethrough removal function:

from docx import Document
def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False
        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True
        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

In this function, we create an instance of the Document class, passing the docx_file as an argument. We iterate through each paragraph in the document and then through each run within the section. If the strike attribute of the run’s font is True, we print a message indicating removing the strikethrough text and set the run’s text to an empty string. If strikethrough text was removed, we save the modified document to a new file with _modified appended to the original filename. If no strikethrough text was found, we return the original DOCX file.

Step 3: Convert DOCX Back to PDF

The final step is to convert the modified DOCX file back to a PDF file. This ensures that the strikethrough text is removed in the final PDF. We use the docx2pdf library for this conversion. Here is the code for the conversion function:

from docx2pdf import convert

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

We call this function the convert function, passing the docx_file and output_pdf as arguments to perform the conversion. If the conversion is successful, a message is printed indicating the conversion. If an error occurs, an exception is caught, and an error message is printed.

Main Execution Block

The following block of code is the main execution section of the script. It starts by checking if the script is being run directly. It then verifies that the correct number of command-line arguments is provided and that the specified PDF file exists. If these conditions are met, the script defines intermediate file paths and performs the three main steps: converting the PDF to a DOCX file, removing strikethrough text from the DOCX file, and converting the modified DOCX back to a PDF. After completing these steps, it prints the location of the modified PDF file and cleans up any intermediate files. If errors occur during execution, they are caught and printed, and the script exits gracefully.

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Complete Script

import sys
import os
from pdf2docx import Converter
from docx import Document
from docx2pdf import convert

def convert_pdf_to_word(pdf_file, docx_file):
    """Convert PDF to DOCX format."""
    try:
        cv = Converter(pdf_file)
        cv.convert(docx_file, start=0, end=None)
        cv.close()
        print(f"Converted PDF to DOCX: {pdf_file} -> {docx_file}")
    except Exception as e:
        print(f"Error during PDF to DOCX conversion: {e}")
        sys.exit(1)

def remove_strikethrough_text(docx_file):
    """Remove all strikethrough text from a DOCX file."""
    try:
        document = Document(docx_file)
        modified = False

        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.strike:
                    print(f"Removing strikethrough text: {run.text}")
                    run.text = ''
                    modified = True

        if modified:
            modified_docx_file = docx_file.replace('.docx', '_modified.docx')
            document.save(modified_docx_file)
            print(f"Strikethrough text removed. Saved to: {modified_docx_file}")
            return modified_docx_file
        else:
            print("No strikethrough text found.")
            return docx_file
    except Exception as e:
        print(f"Error during strikethrough text removal: {e}")
        sys.exit(1)

def convert_docx_to_pdf(docx_file, output_pdf):
    """Convert DOCX back to PDF format."""
    try:
        convert(docx_file, output_pdf)
        print(f"Converted DOCX to PDF: {docx_file} -> {output_pdf}")
    except Exception as e:
        print(f"Error during DOCX to PDF conversion: {e}")
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit(1)

    pdf_file = sys.argv[1]

    if not os.path.exists(pdf_file):
        print(f"Error: File not found - {pdf_file}")
        sys.exit(1)

    try:
        # Define intermediate file paths
        base_name = os.path.splitext(pdf_file)[0]
        temp_docx_file = f"{base_name}.docx"
        modified_docx_file = f"{base_name}_modified.docx"
        output_pdf_file = f"{base_name}_modified.pdf"

        # Step 1: Convert PDF to DOCX
        convert_pdf_to_word(pdf_file, temp_docx_file)

        # Step 2: Remove strikethrough text
        final_docx_file = remove_strikethrough_text(temp_docx_file)

        # Step 3: Convert modified DOCX back to PDF
        convert_docx_to_pdf(final_docx_file, output_pdf_file)

        print(f"Modified PDF saved to: {output_pdf_file}")

        # Clean up intermediate DOCX files
        if os.path.exists(temp_docx_file):
            os.remove(temp_docx_file)
        if final_docx_file != temp_docx_file and os.path.exists(final_docx_file):
            os.remove(final_docx_file)

    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

Run the Script

Execute the script by running the following command, replacing it with the path to your PDF file:

python <script_name>.py <pdf_file_path>


This Python-based solution effectively removes strikethrough text from PDFs by leveraging the strengths of the pdf2docx, python-docx, and docx2pdf libraries. By converting the PDF to DOCX, modifying the DOCX, and converting it back to PDF, we can ensure that the strikethrough text is removed without affecting other content. This approach provides a robust and efficient method for handling strikethrough text in PDFs, making your documents clean and professional.

]]>
https://blogs.perficient.com/2025/01/14/how-to-remove-strikethrough-text-from-pdfs-using-python/feed/ 1 375277