Skip to main content

Data & Intelligence

Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face

Getting Lost In The Code

Introduction

In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These models are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.

To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.

Hugging Face’s translation models are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP models, including translation models.

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

  1. Install the necessary libraries: Install the transformers library, which includes the translation models, use pip to install it.
    pip install transformers
  2. Load the translation model: Use the from_pretrained()function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.
    from transformers import MarianMTModel, MarianTokenizer
    
    model_name = "Helsinki-NLP/opus-mt-en-fr"
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    
  3. Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
    source_text = "Translate this English text to French."
    encoded_input = tokenizer.encode(source_text, return_tensors="pt")
    
  4. Translate the text: Pass the encoded input to the translation model to obtain the translated output.
    translated_output = model.generate(encoded_input)
    
  5. Translate the text: Pass the encoded input to the translation model to obtain the translated output.
    translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
    

    These steps provide a basic outline of implementing a translation model in Hugging Face.

Handling Metadata, HTML Body, and Plain Text

With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.

It is essential to determine the type of content we will be dealing with before we start translating.

Methods used to differentiate between plain text, HTML, and Metadata are as follows:

  • is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
  • is_html_content(content): identifies the existence of the html tag to identify HTML content.
  • is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.

Approaches that demonstrates the translation of different content:

  1. Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:
    def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
                translated_metadata = {}
                # Loop through each field and perform the translation process
                for key, value in metadata.items():
                    # Translate the value if it is a string and included in fields_to_translate
                    if isinstance(value, str) and key in fields_to_translate:
                        value_tokens = tokenizer.encode(value, return_tensors='pt')
                        translated_value_tokens = model.generate(value_tokens, max_length=100)
                        translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
                    else:
                        translated_value = value
    
                    translated_metadata[key] = translated_value
    
                return json.dumps(translated_metadata)
    
  2. Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:
    def translate_plainText(content,model,tokenizer):
                # Tokenize the plain text content
                encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)
    
                # Translate the text
                translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
                return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    
  3. Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:
    def translate_html_content(content,model,tokenizer):
                 # Tokenize the HTML content
                 soup = BeautifulSoup(content, 'html.parser')
    
                 # Translate the text
                 translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
                                              max_length=1024, num_beams=4, early_stopping=True)
                 translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)
    
                 # Create a new soup with the translated text
                 new_soup = BeautifulSoup(translated_text, 'html.parser')
    
                 # Replace the text in the original HTML structure
                 for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()):
                     if original_tag.string:
                         original_tag.string = translated_tag.get_text()
                 return soup.prettify()
    

Putting it All Together

We provide a central approach that manages the translation method according to the content type to bring everything together:

import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer


# Install necessary packages if not already installed
try:
    import transformers
    import sacremoses
except ImportError:
    subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
    import transformers
    import sacremoses

from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup



def translate_content(content):
    # Load the translation model and tokenizer
    model_name = f'Helsinki-NLP/opus-mt-en-fr'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
    tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

    # Check if the input is HTML or plain text or metadata
    if is_html_content(content):
        translated_content = translate_html_content(content,model,tokenizer)     
    elif is_python_string(content):
        print("Content is a Python string expression.")
        fields_to_translate=['title','name']
        content = content.replace("null", "None")
        metadata = eval(content)
        translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
    elif is_plain_text(content):
        translated_content = translate_plainText(content,model,tokenizer)
        
    return translated_content          
       
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    # Utilize the code snippet from the first point above to translate metadata values.
    #...

def translate_plainText(content,model,tokenizer):
    # Utilize the code snippet from the second point above to translate plain text
    #...

def translate_html_content(content,model,tokenizer):
    # Utilize the code snippet from the third point above to translate html content
    #... 

def is_html_content(content):
    return "<html>" in content.lower()

def is_plain_text(content):
    return "<html>" not in content.lower() and not is_python_string(content)
    
def is_python_string(content):
    return (content.startswith("'") and content.endswith("'")) or \
           (content.startswith('"') and content.endswith('"')) or \
           (content.startswith("{") and content.endswith("}"))

Example Usage

Here are examples of using the provided functions with different content types.

# Example usage with HTML content:
html_content = """
<html>
<head>
    <title>Example HTML</title>
</head>
<body>
    <h1>Hello, world!</h1>
    <p>This is a sample HTML content to be translated.</p>
</body>
</html>
"""
translated_html = translate_content(html_content)
print(translated_html)


# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)
print(translated_text)


# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"


translated_metadata = translate_content(metadata)
print(translated_metadata)

You just need to run this script separately using below command.

python your_file_name.py command

Conclusion:

The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special models in the Transformers library, we can easily translate words from one language to another.

Additional reference:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kanchan Bawane

Kanchan is a Technical Consultant at Perficient with keen interest in various technologies and working for communities. She is enthusiastic about sharing her knowledge, viewpoints, and experiences with others. She has also delivered various Coveo solutions utilizing different framework.

More from this Author

Follow Us