Skip to main content

Data & Intelligence

Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face

Getting Lost In The Code

Introduction

In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These models are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.

To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.

Hugging Face’s translation models are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP models, including translation models.

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

  1. Install the necessary libraries: Install the transformers library, which includes the translation models, use pip to install it.
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    pip install transformers
    pip install transformers
    pip install transformers
  2. Load the translation model: Use the from_pretrained()function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    from transformers import MarianMTModel, MarianTokenizer
    model_name = "Helsinki-NLP/opus-mt-en-fr"
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-en-fr" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name)
    from transformers import MarianMTModel, MarianTokenizer
    
    model_name = "Helsinki-NLP/opus-mt-en-fr"
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    
  3. Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    source_text = "Translate this English text to French."
    encoded_input = tokenizer.encode(source_text, return_tensors="pt")
    source_text = "Translate this English text to French." encoded_input = tokenizer.encode(source_text, return_tensors="pt")
    source_text = "Translate this English text to French."
    encoded_input = tokenizer.encode(source_text, return_tensors="pt")
    
  4. Translate the text: Pass the encoded input to the translation model to obtain the translated output.
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    translated_output = model.generate(encoded_input)
    translated_output = model.generate(encoded_input)
    translated_output = model.generate(encoded_input)
    
  5. Translate the text: Pass the encoded input to the translation model to obtain the translated output.
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
    translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
    translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
    

    These steps provide a basic outline of implementing a translation model in Hugging Face.

Handling Metadata, HTML Body, and Plain Text

With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.

It is essential to determine the type of content we will be dealing with before we start translating.

Methods used to differentiate between plain text, HTML, and Metadata are as follows:

  • is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
  • is_html_content(content): identifies the existence of the html tag to identify HTML content.
  • is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.

Approaches that demonstrates the translation of different content:

  1. Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    deftranslate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    translated_metadata = {}
    # Loop through each field and perform the translation process
    for key, value in metadata.items():
    # Translate the value if it is a string and included in fields_to_translate
    ifisinstance(value, str)and key in fields_to_translate:
    value_tokens = tokenizer.encode(value, return_tensors='pt')
    translated_value_tokens = model.generate(value_tokens, max_length=100)
    translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
    else:
    translated_value = value
    translated_metadata[key] = translated_value
    return json.dumps(translated_metadata)
    def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): translated_metadata = {} # Loop through each field and perform the translation process for key, value in metadata.items(): # Translate the value if it is a string and included in fields_to_translate if isinstance(value, str) and key in fields_to_translate: value_tokens = tokenizer.encode(value, return_tensors='pt') translated_value_tokens = model.generate(value_tokens, max_length=100) translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True) else: translated_value = value translated_metadata[key] = translated_value return json.dumps(translated_metadata)
    def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
                translated_metadata = {}
                # Loop through each field and perform the translation process
                for key, value in metadata.items():
                    # Translate the value if it is a string and included in fields_to_translate
                    if isinstance(value, str) and key in fields_to_translate:
                        value_tokens = tokenizer.encode(value, return_tensors='pt')
                        translated_value_tokens = model.generate(value_tokens, max_length=100)
                        translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
                    else:
                        translated_value = value
    
                    translated_metadata[key] = translated_value
    
                return json.dumps(translated_metadata)
    
  2. Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    deftranslate_plainText(content,model,tokenizer):
    # Tokenize the plain text content
    encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)
    # Translate the text
    translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
    return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    def translate_plainText(content,model,tokenizer): # Tokenize the plain text content encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True) # Translate the text translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True) return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    def translate_plainText(content,model,tokenizer):
                # Tokenize the plain text content
                encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)
    
                # Translate the text
                translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
                return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    
  3. Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:
    Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    deftranslate_html_content(content,model,tokenizer):
    # Tokenize the HTML content
    soup = BeautifulSoup(content, 'html.parser')
    # Translate the text
    translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
    max_length=1024, num_beams=4, early_stopping=True)
    translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)
    # Create a new soup with the translated text
    new_soup = BeautifulSoup(translated_text, 'html.parser')
    # Replace the text in the original HTML structure
    for original_tag, translated_tag inzip(soup.find_all(), new_soup.find_all()):
    if original_tag.string:
    original_tag.string = translated_tag.get_text()
    return soup.prettify()
    def translate_html_content(content,model,tokenizer): # Tokenize the HTML content soup = BeautifulSoup(content, 'html.parser') # Translate the text translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True), max_length=1024, num_beams=4, early_stopping=True) translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True) # Create a new soup with the translated text new_soup = BeautifulSoup(translated_text, 'html.parser') # Replace the text in the original HTML structure for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()): if original_tag.string: original_tag.string = translated_tag.get_text() return soup.prettify()
    def translate_html_content(content,model,tokenizer):
                 # Tokenize the HTML content
                 soup = BeautifulSoup(content, 'html.parser')
    
                 # Translate the text
                 translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
                                              max_length=1024, num_beams=4, early_stopping=True)
                 translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)
    
                 # Create a new soup with the translated text
                 new_soup = BeautifulSoup(translated_text, 'html.parser')
    
                 # Replace the text in the original HTML structure
                 for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()):
                     if original_tag.string:
                         original_tag.string = translated_tag.get_text()
                 return soup.prettify()
    

Putting it All Together

We provide a central approach that manages the translation method according to the content type to bring everything together:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer
# Install necessary packages if not already installed
try:
import transformers
import sacremoses
except ImportError:
subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
import transformers
import sacremoses
from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup
deftranslate_content(content):
# Load the translation model and tokenizer
model_name = f'Helsinki-NLP/opus-mt-en-fr'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()
# Check if the input is HTML or plain text or metadata
ifis_html_content(content):
translated_content = translate_html_content(content,model,tokenizer)
elifis_python_string(content):
print("Content is a Python string expression.")
fields_to_translate=['title','name']
content = content.replace("null", "None")
metadata = eval(content)
translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
elifis_plain_text(content):
translated_content = translate_plainText(content,model,tokenizer)
return translated_content
deftranslate_metadata_content(metadata,model,tokenizer,fields_to_translate):
# Utilize the code snippet from the first point above to translate metadata values.
#...
deftranslate_plainText(content,model,tokenizer):
# Utilize the code snippet from the second point above to translate plain text
#...
deftranslate_html_content(content,model,tokenizer):
# Utilize the code snippet from the third point above to translate html content
#...
defis_html_content(content):
return"<html>"in content.lower()
defis_plain_text(content):
return"<html>"notin content.lower()andnotis_python_string(content)
defis_python_string(content):
return(content.startswith("'")and content.endswith("'"))or \
(content.startswith('"')and content.endswith('"'))or \
(content.startswith("{")and content.endswith("}"))
import sys import subprocess import json import sacremoses from transformers import MarianMTModel, MarianTokenizer # Install necessary packages if not already installed try: import transformers import sacremoses except ImportError: subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses']) import transformers import sacremoses from transformers import MarianMTModel, MarianTokenizer from bs4 import BeautifulSoup def translate_content(content): # Load the translation model and tokenizer model_name = f'Helsinki-NLP/opus-mt-en-fr' model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) tokenizer.src_tokenizer = sacremoses.MosesTokenizer() tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer() # Check if the input is HTML or plain text or metadata if is_html_content(content): translated_content = translate_html_content(content,model,tokenizer) elif is_python_string(content): print("Content is a Python string expression.") fields_to_translate=['title','name'] content = content.replace("null", "None") metadata = eval(content) translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate) elif is_plain_text(content): translated_content = translate_plainText(content,model,tokenizer) return translated_content def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): # Utilize the code snippet from the first point above to translate metadata values. #... def translate_plainText(content,model,tokenizer): # Utilize the code snippet from the second point above to translate plain text #... def translate_html_content(content,model,tokenizer): # Utilize the code snippet from the third point above to translate html content #... def is_html_content(content): return "<html>" in content.lower() def is_plain_text(content): return "<html>" not in content.lower() and not is_python_string(content) def is_python_string(content): return (content.startswith("'") and content.endswith("'")) or \ (content.startswith('"') and content.endswith('"')) or \ (content.startswith("{") and content.endswith("}"))
import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer


# Install necessary packages if not already installed
try:
    import transformers
    import sacremoses
except ImportError:
    subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
    import transformers
    import sacremoses

from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup



def translate_content(content):
    # Load the translation model and tokenizer
    model_name = f'Helsinki-NLP/opus-mt-en-fr'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
    tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

    # Check if the input is HTML or plain text or metadata
    if is_html_content(content):
        translated_content = translate_html_content(content,model,tokenizer)     
    elif is_python_string(content):
        print("Content is a Python string expression.")
        fields_to_translate=['title','name']
        content = content.replace("null", "None")
        metadata = eval(content)
        translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
    elif is_plain_text(content):
        translated_content = translate_plainText(content,model,tokenizer)
        
    return translated_content          
       
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    # Utilize the code snippet from the first point above to translate metadata values.
    #...

def translate_plainText(content,model,tokenizer):
    # Utilize the code snippet from the second point above to translate plain text
    #...

def translate_html_content(content,model,tokenizer):
    # Utilize the code snippet from the third point above to translate html content
    #... 

def is_html_content(content):
    return "<html>" in content.lower()

def is_plain_text(content):
    return "<html>" not in content.lower() and not is_python_string(content)
    
def is_python_string(content):
    return (content.startswith("'") and content.endswith("'")) or \
           (content.startswith('"') and content.endswith('"')) or \
           (content.startswith("{") and content.endswith("}"))

Example Usage

Here are examples of using the provided functions with different content types.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Example usage with HTML content:
html_content = """
<html>
<head>
<title>Example HTML</title>
</head>
<body>
<h1>Hello, world!</h1>
<p>This is a sample HTML content to be translated.</p>
</body>
</html>
"""
translated_html = translate_content(html_content)
print(translated_html)
# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "
translated_text = translate_content(plain_text)
print(translated_text)
# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"
translated_metadata = translate_content(metadata)
print(translated_metadata)
# Example usage with HTML content: html_content = """ <html> <head> <title>Example HTML</title> </head> <body> <h1>Hello, world!</h1> <p>This is a sample HTML content to be translated.</p> </body> </html> """ translated_html = translate_content(html_content) print(translated_html) # Example usage with plain text: plain_text = "plain text content for testing translation functionality " translated_text = translate_content(plain_text) print(translated_text) # Example usage with metadata metadata ="{'title':'title for testing translation of metadata value'}" translated_metadata = translate_content(metadata) print(translated_metadata)
# Example usage with HTML content:
html_content = """
<html>
<head>
    <title>Example HTML</title>
</head>
<body>
    <h1>Hello, world!</h1>
    <p>This is a sample HTML content to be translated.</p>
</body>
</html>
"""
translated_html = translate_content(html_content)
print(translated_html)


# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)
print(translated_text)


# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"


translated_metadata = translate_content(metadata)
print(translated_metadata)

You just need to run this script separately using below command.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
python your_file_name.py command
python your_file_name.py command
python your_file_name.py command

Conclusion:

The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special models in the Transformers library, we can easily translate words from one language to another.

Additional reference:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kanchan Bawane

Kanchan is a Technical Consultant at Perficient with keen interest in various technologies and working for communities. She is enthusiastic about sharing her knowledge, viewpoints, and experiences with others. She has also delivered various Coveo solutions utilizing different framework.

More from this Author

Follow Us