Introduction

In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These models are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.

To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.

Hugging Face’s translation models are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP models, including translation models.

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

Install the necessary libraries: Install the transformers library, which includes the translation models, use pip to install it.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install transformers
pip install transformers
```
pip install transformers
```
Load the translation model: Use the from_pretrained()function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-en-fr"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-en-fr" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name)
```
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
```
Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
source_text = "Translate this English text to French."
encoded_input = tokenizer.encode(source_text, return_tensors="pt")
source_text = "Translate this English text to French." encoded_input = tokenizer.encode(source_text, return_tensors="pt")
```
source_text = "Translate this English text to French."
encoded_input = tokenizer.encode(source_text, return_tensors="pt")
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
translated_output = model.generate(encoded_input)
translated_output = model.generate(encoded_input)
```
translated_output = model.generate(encoded_input)
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
```
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
```
These steps provide a basic outline of implementing a translation model in Hugging Face.

Handling Metadata, HTML Body, and Plain Text

With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

It is essential to determine the type of content we will be dealing with before we start translating.

Methods used to differentiate between plain text, HTML, and Metadata are as follows:

is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
is_html_content(content): identifies the existence of the html tag to identify HTML content.
is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.

Approaches that demonstrates the translation of different content:

Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
deftranslate_metadata_content(metadata,model,tokenizer,fields_to_translate):
translated_metadata = {}
# Loop through each field and perform the translation process
for key, value in metadata.items():
# Translate the value if it is a string and included in fields_to_translate
ifisinstance(value, str)and key in fields_to_translate:
value_tokens = tokenizer.encode(value, return_tensors='pt')
translated_value_tokens = model.generate(value_tokens, max_length=100)
translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
else:
translated_value = value
translated_metadata[key] = translated_value
return json.dumps(translated_metadata)
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): translated_metadata = {} # Loop through each field and perform the translation process for key, value in metadata.items(): # Translate the value if it is a string and included in fields_to_translate if isinstance(value, str) and key in fields_to_translate: value_tokens = tokenizer.encode(value, return_tensors='pt') translated_value_tokens = model.generate(value_tokens, max_length=100) translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True) else: translated_value = value translated_metadata[key] = translated_value return json.dumps(translated_metadata)
```
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
            translated_metadata = {}
            # Loop through each field and perform the translation process
            for key, value in metadata.items():
                # Translate the value if it is a string and included in fields_to_translate
                if isinstance(value, str) and key in fields_to_translate:
                    value_tokens = tokenizer.encode(value, return_tensors='pt')
                    translated_value_tokens = model.generate(value_tokens, max_length=100)
                    translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
                else:
                    translated_value = value

                translated_metadata[key] = translated_value

            return json.dumps(translated_metadata)
```
Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
deftranslate_plainText(content,model,tokenizer):
# Tokenize the plain text content
encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)
# Translate the text
translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
def translate_plainText(content,model,tokenizer): # Tokenize the plain text content encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True) # Translate the text translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True) return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
```
def translate_plainText(content,model,tokenizer):
            # Tokenize the plain text content
            encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)

            # Translate the text
            translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
            return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
```
Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
deftranslate_html_content(content,model,tokenizer):
# Tokenize the HTML content
soup = BeautifulSoup(content, 'html.parser')
# Translate the text
translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
max_length=1024, num_beams=4, early_stopping=True)
translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)
# Create a new soup with the translated text
new_soup = BeautifulSoup(translated_text, 'html.parser')
# Replace the text in the original HTML structure
for original_tag, translated_tag inzip(soup.find_all(), new_soup.find_all()):
if original_tag.string:
original_tag.string = translated_tag.get_text()
return soup.prettify()
def translate_html_content(content,model,tokenizer): # Tokenize the HTML content soup = BeautifulSoup(content, 'html.parser') # Translate the text translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True), max_length=1024, num_beams=4, early_stopping=True) translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True) # Create a new soup with the translated text new_soup = BeautifulSoup(translated_text, 'html.parser') # Replace the text in the original HTML structure for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()): if original_tag.string: original_tag.string = translated_tag.get_text() return soup.prettify()
```
def translate_html_content(content,model,tokenizer):
             # Tokenize the HTML content
             soup = BeautifulSoup(content, 'html.parser')

             # Translate the text
             translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
                                          max_length=1024, num_beams=4, early_stopping=True)
             translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)

             # Create a new soup with the translated text
             new_soup = BeautifulSoup(translated_text, 'html.parser')

             # Replace the text in the original HTML structure
             for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()):
                 if original_tag.string:
                     original_tag.string = translated_tag.get_text()
             return soup.prettify()
```

Putting it All Together

We provide a central approach that manages the translation method according to the content type to bring everything together:

import sys

import subprocess

import json

import sacremoses

from transformers import MarianMTModel, MarianTokenizer

# Install necessary packages if not already installed

try:

import transformers

import sacremoses

except ImportError:

subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])

import transformers

import sacremoses

from transformers import MarianMTModel, MarianTokenizer

from bs4 import BeautifulSoup

deftranslate_content(content):

# Load the translation model and tokenizer

model_name = f'Helsinki-NLP/opus-mt-en-fr'

model = MarianMTModel.from_pretrained(model_name)

tokenizer = MarianTokenizer.from_pretrained(model_name)

tokenizer.src_tokenizer = sacremoses.MosesTokenizer()

tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

# Check if the input is HTML or plain text or metadata

ifis_html_content(content):

translated_content = translate_html_content(content,model,tokenizer)

elifis_python_string(content):

print("Content is a Python string expression.")

fields_to_translate=['title','name']

content = content.replace("null", "None")

metadata = eval(content)

translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)

elifis_plain_text(content):

translated_content = translate_plainText(content,model,tokenizer)

return translated_content

deftranslate_metadata_content(metadata,model,tokenizer,fields_to_translate):

# Utilize the code snippet from the first point above to translate metadata values.

#...

deftranslate_plainText(content,model,tokenizer):

# Utilize the code snippet from the second point above to translate plain text

#...

deftranslate_html_content(content,model,tokenizer):

# Utilize the code snippet from the third point above to translate html content

#...

defis_html_content(content):

return"<html>"in content.lower()

defis_plain_text(content):

return"<html>"notin content.lower()andnotis_python_string(content)

defis_python_string(content):

return(content.startswith("'")and content.endswith("'"))or \

(content.startswith('"')and content.endswith('"'))or \

(content.startswith("{")and content.endswith("}"))

import sys import subprocess import json import sacremoses from transformers import MarianMTModel, MarianTokenizer # Install necessary packages if not already installed try: import transformers import sacremoses except ImportError: subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses']) import transformers import sacremoses from transformers import MarianMTModel, MarianTokenizer from bs4 import BeautifulSoup def translate_content(content): # Load the translation model and tokenizer model_name = f'Helsinki-NLP/opus-mt-en-fr' model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) tokenizer.src_tokenizer = sacremoses.MosesTokenizer() tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer() # Check if the input is HTML or plain text or metadata if is_html_content(content): translated_content = translate_html_content(content,model,tokenizer) elif is_python_string(content): print("Content is a Python string expression.") fields_to_translate=['title','name'] content = content.replace("null", "None") metadata = eval(content) translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate) elif is_plain_text(content): translated_content = translate_plainText(content,model,tokenizer) return translated_content def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): # Utilize the code snippet from the first point above to translate metadata values. #... def translate_plainText(content,model,tokenizer): # Utilize the code snippet from the second point above to translate plain text #... def translate_html_content(content,model,tokenizer): # Utilize the code snippet from the third point above to translate html content #... def is_html_content(content): return "<html>" in content.lower() def is_plain_text(content): return "<html>" not in content.lower() and not is_python_string(content) def is_python_string(content): return (content.startswith("'") and content.endswith("'")) or \ (content.startswith('"') and content.endswith('"')) or \ (content.startswith("{") and content.endswith("}"))

import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer


# Install necessary packages if not already installed
try:
    import transformers
    import sacremoses
except ImportError:
    subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
    import transformers
    import sacremoses

from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup



def translate_content(content):
    # Load the translation model and tokenizer
    model_name = f'Helsinki-NLP/opus-mt-en-fr'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
    tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

    # Check if the input is HTML or plain text or metadata
    if is_html_content(content):
        translated_content = translate_html_content(content,model,tokenizer)     
    elif is_python_string(content):
        print("Content is a Python string expression.")
        fields_to_translate=['title','name']
        content = content.replace("null", "None")
        metadata = eval(content)
        translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
    elif is_plain_text(content):
        translated_content = translate_plainText(content,model,tokenizer)
        
    return translated_content          
       
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    # Utilize the code snippet from the first point above to translate metadata values.
    #...

def translate_plainText(content,model,tokenizer):
    # Utilize the code snippet from the second point above to translate plain text
    #...

def translate_html_content(content,model,tokenizer):
    # Utilize the code snippet from the third point above to translate html content
    #... 

def is_html_content(content):
    return "<html>" in content.lower()

def is_plain_text(content):
    return "<html>" not in content.lower() and not is_python_string(content)
    
def is_python_string(content):
    return (content.startswith("'") and content.endswith("'")) or \
           (content.startswith('"') and content.endswith('"')) or \
           (content.startswith("{") and content.endswith("}"))

Example Usage

Here are examples of using the provided functions with different content types.

# Example usage with HTML content:

html_content = """

<html>

<head>

<title>Example HTML</title>

</head>

<body>

<h1>Hello, world!</h1>

<p>This is a sample HTML content to be translated.</p>

</body>

</html>

"""

translated_html = translate_content(html_content)

print(translated_html)

# Example usage with plain text:

plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)

print(translated_text)

# Example usage with metadata

metadata ="{'title':'title for testing translation of metadata value'}"

translated_metadata = translate_content(metadata)

print(translated_metadata)

# Example usage with HTML content: html_content = """ <html> <head> <title>Example HTML</title> </head> <body> <h1>Hello, world!</h1> <p>This is a sample HTML content to be translated.</p> </body> </html> """ translated_html = translate_content(html_content) print(translated_html) # Example usage with plain text: plain_text = "plain text content for testing translation functionality " translated_text = translate_content(plain_text) print(translated_text) # Example usage with metadata metadata ="{'title':'title for testing translation of metadata value'}" translated_metadata = translate_content(metadata) print(translated_metadata)

# Example usage with HTML content:
html_content = """
<html>
<head>
    <title>Example HTML</title>
</head>
<body>
    <h1>Hello, world!</h1>
    <p>This is a sample HTML content to be translated.</p>
</body>
</html>
"""
translated_html = translate_content(html_content)
print(translated_html)


# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)
print(translated_text)


# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"


translated_metadata = translate_content(metadata)
print(translated_metadata)

You just need to run this script separately using below command.

python your_file_name.py command

python your_file_name.py command

Conclusion:

The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special models in the Transformers library, we can easily translate words from one language to another.

Additional reference:

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face

by Kanchan Bawane on September 4th, 2023 | ~ minute read

Introduction

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

Handling Metadata, HTML Body, and Plain Text

Revolutionize Your Business With Generative AI

Approaches that demonstrates the translation of different content:

Putting it All Together

Example Usage

Conclusion:

Additional reference:

Leave a Reply

Kanchan Bawane

Categories

Follow Us