Introduction
In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These models are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.
To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.
Hugging Face’s translation models are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP models, including translation models.
Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:
- Install the necessary libraries: Install the transformers library, which includes the translation models, use pip to install it.
pip install transformers
- Load the translation model: Use the
from_pretrained()
function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.from transformers import MarianMTModel, MarianTokenizer model_name = "Helsinki-NLP/opus-mt-en-fr" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name)
- Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
source_text = "Translate this English text to French." encoded_input = tokenizer.encode(source_text, return_tensors="pt")
- Translate the text: Pass the encoded input to the translation model to obtain the translated output.
translated_output = model.generate(encoded_input)
- Translate the text: Pass the encoded input to the translation model to obtain the translated output.
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
These steps provide a basic outline of implementing a translation model in Hugging Face.
Handling Metadata, HTML Body, and Plain Text
With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
It is essential to determine the type of content we will be dealing with before we start translating.
Methods used to differentiate between plain text, HTML, and Metadata are as follows:
- is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
- is_html_content(content): identifies the existence of the html tag to identify HTML content.
- is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.
Approaches that demonstrates the translation of different content:
- Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): translated_metadata = {} # Loop through each field and perform the translation process for key, value in metadata.items(): # Translate the value if it is a string and included in fields_to_translate if isinstance(value, str) and key in fields_to_translate: value_tokens = tokenizer.encode(value, return_tensors='pt') translated_value_tokens = model.generate(value_tokens, max_length=100) translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True) else: translated_value = value translated_metadata[key] = translated_value return json.dumps(translated_metadata)
- Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:
def translate_plainText(content,model,tokenizer): # Tokenize the plain text content encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True) # Translate the text translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True) return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
- Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:
def translate_html_content(content,model,tokenizer): # Tokenize the HTML content soup = BeautifulSoup(content, 'html.parser') # Translate the text translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True), max_length=1024, num_beams=4, early_stopping=True) translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True) # Create a new soup with the translated text new_soup = BeautifulSoup(translated_text, 'html.parser') # Replace the text in the original HTML structure for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()): if original_tag.string: original_tag.string = translated_tag.get_text() return soup.prettify()
Putting it All Together
We provide a central approach that manages the translation method according to the content type to bring everything together:
import sys import subprocess import json import sacremoses from transformers import MarianMTModel, MarianTokenizer # Install necessary packages if not already installed try: import transformers import sacremoses except ImportError: subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses']) import transformers import sacremoses from transformers import MarianMTModel, MarianTokenizer from bs4 import BeautifulSoup def translate_content(content): # Load the translation model and tokenizer model_name = f'Helsinki-NLP/opus-mt-en-fr' model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) tokenizer.src_tokenizer = sacremoses.MosesTokenizer() tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer() # Check if the input is HTML or plain text or metadata if is_html_content(content): translated_content = translate_html_content(content,model,tokenizer) elif is_python_string(content): print("Content is a Python string expression.") fields_to_translate=['title','name'] content = content.replace("null", "None") metadata = eval(content) translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate) elif is_plain_text(content): translated_content = translate_plainText(content,model,tokenizer) return translated_content def translate_metadata_content(metadata,model,tokenizer,fields_to_translate): # Utilize the code snippet from the first point above to translate metadata values. #... def translate_plainText(content,model,tokenizer): # Utilize the code snippet from the second point above to translate plain text #... def translate_html_content(content,model,tokenizer): # Utilize the code snippet from the third point above to translate html content #... def is_html_content(content): return "<html>" in content.lower() def is_plain_text(content): return "<html>" not in content.lower() and not is_python_string(content) def is_python_string(content): return (content.startswith("'") and content.endswith("'")) or \ (content.startswith('"') and content.endswith('"')) or \ (content.startswith("{") and content.endswith("}"))
Example Usage
Here are examples of using the provided functions with different content types.
# Example usage with HTML content: html_content = """ <html> <head> <title>Example HTML</title> </head> <body> <h1>Hello, world!</h1> <p>This is a sample HTML content to be translated.</p> </body> </html> """ translated_html = translate_content(html_content) print(translated_html) # Example usage with plain text: plain_text = "plain text content for testing translation functionality " translated_text = translate_content(plain_text) print(translated_text) # Example usage with metadata metadata ="{'title':'title for testing translation of metadata value'}" translated_metadata = translate_content(metadata) print(translated_metadata)
You just need to run this script separately using below command.
python your_file_name.py command
Conclusion:
The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special models in the Transformers library, we can easily translate words from one language to another.