Skip to main content

Analytics

Convert a Text File from UTF-8 Encoding to ANSI using Python in AWS Glue

Istock 1824158252

To convert a text file from UTF-8 encoded data to ANSI using AWS Glue, you will typically work with Python or PySpark. However, it’s important to understand that ANSI is not a specific encoding but often refers to Windows-1252 (or similar 8-bit encodings) in a Windows context.

AWS Glue, running on Apache Spark, uses UTF-8 as the default encoding. Converting to ANSI requires handling the character encoding during the writing phase, because Spark itself doesn’t support writing files in encodings other than UTF-8 natively. But there are a few workarounds.

Here’s a step-by-step guide to converting a text file from UTF-8 to ANSI using Python in AWS Glue. Assume you’re working with a plain text file and want to output a similarly formatted file in ANSI encoding.

General Process Flow

Technical Approach Step-By-Step Guide

Step 1: Add the import statements to the code

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import boto3
import codecs
import boto3 import codecs
import boto3
import codecs

Step 2: Specify the source/target file paths & S3 bucket details

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Initialize S3 client
s3_client = boto3.client('s3')
s3_key_utf8 = ‘utf8_file_path/filename.txt’
s3_key_ansi = 'ansi_file_path/filename.txt'
# Specify S3 bucket and file paths
bucket_name = outgoing_bucket #'your-s3-bucket-name'
input_key = s3_key_utf8 #S3Path/name of input UTF-8 encoded file in S3
output_key = s3_key_ansi #S3 Path/name to save the ANSI encoded file
# Initialize S3 client s3_client = boto3.client('s3') s3_key_utf8 = ‘utf8_file_path/filename.txt’ s3_key_ansi = 'ansi_file_path/filename.txt' # Specify S3 bucket and file paths bucket_name = outgoing_bucket #'your-s3-bucket-name' input_key = s3_key_utf8 #S3Path/name of input UTF-8 encoded file in S3 output_key = s3_key_ansi #S3 Path/name to save the ANSI encoded file
# Initialize S3 client
s3_client = boto3.client('s3')
s3_key_utf8 = ‘utf8_file_path/filename.txt’
s3_key_ansi = 'ansi_file_path/filename.txt'

# Specify S3 bucket and file paths
bucket_name = outgoing_bucket #'your-s3-bucket-name'
input_key = s3_key_utf8   #S3Path/name of input UTF-8 encoded file in S3
output_key = s3_key_ansi  #S3 Path/name to save the ANSI encoded file

Step 3: Write a function to convert the text file from UTF-8 to ANSI, based on the parameters supplied (S3 bucket name, source-file, target-file)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Function to convert UTF-8 file to ANSI (Windows-1252) and upload back to S3
defconvert_utf8_to_ansi(bucket_name, input_key, output_key):
# Download the UTF-8 encoded file from S3
response = s3_client.get_object(Bucket=bucket_name, Key=input_key)
# Read the file content from the response body (UTF-8 encoded)
utf8_content = response['Body'].read().decode('utf-8')
# Convert the content to ANSI encoding (Windows-1252)
ansi_content = utf8_content.encode('windows-1252', 'ignore')# 'ignore' to handle invalid characters
# Upload the converted file to S3 (in ANSI encoding)
s3_client.put_object(Bucket=bucket_name, Key=output_key, Body=ansi_content)
# Function to convert UTF-8 file to ANSI (Windows-1252) and upload back to S3 def convert_utf8_to_ansi(bucket_name, input_key, output_key): # Download the UTF-8 encoded file from S3 response = s3_client.get_object(Bucket=bucket_name, Key=input_key) # Read the file content from the response body (UTF-8 encoded) utf8_content = response['Body'].read().decode('utf-8') # Convert the content to ANSI encoding (Windows-1252) ansi_content = utf8_content.encode('windows-1252', 'ignore') # 'ignore' to handle invalid characters # Upload the converted file to S3 (in ANSI encoding) s3_client.put_object(Bucket=bucket_name, Key=output_key, Body=ansi_content)
# Function to convert UTF-8 file to ANSI (Windows-1252) and upload back to S3
def convert_utf8_to_ansi(bucket_name, input_key, output_key):
    # Download the UTF-8 encoded file from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=input_key)
    # Read the file content from the response body (UTF-8 encoded)
    utf8_content = response['Body'].read().decode('utf-8')
    # Convert the content to ANSI encoding (Windows-1252)
    ansi_content = utf8_content.encode('windows-1252', 'ignore')  # 'ignore' to handle invalid characters
    # Upload the converted file to S3 (in ANSI encoding)
    s3_client.put_object(Bucket=bucket_name, Key=output_key, Body=ansi_content) 

Step 4: Call the function that converts the text file from UTF-8 to ANSI

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Call the function to convert the file
convert_utf8_to_ansi(bucket_name, input_key, output_key)
# Call the function to convert the file convert_utf8_to_ansi(bucket_name, input_key, output_key)
# Call the function to convert the file 
convert_utf8_to_ansi(bucket_name, input_key, output_key) 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Venkatanaidu Gallipalli

Venkatanaidu Gullipalli is a highly experienced data engineering leader with 15 years of expertise in designing, developing, and managing E2E Data Solutions using industry-leading tools such as IICS, AWS Glue, Python, and Snowflake, ADF. He has successfully led numerous data integration projects and delivered scalable, high-performance data pipelines for clients across various sectors.

More from this Author

Follow Us