Skip to content

Encoding and Binary Data

Serial communication deals with bytes. How those bytes map to characters — or whether they represent characters at all — depends on your encoding choice. This guide covers when to use each encoding and how to handle binary protocols that have no character representation.


Most mcserial read and write tools accept an encoding parameter that controls how strings convert to/from bytes:

// Text with UTF-8 (default)
// write_serial(port="/dev/ttyUSB0", data="Hello, 世界\r\n", encoding="utf-8")
// Raw bytes as Latin-1
// write_serial(port="/dev/ttyUSB0", data="\x01\x03\x00\x00\x00\x01", encoding="latin-1")

When reading, mcserial decodes incoming bytes using the specified encoding. Invalid byte sequences are replaced with the Unicode replacement character (�) rather than throwing an error — this is the errors="replace" behavior in Python.


UTF-8 is the default encoding for all string operations. It handles ASCII (bytes 0x00–0x7F) directly and encodes non-ASCII characters as multi-byte sequences.

When to use UTF-8:

  • ASCII text protocols (AT commands, console output)
  • Devices that explicitly use UTF-8 (modern systems, JSON/XML output)
  • Human-readable text where you want proper Unicode support
// Reading UTF-8 text
// read_serial_line(port="/dev/ttyUSB0", encoding="utf-8")
{
"line": "Temperature: 23.5°C",
"bytes_read": 21
}

Latin-1 (ISO-8859-1) maps bytes 0x00–0xFF directly to Unicode code points U+0000–U+00FF. This makes it a perfect passthrough for raw binary data — every possible byte value round-trips through encoding and decoding unchanged.

When to use Latin-1:

  • Binary protocols (Modbus RTU, proprietary framing)
  • Data with arbitrary byte values (firmware blobs, encrypted payloads)
  • Protocol analysis where you need to see exact bytes
// Writing a Modbus RTU request (address 1, read holding registers)
// write_serial(
// port="/dev/ttyUSB0",
// data="\x01\x03\x00\x00\x00\x01\x84\x0A",
// encoding="latin-1"
// )
{
"bytes_written": 8
}

mcserial defaults to Latin-1 for binary-oriented operations like rs485_scan_addresses precisely because it ensures byte-for-byte fidelity.


ASCII only covers bytes 0x00–0x7F. Bytes outside this range cause encoding errors.

When to use ASCII:

  • Strict validation of 7-bit protocols
  • Legacy systems that only accept ASCII
  • When you want encoding to fail loudly on invalid data
// This would fail if the device sends bytes > 0x7F
// read_serial(port="/dev/ttyUSB0", encoding="ascii")

For precise binary control, use write_serial_bytes instead of write_serial. It accepts a list of integer byte values (0–255) and writes them directly:

// Write exact bytes without encoding conversion
// write_serial_bytes(port="/dev/ttyUSB0", data=[0x01, 0x03, 0x00, 0x00, 0x00, 0x01, 0x84, 0x0A])
{
"bytes_written": 8
}

This is clearer than escaping bytes in a string and avoids any encoding ambiguity.


For binary analysis, use the serial://{port}/raw resource. It returns data as a hex dump with printable ASCII annotations:

Read resource: serial:///dev/ttyUSB0/raw

Output format:

00000000 01 03 02 00 64 B8 44 |....d.D|

This shows:

  • Offset (00000000)
  • Hex bytes (01 03 02 00 64 B8 44)
  • ASCII representation (. for non-printable, actual character otherwise)

Modbus RTU is a binary protocol with CRC-16 error checking. Always use Latin-1 or write_serial_bytes:

// Request: Read holding register 0 from device 1
// write_serial(
// port="/dev/ttyUSB0",
// data="\x01\x03\x00\x00\x00\x01\x84\x0A",
// encoding="latin-1"
// )
// Read response
// read_serial(port="/dev/ttyUSB0", size=7, encoding="latin-1")

NMEA sentences are pure ASCII with a simple checksum:

// NMEA sentences are safe with UTF-8 or ASCII
// read_serial_line(port="/dev/ttyUSB0", encoding="utf-8")
{
"line": "$GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,47.0,M,,*47"
}

Some protocols embed binary data within text framing. Handle these by:

  1. Reading with Latin-1 to preserve all bytes
  2. Parsing the text portions as needed
  3. Extracting binary payloads by position
// Read with Latin-1 to preserve everything
// read_serial(port="/dev/ttyUSB0", size=100, encoding="latin-1")
// Then parse: "DATA:" prefix + 4-byte length + binary payload + "\r\n"

When decoding with UTF-8 (or any multi-byte encoding), invalid sequences are replaced with � (U+FFFD). This is intentional — it prevents crashes and makes problems visible.

Symptoms of encoding mismatch:

What You SeeLikely Cause
Scattered � in outputBinary data decoded as UTF-8
Truncated stringsMulti-byte sequence split across reads
Missing bytesXON/XOFF stripping 0x11/0x13

Diagnosis:

// Switch to Latin-1 to see raw bytes
// read_serial(port="/dev/ttyUSB0", encoding="latin-1")
// Or use the hex dump resource for full visibility
// Read resource: serial:///dev/ttyUSB0/raw

EncodingByte RangeUse Case
utf-8Multi-byteText protocols, console I/O, JSON (default)
latin-10x00–0xFF → U+0000–U+00FFBinary protocols, raw byte passthrough
ascii0x00–0x7FStrict 7-bit validation
(bytes)0–255write_serial_bytes for explicit binary

Rules of thumb:

  1. Text you can read? Use UTF-8 (default)
  2. Binary protocol? Use Latin-1 or write_serial_bytes
  3. Seeing � characters? You’re decoding binary as UTF-8 — switch to Latin-1
  4. Need to analyze raw bytes? Use the serial://{port}/raw resource

Many binary protocols include error-checking bytes (CRC, checksum). When constructing frames:

  1. Build the data portion as a byte array
  2. Calculate the check value
  3. Append the check bytes
  4. Send via write_serial_bytes or Latin-1

Example: Simple XOR checksum

# In your MCP client or preprocessing
data = [0x01, 0x03, 0x00, 0x00, 0x00, 0x01]
checksum = 0
for b in data:
checksum ^= b
frame = data + [checksum]
# frame = [0x01, 0x03, 0x00, 0x00, 0x00, 0x01, 0x03]

Then send:

// write_serial_bytes(port="/dev/ttyUSB0", data=[1, 3, 0, 0, 0, 1, 3])