UTF -16

UTF-16 Encoder / Decoder

UTF-16 Encoder / Decoder

Online Free Unicode & UTF-16 Encoding Tool

Auto-convert

Drop file here

Chars: 0 | Code Units: 0 | Bytes: 0
Chars: 0 | Bytes: 0
Add BOM (FEFF)
Preserve ASCII (encode only non-ASCII)
Process Line by Line
Group by Character
Highlight Surrogate Pairs

Why Use Our UTF-16 Encoder / Decoder?

23 Formats

Hex, binary, Base64 & more

BE & LE

Both endianness modes

Inspector

Byte-level analysis

UTF-8 Compare

Side-by-side comparison

Private

100% browser-based

Free

No signup required

The Complete Guide to UTF-16 Encoding and Decoding: Understanding the Native Encoding of Windows, Java, and JavaScript

In the landscape of character encodings, UTF-16 occupies a unique and critically important position. While UTF-8 dominates the web and network protocols, UTF-16 is the native internal encoding used by some of the most widely deployed software systems in the world. Microsoft Windows uses UTF-16 for its entire internal string representation, the Windows API, file system paths, and the Windows Registry. Java and the Java Virtual Machine (JVM) use UTF-16 as the native encoding for all String and char objects. JavaScript, despite being a web language, also represents strings internally as sequences of UTF-16 code units. The .NET framework and C# use UTF-16 for all string handling. SQL Server's NVARCHAR and NCHAR data types store text as UTF-16. Understanding UTF-16 encoding is therefore not an academic exercise but a practical necessity for anyone working with these platforms. Our free UTF-16 encoder decoder online provides the most comprehensive tool available for encoding text into UTF-16 code units, decoding UTF-16 sequences back to readable text, analyzing the UTF-16 structure of any text, and comparing UTF-16 encoding with UTF-8—all running entirely in your browser with complete privacy.

The story of UTF-16 begins with the original Unicode standard, which initially assumed that 65,536 codepoints (representable in 16 bits) would be sufficient to encode every character in every writing system. Based on this assumption, the original encoding was UCS-2 (Universal Character Set, 2 bytes), which used exactly two bytes per character with a simple one-to-one mapping between codepoints and 16-bit code units. When it became clear that 65,536 codepoints were not enough—particularly after the addition of historic scripts, mathematical notation, musical symbols, and eventually emoji—Unicode was extended to support over 1.1 million codepoints (up to U+10FFFF). UTF-16 was designed as a backward-compatible extension of UCS-2 that uses a clever mechanism called surrogate pairs to represent codepoints above U+FFFF while maintaining full compatibility with existing UCS-2 data for the Basic Multilingual Plane (BMP). This history is important because many legacy systems were designed around UCS-2 and later upgraded to UTF-16, and understanding the transition helps diagnose encoding issues that still occur in production systems today.

How UTF-16 Encoding Works: BMP Characters and Surrogate Pairs

UTF-16 is a variable-length encoding that uses either one or two 16-bit code units per character. Characters in the Basic Multilingual Plane (BMP), which includes codepoints from U+0000 to U+FFFF, are encoded as a single 16-bit code unit whose value equals the codepoint. This covers the vast majority of commonly used characters: all of ASCII, Latin extended characters, Greek, Cyrillic, Hebrew, Arabic, CJK unified ideographs, Japanese hiragana and katakana, Korean hangul syllables, and thousands of symbols and punctuation marks. For these characters, the encoding is identical to the older UCS-2 format and is straightforward. For example, the letter "A" (U+0041) is encoded as the 16-bit value 0x0041, and the Chinese character "你" (U+4F60) is encoded as 0x4F60.

Characters outside the BMP—known as supplementary characters, with codepoints from U+10000 to U+10FFFF—are encoded using surrogate pairs: two consecutive 16-bit code units. The encoding process works as follows: first, subtract 0x10000 from the codepoint to get a 20-bit value (since the supplementary range spans exactly 2^20 codepoints). Then split this 20-bit value into two 10-bit halves. The high 10 bits are added to 0xD800 to produce the high surrogate (range D800–DBFF), and the low 10 bits are added to 0xDC00 to produce the low surrogate (range DC00–DFFF). The high surrogate must always come first, followed immediately by the low surrogate. This design cleverly reserves the range D800–DFFF from being used as actual character codepoints, ensuring that surrogate code units can always be unambiguously identified. For example, the emoji 🌍 (U+1F30D) is encoded as the surrogate pair D83C DF0D: subtracting 0x10000 gives 0xF30D (binary: 0000 1111 0011 0000 1101), the high 10 bits are 0x003C added to 0xD800 giving 0xD83C, and the low 10 bits are 0x030D added to 0xDC00 giving 0xDF0D.

Byte Order: Big Endian vs. Little Endian

Since UTF-16 uses 16-bit code units, the question of byte order (endianness) becomes important when the encoding is stored as a sequence of bytes. In big endian (BE) byte order, the most significant byte comes first: the code unit 0x0041 is stored as the bytes 00 41. In little endian (LE) byte order, the least significant byte comes first: 0x0041 is stored as 41 00. Both orders are valid and widely used. Windows systems typically use UTF-16 LE, while network protocols and many file formats use UTF-16 BE. The Byte Order Mark (BOM), the character U+FEFF placed at the beginning of a UTF-16 stream, serves to indicate the byte order: if the first two bytes are FE FF, the stream is big endian; if they are FF FE, it is little endian. Our UTF-16 encoding tool online supports both byte orders with explicit selection, plus optional BOM insertion, giving users complete control over the output format.

The endianness distinction is one of the key differences between UTF-16 and UTF-8. UTF-8 is a byte-oriented encoding where byte order is irrelevant—each byte has a fixed meaning regardless of the platform's native endianness. UTF-16, being a 16-bit encoding, inherently requires a decision about byte order, which introduces complexity in data interchange. This is why the BOM exists and why UTF-16 files often begin with FE FF or FF FE. When working with our free UTF-16 encode decode tool, understanding endianness is essential for producing output that is correctly interpreted by the target system.

UTF-16 vs. UTF-8: When to Use Each Encoding

The choice between UTF-16 and UTF-8 depends on the context, and understanding the trade-offs is important for developers and system administrators. UTF-8 is universally preferred for web content, network protocols, JSON, XML, and file interchange because it is byte-order-independent, backward-compatible with ASCII, and typically more compact for text that is primarily ASCII (since ASCII characters use only one byte in UTF-8 versus two bytes in UTF-16). UTF-16, however, is more compact than UTF-8 for text that consists primarily of characters in the BMP range above U+007F—particularly CJK text, which uses two bytes in UTF-16 but three bytes in UTF-8. This makes UTF-16 the more space-efficient choice for Chinese, Japanese, and Korean text processing, which is one reason why many Asian-market software systems have historically preferred UTF-16.

Beyond storage efficiency, UTF-16 has practical advantages in certain programming contexts. Random access to characters within a string is simpler in UTF-16 for BMP-only text, since each BMP character is exactly one code unit. While supplementary characters break this simplicity (requiring two code units), the vast majority of practical text consists entirely of BMP characters, making UTF-16 indexing reliable in most cases. This is why Java's charAt() method returns a char (16-bit value) that corresponds to a single UTF-16 code unit, and why JavaScript's string indexing also operates on UTF-16 code units. Our compare feature allows users to see both UTF-8 and UTF-16 encodings side by side for any text, making these trade-offs concrete and visible.

Practical Use Cases for UTF-16 Encoding and Decoding

Windows developers frequently need to work with UTF-16 when interfacing with the Windows API. All "W" (wide) versions of Windows API functions accept and return UTF-16 strings. When debugging these strings—examining them in memory dumps, log files, or crash reports—the ability to decode UTF-16 byte sequences back to readable text is essential. Our UTF-16 decoder online free handles both big endian and little endian byte sequences, automatically detects the format, and correctly processes surrogate pairs to reconstruct the original text including emoji and supplementary characters.

Java developers work with UTF-16 constantly, even when they may not be aware of it. Every Java String is internally a sequence of UTF-16 code units, and operations like String.length() return the number of code units (not characters), which differs from the character count when surrogate pairs are present. Understanding this distinction is crucial for correctly handling modern text that includes emoji. Our tool's inspector shows exactly how each character maps to UTF-16 code units, making these internals visible and comprehensible. JavaScript developers face identical issues—the string.length property in JavaScript returns the number of UTF-16 code units, and Array.from(string) or the spread operator [...string] is needed to correctly iterate over characters including surrogate pairs.

Database professionals working with SQL Server's NVARCHAR columns, which store text as UTF-16 LE, need to understand the encoding when debugging character storage issues, migrating data between systems, or optimizing storage. The ability to encode text to UTF-16 and see the resulting byte sequences helps verify that data is being stored correctly and helps diagnose issues where characters are corrupted during import or export operations. Our UTF-16 utility online free provides all the tools needed for these database-related tasks.

Network protocol developers and security researchers often encounter UTF-16 encoded data in Windows protocol captures, SMB traffic, Active Directory data, and various Microsoft protocol implementations. Decoding these UTF-16 byte sequences quickly and accurately is essential for protocol analysis and security auditing. The batch processing mode allows multiple encoded sequences to be decoded simultaneously, speeding up analysis workflows significantly.

Understanding Surrogate Pairs: The Key to Supplementary Characters

Surrogate pairs are the mechanism that extends UTF-16 beyond the BMP's 65,536 characters to cover the full Unicode range of over 1.1 million codepoints. The concept is elegant: a range of 2,048 code unit values (D800–DFFF) is permanently reserved—these values are never assigned to actual characters—and divided into two sub-ranges: high surrogates (D800–DBFF, 1,024 values) and low surrogates (DC00–DFFF, 1,024 values). Each combination of one high surrogate and one low surrogate represents one supplementary character, giving 1,024 × 1,024 = 1,048,576 additional characters, which when added to the BMP's 63,488 non-surrogate codepoints gives the full Unicode range.

Common sources of surrogate pair issues include emoji (virtually all emoji codepoints are in the supplementary planes), mathematical symbols (many are in the Mathematical Alphanumeric Symbols block, U+1D400–U+1D7FF), historic scripts (Egyptian hieroglyphs, cuneiform, etc.), and musical symbols (U+1D100–U+1D1FF). When any of these characters appear in text processed by UTF-16-based systems, surrogate pairs are generated, and code that assumes one code unit equals one character will produce incorrect results. Our tool's surrogate pair analysis clearly shows which characters generate surrogate pairs and what the high and low surrogate values are, making it an invaluable educational and debugging resource.

Security and Encoding Issues

UTF-16 has its own set of security considerations that differ from UTF-8. Unpaired surrogates—a high surrogate without a following low surrogate, or a low surrogate without a preceding high surrogate—are technically invalid UTF-16 but can occur in real-world data, particularly in systems that allow arbitrary 16-bit values to be stored in string fields. Some security vulnerabilities have exploited the difference between how UTF-16 validators and UTF-16 consumers handle these edge cases. Our tool correctly identifies and handles unpaired surrogates, displaying them clearly in the inspector so users can identify potentially problematic text.

The byte order mark (BOM) can also cause security-relevant issues. In some contexts, a BOM at the beginning of a file or string can cause parsing errors, introduce invisible characters that break string comparisons, or confuse system components that don't expect it. Our tool provides explicit BOM control, allowing users to add or remove BOMs as needed and understand their effect on the encoded output.

Tips for Best Results

When encoding text, start with the default "Hex Code Units (spaced)" format to see the raw UTF-16 code units clearly. This format shows each 16-bit code unit as a four-digit hexadecimal value, making it easy to identify BMP characters (single code units) and supplementary characters (surrogate pairs). Switch to other formats as needed: "Raw Bytes (Big Endian)" or "Raw Bytes (Little Endian)" for byte-level output, "Surrogate Pair Details" for a detailed breakdown of supplementary characters, and language-specific formats (JavaScript, Java, C#, Python, Rust) for direct use in source code.

When decoding, always try "Auto Detect" first. The auto-detection algorithm recognizes common UTF-16 encoded formats including spaced hex code units, \\u-escaped strings, 0x-prefixed values, Base64-encoded UTF-16 data, and HTML entities. If auto-detection gives unexpected results, manually select the correct format. For byte-level input, ensure you select the correct endianness—the same bytes produce different text when interpreted as big endian versus little endian.

The Inspector tab is your best tool for understanding text at a deep level. It shows each character's codepoint, name, whether it uses a surrogate pair, the exact code unit values, and the byte representation in both endianness modes. The UTF-8 comparison tab provides immediate insight into the storage efficiency differences between the two encodings for your specific text, which is valuable for making informed encoding decisions in application design.

Conclusion: The Essential UTF-16 Tool for Modern Development

Our UTF-16 encoder decoder is a comprehensive Unicode analysis environment that combines encoding and decoding in twenty-three formats, both endianness modes with optional BOM, character-by-character inspection with full Unicode metadata, UTF-8 vs UTF-16 comparison, batch processing, and an interactive Unicode reference system—all running entirely in your browser with complete privacy. Whether you need to encode UTF-16 online, decode UTF-16 online, debug surrogate pair issues in Java or JavaScript, analyze Windows API string data, generate byte arrays for .NET applications, or understand the encoding differences between UTF-8 and UTF-16 for your specific text content, our free online UTF-16 encoder decoder delivers accurate, professional results instantly and without any signup or data upload. This tool bridges the gap between the theoretical understanding of UTF-16 encoding and the practical need to work with it daily in real software development, making it an essential bookmark for every developer working with Windows, Java, JavaScript, .NET, or any system that uses UTF-16 as its native string encoding.

Frequently Asked Questions

UTF-16 is a variable-length character encoding that uses 16-bit code units (2 bytes each). Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use one code unit (2 bytes), while supplementary characters (U+10000–U+10FFFF) use two code units (4 bytes) via surrogate pairs. UTF-8 uses 8-bit code units and 1–4 bytes per character. Key differences: UTF-8 is byte-order independent while UTF-16 has endianness concerns; UTF-8 is more efficient for ASCII-heavy text; UTF-16 is more efficient for CJK text; UTF-16 is the native encoding of Windows, Java, JavaScript, and .NET.

Surrogate pairs are pairs of 16-bit code units used in UTF-16 to represent characters with codepoints above U+FFFF (outside the Basic Multilingual Plane). A high surrogate (D800–DBFF) followed by a low surrogate (DC00–DFFF) together encode one supplementary character. They occur for emoji (😀, 🌍, 🎉), musical symbols (𝄞), mathematical symbols, historic scripts, and other supplementary plane characters. For example, 🌍 (U+1F30D) becomes the surrogate pair D83C DF0D. Understanding surrogate pairs is critical when working with string operations in Java, JavaScript, and C#.

Since UTF-16 uses 16-bit code units, each unit occupies two bytes. The order of these bytes matters: Big Endian (BE) stores the most significant byte first (0x0041 = 00 41), while Little Endian (LE) stores the least significant byte first (0x0041 = 41 00). Windows typically uses UTF-16 LE, while network protocols often use UTF-16 BE. The Byte Order Mark (BOM, U+FEFF) at the start of a file indicates the byte order: FE FF = Big Endian, FF FE = Little Endian.

JavaScript was created in 1995 when Unicode was still limited to 16 bits (UCS-2). JavaScript strings were designed as sequences of 16-bit code units. When Unicode expanded beyond 65,536 characters, JavaScript's string representation effectively became UTF-16 with surrogate pairs. This means string.length returns the number of UTF-16 code units, not characters. The emoji "🌍" has length 2 in JavaScript because it requires a surrogate pair. To correctly count characters, use [...string].length or Array.from(string).length.

The tool supports 23 output formats: Hex Code Units (spaced and compact), Hex with 0x and \\u prefixes, Decimal and Octal code units, Binary, Raw Bytes (BE and LE), Base64 (BE and LE), Surrogate Pair Details, Unicode Codepoints, JSON Escape, C# String Literal, Java/Kotlin Escape, JavaScript \\u format, Python UTF-16 Bytes, HTML Entities (decimal and hex), Rust String Literal, and Byte Arrays (BE and LE). Each format is precisely formatted for direct use in its intended context.

The Compare tab shows both UTF-8 and UTF-16 encoding for each character in your text, side by side. For each character, you can see its codepoint, the UTF-8 bytes (with byte count), and the UTF-16 code units (with byte count). A summary shows total bytes for each encoding, letting you see which is more efficient for your specific text. ASCII text is 50% smaller in UTF-8, while CJK text is 33% smaller in UTF-16. This helps developers make informed encoding decisions.

Absolutely. The Windows API uses UTF-16 LE for all "wide" (W) string functions. You can use this tool to: encode strings to UTF-16 LE byte sequences for comparison with memory dumps, decode UTF-16 LE bytes from debugger output or crash reports, generate C# or C++ string literals with proper Unicode escapes, verify correct encoding of file paths and registry values, and understand surrogate pair handling in Windows string operations. Set the byte order to "Little Endian" and use the "Raw Bytes (LE)" or "Byte Array (LE)" format for Windows-compatible output.

Yes, completely. The UTF-16 Encoder / Decoder runs 100% in your web browser using JavaScript. Your text is processed entirely on your local device and is never transmitted to any server, stored in any database, or accessible by any third party. This makes the tool safe for use with confidential text, passwords, personal data, proprietary code, and any other sensitive content.

BOM (Byte Order Mark) is the Unicode character U+FEFF placed at the beginning of a UTF-16 stream. In UTF-16 BE, it appears as bytes FE FF; in UTF-16 LE, it appears as FF FE. Its purpose is to indicate the byte order to the reader. Including a BOM is recommended when creating standalone UTF-16 files, as it eliminates ambiguity about endianness. However, many protocols and contexts (like network streams or API calls) don't use or expect a BOM. Our tool gives you explicit control with the "Add BOM" option.

Paste your text into the tool and check the Inspector tab. You'll see which characters use surrogate pairs (shown with orange "Surrogate" badges). Each surrogate pair counts as 2 in Java's String.length() and JavaScript's string.length, but represents only 1 character. For example, "Hello🌍" has length 7 in Java/JS (5 ASCII code units + 2 surrogate code units) but only 6 characters. The Inspector makes this immediately visible. Use Java's codePointCount() or JS's [...string].length for correct character counts.