Quick: English text ~1 byte/char (UTF-8). International ~2-3 bytes/char. Todo label ~50-100 bytes.
UTF-8 Encoding (Most Common)
| Character Type | Bytes | Examples |
|---|---|---|
| ASCII | 1 | a-z, A-Z, 0-9, punctuation |
| Latin Extended | 2 | é, ñ, ü, ø |
| CJK | 3 | 中, 日, 한 |
| Emoji | 4 | 😀, 🎉, 🚀 |
In-Memory String Representation
Different from disk! Languages decode to their own internal format.
| Language | Internal Format | Bytes/Char |
|---|---|---|
| Python 3 | Flexible (Latin-1/UCS-2/UCS-4) | 1, 2, or 4 |
| JavaScript | UTF-16 | 2 or 4 |
| Java | UTF-16 (Latin-1 since Java 9) | 1 or 2 |
| Go / Rust | UTF-8 | 1-4 |
String Size Estimates
| Content Type | Chars | Bytes (UTF-8) |
|---|---|---|
| Todo label | 20-80 | ~50-100 |
| Tweet | 280 | ~280-560 |
| Email subject | 50-100 | ~50-150 |
| Username | 3-30 | ~3-30 |
| URL | 50-200 | ~50-200 |
| Paragraph | ~500 | ~500-800 |