The String module provides functionality for encoding, escaping and formatting strings.
It may also apply a number of fixes to the built-in String
object.
Adds entries to UsefulJS.featureSupport:
Whether \u{...} escape sequences can be used.
Formats a string with placeholder fields replaced with argument values.
UsefulJS.String.sprintf(fmt[, field1[, field2 ...]])
fmt
Stringfield1 ... fieldn
String|NumberReturns: String. The formatted string
Throws: TypeError: unrecognized format code; unexpected argument type for the field.
var company = "Holding Holdings, Inc", year = (new Date()).getFullYear(), copyright = UsefulJS.String.sprintf("Copyright (C) %04d %s. All rights reserved.", year, company);
The general syntax of a sprintf format field is:
%[argno][flags][width][precision]type
Items in square brackets are optional and for some type values they may be ignored.
%ld
sequence for a long int
value. Let me put it like this: JAVASCRIPT IS NOT C! C is a strongly typed language
and the compiler cares very much about the distinction between an int
and long int. JavaScript is a weakly typed language; not only am I
unable to make the distinction but I don't even care to.Field type | Purpose | Notes |
---|---|---|
%d | Signed integer | Range is ±253 - 1. Locale-aware |
%i | 32-bit signed integer | Range is -231 to 231 - 1. Not locale-aware |
%u | 32-bit unsigned integer | Range is 0 to 232 |
%x / %X | Unsigned integer in hexadecimal notation | Range is the same as %u. %X uses uppercase characters and '0X' with the # flag |
%o | Unsigned integer in octal notation | Range is the same as %u |
%b / %B | Unsigned integer in binary notation | Range is the same as %u. %B uses uppercase '0B' with # flag |
%f | Floating point value in decimal format | Locale-aware |
%e / %E | Floating point value in exponential format | %E uses an uppercase 'E' for the exponent part; exponent value is a minimum of two digits |
%c | A single character or surrogate pair | Codepoint in the range 0 to 0x10ffff |
%s | A character string | - |
%% | A literal '%' character | Note that x% is not a valid percentage format in certain locales |
%p
and %n
fields and I wouldn't even if I could since they're features
in search of a use case. In twenty five years I've never once used %g
which behaves like %f
or %e
depending on the magnitude
and has a funky interaction with the precision field. So that one's out too.Generally, each field in the format string consumes the next argument. You
can change this behaviour by specifying the argument number in the first slot
in the format field. The syntax is n$
where n
is the argument number. it must be a minimum of 1 since argument 0 is the format
string itself. To illustrate:
UsefulJS.String.sprintf("%1$3u %1$#04x %1$#010b", 23); // " 23 0x17 0b00010111"
Flags control the appearance of the formatted value. You can use as many of them as you like, though some may be ignored.
Flag | Purpose | Notes |
---|---|---|
- | Left-align padded values | Fixed width fields are right-aligned unless this flag is set; left aligned fields ignore the '0' flag |
+ | Prefix positive numbers with '+' | Only used when the field indicates a signed value: %d, %i, %e %f |
<SPACE> | Prefix positive numbers with 'NBSP' (non-breaking space) | Only used when the field indicates a signed value |
0 | Pad numeric fields with '0' characters; otherwise NBSP characters are used | - |
# | Prefixes binary, octal and hex values with a base identifier | Hex values are prefixed with '0x', octal values with '0' and binary values with '0b'. Affects the behaviour of the width field |
, | Group the digits in the output value (e.g. "1,000" rather than "1000") | Only used when the field is %d or %f |
The width field follows the flags and is a number that specifies the minimum width of the formatted field. Formatted values wider than this are not truncated. Field width is applied after all other formatting is complete, so precision and prefixes are taken into account when calculating how much padding is required:
UsefulJS.String.sprintf("%+7.4f", 1.0); // "+1.0000" UsefulJS.String.sprintf("%+8.4f", 1.0); // " +1.0000"
A dynamic width can be specified with the '*' character. This consumes an argument. To illustrate:
UsefulJS.String.sprintf("%*d", 2, 1); // " 1"
The width field also controls the output width with the %s field type:
UsefulJS.String.sprintf("%8s", "expand"); // " expand" UsefulJS.String.sprintf("%-8s", "expand"); // "expand "
The precision field follows the width and is a number preceded by a '.' character that specifies how many digits after the decimal point are to be displayed when used with the %f and %e field types. The default value is 6. Trailing zeroes in the output are not suppressed:
UsefulJS.String.sprintf("%f", 1); // "1.000000" UsefulJS.String.sprintf("%.2f", 1); // "1.00"
As with the width, you can specify precision dynamically with the '*' character:
UsefulJS.String.sprintf("%.*f", 2, 1); // "1.00"
When used with the %s field type, the precision value controls the output width, truncating if required:
UsefulJS.String.sprintf("%.8s", "truncated"); // "truncate"
The %d and %f fields are locale aware (that is, assuming that the
UsefulJS.Number
module is available). This means that the decimal
separator and digits for the current locale are observed:
UsefulJS.Locale.current = "fr"; UsefulJS.String.sprintf("%.*f", 2, 1); // "1,00" UsefulJS.Locale.current = "hi"; UsefulJS.String.sprintf("%.*f", 2, 1); // "१.००"
You can enable grouped output with the ',' (comma) flag to improve readability:/p>
UsefulJS.Locale.current = "en-IN"; UsefulJS.String.sprintf("%,d", 100000); // "1,00,000"
Beyond this, the formats used for numbers are fixed. If you have more complex
formatting requirements (e.g. currency or suppressing trailing zeroes), you
should format the numbers as a separate step and use %s
fields.
Here is a simple but functional internationalization framework:
var I18N = { strings : { de : { question : "Was ist das Ergebnis der Multiplikation %1$.1f von %2$.1f?", ... }, en : { question : "What do you get when you multiply %1$.1f by %2$.1f?", ... }, ... }, resolve : function(key/*, arg1, arg2, ... */) { var fmt = I18n.strings[UsefulJS.Locale.current][key]; if (!fmt) { return key; } // Get the rest of the arguments var args = Array.from(arguments); // Put the format string on the front args[0] = fmt; return UsefulJS.String.sprintf.apply(null, args); } }; UsefulJS.Locale.current = "de"; I18n.resolve("question", 6, 9); // "Was ist das Ergebnis der Multiplikation 6,0 von 9,0?"
Note the use of positional parameters %1$.1f
and %2$.1f
.
This is particularly important for %s
fields. When your stringtable
entries are translated, the word order can change very radically and there is
no way of distinguishing one %s from another when the argument order is fixed.
Specifying which arguments to use in the format string means that substitution
won't produce gobbledegook.
The functions in the
Returns its input, UTF-8 encoded.
UsefulJS.String.encode.toUtf8(s)
s
StringReturns: String
UTF-8 is a character encoding that uses a variable number of 8-bit bytes to encode a single character. The bit pattern in the first byte of the sequence says how many bytes are in an encoded sequence. UTF-8 has a number of intrinsic advantages over other character encodings:
char
data type can handle UTF-8 data.
Important values like the null terminator retain their special meanings
so that UTF-8 encoded data can be used with, for example, std::string
.No more than four bytes are required to represent any defined character. Here are some examples:
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("$")); // "$"; unchanged UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("£")); // "\xc2\xa3" UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("€")); // "\xe2\x82\xac" UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("💩")); // "\xf0\x9f\x92\xa9"
Following the Noncharacter FAQ, noncharacter codepoints are encoded like
any other. However, codepoints that represent isolated halves of a surrogate
pair are encoded as "\xef\xbf\xbd"
, the UTF-8 encoding of U+FFFD,
the replacement character. Note that no "BOM" (byte-order mark) is emitted -
this is absolutely not needed for UTF-8. If, for some reason, the receiving
program expects one, you can prefix the encoded string with with
"\xef\xbb\xbf"
.
UTF-8 decodes its input, returning a regular String.
UsefulJS.String.encode.fromUtf8(s)
s
StringReturns: String
Valid UTF-8 sequences in the input are decoded to the corresponding codepoints. This includes sequences that decode to noncharacters. Invalid sequences are decoded to U+FFFD, the replacement character. Invalid sequences are:
Canonicalizes line-endings as "\n".
UsefulJS.String.encode.lf(s)
s
StringReturns: String
Strips carriage returns out of its argument and returns the result.
Canonicalizes line-endings as "\r\n".
UsefulJS.String.encode.crlf(s)
s
StringReturns: String
Replaces all newlines in its argument with carriage return / newline pairs. If called twice on the same string, carriage returns will not be doubled-up.
UTF-8 encoding and decoding may trivially be done using the deprecated
escape
and unescape
functions:
var s = "...", encoded = unescape(encodeURIComponent(s)), decoded = decodeURIComponent(escape(encoded));
This works because URLs must be UTF-8 encoded with individual octets
%-escaped. escape/unescape
are completely unaware of this and treat the UTF-8
byte values as individual characters.
I chose to implement my own codec for a number of reasons. The primary reason
is the dependence on deprecated functions which may stop working at any time. The
other is control over error handling; decodeURIComponent
throws on
invalid input (with, given the context, a slightly weird "Malformed URI" error) while
I prefer to replace bad sequences with a replacement character and carry on.
Functions in the UsefulJS.String.escape
namespace are used to
escape potentially problematic characters in strings before they're used in
various contexts.
Escapes a String for use in HTML
UsefulJS.String.escape.html(s)
s
StringReturns: String
This is a basic escape function, only escaping &<>'"/. It's intended
to prevent you from shooting yourself in the foot when using innerHTML
.
The better approach is to document.createTextNode
or attribute.setValue
which turn strings plain old data that will never be interpreted. The
function does not emit character entities like ®
since
these have not been required for years - simply specify a UTF-8 charset in
a <meta> element.
Escapes a String for use in JavaScript
UsefulJS.String.escape.js(s)
s
StringReturns: String
This uses a fairly paranoid escape algorithm: only ASCII alphanumeric characters
(A-Z, a-z and 0-9) are not escaped. Otherwise, codepoints below U+0100 are escaped
using the form \xHH
where 'H' is a hexadecimal digit. The escaping
of other codepoints depends on the capabilities of the browser: if the \u{H...H}
escaping style is supported, this form will be used; if not the escape sequence
will use the \uHHHH
style.
Escapes a String for use in in regular expressions
UsefulJS.String.escape.regex(s)
s
StringReturns: String
String values passed to the RegExp
constructor may need to be
escaped if you don't want certain characters to interpreted as part of the
regular expression syntax. This function backslash-escapes the following
characters: \.-[]{}()^$|+?*. Note that '/' does not need to be escaped unless
you're in the habit of constructing dynamic regexes with eval
.
The fixes for the String module implement a number of ES5/6 methods and are defined in the _string namespace of the fix options. Fixes are applied automatically apart from the padLeft and padRight options which, as library extensions, must be explicitly enabled.
String prototype methods are implemented generically so that you can apply any value to them:
String.prototype.includes.call([1,2,3,4], ",3,")
As per spec, however, they will throw a TypeError if the first argument is null or undefined.
Adds a startsWith
method to String.prototype
if
it is not implemented natively.
string.startsWith(t)
See the MDN documentation for details.
Adds an endsWith
method to String.prototype
if
it is not implemented natively.
string.endsWith(t)
See the MDN documentation for details.
Adds an includes
method to String.prototype
if
it is not implemented natively.
string.includes(t[, pos])
See the MDN documentation for details.
Adds a trim
method to String.prototype
if
it is not implemented natively.
string.trim()
See the MDN documentation for details. The pattern used for whitespace is Unicode-aware.
Adds a trimLeft
method to String.prototype
if
it is not implemented natively.
string.trimLeft()
See the MDN documentation for details.
Adds a trimRight
method to String.prototype
if
it is not implemented natively.
string.trimRight()
See the MDN documentation for details.
Adds a repeat
method to String.prototype
if
it is not implemented natively.
string.repeat(count)
See the MDN documentation for details.
Adds a padLeft
method to String.prototype
.
string.padLeft(padTo, padWith)
Adds 0 or more copies of the character in padWith
to the start of
string
so that it is at least padTo
characters long and returns
the result. If string
is already long enough, no padding is applied.
Adds a padRight
method to String.prototype
.
string.padRight(padTo, padWith)
Adds 0 or more copies of the character in padWith
to the end of
string
so that it is at least padTo
characters long and returns
the result. If string
is already long enough, no padding is applied.
Adds a fromCodePoint
factory method to String
if
it is not implemented natively.
String.fromCodePoint(codePoint1[, codePoint2, ...])
See the MDN documentation for details.
Adds a codePointAt
method to String.prototype
if
it is not implemented natively.
string.codePointAt(pos)
See the MDN documentation for details.