UsefulJS.String

The String module provides functionality for encoding, escaping and formatting strings. It may also apply a number of fixes to the built-in String object.

Static properties

UsefulJS.featureSupport.string

Adds entries to UsefulJS.featureSupport:

codePointEscaping

Whether \u{...} escape sequences can be used.

Formatting strings

UsefulJS.String.sprintf

Formats a string with placeholder fields replaced with argument values.

Syntax

UsefulJS.String.sprintf(fmt[, field1[, field2 ...]])

Parameters

fmt String
The string to format.
field1 ... fieldn String|Number
Ordered replacements for field placeholders in the format string.

Returns: String. The formatted string

Throws: TypeError: unrecognized format code; unexpected argument type for the field.

Usage

var company = "Holding Holdings, Inc", year = (new Date()).getFullYear(),
    copyright = UsefulJS.String.sprintf("Copyright (C) %04d %s. All rights reserved.", year, company);

Format field syntax

The general syntax of a sprintf format field is:

%[argno][flags][width][precision]type

Items in square brackets are optional and for some type values they may be ignored.

What, no length field? If you have a C or Perl background you may have used, for example, the %ld sequence for a long int value. Let me put it like this: JAVASCRIPT IS NOT C! C is a strongly typed language and the compiler cares very much about the distinction between an int and long int. JavaScript is a weakly typed language; not only am I unable to make the distinction but I don't even care to.

Field types

Field type	Purpose	Notes
%d	Signed integer	Range is ±2⁵³ - 1. Locale-aware
%i	32-bit signed integer	Range is -2³¹ to 2³¹ - 1. Not locale-aware
%u	32-bit unsigned integer	Range is 0 to 2³²
%x / %X	Unsigned integer in hexadecimal notation	Range is the same as %u. %X uses uppercase characters and '0X' with the # flag
%o	Unsigned integer in octal notation	Range is the same as %u
%b / %B	Unsigned integer in binary notation	Range is the same as %u. %B uses uppercase '0B' with # flag
%f	Floating point value in decimal format	Locale-aware
%e / %E	Floating point value in exponential format	%E uses an uppercase 'E' for the exponent part; exponent value is a minimum of two digits
%c	A single character or surrogate pair	Codepoint in the range 0 to 0x10ffff
%s	A character string	-
%%	A literal '%' character	Note that x% is not a valid percentage format in certain locales

A number of fields familiar to C programmers are unimplemented. There is no way that I could implement the %p and %n fields and I wouldn't even if I could since they're features in search of a use case. In twenty five years I've never once used %g which behaves like %f or %e depending on the magnitude and has a funky interaction with the precision field. So that one's out too.

Argno

Generally, each field in the format string consumes the next argument. You can change this behaviour by specifying the argument number in the first slot in the format field. The syntax is n$ where n is the argument number. it must be a minimum of 1 since argument 0 is the format string itself. To illustrate:

UsefulJS.String.sprintf("%1$3u %1$#04x %1$#010b", 23); // " 23 0x17 0b00010111"

Flags

Flags control the appearance of the formatted value. You can use as many of them as you like, though some may be ignored.

Flag	Purpose	Notes
-	Left-align padded values	Fixed width fields are right-aligned unless this flag is set; left aligned fields ignore the '0' flag
+	Prefix positive numbers with '+'	Only used when the field indicates a signed value: %d, %i, %e %f
<SPACE>	Prefix positive numbers with 'NBSP' (non-breaking space)	Only used when the field indicates a signed value
0	Pad numeric fields with '0' characters; otherwise NBSP characters are used	-
#	Prefixes binary, octal and hex values with a base identifier	Hex values are prefixed with '0x', octal values with '0' and binary values with '0b'. Affects the behaviour of the width field
,	Group the digits in the output value (e.g. "1,000" rather than "1000")	Only used when the field is %d or %f

Width

The width field follows the flags and is a number that specifies the minimum width of the formatted field. Formatted values wider than this are not truncated. Field width is applied after all other formatting is complete, so precision and prefixes are taken into account when calculating how much padding is required:

UsefulJS.String.sprintf("%+7.4f", 1.0);       //  "+1.0000"
UsefulJS.String.sprintf("%+8.4f", 1.0);       // " +1.0000"

A dynamic width can be specified with the '*' character. This consumes an argument. To illustrate:

UsefulJS.String.sprintf("%*d", 2, 1);         // " 1"

The width field also controls the output width with the %s field type:

UsefulJS.String.sprintf("%8s", "expand");  // "  expand"
UsefulJS.String.sprintf("%-8s", "expand"); // "expand  "

Precision

The precision field follows the width and is a number preceded by a '.' character that specifies how many digits after the decimal point are to be displayed when used with the %f and %e field types. The default value is 6. Trailing zeroes in the output are not suppressed:

UsefulJS.String.sprintf("%f", 1);             // "1.000000"
UsefulJS.String.sprintf("%.2f", 1);           // "1.00"

As with the width, you can specify precision dynamically with the '*' character:

UsefulJS.String.sprintf("%.*f", 2, 1);        // "1.00"

When used with the %s field type, the precision value controls the output width, truncating if required:

UsefulJS.String.sprintf("%.8s", "truncated"); // "truncate"

Localization

The %d and %f fields are locale aware (that is, assuming that the UsefulJS.Number module is available). This means that the decimal separator and digits for the current locale are observed:

UsefulJS.Locale.current = "fr";
UsefulJS.String.sprintf("%.*f", 2, 1);        // "1,00"
UsefulJS.Locale.current = "hi";
UsefulJS.String.sprintf("%.*f", 2, 1);        // "१.००"

You can enable grouped output with the ',' (comma) flag to improve readability:

UsefulJS.Locale.current = "en-IN";
UsefulJS.String.sprintf("%,d", 100000);       // "1,00,000"

Beyond this, the formats used for numbers are fixed. If you have more complex formatting requirements (e.g. currency or suppressing trailing zeroes), you should format the numbers as a separate step and use %s fields.

Internationalization

Here is a simple but functional internationalization framework:

var I18N = {
    strings : {
        de : {
            question : "Was ist das Ergebnis der Multiplikation %1$.1f von %2$.1f?",
            ...
        },
        en : {
            question : "What do you get when you multiply %1$.1f by %2$.1f?",
            ...
        },
        ...
    },
    
    resolve : function(key/*, arg1, arg2, ... */) {
        var fmt = I18n.strings[UsefulJS.Locale.current][key];
        if (!fmt) {
            return key;
        }
        // Get the rest of the arguments
        var args = Array.from(arguments);
        // Put the format string on the front
        args[0] = fmt;
        return UsefulJS.String.sprintf.apply(null, args);
    }       
};

UsefulJS.Locale.current = "de";
I18n.resolve("question", 6, 9);  // "Was ist das Ergebnis der Multiplikation 6,0 von 9,0?"

Note the use of positional parameters %1$.1f and %2$.1f. This is particularly important for %s fields. When your stringtable entries are translated, the word order can change very radically and there is no way of distinguishing one %s from another when the argument order is fixed. Specifying which arguments to use in the format string means that substitution won't produce gobbledegook.

String encoding

The functions in the UsefulJS.String.encode namespace are used for interchange with backend processes.

UsefulJS.String.encode.toUtf8

Returns its input, UTF-8 encoded.

Syntax

UsefulJS.String.encode.toUtf8(s)

Parameters

s String
The string to encode

Returns: String

Description

UTF-8 is a character encoding that uses a variable number of 8-bit bytes to encode a single character. The bit pattern in the first byte of the sequence says how many bytes are in an encoded sequence. UTF-8 has a number of intrinsic advantages over other character encodings:

Any of the 1.1 million or so codepoints defined by the Unicode standard can be represented unambiguously
Error recovery is a simple matter of scanning forward in the decode stream to the next start byte
Being 8-bit, byte ordering is a non-issue
Backwards compatibility with ASCII means that any program which uses an 8-bit char data type can handle UTF-8 data. Important values like the null terminator retain their special meanings so that UTF-8 encoded data can be used with, for example, std::string.

No more than four bytes are required to represent any defined character. Here are some examples:

UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("$"));  // "$"; unchanged
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("£"));  // "\xc2\xa3"
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("€"));  // "\xe2\x82\xac"
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("💩")); // "\xf0\x9f\x92\xa9"

Following the Noncharacter FAQ, noncharacter codepoints are encoded like any other. However, codepoints that represent isolated halves of a surrogate pair are encoded as "\xef\xbf\xbd", the UTF-8 encoding of U+FFFD, the replacement character. Note that no "BOM" (byte-order mark) is emitted - this is absolutely not needed for UTF-8. If, for some reason, the receiving program expects one, you can prefix the encoded string with with "\xef\xbb\xbf".

UsefulJS.String.encode.fromUtf8

UTF-8 decodes its input, returning a regular String.

Syntax

UsefulJS.String.encode.fromUtf8(s)

Parameters

s String
The string to decode

Returns: String

Description

Valid UTF-8 sequences in the input are decoded to the corresponding codepoints. This includes sequences that decode to noncharacters. Invalid sequences are decoded to U+FFFD, the replacement character. Invalid sequences are:

Sequences that decode to a codepoint >0x10ffff
Truncated sequences where one or more continuation bytes are missing
Continuation bytes where a start byte is expected
Sequences that decode to one half of a surrogate pair (codepoints between 0xd800 and 0xdfff)
Overlong sequences. Overlong sequences are ones where more bytes than necessary have been used to encode the codepoint, for example, the NUL character "\x00" being encoded as "\xC0\x00".

UsefulJS.String.encode.lf

Canonicalizes line-endings as "\n".

Syntax

UsefulJS.String.encode.lf(s)

Parameters

s String
The string to encode

Returns: String

Description

Strips carriage returns out of its argument and returns the result.

UsefulJS.String.encode.crlf

Canonicalizes line-endings as "\r\n".

Syntax

UsefulJS.String.encode.crlf(s)

Parameters

s String
The string to encode

Returns: String

Description

Replaces all newlines in its argument with carriage return / newline pairs. If called twice on the same string, carriage returns will not be doubled-up.

Implementation notes

UTF-8 encoding and decoding may trivially be done using the deprecated escape and unescape functions:

var s = "...",
    encoded = unescape(encodeURIComponent(s)),
    decoded = decodeURIComponent(escape(encoded));

This works because URLs must be UTF-8 encoded with individual octets %-escaped. escape/unescape are completely unaware of this and treat the UTF-8 byte values as individual characters.

I chose to implement my own codec for a number of reasons. The primary reason is the dependence on deprecated functions which may stop working at any time. The other is control over error handling; decodeURIComponent throws on invalid input (with, given the context, a slightly weird "Malformed URI" error) while I prefer to replace bad sequences with a replacement character and carry on.

String escaping

Functions in the UsefulJS.String.escape namespace are used to escape potentially problematic characters in strings before they're used in various contexts.

UsefulJS.String.escape.html

Escapes a String for use in HTML

Syntax

UsefulJS.String.escape.html(s)

Parameters

s String
The string to escape

Returns: String

Description

This is a basic escape function, only escaping &<>'"/. It's intended to prevent you from shooting yourself in the foot when using innerHTML. The better approach is to document.createTextNode or attribute.setValue which turn strings plain old data that will never be interpreted. The function does not emit character entities like ® since these have not been required for years - simply specify a UTF-8 charset in a <meta> element.

UsefulJS.String.escape.js

Escapes a String for use in JavaScript

Syntax

UsefulJS.String.escape.js(s)

Parameters

s String
The string to escape

Returns: String

Description

This uses a fairly paranoid escape algorithm: only ASCII alphanumeric characters (A-Z, a-z and 0-9) are not escaped. Otherwise, codepoints below U+0100 are escaped using the form \xHH where 'H' is a hexadecimal digit. The escaping of other codepoints depends on the capabilities of the browser: if the \u{H...H} escaping style is supported, this form will be used; if not the escape sequence will use the \uHHHH style.

UsefulJS.String.escape.regex

Escapes a String for use in in regular expressions

Syntax

UsefulJS.String.escape.regex(s)

Parameters

s String
The string to escape

Returns: String

Description

String values passed to the RegExp constructor may need to be escaped if you don't want certain characters to interpreted as part of the regular expression syntax. This function backslash-escapes the following characters: \.-[]{}()^$|+?*. Note that '/' does not need to be escaped unless you're in the habit of constructing dynamic regexes with eval.

Fixes

The fixes for the String module implement a number of ES5/6 methods and are defined in the _string namespace of the fix options. Fixes are applied automatically apart from the padLeft and padRight options which, as library extensions, must be explicitly enabled.

String prototype methods are implemented generically so that you can apply any value to them:

String.prototype.includes.call([1,2,3,4], ",3,")

As per spec, however, they will throw a TypeError if the first argument is null or undefined.

startsWith

Adds a startsWith method to String.prototype if it is not implemented natively.

Syntax

string.startsWith(t)

Description

See the MDN documentation for details.

endsWith

Adds an endsWith method to String.prototype if it is not implemented natively.

Syntax

string.endsWith(t)

Description

See the MDN documentation for details.

includes

Adds an includes method to String.prototype if it is not implemented natively.

Syntax

string.includes(t[, pos])

Description

See the MDN documentation for details.

trim

Adds a trim method to String.prototype if it is not implemented natively.

Syntax

string.trim()

Description

See the MDN documentation for details. The pattern used for whitespace is Unicode-aware.

trimLeft

Adds a trimLeft method to String.prototype if it is not implemented natively.

Syntax

string.trimLeft()

Description

See the MDN documentation for details.

trimRight

Adds a trimRight method to String.prototype if it is not implemented natively.

Syntax

string.trimRight()

Description

See the MDN documentation for details.

repeat

Adds a repeat method to String.prototype if it is not implemented natively.

Syntax

string.repeat(count)

Description

See the MDN documentation for details.

padLeft

Adds a padLeft method to String.prototype.

Syntax

string.padLeft(padTo, padWith)

Description

Adds 0 or more copies of the character in padWith to the start of string so that it is at least padTo characters long and returns the result. If string is already long enough, no padding is applied.

padRight

Adds a padRight method to String.prototype.

Syntax

string.padRight(padTo, padWith)

Description

Adds 0 or more copies of the character in padWith to the end of string so that it is at least padTo characters long and returns the result. If string is already long enough, no padding is applied.

fromCodePoint

Adds a fromCodePoint factory method to String if it is not implemented natively.

Syntax

String.fromCodePoint(codePoint1[, codePoint2, ...])

Description

See the MDN documentation for details.

codePointAt

Adds a codePointAt method to String.prototype if it is not implemented natively.

Syntax

string.codePointAt(pos)

Description

See the MDN documentation for details.