UsefulJS.String

The String module provides functionality for encoding, escaping and formatting strings. It may also apply a number of fixes to the built-in String object.

Static properties

UsefulJS.featureSupport.string

Adds entries to UsefulJS.featureSupport:

Formatting strings

UsefulJS.String.sprintf

Formats a string with placeholder fields replaced with argument values.

Syntax
UsefulJS.String.sprintf(fmt[, field1[, field2 ...]])
Parameters

Returns: String. The formatted string

Throws: TypeError: unrecognized format code; unexpected argument type for the field.

Usage
var company = "Holding Holdings, Inc", year = (new Date()).getFullYear(),
    copyright = UsefulJS.String.sprintf("Copyright (C) %04d %s. All rights reserved.", year, company);

Format field syntax

The general syntax of a sprintf format field is:

%[argno][flags][width][precision]type

Items in square brackets are optional and for some type values they may be ignored.

What, no length field? If you have a C or Perl background you may have used, for example, the %ld sequence for a long int value. Let me put it like this: JAVASCRIPT IS NOT C! C is a strongly typed language and the compiler cares very much about the distinction between an int and long int. JavaScript is a weakly typed language; not only am I unable to make the distinction but I don't even care to.

Field types

Field type Purpose Notes
%d Signed integer Range is ±253 - 1. Locale-aware
%i 32-bit signed integer Range is -231 to 231 - 1. Not locale-aware
%u 32-bit unsigned integer Range is 0 to 232
%x / %X Unsigned integer in hexadecimal notation Range is the same as %u. %X uses uppercase characters and '0X' with the # flag
%o Unsigned integer in octal notation Range is the same as %u
%b / %B Unsigned integer in binary notation Range is the same as %u. %B uses uppercase '0B' with # flag
%f Floating point value in decimal format Locale-aware
%e / %E Floating point value in exponential format %E uses an uppercase 'E' for the exponent part; exponent value is a minimum of two digits
%c A single character or surrogate pair Codepoint in the range 0 to 0x10ffff
%s A character string -
%% A literal '%' character Note that x% is not a valid percentage format in certain locales
A number of fields familiar to C programmers are unimplemented. There is no way that I could implement the %p and %n fields and I wouldn't even if I could since they're features in search of a use case. In twenty five years I've never once used %g which behaves like %f or %e depending on the magnitude and has a funky interaction with the precision field. So that one's out too.

Argno

Generally, each field in the format string consumes the next argument. You can change this behaviour by specifying the argument number in the first slot in the format field. The syntax is n$ where n is the argument number. it must be a minimum of 1 since argument 0 is the format string itself. To illustrate:

UsefulJS.String.sprintf("%1$3u %1$#04x %1$#010b", 23); // " 23 0x17 0b00010111"

Flags

Flags control the appearance of the formatted value. You can use as many of them as you like, though some may be ignored.

Flag Purpose Notes
- Left-align padded values Fixed width fields are right-aligned unless this flag is set; left aligned fields ignore the '0' flag
+ Prefix positive numbers with '+' Only used when the field indicates a signed value: %d, %i, %e %f
<SPACE> Prefix positive numbers with 'NBSP' (non-breaking space) Only used when the field indicates a signed value
0 Pad numeric fields with '0' characters; otherwise NBSP characters are used -
# Prefixes binary, octal and hex values with a base identifier Hex values are prefixed with '0x', octal values with '0' and binary values with '0b'. Affects the behaviour of the width field
, Group the digits in the output value (e.g. "1,000" rather than "1000") Only used when the field is %d or %f

Width

The width field follows the flags and is a number that specifies the minimum width of the formatted field. Formatted values wider than this are not truncated. Field width is applied after all other formatting is complete, so precision and prefixes are taken into account when calculating how much padding is required:

UsefulJS.String.sprintf("%+7.4f", 1.0);       //  "+1.0000"
UsefulJS.String.sprintf("%+8.4f", 1.0);       // " +1.0000"

A dynamic width can be specified with the '*' character. This consumes an argument. To illustrate:

UsefulJS.String.sprintf("%*d", 2, 1);         // " 1"

The width field also controls the output width with the %s field type:

UsefulJS.String.sprintf("%8s", "expand");  // "  expand"
UsefulJS.String.sprintf("%-8s", "expand"); // "expand  "

Precision

The precision field follows the width and is a number preceded by a '.' character that specifies how many digits after the decimal point are to be displayed when used with the %f and %e field types. The default value is 6. Trailing zeroes in the output are not suppressed:

UsefulJS.String.sprintf("%f", 1);             // "1.000000"
UsefulJS.String.sprintf("%.2f", 1);           // "1.00"

As with the width, you can specify precision dynamically with the '*' character:

UsefulJS.String.sprintf("%.*f", 2, 1);        // "1.00"

When used with the %s field type, the precision value controls the output width, truncating if required:

UsefulJS.String.sprintf("%.8s", "truncated"); // "truncate"

Localization

The %d and %f fields are locale aware (that is, assuming that the UsefulJS.Number module is available). This means that the decimal separator and digits for the current locale are observed:

UsefulJS.Locale.current = "fr";
UsefulJS.String.sprintf("%.*f", 2, 1);        // "1,00"
UsefulJS.Locale.current = "hi";
UsefulJS.String.sprintf("%.*f", 2, 1);        // "१.००"

You can enable grouped output with the ',' (comma) flag to improve readability:

UsefulJS.Locale.current = "en-IN";
UsefulJS.String.sprintf("%,d", 100000);       // "1,00,000"

Beyond this, the formats used for numbers are fixed. If you have more complex formatting requirements (e.g. currency or suppressing trailing zeroes), you should format the numbers as a separate step and use %s fields.

Internationalization

Here is a simple but functional internationalization framework:

var I18N = {
    strings : {
        de : {
            question : "Was ist das Ergebnis der Multiplikation %1$.1f von %2$.1f?",
            ...
        },
        en : {
            question : "What do you get when you multiply %1$.1f by %2$.1f?",
            ...
        },
        ...
    },
    
    resolve : function(key/*, arg1, arg2, ... */) {
        var fmt = I18n.strings[UsefulJS.Locale.current][key];
        if (!fmt) {
            return key;
        }
        // Get the rest of the arguments
        var args = Array.from(arguments);
        // Put the format string on the front
        args[0] = fmt;
        return UsefulJS.String.sprintf.apply(null, args);
    }       
};

UsefulJS.Locale.current = "de";
I18n.resolve("question", 6, 9);  // "Was ist das Ergebnis der Multiplikation 6,0 von 9,0?"

Note the use of positional parameters %1$.1f and %2$.1f. This is particularly important for %s fields. When your stringtable entries are translated, the word order can change very radically and there is no way of distinguishing one %s from another when the argument order is fixed. Specifying which arguments to use in the format string means that substitution won't produce gobbledegook.

String encoding

The functions in the UsefulJS.String.encode namespace are used for interchange with backend processes.

UsefulJS.String.encode.toUtf8

Returns its input, UTF-8 encoded.

Syntax
UsefulJS.String.encode.toUtf8(s)
Parameters

Returns: String

Description

UTF-8 is a character encoding that uses a variable number of 8-bit bytes to encode a single character. The bit pattern in the first byte of the sequence says how many bytes are in an encoded sequence. UTF-8 has a number of intrinsic advantages over other character encodings:

No more than four bytes are required to represent any defined character. Here are some examples:

UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("$"));  // "$"; unchanged
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("£"));  // "\xc2\xa3"
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("€"));  // "\xe2\x82\xac"
UsefulJS.String.escape.js(UsefulJS.String.encode.toUtf8("💩")); // "\xf0\x9f\x92\xa9"

Following the Noncharacter FAQ, noncharacter codepoints are encoded like any other. However, codepoints that represent isolated halves of a surrogate pair are encoded as "\xef\xbf\xbd", the UTF-8 encoding of U+FFFD, the replacement character. Note that no "BOM" (byte-order mark) is emitted - this is absolutely not needed for UTF-8. If, for some reason, the receiving program expects one, you can prefix the encoded string with with "\xef\xbb\xbf".

UsefulJS.String.encode.fromUtf8

UTF-8 decodes its input, returning a regular String.

Syntax
UsefulJS.String.encode.fromUtf8(s)
Parameters

Returns: String

Description

Valid UTF-8 sequences in the input are decoded to the corresponding codepoints. This includes sequences that decode to noncharacters. Invalid sequences are decoded to U+FFFD, the replacement character. Invalid sequences are:

UsefulJS.String.encode.lf

Canonicalizes line-endings as "\n".

Syntax
UsefulJS.String.encode.lf(s)
Parameters

Returns: String

Description

Strips carriage returns out of its argument and returns the result.

UsefulJS.String.encode.crlf

Canonicalizes line-endings as "\r\n".

Syntax
UsefulJS.String.encode.crlf(s)
Parameters

Returns: String

Description

Replaces all newlines in its argument with carriage return / newline pairs. If called twice on the same string, carriage returns will not be doubled-up.

Implementation notes

UTF-8 encoding and decoding may trivially be done using the deprecated escape and unescape functions:

var s = "...",
    encoded = unescape(encodeURIComponent(s)),
    decoded = decodeURIComponent(escape(encoded));

This works because URLs must be UTF-8 encoded with individual octets %-escaped. escape/unescape are completely unaware of this and treat the UTF-8 byte values as individual characters.

I chose to implement my own codec for a number of reasons. The primary reason is the dependence on deprecated functions which may stop working at any time. The other is control over error handling; decodeURIComponent throws on invalid input (with, given the context, a slightly weird "Malformed URI" error) while I prefer to replace bad sequences with a replacement character and carry on.

String escaping

Functions in the UsefulJS.String.escape namespace are used to escape potentially problematic characters in strings before they're used in various contexts.

UsefulJS.String.escape.html

Escapes a String for use in HTML

Syntax
UsefulJS.String.escape.html(s)
Parameters

Returns: String

Description

This is a basic escape function, only escaping &<>'"/. It's intended to prevent you from shooting yourself in the foot when using innerHTML. The better approach is to document.createTextNode or attribute.setValue which turn strings plain old data that will never be interpreted. The function does not emit character entities like &reg; since these have not been required for years - simply specify a UTF-8 charset in a <meta> element.

UsefulJS.String.escape.js

Escapes a String for use in JavaScript

Syntax
UsefulJS.String.escape.js(s)
Parameters

Returns: String

Description

This uses a fairly paranoid escape algorithm: only ASCII alphanumeric characters (A-Z, a-z and 0-9) are not escaped. Otherwise, codepoints below U+0100 are escaped using the form \xHH where 'H' is a hexadecimal digit. The escaping of other codepoints depends on the capabilities of the browser: if the \u{H...H} escaping style is supported, this form will be used; if not the escape sequence will use the \uHHHH style.

UsefulJS.String.escape.regex

Escapes a String for use in in regular expressions

Syntax
UsefulJS.String.escape.regex(s)
Parameters

Returns: String

Description

String values passed to the RegExp constructor may need to be escaped if you don't want certain characters to interpreted as part of the regular expression syntax. This function backslash-escapes the following characters: \.-[]{}()^$|+?*. Note that '/' does not need to be escaped unless you're in the habit of constructing dynamic regexes with eval.

Fixes

The fixes for the String module implement a number of ES5/6 methods and are defined in the _string namespace of the fix options. Fixes are applied automatically apart from the padLeft and padRight options which, as library extensions, must be explicitly enabled.

String prototype methods are implemented generically so that you can apply any value to them:

String.prototype.includes.call([1,2,3,4], ",3,")

As per spec, however, they will throw a TypeError if the first argument is null or undefined.

startsWith

Adds a startsWith method to String.prototype if it is not implemented natively.

Syntax
string.startsWith(t)
Description

See the MDN documentation for details.

endsWith

Adds an endsWith method to String.prototype if it is not implemented natively.

Syntax
string.endsWith(t)
Description

See the MDN documentation for details.

includes

Adds an includes method to String.prototype if it is not implemented natively.

Syntax
string.includes(t[, pos])
Description

See the MDN documentation for details.

trim

Adds a trim method to String.prototype if it is not implemented natively.

Syntax
string.trim()
Description

See the MDN documentation for details. The pattern used for whitespace is Unicode-aware.

trimLeft

Adds a trimLeft method to String.prototype if it is not implemented natively.

Syntax
string.trimLeft()
Description

See the MDN documentation for details.

trimRight

Adds a trimRight method to String.prototype if it is not implemented natively.

Syntax
string.trimRight()
Description

See the MDN documentation for details.

repeat

Adds a repeat method to String.prototype if it is not implemented natively.

Syntax
string.repeat(count)
Description

See the MDN documentation for details.

padLeft

Adds a padLeft method to String.prototype.

Syntax
string.padLeft(padTo, padWith)
Description

Adds 0 or more copies of the character in padWith to the start of string so that it is at least padTo characters long and returns the result. If string is already long enough, no padding is applied.

padRight

Adds a padRight method to String.prototype.

Syntax
string.padRight(padTo, padWith)
Description

Adds 0 or more copies of the character in padWith to the end of string so that it is at least padTo characters long and returns the result. If string is already long enough, no padding is applied.

fromCodePoint

Adds a fromCodePoint factory method to String if it is not implemented natively.

Syntax
String.fromCodePoint(codePoint1[, codePoint2, ...])
Description

See the MDN documentation for details.

codePointAt

Adds a codePointAt method to String.prototype if it is not implemented natively.

Syntax
string.codePointAt(pos)
Description

See the MDN documentation for details.