Getting Started with BCP47
BCP47.RmdWhat is BCP 47?
BCP 47 (Best Current Practice 47) is the IETF standard that defines how human languages are identified in internet protocols. It is specified by two RFCs:
-
RFC 5646 — the syntax for language tags (e.g.,
en-US,zh-Hans-CN) - RFC 4647 — the rules for matching language tags to available resources
A BCP 47 tag is a sequence of subtags separated by hyphens:
language [-script] [-region] [-variant]* [-extension]* [-privateuse]
For example:
| Tag | Meaning |
|---|---|
en |
English |
en-US |
English as used in the United States |
zh-Hans-CN |
Chinese, Simplified script, as used in China |
sr-Latn |
Serbian written in the Latin script |
de-1901 |
German, traditional orthography (1901 variant) |
x-myapp |
Entirely private-use tag |
The canonical source of valid subtags is the IANA Language Subtag Registry.
Parsing
bcp_parse() decomposes a tag into its named components.
All subtags are returned in lower-case. Both hyphens (-)
and underscores (_) are accepted as separators.
bcp_parse("en-US")
#> $language
#> [1] "en"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] "us"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULL
bcp_parse("zh-Hans-CN")
#> $language
#> [1] "zh"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] "hans"
#>
#> $region
#> [1] "cn"
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> NULLThe returned list always has the same structure:
tag <- bcp_parse("sr-Latn-RS-rozaj-x-custom")
names(tag)
#> [1] "language" "extlang" "script" "region" "variants"
#> [6] "extensions" "private"-
language: primary language subtag (e.g.,"sr") -
extlang: extended language subtags (NULLif absent) -
script: four-letter script subtag (NAif absent) -
region: two-letter or three-digit region subtag (NAif absent) -
variants: variant subtags (NULLif absent) -
extensions: named list of extension sequences -
private: private-use subtags (NULLif absent)
# Pure private-use tag
bcp_parse("x-myapp-v2")
#> $language
#> [1] NA
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] NA
#>
#> $variants
#> NULL
#>
#> $extensions
#> list()
#>
#> $private
#> [1] "myapp" "v2"
# Variant and extension subtags
bcp_parse("en-US-u-ca-gregory")
#> $language
#> [1] "en"
#>
#> $extlang
#> NULL
#>
#> $script
#> [1] NA
#>
#> $region
#> [1] "us"
#>
#> $variants
#> NULL
#>
#> $extensions
#> $extensions$u
#> [1] "ca" "gregory"
#>
#>
#> $private
#> NULLLanguage Matching
bcp_match_language() implements the RFC 4647 “Lookup”
scheme. Given an ordered list of language preferences and a set of
available tags, it returns the best match.
Matching works by progressively stripping the rightmost subtag from each preference until a match is found:
en-US → en (strip region)
zh-Hans-CN → zh-Hans → zh (strip region, then script)
# User prefers en-US; only 'en' is available — falls back to the base language
bcp_match_language("en-US", c("en", "fr", "de"))
#> [1] "en"
# Multiple preferences: de-AT falls back through de, then en-GB falls back to en
bcp_match_language(c("de-AT", "en-GB"), c("en", "fr"))
#> [1] "en"
# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
c("zh-Hant-TW", "zh-Hans", "en"),
c("zh-Hans", "en", "fr")
)
#> [1] "zh-Hans"
# No match — return a default value
bcp_match_language("pt-BR", c("fr", "de"), default = "en")
#> [1] "en"Matching is case-insensitive. The original casing
from available is preserved in the return value:
bcp_match_language("EN-US", c("en-US", "fr"))
#> [1] "en-US"Validation
bcp_validate() checks whether the language, script, and
region subtags in a tag appear in the IANA Language Subtag Registry. It
downloads and caches the registry on first use.
bcp_validate("en-US") # TRUE — both 'en' and 'US' are registered
bcp_validate("zh-Hans-CN") # TRUE
bcp_validate("xx-ZZ") # FALSE — 'xx' is not a registered language
bcp_validate("en-Xxxx") # FALSE — 'Xxxx' is not a registered scriptNote that validation only checks structural registry membership. It
does not check whether a combination of subtags is meaningful
(e.g., en-Hans would pass validation even though English is
not normally written in the Han script).
Normalization
bcp_normalize() applies the canonicalization rules from
RFC 5646:
-
Deprecated languages are replaced with their
preferred values (e.g.,
iw→hefor Hebrew) -
Default scripts are suppressed (e.g.,
en-Latn-US→en-US, since Latin is the default script for English) - Canonical casing is applied: language lower-case, script title-case, region upper-case
bcp_normalize("en-us") # "en-US" (region uppercased)
bcp_normalize("ZH-HANS-CN") # "zh-Hans-CN" (language lowercased, script title-cased)
bcp_normalize("en-Latn-US") # "en-US" (Latn is the default/suppress script for 'en')
bcp_normalize("sr-latn") # "sr-Latn" (Latn is NOT the default for Serbian)The IANA Registry
Both bcp_validate() and bcp_normalize()
rely on the IANA Language Subtag Registry. You can access it directly
with bcp_process_registry(), which returns a tidy
tibble:
reg <- bcp_process_registry()
nrow(reg)
# Registry metadata
attr(reg, "last_update")
# Browse languages
reg[reg$type == "language", c("subtag", "description", "suppress_script")]
# Find deprecated languages and their preferred replacements
deprecated <- reg[reg$type == "language" & !is.na(reg$preferred_value), ]
head(deprecated[, c("subtag", "description", "preferred_value")])Caching
To avoid downloading the registry on every call, use
bcp_cache_update() to save it locally:
# Download and cache the registry
bcp_cache_update()
# Inspect the cache
bcp_cache_path() # file path
bcp_cache_size() # size on disk
# Refresh to get the latest registry
bcp_cache_update(overwrite = TRUE)
# Remove the cache
bcp_cache_clear()After caching, bcp_validate() and
bcp_normalize() will load from the local file
automatically.