Getting Started with BCP47 • BCP47

library(BCP47)

What is BCP 47?

BCP 47 (Best Current Practice 47) is the IETF standard that defines how human languages are identified in internet protocols. It is specified by two RFCs:

RFC 5646 — the syntax for language tags (e.g., en-US, zh-Hans-CN)
RFC 4647 — the rules for matching language tags to available resources

A BCP 47 tag is a sequence of subtags separated by hyphens:

language [-script] [-region] [-variant]* [-extension]* [-privateuse]

For example:

Tag	Meaning
`en`	English
`en-US`	English as used in the United States
`zh-Hans-CN`	Chinese, Simplified script, as used in China
`sr-Latn`	Serbian written in the Latin script
`de-1901`	German, traditional orthography (1901 variant)
`x-myapp`	Entirely private-use tag

The canonical source of valid subtags is the IANA Language Subtag Registry.

Parsing

bcp_parse() decomposes a tag into its named components. All subtags are returned in lower-case. Both hyphens (-) and underscores (_) are accepted as separators.

bcp_parse("en-US")
#> $language
#> [1] "en"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] "us"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL

bcp_parse("zh-Hans-CN")
#> $language
#> [1] "zh"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] "hans"
#> 
#> $region
#> [1] "cn"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL

The returned list always has the same structure:

tag <- bcp_parse("sr-Latn-RS-rozaj-x-custom")
names(tag)
#> [1] "language"   "extlang"    "script"     "region"     "variants"  
#> [6] "extensions" "private"

language: primary language subtag (e.g., "sr")
extlang: extended language subtags (NULL if absent)
script: four-letter script subtag (NA if absent)
region: two-letter or three-digit region subtag (NA if absent)
variants: variant subtags (NULL if absent)
extensions: named list of extension sequences
private: private-use subtags (NULL if absent)

# Pure private-use tag
bcp_parse("x-myapp-v2")
#> $language
#> [1] NA
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] NA
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> [1] "myapp" "v2"

# Variant and extension subtags
bcp_parse("en-US-u-ca-gregory")
#> $language
#> [1] "en"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] "us"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> $extensions$u
#> [1] "ca"      "gregory"
#> 
#> 
#> $private
#> NULL

Language Matching

bcp_match_language() implements the RFC 4647 “Lookup” scheme. Given an ordered list of language preferences and a set of available tags, it returns the best match.

Matching works by progressively stripping the rightmost subtag from each preference until a match is found:

en-US → en  (strip region)
zh-Hans-CN → zh-Hans → zh  (strip region, then script)

# User prefers en-US; only 'en' is available — falls back to the base language
bcp_match_language("en-US", c("en", "fr", "de"))
#> [1] "en"

# Multiple preferences: de-AT falls back through de, then en-GB falls back to en
bcp_match_language(c("de-AT", "en-GB"), c("en", "fr"))
#> [1] "en"

# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
  c("zh-Hant-TW", "zh-Hans", "en"),
  c("zh-Hans", "en", "fr")
)
#> [1] "zh-Hans"

# No match — return a default value
bcp_match_language("pt-BR", c("fr", "de"), default = "en")
#> [1] "en"

Matching is case-insensitive. The original casing from available is preserved in the return value:

bcp_match_language("EN-US", c("en-US", "fr"))
#> [1] "en-US"

Validation

bcp_validate() checks whether the language, script, and region subtags in a tag appear in the IANA Language Subtag Registry. It downloads and caches the registry on first use.

bcp_validate("en-US") # TRUE — both 'en' and 'US' are registered
bcp_validate("zh-Hans-CN") # TRUE
bcp_validate("xx-ZZ") # FALSE — 'xx' is not a registered language
bcp_validate("en-Xxxx") # FALSE — 'Xxxx' is not a registered script

Note that validation only checks structural registry membership. It does not check whether a combination of subtags is meaningful (e.g., en-Hans would pass validation even though English is not normally written in the Han script).

Normalization

bcp_normalize() applies the canonicalization rules from RFC 5646:

Deprecated languages are replaced with their preferred values (e.g., iw → he for Hebrew)
Default scripts are suppressed (e.g., en-Latn-US → en-US, since Latin is the default script for English)
Canonical casing is applied: language lower-case, script title-case, region upper-case

bcp_normalize("en-us") # "en-US"   (region uppercased)
bcp_normalize("ZH-HANS-CN") # "zh-Hans-CN"  (language lowercased, script title-cased)
bcp_normalize("en-Latn-US") # "en-US"   (Latn is the default/suppress script for 'en')
bcp_normalize("sr-latn") # "sr-Latn" (Latn is NOT the default for Serbian)

The IANA Registry

Both bcp_validate() and bcp_normalize() rely on the IANA Language Subtag Registry. You can access it directly with bcp_process_registry(), which returns a tidy tibble:

reg <- bcp_process_registry()
nrow(reg)

# Registry metadata
attr(reg, "last_update")

# Browse languages
reg[reg$type == "language", c("subtag", "description", "suppress_script")]

# Find deprecated languages and their preferred replacements
deprecated <- reg[reg$type == "language" & !is.na(reg$preferred_value), ]
head(deprecated[, c("subtag", "description", "preferred_value")])

Caching

To avoid downloading the registry on every call, use bcp_cache_update() to save it locally:

# Download and cache the registry
bcp_cache_update()

# Inspect the cache
bcp_cache_path() # file path
bcp_cache_size() # size on disk

# Refresh to get the latest registry
bcp_cache_update(overwrite = TRUE)

# Remove the cache
bcp_cache_clear()

After caching, bcp_validate() and bcp_normalize() will load from the local file automatically.