Work with Language Tags • BCP47

BCP47 provides tools to parse, validate, normalize, and match language tags following the BCP 47 standard. BCP 47 (Best Current Practice 47) is the IETF standard that governs how human languages are identified in internet protocols—it defines tags like en-US (English, United States).

The package bundles access to the IANA Language Subtag Registry, the authoritative source of valid language, script, region, and variant subtags.

Installation

You can install the development version of BCP47 from GitHub with:

# install.packages('pak')
pak::pak('christopherkenny/BCP47')

Core Functions

Function	Description
`bcp_parse()`	Decompose a tag into its subtag components
`bcp_validate()`	Check whether subtags appear in the IANA registry
`bcp_normalize()`	Apply canonical casing and substitute preferred values
`bcp_match_language()`	Find the best available language for a set of preferences
`bcp_process_registry()`	Download and parse the IANA registry
`bcp_cache_*()`	Manage the local registry cache

Examples

Parsing

bcp_parse() decomposes a tag into its RFC 5646 components. All subtags are returned in lower-case.

library(BCP47)

bcp_parse('en-US')
#> $language
#> [1] "en"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] "us"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL
bcp_parse('zh-Hans-CN')
#> $language
#> [1] "zh"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] "hans"
#> 
#> $region
#> [1] "cn"
#> 
#> $variants
#> NULL
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL
bcp_parse('de-1901')
#> $language
#> [1] "de"
#> 
#> $extlang
#> NULL
#> 
#> $script
#> [1] NA
#> 
#> $region
#> [1] NA
#> 
#> $variants
#> [1] "1901"
#> 
#> $extensions
#> list()
#> 
#> $private
#> NULL

Language Matching

bcp_match_language() implements the RFC 4647 “Lookup” scheme. It finds the best available language for a user’s ordered list of preferences, progressively stripping subtags to find a match.

# User prefers en-US, then French. Only 'en' and 'de' are available.
bcp_match_language(c('en-US', 'fr'), c('en', 'de'))
#> [1] "en"

# Prefer Traditional Chinese, fall back to Simplified, then English
bcp_match_language(
  c('zh-Hant-TW', 'zh-Hans', 'en'),
  c('zh-Hans', 'en', 'fr')
)
#> [1] "zh-Hans"

# No match — return a default
bcp_match_language('pt-BR', c('fr', 'de'), default = 'en')
#> [1] "en"

Validation and Normalization

bcp_validate() and bcp_normalize() check and canonicalize tags against the IANA registry. They download (and cache) the registry on first use.

# Check whether subtags are registered
bcp_validate('en-US') # TRUE
bcp_validate('xx-ZZ') # FALSE — neither subtag is registered

# Canonicalize casing and suppress default scripts
bcp_normalize('en-us') # "en-US"  (region uppercased)
bcp_normalize('en-Latn-US') # "en-US"  (Latn is the default script for English)
bcp_normalize('sr-latn') # "sr-Latn" (Latn is not the default for Serbian)

Registry Access

The IANA registry is parsed into a tidy data frame you can query directly:

reg <- bcp_process_registry()
head(reg)

# Find all scripts
reg[reg$type == 'script', c('subtag', 'description')]

# Check the registry date
attr(reg, 'last_update')

Cache Management

Registry data is cached locally to avoid repeated downloads:

bcp_cache_path() # where the cache lives
bcp_cache_size() # how big it is
bcp_cache_update() # refresh from IANA
bcp_cache_clear() # delete the cache