Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.
Version: | 2.0.0 |
Depends: | R (≥ 3.5.0) |
Suggests: | testthat (≥ 3.0.0) |
Published: | 2022-03-03 |
DOI: | 10.32614/CRAN.package.wordpiece.data |
Author: | Jonathan Bratt [aut], Jon Harmon [aut, cre], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies) |
Maintainer: | Jon Harmon <jonthegeek at gmail.com> |
BugReports: | https://github.com/macmillancontentscience/wordpiece.data/issues |
License: | Apache License (≥ 2) |
URL: | https://github.com/macmillancontentscience/wordpiece.data |
NeedsCompilation: | no |
Materials: | README NEWS |
CRAN checks: | wordpiece.data results |
Reference manual: | wordpiece.data.pdf |
Package source: | wordpiece.data_2.0.0.tar.gz |
Windows binaries: | r-devel: wordpiece.data_2.0.0.zip, r-release: wordpiece.data_2.0.0.zip, r-oldrel: wordpiece.data_2.0.0.zip |
macOS binaries: | r-release (arm64): wordpiece.data_2.0.0.tgz, r-oldrel (arm64): wordpiece.data_2.0.0.tgz, r-release (x86_64): wordpiece.data_2.0.0.tgz, r-oldrel (x86_64): wordpiece.data_2.0.0.tgz |
Old sources: | wordpiece.data archive |
Reverse imports: | wordpiece |
Please use the canonical form https://CRAN.R-project.org/package=wordpiece.data to link to this page.