Supplementary data for "Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification."

Placeholder Show Content

Abstract/Contents

Abstract

Fifty novels written by fifteen different authors (twelve male, three female) were selected from the Chadwyck Healey Nineteenth Century American Literature collection. Selection was based solely on publication date (e.g. chronological proximity to the 1830 publication of the Book of Mormon). From these texts we extracted word frequency data for frequently occurring words and employed hierarchal clustering ("hclust" function in R with complete linkage) to group the texts based on their similarity. Shown below are three dendrograms produced using three different feature sets.

PLEASE NOTE: Shortly after publication of the online "Advance Access" version of "Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification," Jockers discovered an error he made in the preprocessing of the textual data. In tokenizing the texts, Jockers's original script failed to account for the possible presence of two hyphens ("--") as substitutes for the em dash. This resulted in a very small number of word types being incorrect and an even smaller number of miscounted correct word types. For example, the ngram "age--and" was tokenized as a unique word type instead of being counted as one instance each of the words "aged" and "and." This ngram occurred once in the entire corpus, in one of the Pratt samples, and its presence meant that the counts for occurrences of "age" and "and" in the Pratt sample were thus off by one. There were other similar cases.

Jockers became aware of this error on January 9, 2009 and immediately corrected his tokenization script and reprocessed the data. Witten then reran both the winnowing algorithm and the NSC and Delta procedures. The minor corrections to the data file did not result in any changes to the winnowed result set of words used by NSC. In all but one case, the classification results given by NSC were also unchanged. The only change in classification occurred in chapter 147 (Alma 52) of the Book of Mormon. Instead of Rigdon being the most likely candidate and Spaulding the second most likley, NSC reported the reverse, Spalding as most likely and Rigdon second most likely. In the original results, NSC ranked Rigdon at 0.4646 and Spalding at 0.4628. With the corrected data, NSC ranked Rigdon at 0.4626 and Spalding at 0.46525.

Description

Type of resource software, multimedia
Date created 2008

Creators/Contributors

Author Jockers, Matthew L.
Author Witten, Daniela M.
Author Criddle, Craig S.

Subjects

Subject authorship attribution
Subject machine learning
Subject Book of Mormon
Subject Joseph Smith
Subject Sidney Rigdon
Subject Oliver Cowdery
Subject Parley Pratt
Subject Solomon Spalding
Subject Solomon Spaulding
Subject Spalding-Rigdon Theory
Subject Nearest Shrunken Centroids
Subject Burrow's Delta
Genre Dataset

Bibliographic information

Related Publication Matthew L. Jockers; Daniela M. Witten; Craig S. Criddle. "Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification." Literary and Linguistic Computing, 2008; doi: 10.1093/llc/fqn040.
Location https://purl.stanford.edu/rs276tc2764

Access conditions

Use and reproduction
User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Preferred citation

Preferred Citation
Matthew L. Jockers; Daniela M. Witten; Craig S. Criddle. (2008). Supplementary data for "Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification." Stanford Digital Repository. Available at: http://purl.stanford.edu/rs276tc2764

Collection

Contact information

Also listed in

Loading usage metrics...