The Rhythm of Names

One of the fundamental properties of English prosody is a preference for alternations between strong and weak beats. This preference for rhythmic alternation is expressed in several ways:

      • Stress patterns in polysyllabic words like “testimony” and “obligatory” – as well as nonce words like “supercalifragilisticexpialidocious” – alternate between strong and weak beats.
      • Stress patterns on words change over time so that they maintain rhythmic alternation in the contexts in which they typically appear.
      • Over 90% of formal English poetry like that written by Shakespeare and Milton follows iambic or trochaic meter, i.e., weak-strong or strong-weak units.
      • Speakers insert disyllabic expletives in polysyllabic word in places that create or reinforce rhythmic alternation (e.g., we say “Ala-bloody-bama” or “Massa-bloody-chusetts” not “Alabam-blood-a” or “Massachu-bloody-setts”)
This blog examines whether the preference for rhythmic alternation affects naming patterns. Consider the following names of lakes in the United States:
      1. Glacier Lake
      2. Guitar Lake*
      3. Lake Louise
      4. Lake Ellen*
The names (a) and (c) preserve rhythmic alternation whereas the asterisked (b) and (d) create a stress clash with two consecutive stressed syllables.

As you can see, lake names in the US can either begin or end with “Lake”. More than 90% end in “Lake” reflecting the standard modifier + noun word order in English. But the flexibility allows us to test whether particular principles (e.g., linguistic, cultural) affect the choice between “X Lake” and “Lake X.” In the case of rhythmic alternation, we would expect weak-strong “iambic” words like “Louise” and “Guitar” to be more common in names beginning than ending in “Lake.”

To test this hypothesis, I used the Quanteda package to pull all the names of lakes and reservoirs in the USGS database of placenames that contained the word “Lake” plus just one other word before or after it. Using the “nsyllable” function in Quanteda, I whittled the list down to names whose non-“lake” word contained just two syllables. Finally, I pulled random samples of 500 names each from those beginning and ending with Lake, then manually coded the stress patterns on the non-lake word in each name.

Coding details for these steps follow. First, we’ll load our place name data frame and take a look at the variable names in the data frame, which are generally self-explanatory:


Next, we’ll filter to lakes and reservoirs based on the FEATURE_CLASS variable, and convert names to lower case. We’ll then flag lake and reservoir names that either begin or end with the word “lake”, filtering out those in neither category:

temp <- filter(placeNames, FEATURE_CLASS %in% c("Lake","Reservoir"))
temp$FEATURE_NAME <- tolower(temp$FEATURE_NAME)
temp$first_word <- 0
temp$last_word <- 0
temp$first_word[grepl("^lake\\b",temp$FEATURE_NAME)] <- 1
temp$last_word[grepl("\\blake$",temp$FEATURE_NAME)] <- 1
temp <- filter(temp, first_word + last_word > 0)
We’ll use the ntoken function in the Quanteda text analytics package to find names that contain just two words. By definition given the code so far, one of these two words is “lake.” We’ll separate out the other word, and use the nsyllable function in Quanteda to pull out just those words containing two syllables (i.e., “disyllabic” words). These will be the focus of our analysis.
temp$nWords <- ntoken(temp$FEATURE_NAME, remove_punct=TRUE)
temp <- filter(temp, nWords == 2)
temp$num_syl <- nsyllable(temp$otherWord)
temp <- filter(temp, num_syl == 2)
temp$otherWord <- temp$FEATURE_NAME
temp$otherWord <- gsub("^lake\\b","",temp$otherWord)
temp$otherWord <- gsub("\\blake$","",temp$otherWord)
temp$otherWord <- trimws(temp$otherWord)
  Given the large number of names with “lake” either in first or last position plus a two syllable word (30,391 names), we’ll take a random sample of 500 names beginning with “lake” and 500 ending with “lake”, combine into single data frame, and save as a csv file.
lake_1 <- filter(temp, first_word == 1) %>% sample_n(500)
lake_2 <- filter(temp, last_word == 1) %>% sample_n(500)
lakeSample <- rbind(lake_1,lake_2)
write.csv(lakeSample,file="lake stress clash sample.csv",row.names=FALSE)

I manually coded each of the disyllabic non-“lake” words in each name for whether it had strong-weak (i.e., “trochaic”) or weak-strong (“iambic”) stress. This coding was conducted blind to whether the name began or ended in “Lake.” Occasionally, I came across words like “Cayuga” that the nsyllable function erred in classifying as containing two syllables. I dropped these 23 words – 2.3% of the total – from the analysis (18 in names beginning with “Lake” and 5 in names ending in “Lake”).

Overall, 90% of the non-lake words had trochaic stress, which is consistent with the dominance of this stress pattern in the disyllabic English lexicon. However, as predicted from the preference for rhythmic alternation, iambic stress was almost 5x more common in names beginning than ending with “Lake” (16.4% vs. 3.4%, x2 = 44.81, p < .00001).

Place names provide a rich resource for testing the potential impact of linguistic and cultural factors on the layout of our “namescape.” For example, regional differences in the distribution of violent words in US place names are associated with long-standing regional variation in attitudes towards violence. Large databases of place names along with R tools for text analytics offer many opportunities for similar analyses.

From “Sh*t’s Creek” to “Schitt’s Creek”: On Padding Surnames with Extra Letters

We typically think of English and related spelling systems as mapping orthographic units or graphemes onto units of speech sounds, or phonemes. For instance, each of the three letters in “pen” maps to the three phonemes /p/, /ɛ/, and /n/ in the spoken version of the word. But there is considerable flexibility in the English spelling system, enabling other information to be encoded while still preserving phonemic mapping. For example, padding the ends of disyllabic words with extra unpronounced letters indicates that accent or stress should be placed on the second syllabic instead of the more common English pattern of first syllable stress(e.g., compare “trusty” with “trustee”, “gravel” with “gazelle”, or “rivet” with “roulette).

Proper names provide a rich resource for exploring how spelling systems are used to convey more than sound. Consider “Gerry” and “Gerrie” for example. These names are pronounced the same, but the final vowel /i/ is spelled differently. The difference is associated with gender: Between 1880 and 2016, 99% of children named “Gerrie” have been girls compared with 32% of children named “Gerry.” More generally, as documented in the code below, name-final “ie” is more associated with girls than boys. On average, names ending in “ie” and “y” are given to girls 84% and 66% of the time respectively (i.e., names ending in the sound /i/ tend to be given to girls, but more so if spelled with “ie” than “y”).

#Data frame I created from US Census dataset of baby names 
load("/Users/MHK/R/Baby Names/NamesOverall.RData")
sumNames$final_y_ie <- grepl("y$|ie$",sumNames$Name)
final_y_ie <- filter(sumNames, final_y_ie==TRUE)
final_y_ie$prop_f <- final_y_ie$femaleTotal/final_y_ie$allTotal
    Welch Two Sample t-test

t = 20.624, df = 7024.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1684361 0.2038195
sample estimates:
mean of x mean of y 
0.8412466 0.6551188 
Capitalizing the first letter in proper nouns is perhaps the most well-known example of how we use the flexibility in spelling systems to convey information beyond pronunciation. More subtly, we sometimes increase the prominence of proper names by padding with extra unpronounced letters as in “Penn” versus “pen” and “Kidd” versus “kid.” An interesting question is what factors influence whether or not a name is padded. Which brings us to Schitt’s Creek. The title of the popular series plays exactly on the fact that padding the name with extra letters that don’t affect pronunciation hides the expletive. This suggests a hypothesis: Padded names should be more common when the unpadded version contains negative sentiment, which might carry over via psychological “contagion” from the surname to the person. So, surnames like “Grimm” and “Sadd” should be more common than surnames like “Winn.”

I tested this hypothesis using a data set of surnames occurring at least 100 times in the 2010 US Census. Specifically, I flagged all monosyllabic names that ended in double letters. I restricted to monosyllabic names since letter doubling can affect accent placement, as noted above, which could create differences between the padded and unpadded versions. Next, I stripped off the final letter from these names and matched to a sentiment dictionary. Finally, I tested whether surnames were more likely to be padded if the unpadded version expressed negative sentiment.

The following R code walks through these steps. We’ll start by first reading the downloaded csv file of surnames into a data frame, and then converting the surnames from upper case in the Census file to lower case for mapping to a sentiment dictionary:

surnames <- read.csv("/Users/mike/R/Names/Names_2010Census.csv",header=TRUE)
surnames$name <- tolower(surnames$name)

Next, we’ll flag all monosyllabic names using the nsyllable function in the Quanteda package, identify those with a final double letter, and place into a new data frame:

surnames$num_syllables <- nsyllable(surnames$name)
surnames$finalDouble <- grepl("(.)\\1$", surnames$name)
oneSylFinalDouble <- filter(surnames, num_syllables == 1 & finalDouble == TRUE)
oneSylFinalDouble <- select(oneSylFinalDouble, name, num_syllables, finalDouble)

Finally, we’ll create a variable that strips each name of its final letter (e.g., “grimm” becomes “grim”) and match the latter to a sentiment dictionary, VADER (“Valence  Aware  Dictionary  for sEntiment Reasoning”) specifically. We’ll then put the matching words into a new data frame:

oneSylFinalDouble$Stripped <- substr(oneSylFinalDouble$name,1,nchar(oneSylFinalDouble$name)-1)
vader <- read.csv("/Users/MHK/R/Lexicon/vader_sentiment_words.csv",header=TRUE)
vader$word <- as.character(vader$word)
vader <- select(vader,word,mean_sent)
oneSylFinalDouble <- left_join(oneSylFinalDouble, vader, by=c("Stripped"="word"))
sentDouble <- filter(oneSylFinalDouble, !(

The final set of surnames is small – just 36 cases after removing one duplicate. It’s a small enough dataset to list them all:

select(sentDouble,name, mean_sent) %>% arrange(mean_sent)
      name mean_sent
1     warr      -2.9
2   cruell      -2.8
3    stabb      -2.8
4    grimm      -2.7
5     robb      -2.6
6     sinn      -2.6
7     bann      -2.6
8  threatt      -2.4
9    hurtt      -2.4
10  grieff      -2.2
11    fagg      -2.1
12    sadd      -2.1
13   glumm      -2.1
14   liess      -1.8
15   crapp      -1.6
16    nagg      -1.5
17    gunn      -1.4
18   trapp      -1.3
19   stopp      -1.2
20   dropp      -1.1
21    cutt      -1.1
22   dragg      -0.9
23    rigg      -0.5
24    wagg      -0.2
25  stoutt       0.7
26    topp       0.8
27    fann       1.3
28    fitt       1.5
29  smartt       1.7
30    yess       1.7
31   gladd       2.0
32    hugg       2.1
33    funn       2.3
34    wonn       2.7
35    winn       2.8
36    loll       2.9
Of these 36 cases, 24 have negative sentiment when the final letter is removed and only 12 positive sentiment, a significant skew toward padding surnames that would express negative sentiment if unpadded as determined through a one-tailed binomial test:

binom.test(24,36, alternative=c("greater"))
    Exact binomial test

data:  24 and 36
number of successes = 24, number of trials = 36, p-value = 0.03262
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.516585 1.000000
sample estimates:
probability of success 

An alternative explanation for this pattern is that there are more surnames with negative than positive sentiment overall, providing greater opportunity for negative surnames to be padded with extra letters. However, if anything, there are slightly more surnames with positive than negative sentiment in the Census database (294 vs. 254).

In sum, US surnames are more likely to be padded with extra letters when the unpadded version would express negative rather than positive sentiment. These results align with other naming patterns that indicate an aversion toward negative sentiment. Such aversions are consistent with nominal realism or the cross-cultural tendency to transfer connotations from a name to the named.

Finally, in case you’re wondering, no—“Schitt,” “Shitt,” nor “Sh*t” appear in the US Census database of surnames (at least in 2010). However, “Dicke,” “Asse,” and “Paine” do appear, illustrating another way to pad proper names besides letter doubling: Adding a final unpronounced “e.” But that’s a tale for another blog….

Sentiment Analysis of Surnames

Sentiment analysis is typically applied to connected text such as product reviews. However, it can also be extended to names, potentially delivering rich insights into psychology and culture. Globally and historically, names hold important familial, cultural, and religious significance. The foundation for much of this significance is a concept called nominal realism, which holds that the name imbues characteristics into the named. For instance, personal names in many cultures are based on totemic animals so that valued traits of the totem are transferred to the human namesake. We see nominal realism in our own culture from our tendency to name sports teams such as the Detroit Lions and Chicago Bears after predatory animals with stereotypically aggressive dispositions rather than, say, the Cincinnati Sloths, Chicago Sheep, or Green Bay Guinea Pigs.

My own research has examined nominal realism in names by documenting biases toward positive versus negative sentiment in names. For example, in cultures around the world, people emphasize the positive more than the negative in everyday speech. But I found a much more pronounced focus on the positive in a sentiment analysis of US place names. The positivity bias is especially large in names of cities and towns – which are closely connected to the self – than names of natural features. More recently, I’ve shown that business names also show a strong bias toward positive over negative words – with consequences for business performance. Specifically, revenues of businesses containing negative words are significantly lower than those for businesses containing positive or neutral words. In this post, I will extend sentiment analysis to surnames such as “Smith” and “Jones”. Surnames are interesting since technically they have no meaning, although they may at one time. Today’s “Shoemakers” for example are probably no more likely to be in that profession than those with other surnames (though I suppose this is an assumption that warrants testing). That said,  sentiment analysis would code surnames like “Grief” and “Coward” as negative while “Hardy” and “Courage” would be coded positive. Nominal realism would predict that negative surnames would be less common than positive surnames given fears that negative characteristics of the name would carry over to the named. I tested this hypothesis using a data set of surnames occurring at least 100 times in the 2010 US Census. We’ll start the analysis by first reading the downloaded csv file into a data frame, and then streamlining to just the two key variables used in the analysis, the name and count of occurrences:
surnames <- read.csv("/Users/mike/R/Names/Names_2010Census.csv",header=TRUE)
surnames <- select(surnames, name, count)
      name   count
1    SMITH 2442977
2  JOHNSON 1932812
3 WILLIAMS 1625252
4    BROWN 1437026
5    JONES 1425470
6   GARCIA 1166120
Next, we’ll convert the surnames to lower case for matching to a sentiment dictionary:
surnames$name <- tolower(surnames$name)
We’ll identify surnames with positive or negative sentiment using the AFINN sentiment lexicon, specifically the 2011 version. Each of the 2477 words in this lexicon is coded with an integer score ranging from -5 to +5 with negative/positive values reflecting sentiment valence and magnitude. I downloaded this lexicon, saving in Excel which we’ll load and merge with the surnames data frame. We’ll then remove all non-matching surnames (i.e., the vast majority of names like “Baker” and “Smith” with neutral sentiment).
affin <- read_excel("/Users/mike/R/AFFIN_Sentiment_Lexicon.xlsx", sheet="AFINN-111")
surnames <- left_join(surnames,affin,by=c("name"="Word"))
surname_sent <- filter(surnames,!
This leaves us with 332 surnames with a coded sentiment score, representing 13% of the words in the AFINN sentiment lexicon. We can look at a few surnames randomly selected from those with positive and negative sentiment to get a sense for them:
filter(surname_sent, Sentiment > 0) %>% slice_sample(n=10)
        name count Sentiment
1       free  9923         1
2      mercy   575         2
3    freedom   138         2
4       gift  1490         2
5   straight  4307         1
6       fair 18609         2
7      spark   472         1
8     heaven   625         2
9      hardy 80252         2
10 brilliant   491         4

filter(surname_sent, Sentiment < 0) %>% slice_sample(n=10)
      name count Sentiment
1     fail   754        -2
2  failing   717        -2
3   sullen   401        -2
4    angry   154        -3
5     bias  6518        -1
6     sore   115        -1
7    moody 64429        -1
8    blind   835        -1
9     glum   118        -2
10    lack  2661        -2
Next, we’ll test whether surnames with positive sentiment occur more frequently than those with negative sentiment, as nominal realism would predict. Consistent with other word frequency analyses that include words with a huge frequency range (100-2,442,977), we’ll first convert the frequency counts to logs and use those values in a t-test:
t.test(log10(surname_sent$count[surname_sent$Sentiment >0]),log10(surname_sent$count[surname_sent$Sentiment < 0]))
    Welch Two Sample t-test

data:  log10(surname_sent$count[surname_sent$Sentiment > 0]) and log10(surname_sent$count[surname_sent$Sentiment < 0])
t = 3.5516, df = 322.6, p-value = 0.0004399
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1283081 0.4469758
sample estimates:
mean of x mean of y 
 3.130305  2.842663 
Results were in the predicted direction, with mean frequency of positive surnames ~1350 and negative surnames ~700, or almost 2:1. I replicated the results using another sentiment lexicon. In sum, surname usage in the US shows a bias toward positive sentiment/avoidance of negative sentiment similar to those seen in US place and business names. It would be interesting to test whether there are significant consequences of having a negative surname (e.g., like the analysis of negative business names described above).