Monitoring systemic risk with R

Hi everyone !

This is my very first post… many researchers / students / practitioners from all over the world write to me regularly about my R package SystemicR. I’m glad to contribute to the community but questions about data management and plotting often come up. I guess that the package notice could be improved. As I receive more and more emails (which I always answer), I have to go a step further. In order to help the community in the most efficient way, I am launching a blog ! The purpose is to introduce my package proposing a tutorial. I hope you’ll find what you were looking for !

Tutorial : load, estimate and plot

First of all, you have to install and load the package SystemicR (available on CRAN so that’s the easy part) :

# Install and load SystemicR
install.packages("SystemicR")
library(SystemicR)


See ? By the way I use RStudio / R 3.6.3, please let me know in comments if you have problems with more recent versions.
Then, we have to deal with data input : (i) the data I used in my research paper entitled “Systemic Risk: a Network Approach”, or (ii) your own data. Let’s begin with my data (included in the package) :

# Data management
data("data_stock_returns")
head(data_stock_returns)
data("data_state_variables")
head(data_state_variables)


That’s it. Nothing too difficult. But things can be a bit more complicated if you choose to import your own data. First of all, you have to be very careful about the input file: please import a .txt or .csv file. This will avoid 90% of the issues you face and write to me about. Then, you have to be aware of the format of the variables that the functions can take as input. The first column is named “Date” but the variable is not a date. I used a character format when I created this dataframe. If you want to import your own data, please use the “dd/mm/yyyy” format. Furthermore, my advice would be to import the data from a .txt or .csv file, using the command :

# Import data
df_my_data <- read.table(file = "My_CSV_File", sep = ";")
df_my_data <- read.csv(file = "My_CSV_File", sep = ";")

And please be careful about sep =
Once data management is done, the remaining part of the code is straightforward. The package SystemicR is a toolbox that provides R users with useful functions to estimate and plot systemic risk measures.

Let’s begin with f_CoVaR_Delta_CoVaR_i_q(). This function computes the CoVaR and the ΔCoVaR of a given financial institution i for a given quantile q. I developed this function following each step of Adrian and Brunnermeier (2016)’s research article :

# Compute CoVaR_i_q and Delta_CoVaR_i_q
f_CoVaR_Delta_CoVaR_i_q(data_stock_returns)


Then, having estimated this static measure, let’s move on to the dynamic using f_CoVaR_Delta_CoVaR_i_q_t().Still, I developed this function following each step of Adrian and Brunnermeier (2016)’s research article :

# Compute CoVaR_i_q_t , Delta_CoVaR_i_q_t and Delta_CoVaR_t
l_result <- f_CoVaR_Delta_CoVaR_i_q_t(data_stock_returns, data_state_variables)


Of course, other systemic risk measures can be estimated. Following Billio et al. (2012), the function f_correlation_network_measures() estimates degree, closeness centrality, eigenvector centrality. This function also estimates SR and volatility as in Hasse (2020):

# Compute topological risk measures from correlation-based financial networks
l_result <- f_correlation_network_measures(data_stock_returns)

Last but not least, we shall now plot the evolution of one of the systemic risk measure using the function f_plot() :

# Plot Delta_CoVaR_t and SR_t
f_plot(l_result$Delta_CoVaR_t)
f_plot(l_result$SR)


And that’s it! Before the end of the year, I will do my best to propose a new version of this package, including updated data and other systemic risk measure. Any suggestions are welcome and you can contact me if needed. Last, please do not hesitate to share your thoughts or questions above !

References

Adrian, Tobias, and Markus K. Brunnermeier. “CoVaR”. American Economic Review 106.7 (2016): , 106, 7, 1705-1741.

Billio, M., Getmansky, M., Lo, A. W., & Pelizzon, L. (2012). Econometric measures of connectedness and systemic risk in the finance and insurance sectors. Journal of Financial Economics, 104(3), 535-559.

Hasse, Jean-Baptiste. “Systemic Risk: a Network Approach”. AMSE Working Paper (2020)

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 2 of 4

This represents Part 2 of a 4-part series relative to the calculation of Equity Cash Flow (ECF) using R.  If you missed Part 1, be certain read that first part before proceeding. The content builds off prior described information/data.

Part 1 previous post is located here.
‘ECF – Method 2’ is defined as follows: 
 
The equation appears innocent enough, though there are many underlying terms that require definition for understanding of the calculation. In words, ‘ECF – Method 2’ equals free cash Flow (FCFF) minus after-tax Debt Cash Flow (CFd).

Reference details of the 5-year capital project’s fully integrated financial statements developed in R at the following link.  The R output is formatted in Excel.  Zoom for detail. 

https://www.dropbox.com/s/lx3uz2mnei3obbb/financial_statements.pdf?dl=0
The first order of business is to define the terms necessary to calculate FCFF.




Next, pretax Debt Cash Flow (CFd) and its components are defined as follows:

  The following data are added to the ‘data’ tibble from the prior article relative to the financial statements.
data <- data %>%
  mutate(ie    = c(0, 10694, 8158, 527, 627, 717 ),
         np    = c(31415, 9188, 13875,  16500, 18863, 0),
         LTD   = c(250000, 184952, 0, 0, 0, 0),
         cpltd = c(0, 20550, 0, 0, 0, 0),
         ni    =  c(0, 47584,  141355,  262035, 325894, 511852),
         bd    =  c(0, 62500,  62500,   62500,   62500,   62500),
         chg_DTL_net = c(0, 35000,  55000,  35000, -25000, -100000),
         cash  = c(30500,  61250, 92500, 110000, 125750, 0),
         ar    = c(0, 61250,  92500,  110000,  125750, 0),
         inv   = c(30500, 61250, 92500, 110000,  125750, 0),
         pe    = c(915, 1838, 2775, 3300, 3773, 0),
         ap    = c(30500, 73500, 111000, 132000, 150900, 0),
         wp    = c(0, 5513, 8325, 9900, 11318, 0),
         itp   = c(0, -819.377,  9809,  34923, 60566, 0),
         CapX  = c(500000,0,0,0,0,0),
         gain  = c(0,0,0,0,0,162500),
         sp  = c(0,0,0,0,0,350000))
View tibble.



All of the above calculations are defined in the below R function ECF_2. ‘ECF – Method 2’ R function
ECF_2 <- function(a) {
  
  ECF2 <-      tibble(T_        = a$T_,
                       ie       = a$ie,
                       ii       = a$ii,
                       Year     = c(0:(length(ii)-1)),
                       ni       = a$ni,
                       bd       = a$bd,
                       chg_DTL_net = a$chg_DTL_net,
                       gain     = - a$gain,
                       sp       = a$sp,
                       ie_AT    = ie*(1-a$T_),
                       ii_AT    = - ii*(1-a$T_),
                       gcf      = ni + bd + chg_DTL_net + gain + sp 
                                + ie_AT + ii_AT,
                       OCA      = a$cash + a$ar + a$inv + a$pe,
                       OCL      = a$ap + a$wp + a$itp,
                       OWC      = OCA - OCL,
                       chg_OWC  = OWC - lag(OWC, default=0),
                       CapX     = - a$CapX,
                       FCFF1    = gcf + CapX - chg_OWC,
                       N        = a$LTD + a$cpltd + a$np,
                       chg_N    = N - lag(N, default=0),
                       CFd_AT   = ie*(1-T_) - chg_N,   
                       ECF2     = FCFF1 - CFd_AT )
                      
  
  ECF2 <- rotate(ECF2)
  return(ECF2)
  
}

Run the R function and view the output.



R Output formatted in Excel
Method 2



ECF Method 2‘ agrees with the prior results from ‘ECF Method 1‘ each year.  Any differences are due to rounding error.

This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’  It is discussed in far greater detail along with development of the integrated financials using R as well as numerous, advanced DCF valuation modeling approaches – some never before published. The text importantly clearly explains ‘why’ these ECF calculation methods are mathematically exactly equivalent, though the individual components appear vastly different.

Reference my website for further details.

https://www.leewacc.com/

Next up, ‘ECF – Method 3’ …

Brian K. Lee, MBA, PRM, CMA, CFA




 

Four (4) Different Ways to Calculate DCF Based ‘Equity Cash Flow (ECF)’ – Part 1 of 4



Over the next several days, I will present 4 different methods of correctly calculating Equity Cash Flow (ECF) using R.  The valuation technique of discounted cash flow (DCF) estimates equity value (E) as the present value of forecasted ECF.  The appropriate discount rate for this flow definition is the cost of equity capital (Ke).

‘ECF – Method 1’ is defined as follows: 



where



Note: ECF is not simply ‘dividends.’  A common misconception is that discounted dividends (DIV) provide equity value.  An example of this is the common ‘dividend growth’ equity valuation model found in many corporate finance texts.  All ‘dividend growth’ models that discount dividends (DIV) at the cost of equity capital (Ke) are incorrect unless forecasted 1) marketable securities (MS) balances are zero, and 2) there is no issuance or repurchase of equity shares.

The data assumes a 5-year year hypothetical capital project.  A single revenue producing asset is purchased at the end of ‘Year 0‘ and is sold at the end of ‘Year 5.’  The $500,000 asset is purchased assuming 50% debt and 50% equity financing.   

Further, the data used to estimate ECF in this example are taken from fully integrated pro forma financial statements and other relevant data assumptions including the corporate tax rate.  This particular example only requires financial data from integrated pro forma  income statements and balance sheets.  These 2 pro forma financial statements are shown below with the relevant data rows highlighted. 



https://www.dropbox.com/s/xwy97flxe99gqr9/financials.pdf?dl=0

The above link provides access to a PDF of all financial statement pro forma data and is easily zoomable for viewing purposes.

The relevant data used to calculate ECF are initially placed in a tibble.

library(tidyverse)


data <- tibble(Year = c(0:5),
              div  = c(0, 2379, 7068, 13102, 16295, 1249876),
              MS   = c(0, 0, 7226, 350948, 698648, 0),
              ii   = c(0, 0, 0, 253, 12283, 24453),
              pic  = c(250000, 250000, 250000, 250000, 250000, 0), 
              T_   = c(0.25, 0.40, 0.40, 0.40, 0.40, 0.40))

data



An R function is created to rotate the data in standard financial data presentation format (each data line item occupies a single row instead of a column)

rotate <- function(r) {
  
  p <- t(as.matrix(as_tibble(r)))
  
  return(p)
  
}

View the  rotated data

rotate(data)



An R function reads in the appropriate data, performs the necessary calculations, and outputs the data.  The R output is then placed in a spreadsheet to formatting purposes. 

‘ECF – Method 1’ R function
ECF_1 <- function(a) {
  
  ECF1 <-     tibble( T_             = a$T_,
                      pic            = a$pic,
                      chg_pic        = pic - lag(pic, default=0),
                      MS             = a$MS,
                      ii             = a$ii,
                      Year           = c(0:(length(T_)-1)),
                      div            = a$div,
                      net_new_equity = -chg_pic,   
                      chg_MS         = MS - lag(MS, default=0),
                      ii_AT          = -ii*(1-T_),
                      ECF1           = div + net_new_equity 
                                       + chg_MS + ii_AT )
  
  ECF1 <- rotate(ECF1)
  
  return(ECF1)
  
}

View R Output

ECF_method_1 <- ECF_1( data)
ECF_method_1



Excel formatting applied to R Output



It is quite evident there is far more than just dividends (DIV) involved in the proper calculation of ECF.  Use of a ‘dividend growth’ equity valuation model in this instance would result in significant model error.

This ECF calculation example is taken from my newly published textbook, ‘Advanced Discounted Cash Flow (DCF) Valuation using R.’  It is discussed in far greater detail along with development of the integrated financials using R as well as numerous, advanced DCF valuation modeling approaches – some never before published.

Reference my website for further details.

https://www.leewacc.com/

Next up, ‘ECF – Method 2’ …

Brian K. Lee, MBA, PRM, CMA, CFA

New R textbook for machine learning

Mathematics and Programming for Machine Learning with R -Chapter 2 Logic

Have a look at the FREE attached pdf of Chapter 2 on Logic and R from my recently published textbook,

Mathematics and Programming for Machine Learning with R: From the Ground Up, by William B. Claster (Author)
~430 pages, over 400 exercises.Mathematics and Programming for Machine Learning with R -Chapter 2 Logic
We discuss how to code machine learning algorithms in R but start from scratch. The first 4 chapters cover Logic, Sets, Probability, Functions. I am sharing Chapter 2 here on Logic and R here and will also probably release chapters 9 and 10 on Math for Neural Networks shortly. The text is on sale at Amazon here:
https://www.amazon.com/Mathematics-Programming-Machine-Learning-R-dp-0367507854/dp/0367507854/ref=mt_other?_encoding=UTF8&me=&qid=1623663440

I will try to add an errata page as well.

The useR! 2021 (virtual) conference: 5-9 JULY, 2021

useR! conferences are non-profit conferences organized by community volunteers for the community, supported by the R Foundation. Attendees include R developers and users who are data scientists, business intelligence specialists, analysts, statisticians from academia and industry, and students.

The useR! 2021 conference will be the first R conference that is global by design, both in audience and leadership. Leveraging a diversity of experiences and backgrounds helps us to make the conference accessible and inclusive in as many ways as possible and to grow the global community of R users giving new talents access to this amazing ecosystem. Being virtual makes the conference more accessible to minoritized individuals and we strive to leverage that potential. We are paying special attention to the needs of people with a disability to ensure that they can attend and contribute to the conference as conveniently as possible. Going fully virtual and global allows us to reimagine what an R conference can offer to presenters and attendees from across the globe and from diverse backgrounds.

We have an awesome lineup of keynotes:

Paul Murrell (R Core), Edzer Pebesma (Statistics), Heidi Seibold (Research Software Engineering), Jeroen Ooms (R Programming), Meenakshi Kushwaha (R in Action), Catherine Gicheru and Katharine Hayhoe (Communications), Jonathan Godfrey, Kristian Lum, Achim Zeileis and Dorothy Gordon (Responsible programming).

We will have a full day of tutorials in four tracks and different time zones. The 22 selected tutorials will be taught in different languages (English, Spanish or French) and they are aimed at beginner, intermediate and advanced users. Have a look at the tutorial schedule.

You can register from the link on the registration page.
We determined our fee structure based on two important points: * We want everyone to be able to participate! * We would appreciate it if you pay what you can!
We have been able to extend our Early Bird window until registration closes thanks to great support from paying participants and our sponsors.
Fees are adapted to your country of residence. The rate for academic participants also applies to non-profit organizations and government employees, and the student rate also applies to retired people. Freelancers are encouraged to select the rate they feel applies best to them.

Additionally, fees are waived if your employer does not have funds to cover the conference fees. There is no need to convince anyone — you only have to tell us that you don’t have the supporting funds. If you are from a “Low Income” country, your fees are automatically waived!

We encourage both active users of R and those curious about R to attend useR! 2021 conference.

Looking forward to meeting you at the conference!

The useR! 2021 organizing committee


Zoom talk on “Building dashboards in R/shiny (and improve them with logs and user feedback)” from the Grenoble (FR) R user group

The next talk of Grenoble’s R user group will be on May 25, 2021 at 5PM (FR) and is free and open to all:

Building dashboards in R/shiny (and improve them with logs and user feedback)

        The main goal of this seminar is to demonstrate how R/Shiny app developers can collect data from the visitors of their app (such as logs or feedback) to improve it. We will use as example the CaDyCo project, a dynamic and collaborative cartography project hosted by USMB and UGA.


Hope to see you there!

EdOptimize – An Open Source K-12 Learning Analytics Platform





Important Links

  1. Open-source code of our platform –  https://github.com/PlaypowerLabs/EdOptimize
  2. Live Platform Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_platform_analytics/
  3. Live Curriculum Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_curriculum_analytics/
  4. Live Implementation Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_implementation_analytics/

Introduction
Data from EdTech platforms have a tremendous potential to positively impact student learning outcomes. EdTech leaders are now realizing that learning analytics data can be used to take decisive actions that make online learning more effective. By using the EdOptimize Platform, we can rapidly begin to observe the implementation of digital learning programs at scale. The data insights from the EdOptimize Platform can enable education stakeholders to make evidence-based decisions that are aimed at creating improved digital learning systems, programs, and their implementation.

EdOptimize Platform is a collection of 3 extensive data dashboards. All 3 Dashboards are devloped using R Shiny and all the code right from the beginning with data simulation and data processing is R native . These dashboards contain many actionable learning analytics that we have designed from our years of work with various school districts in the US.

Here are the brief descriptions of each of the dashboards:

  1. Platform Analytics: To discover trends and power users in the online learning platform. You can use the user behavior data in this platform to identify actions that can increase user retention and engagement. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_platform_analytics/
  2. Curriculum Analytics: To identify learning patterns in the digital curriculum products. Using this dashboard, you can locate content that needs change and see classroom pacing analytics. You can also look at assessment data, item analysis, and standards performance of the curriculum users. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_curriculum_analytics/
  3. Implementation Analytics: To track the implementation of the digital programs in school districts. This dashboard will help districts make the most out of their online learning programs. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_implementation_analytics/

    Data Processing Workflow :
    image


    To learn more about the platform in detail please head over to – https://github.com/PlaypowerLabs/EdOptimize#readme

    About Playpower Labs
Playpower Labs is one of the world’s leading EdTech consulting companies. Our award-winning research team has worked with many different types of educational data. Examples include event data, assessment data, user behavior data, web analytics data, license and entitlement data, roster data, eText data, item response data, time-series data, panel data, hierarchical data, skill and standard data, assignment data, knowledge structure data, school demographic data, and more.

If you need a professional help with this platform or any other EdTech data project, please contact our Chief Data Scientist Nirmal Patel at [email protected] He will be happy to have a conversation with you!

Zoom talk on “Version control and git for beginners” from the Grenoble (FR) R user group

The next talk of Grenoble’s R user group will be on May 06, 2021 at 5PM (FR) and is free and open to all:

Version control and git for beginners

    Git is a version control system whose original purpose was to help groups of developers work collaboratively on big software projects. Git manages the evolution of a set of files – called a repository – in a sane, highly structured way. Git has been re-purposed by the data science community. In addition to using it for source code, we use it to manage the motley collection of files that make up typical data analytical projects, which often consist of data, figures, reports, and, yes, source code. In this presentation aimed at beginners we will try to give you an understanding on what git is, how it is integrated in Rstudio and how it can help you make your projects more reproducible, enable you to share your code with the world and collaborate in a seamless manner. (abstract strongly inspired by the intro from happygitwithr.com)


Hope to see you there!

NPL Markets and R Shiny

Our mission at NPL Markets is to enable smart decision-making through innovative trading technology, advanced data analytics and a new comprehensive trading ecosystem. We focus specifically on the illiquid asset market, a market that is partially characterized by its unstructured data.

  Platform Overview


NPL Markets fully embraces R and Shiny to create an interactive platform where sellers and buyers of illiquid credit portfolios can interact with each other and use sophisticated tooling to structure and analyse credit data. 

Creating such a platform with R Shiny was a challenge, while R Shiny is extremely well suited to analyzing data, it is not entirely a generalist web framework. Perhaps it was not intended as such, but our team at NPL Markets has managed to create an extremely productive setup. 

Our development setup has a self-build ‘hot reload’ library, which we may release to a wider audience once it is sufficiently tested internally. This library allows us to update front-end and server code on the fly without restarting the entire application.

In addition to our development setup, our production environment uses robust error handling and reporting, preventing crashes that require an application to restart.

More generally, we use continuous integration and deployment and automatically create unit tests for any newly created R functions, allowing us to quickly iterate.

If you would like to know more about what we built with R and Shiny, reach out to us via our website www.nplmarkets.com

The Rhythm of Names

One of the fundamental properties of English prosody is a preference for alternations between strong and weak beats. This preference for rhythmic alternation is expressed in several ways:

      • Stress patterns in polysyllabic words like “testimony” and “obligatory” – as well as nonce words like “supercalifragilisticexpialidocious” – alternate between strong and weak beats.
      • Stress patterns on words change over time so that they maintain rhythmic alternation in the contexts in which they typically appear.
      • Over 90% of formal English poetry like that written by Shakespeare and Milton follows iambic or trochaic meter, i.e., weak-strong or strong-weak units.
      • Speakers insert disyllabic expletives in polysyllabic word in places that create or reinforce rhythmic alternation (e.g., we say “Ala-bloody-bama” or “Massa-bloody-chusetts” not “Alabam-blood-a” or “Massachu-bloody-setts”)
This blog examines whether the preference for rhythmic alternation affects naming patterns. Consider the following names of lakes in the United States:
      1. Glacier Lake
      2. Guitar Lake*
      3. Lake Louise
      4. Lake Ellen*
The names (a) and (c) preserve rhythmic alternation whereas the asterisked (b) and (d) create a stress clash with two consecutive stressed syllables.

As you can see, lake names in the US can either begin or end with “Lake”. More than 90% end in “Lake” reflecting the standard modifier + noun word order in English. But the flexibility allows us to test whether particular principles (e.g., linguistic, cultural) affect the choice between “X Lake” and “Lake X.” In the case of rhythmic alternation, we would expect weak-strong “iambic” words like “Louise” and “Guitar” to be more common in names beginning than ending in “Lake.”

To test this hypothesis, I used the Quanteda package to pull all the names of lakes and reservoirs in the USGS database of placenames that contained the word “Lake” plus just one other word before or after it. Using the “nsyllable” function in Quanteda, I whittled the list down to names whose non-“lake” word contained just two syllables. Finally, I pulled random samples of 500 names each from those beginning and ending with Lake, then manually coded the stress patterns on the non-lake word in each name.

Coding details for these steps follow. First, we’ll load our place name data frame and take a look at the variable names in the data frame, which are generally self-explanatory:

setwd("/Users/MHK/R/Placenames")
load("placeNames.RData")
colnames(placeNames)
 [1] "FEATURE_ID"     "FEATURE_NAME"   "FEATURE_CLASS"  "STATE_ALPHA"    "STATE_NUMERIC" 
 [6] "COUNTY_NAME"    "COUNTY_NUMERIC" "PRIM_LAT_DEC"   "PRIM_LONG_DEC"  "ELEV_IN_M"

Next, we’ll filter to lakes and reservoirs based on the FEATURE_CLASS variable, and convert names to lower case. We’ll then flag lake and reservoir names that either begin or end with the word “lake”, filtering out those in neither category:

temp <- filter(placeNames, FEATURE_CLASS %in% c("Lake","Reservoir"))
temp$FEATURE_NAME <- tolower(temp$FEATURE_NAME)
temp$first_word <- 0
temp$last_word <- 0
temp$first_word[grepl("^lake\\b",temp$FEATURE_NAME)] <- 1
temp$last_word[grepl("\\blake$",temp$FEATURE_NAME)] <- 1
temp <- filter(temp, first_word + last_word > 0)
We’ll use the ntoken function in the Quanteda text analytics package to find names that contain just two words. By definition given the code so far, one of these two words is “lake.” We’ll separate out the other word, and use the nsyllable function in Quanteda to pull out just those words containing two syllables (i.e., “disyllabic” words). These will be the focus of our analysis.
temp$nWords <- ntoken(temp$FEATURE_NAME, remove_punct=TRUE)
temp <- filter(temp, nWords == 2)
temp$num_syl <- nsyllable(temp$otherWord)
temp <- filter(temp, num_syl == 2)
temp$otherWord <- temp$FEATURE_NAME
temp$otherWord <- gsub("^lake\\b","",temp$otherWord)
temp$otherWord <- gsub("\\blake$","",temp$otherWord)
temp$otherWord <- trimws(temp$otherWord)
  Given the large number of names with “lake” either in first or last position plus a two syllable word (30,391 names), we’ll take a random sample of 500 names beginning with “lake” and 500 ending with “lake”, combine into single data frame, and save as a csv file.
lake_1 <- filter(temp, first_word == 1) %>% sample_n(500)
lake_2 <- filter(temp, last_word == 1) %>% sample_n(500)
lakeSample <- rbind(lake_1,lake_2)
write.csv(lakeSample,file="lake stress clash sample.csv",row.names=FALSE)

I manually coded each of the disyllabic non-“lake” words in each name for whether it had strong-weak (i.e., “trochaic”) or weak-strong (“iambic”) stress. This coding was conducted blind to whether the name began or ended in “Lake.” Occasionally, I came across words like “Cayuga” that the nsyllable function erred in classifying as containing two syllables. I dropped these 23 words – 2.3% of the total – from the analysis (18 in names beginning with “Lake” and 5 in names ending in “Lake”).

Overall, 90% of the non-lake words had trochaic stress, which is consistent with the dominance of this stress pattern in the disyllabic English lexicon. However, as predicted from the preference for rhythmic alternation, iambic stress was almost 5x more common in names beginning than ending with “Lake” (16.4% vs. 3.4%, x2 = 44.81, p < .00001).

Place names provide a rich resource for testing the potential impact of linguistic and cultural factors on the layout of our “namescape.” For example, regional differences in the distribution of violent words in US place names are associated with long-standing regional variation in attitudes towards violence. Large databases of place names along with R tools for text analytics offer many opportunities for similar analyses.