Zoom talk on “Building dashboards in R/shiny (and improve them with logs and user feedback)” from the Grenoble (FR) R user group

The next talk of Grenoble’s R user group will be on May 25, 2021 at 5PM (FR) and is free and open to all:

Building dashboards in R/shiny (and improve them with logs and user feedback)

        The main goal of this seminar is to demonstrate how R/Shiny app developers can collect data from the visitors of their app (such as logs or feedback) to improve it. We will use as example the CaDyCo project, a dynamic and collaborative cartography project hosted by USMB and UGA.

Hope to see you there!

EdOptimize – An Open Source K-12 Learning Analytics Platform

Important Links

  1. Open-source code of our platform –  https://github.com/PlaypowerLabs/EdOptimize
  2. Live Platform Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_platform_analytics/
  3. Live Curriculum Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_curriculum_analytics/
  4. Live Implementation Analytics Dashboard – https://playpowerlabs.shinyapps.io/edopt_implementation_analytics/

Data from EdTech platforms have a tremendous potential to positively impact student learning outcomes. EdTech leaders are now realizing that learning analytics data can be used to take decisive actions that make online learning more effective. By using the EdOptimize Platform, we can rapidly begin to observe the implementation of digital learning programs at scale. The data insights from the EdOptimize Platform can enable education stakeholders to make evidence-based decisions that are aimed at creating improved digital learning systems, programs, and their implementation.

EdOptimize Platform is a collection of 3 extensive data dashboards. All 3 Dashboards are devloped using R Shiny and all the code right from the beginning with data simulation and data processing is R native . These dashboards contain many actionable learning analytics that we have designed from our years of work with various school districts in the US.

Here are the brief descriptions of each of the dashboards:

  1. Platform Analytics: To discover trends and power users in the online learning platform. You can use the user behavior data in this platform to identify actions that can increase user retention and engagement. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_platform_analytics/
  2. Curriculum Analytics: To identify learning patterns in the digital curriculum products. Using this dashboard, you can locate content that needs change and see classroom pacing analytics. You can also look at assessment data, item analysis, and standards performance of the curriculum users. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_curriculum_analytics/
  3. Implementation Analytics: To track the implementation of the digital programs in school districts. This dashboard will help districts make the most out of their online learning programs. See the dashboard in action here: https://playpowerlabs.shinyapps.io/edopt_implementation_analytics/

    Data Processing Workflow :

    To learn more about the platform in detail please head over to – https://github.com/PlaypowerLabs/EdOptimize#readme

    About Playpower Labs
Playpower Labs is one of the world’s leading EdTech consulting companies. Our award-winning research team has worked with many different types of educational data. Examples include event data, assessment data, user behavior data, web analytics data, license and entitlement data, roster data, eText data, item response data, time-series data, panel data, hierarchical data, skill and standard data, assignment data, knowledge structure data, school demographic data, and more.

If you need a professional help with this platform or any other EdTech data project, please contact our Chief Data Scientist Nirmal Patel at [email protected] He will be happy to have a conversation with you!

Zoom talk on “Version control and git for beginners” from the Grenoble (FR) R user group

The next talk of Grenoble’s R user group will be on May 06, 2021 at 5PM (FR) and is free and open to all:

Version control and git for beginners

    Git is a version control system whose original purpose was to help groups of developers work collaboratively on big software projects. Git manages the evolution of a set of files – called a repository – in a sane, highly structured way. Git has been re-purposed by the data science community. In addition to using it for source code, we use it to manage the motley collection of files that make up typical data analytical projects, which often consist of data, figures, reports, and, yes, source code. In this presentation aimed at beginners we will try to give you an understanding on what git is, how it is integrated in Rstudio and how it can help you make your projects more reproducible, enable you to share your code with the world and collaborate in a seamless manner. (abstract strongly inspired by the intro from happygitwithr.com)

Hope to see you there!

NPL Markets and R Shiny

Our mission at NPL Markets is to enable smart decision-making through innovative trading technology, advanced data analytics and a new comprehensive trading ecosystem. We focus specifically on the illiquid asset market, a market that is partially characterized by its unstructured data.

  Platform Overview

NPL Markets fully embraces R and Shiny to create an interactive platform where sellers and buyers of illiquid credit portfolios can interact with each other and use sophisticated tooling to structure and analyse credit data. 

Creating such a platform with R Shiny was a challenge, while R Shiny is extremely well suited to analyzing data, it is not entirely a generalist web framework. Perhaps it was not intended as such, but our team at NPL Markets has managed to create an extremely productive setup. 

Our development setup has a self-build ‘hot reload’ library, which we may release to a wider audience once it is sufficiently tested internally. This library allows us to update front-end and server code on the fly without restarting the entire application.

In addition to our development setup, our production environment uses robust error handling and reporting, preventing crashes that require an application to restart.

More generally, we use continuous integration and deployment and automatically create unit tests for any newly created R functions, allowing us to quickly iterate.

If you would like to know more about what we built with R and Shiny, reach out to us via our website www.nplmarkets.com

The Rhythm of Names

One of the fundamental properties of English prosody is a preference for alternations between strong and weak beats. This preference for rhythmic alternation is expressed in several ways:

      • Stress patterns in polysyllabic words like “testimony” and “obligatory” – as well as nonce words like “supercalifragilisticexpialidocious” – alternate between strong and weak beats.
      • Stress patterns on words change over time so that they maintain rhythmic alternation in the contexts in which they typically appear.
      • Over 90% of formal English poetry like that written by Shakespeare and Milton follows iambic or trochaic meter, i.e., weak-strong or strong-weak units.
      • Speakers insert disyllabic expletives in polysyllabic word in places that create or reinforce rhythmic alternation (e.g., we say “Ala-bloody-bama” or “Massa-bloody-chusetts” not “Alabam-blood-a” or “Massachu-bloody-setts”)
This blog examines whether the preference for rhythmic alternation affects naming patterns. Consider the following names of lakes in the United States:
      1. Glacier Lake
      2. Guitar Lake*
      3. Lake Louise
      4. Lake Ellen*
The names (a) and (c) preserve rhythmic alternation whereas the asterisked (b) and (d) create a stress clash with two consecutive stressed syllables.

As you can see, lake names in the US can either begin or end with “Lake”. More than 90% end in “Lake” reflecting the standard modifier + noun word order in English. But the flexibility allows us to test whether particular principles (e.g., linguistic, cultural) affect the choice between “X Lake” and “Lake X.” In the case of rhythmic alternation, we would expect weak-strong “iambic” words like “Louise” and “Guitar” to be more common in names beginning than ending in “Lake.”

To test this hypothesis, I used the Quanteda package to pull all the names of lakes and reservoirs in the USGS database of placenames that contained the word “Lake” plus just one other word before or after it. Using the “nsyllable” function in Quanteda, I whittled the list down to names whose non-“lake” word contained just two syllables. Finally, I pulled random samples of 500 names each from those beginning and ending with Lake, then manually coded the stress patterns on the non-lake word in each name.

Coding details for these steps follow. First, we’ll load our place name data frame and take a look at the variable names in the data frame, which are generally self-explanatory:


Next, we’ll filter to lakes and reservoirs based on the FEATURE_CLASS variable, and convert names to lower case. We’ll then flag lake and reservoir names that either begin or end with the word “lake”, filtering out those in neither category:

temp <- filter(placeNames, FEATURE_CLASS %in% c("Lake","Reservoir"))
temp$FEATURE_NAME <- tolower(temp$FEATURE_NAME)
temp$first_word <- 0
temp$last_word <- 0
temp$first_word[grepl("^lake\\b",temp$FEATURE_NAME)] <- 1
temp$last_word[grepl("\\blake$",temp$FEATURE_NAME)] <- 1
temp <- filter(temp, first_word + last_word > 0)
We’ll use the ntoken function in the Quanteda text analytics package to find names that contain just two words. By definition given the code so far, one of these two words is “lake.” We’ll separate out the other word, and use the nsyllable function in Quanteda to pull out just those words containing two syllables (i.e., “disyllabic” words). These will be the focus of our analysis.
temp$nWords <- ntoken(temp$FEATURE_NAME, remove_punct=TRUE)
temp <- filter(temp, nWords == 2)
temp$num_syl <- nsyllable(temp$otherWord)
temp <- filter(temp, num_syl == 2)
temp$otherWord <- temp$FEATURE_NAME
temp$otherWord <- gsub("^lake\\b","",temp$otherWord)
temp$otherWord <- gsub("\\blake$","",temp$otherWord)
temp$otherWord <- trimws(temp$otherWord)
  Given the large number of names with “lake” either in first or last position plus a two syllable word (30,391 names), we’ll take a random sample of 500 names beginning with “lake” and 500 ending with “lake”, combine into single data frame, and save as a csv file.
lake_1 <- filter(temp, first_word == 1) %>% sample_n(500)
lake_2 <- filter(temp, last_word == 1) %>% sample_n(500)
lakeSample <- rbind(lake_1,lake_2)
write.csv(lakeSample,file="lake stress clash sample.csv",row.names=FALSE)

I manually coded each of the disyllabic non-“lake” words in each name for whether it had strong-weak (i.e., “trochaic”) or weak-strong (“iambic”) stress. This coding was conducted blind to whether the name began or ended in “Lake.” Occasionally, I came across words like “Cayuga” that the nsyllable function erred in classifying as containing two syllables. I dropped these 23 words – 2.3% of the total – from the analysis (18 in names beginning with “Lake” and 5 in names ending in “Lake”).

Overall, 90% of the non-lake words had trochaic stress, which is consistent with the dominance of this stress pattern in the disyllabic English lexicon. However, as predicted from the preference for rhythmic alternation, iambic stress was almost 5x more common in names beginning than ending with “Lake” (16.4% vs. 3.4%, x2 = 44.81, p < .00001).

Place names provide a rich resource for testing the potential impact of linguistic and cultural factors on the layout of our “namescape.” For example, regional differences in the distribution of violent words in US place names are associated with long-standing regional variation in attitudes towards violence. Large databases of place names along with R tools for text analytics offer many opportunities for similar analyses.

Tutorial: Cleaning and filtering data from Qualtrics surveys, and creating new variables from existing data

Hi fellow R users (and Qualtrics users),

As many Qualtrics surveys produce really similar output datasets, I created a tutorial with the most common steps to clean and filter data from datasets directly downloaded from Qualtrics.

You will also find some useful codes to handle data such as creating new variables in the dataframe from existing variables with functions and logical operators.

The tutorial is presented in the format of a downloadable R code with  explanations and annotations of each step. You will also find a raw Qualtrics dataset to work with.

Link to the tutorial: https://github.com/angelajw/QualtricsDataCleaning

This dataset comes from a Qualtrics survey with an experiment format (control and treatment conditions), but the codes can be applicable to non-experimental datasets as well, as many cleaning steps are the same.

New Version of the Package “Vecsets” is on CRAN

  The base “sets” tools follow the algebraic definition that each element of a set must be unique. Since it’s often helpful to compare all elements of two vectors, this toolset treats every element as unique for counting purposes. For ease of use, all functions in vecsets have an argument multiple which, when set to FALSE , reverts them to the base set tools functionality.
For this revision, the code for most functions was rewritten to increase processing speed dramatically.  A new function, “vperm,” generates all permutations of  all possible combinations taken N elements at a time  from an input vector.  

Hackathon sponsored by NanoString Technologies

NanoString Technologies is sponsoring a hackathon with DevPost to spur development of tools and methods in the form of R packages for its GeoMx Spatial Biology platform.  GeoMx allows measurement of protein and RNA expression within selected regions of a tissue slide versus having to measure the slide as a whole.  This allows you to compare expression between tumor and non-tumor cells within a tumor biopsy slide, for example.

A dataset from a collection of kidney biopsy samples is available for use in the hackathon and packages that either create new and insightful graphs or data analysis methods are encouraged.  NanoString has developed some infrastructure packages based on ExpressionSets that are available on Bioconductor to allow developers to focus more on graphs and methods versus data input.

A top prize of $10,000 is available to the top winner with other prizes for second and third place winners.  Details can be found at the hackathon website.



ExpDes: An R Package for ANOVA

Analysis of variance (ANOVA) is an usual way for analysing experiments. However, depending on the design and/or the analysis scheme, it can be a hard task. ExpDes, acronym for Experimental Designs, is a package that intends to turn such task easier. Devoted to fixed models and balanced experiments, ExpDes allows the user to deal with additional treatments in a single run, several experiment designs and exhibits standard and easy-to-interpret outputs.

The main purpose of the package ExpDes is to analyze simple experiments under completely randomized designs (crd()), randomized block designs (cbd()) and Latin square designs (latsd()). Also enables the analysis of treatments in a factorial design with 2 and 3 factors (fat2.crd(), fat3.crd(), fat2.rbd(), fat3.rbd()) and also the analysis of split-plot designs (split2.crd(), split2.rbd()). Other functionality is analyzing experiments with one additional treatments on completely randomized design and randomized blocks design with 2 or 3 factors (fat2.ad.crd(), fat2.ad.rbd(), fat3.ad.crd() and fat3.ad.crd()). 

After loading the package and reading and attaching the data, a single command is required to analyze any situation. For instance, consider a completely randomized design for a qualitative factor. 
 crd(treat, resp, quali = TRUE, mcomp = "tukey") 
For multiple comparison used Tukey´s test. Other tests where implements in package. For factor quantitative is possible used models regression linear and non linear.
 crd(trat, resp, quali = FALSE, sigF = 0.05) 
Analysis of Variance Table
            DF      SS      MS      Fc      Pr>Fc
Treatament  3   214.88  71.626  6.5212  0.0029622
Residuals   20  219.67  10.984      
Total       23  434.55  
CV = 3.41 %

Shapiro-Wilk normality test
p-value:  0.91697 
According to Shapiro-Wilk normality test at 5% of significance, 
residuals can be considered normal.

Homogeneity of variances test
p-value:  0.1863216 
According to the test of bartlett at 5% of significance, 
residuals can be considered homocedastic.

Tukey's test
Groups  Treatments  Means
  a            A    102.1983
   b           B    96.74333
   b           C    95.05833
   b           D    94.74333
 a <- crd(treat, resp)

The object “a” contains information about residuals, means, among others. In this way, graphs of averages and other analyzes can be performed. Grafical ExpDes (GExpDes) expansions are being worked on and new features are in progress. Information in 10.4236/am.2014.519280 and https://cran.r-project.org/web/packages/ExpDes/index.html.

From “Sh*t’s Creek” to “Schitt’s Creek”: On Padding Surnames with Extra Letters

We typically think of English and related spelling systems as mapping orthographic units or graphemes onto units of speech sounds, or phonemes. For instance, each of the three letters in “pen” maps to the three phonemes /p/, /ɛ/, and /n/ in the spoken version of the word. But there is considerable flexibility in the English spelling system, enabling other information to be encoded while still preserving phonemic mapping. For example, padding the ends of disyllabic words with extra unpronounced letters indicates that accent or stress should be placed on the second syllabic instead of the more common English pattern of first syllable stress(e.g., compare “trusty” with “trustee”, “gravel” with “gazelle”, or “rivet” with “roulette).

Proper names provide a rich resource for exploring how spelling systems are used to convey more than sound. Consider “Gerry” and “Gerrie” for example. These names are pronounced the same, but the final vowel /i/ is spelled differently. The difference is associated with gender: Between 1880 and 2016, 99% of children named “Gerrie” have been girls compared with 32% of children named “Gerry.” More generally, as documented in the code below, name-final “ie” is more associated with girls than boys. On average, names ending in “ie” and “y” are given to girls 84% and 66% of the time respectively (i.e., names ending in the sound /i/ tend to be given to girls, but more so if spelled with “ie” than “y”).

#Data frame I created from US Census dataset of baby names 
load("/Users/MHK/R/Baby Names/NamesOverall.RData")
sumNames$final_y_ie <- grepl("y$|ie$",sumNames$Name)
final_y_ie <- filter(sumNames, final_y_ie==TRUE)
final_y_ie$prop_f <- final_y_ie$femaleTotal/final_y_ie$allTotal
    Welch Two Sample t-test

t = 20.624, df = 7024.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1684361 0.2038195
sample estimates:
mean of x mean of y 
0.8412466 0.6551188 
Capitalizing the first letter in proper nouns is perhaps the most well-known example of how we use the flexibility in spelling systems to convey information beyond pronunciation. More subtly, we sometimes increase the prominence of proper names by padding with extra unpronounced letters as in “Penn” versus “pen” and “Kidd” versus “kid.” An interesting question is what factors influence whether or not a name is padded. Which brings us to Schitt’s Creek. The title of the popular series plays exactly on the fact that padding the name with extra letters that don’t affect pronunciation hides the expletive. This suggests a hypothesis: Padded names should be more common when the unpadded version contains negative sentiment, which might carry over via psychological “contagion” from the surname to the person. So, surnames like “Grimm” and “Sadd” should be more common than surnames like “Winn.”

I tested this hypothesis using a data set of surnames occurring at least 100 times in the 2010 US Census. Specifically, I flagged all monosyllabic names that ended in double letters. I restricted to monosyllabic names since letter doubling can affect accent placement, as noted above, which could create differences between the padded and unpadded versions. Next, I stripped off the final letter from these names and matched to a sentiment dictionary. Finally, I tested whether surnames were more likely to be padded if the unpadded version expressed negative sentiment.

The following R code walks through these steps. We’ll start by first reading the downloaded csv file of surnames into a data frame, and then converting the surnames from upper case in the Census file to lower case for mapping to a sentiment dictionary:

surnames <- read.csv("/Users/mike/R/Names/Names_2010Census.csv",header=TRUE)
surnames$name <- tolower(surnames$name)

Next, we’ll flag all monosyllabic names using the nsyllable function in the Quanteda package, identify those with a final double letter, and place into a new data frame:

surnames$num_syllables <- nsyllable(surnames$name)
surnames$finalDouble <- grepl("(.)\\1$", surnames$name)
oneSylFinalDouble <- filter(surnames, num_syllables == 1 & finalDouble == TRUE)
oneSylFinalDouble <- select(oneSylFinalDouble, name, num_syllables, finalDouble)

Finally, we’ll create a variable that strips each name of its final letter (e.g., “grimm” becomes “grim”) and match the latter to a sentiment dictionary, VADER (“Valence  Aware  Dictionary  for sEntiment Reasoning”) specifically. We’ll then put the matching words into a new data frame:

oneSylFinalDouble$Stripped <- substr(oneSylFinalDouble$name,1,nchar(oneSylFinalDouble$name)-1)
vader <- read.csv("/Users/MHK/R/Lexicon/vader_sentiment_words.csv",header=TRUE)
vader$word <- as.character(vader$word)
vader <- select(vader,word,mean_sent)
oneSylFinalDouble <- left_join(oneSylFinalDouble, vader, by=c("Stripped"="word"))
sentDouble <- filter(oneSylFinalDouble, !(is.na(mean_sent)))

The final set of surnames is small – just 36 cases after removing one duplicate. It’s a small enough dataset to list them all:

select(sentDouble,name, mean_sent) %>% arrange(mean_sent)
      name mean_sent
1     warr      -2.9
2   cruell      -2.8
3    stabb      -2.8
4    grimm      -2.7
5     robb      -2.6
6     sinn      -2.6
7     bann      -2.6
8  threatt      -2.4
9    hurtt      -2.4
10  grieff      -2.2
11    fagg      -2.1
12    sadd      -2.1
13   glumm      -2.1
14   liess      -1.8
15   crapp      -1.6
16    nagg      -1.5
17    gunn      -1.4
18   trapp      -1.3
19   stopp      -1.2
20   dropp      -1.1
21    cutt      -1.1
22   dragg      -0.9
23    rigg      -0.5
24    wagg      -0.2
25  stoutt       0.7
26    topp       0.8
27    fann       1.3
28    fitt       1.5
29  smartt       1.7
30    yess       1.7
31   gladd       2.0
32    hugg       2.1
33    funn       2.3
34    wonn       2.7
35    winn       2.8
36    loll       2.9
Of these 36 cases, 24 have negative sentiment when the final letter is removed and only 12 positive sentiment, a significant skew toward padding surnames that would express negative sentiment if unpadded as determined through a one-tailed binomial test:

binom.test(24,36, alternative=c("greater"))
    Exact binomial test

data:  24 and 36
number of successes = 24, number of trials = 36, p-value = 0.03262
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.516585 1.000000
sample estimates:
probability of success 

An alternative explanation for this pattern is that there are more surnames with negative than positive sentiment overall, providing greater opportunity for negative surnames to be padded with extra letters. However, if anything, there are slightly more surnames with positive than negative sentiment in the Census database (294 vs. 254).

In sum, US surnames are more likely to be padded with extra letters when the unpadded version would express negative rather than positive sentiment. These results align with other naming patterns that indicate an aversion toward negative sentiment. Such aversions are consistent with nominal realism or the cross-cultural tendency to transfer connotations from a name to the named.

Finally, in case you’re wondering, no—“Schitt,” “Shitt,” nor “Sh*t” appear in the US Census database of surnames (at least in 2010). However, “Dicke,” “Asse,” and “Paine” do appear, illustrating another way to pad proper names besides letter doubling: Adding a final unpronounced “e.” But that’s a tale for another blog….