Regression Modeling Strategies Short Course 2026, with Frank Harrell; May 14, 15, 18, 19

Frank Harrell’s Regression Modeling Strategies online seminar will take place May 14, 15, 18, and 19.

This workshop covers principled strategies for building, validating, and interpreting multivariable regression models for a wide range of outcomes, with emphasis on predictive accuracy, avoiding overfitting, and interpreting estimated effects. It explores spline methods, data reduction, benefits of Bayesian modeling, robust semiparametric ordinal, longitudinal, and survival models, and rigorous resampling-based validation, illustrated with applied case studies and R examples. More details here.

Along with the 1-day Introduction to R, Regression, and the rms Package, these virtual seminars are offered through Instats, in association with the A.S.A.

Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop

Join our workshop on Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop,  which is a part of our workshops for Ukraine series! 


Here’s some more info: 


Title: Reactive Shiny Apps and Deployment with Google Cloud Run: Intermediate R Shiny Workshop 

Date: Thursday, May 21st, 18:00 – 20:00 CET (Rome, Berlin, Paris timezone) 

Speaker:Alfredo Hernández Sánchez is a Marie Skłodowska Curie ERA Postdoctoral Fellow at Vilnius University, where he leads the FIRSA project on financial regulation and innovation in Europe. His work combines applied research, data analysis, and reproducible computational methods, with a strong interest in turning research outputs into accessible digital tools such as dashboards and interactive web applications. He works extensively with R, Quarto, and Shiny in academic and policy-oriented settings.

www.alfredohs.com 

Description:  This workshop is designed for people who already know the basics of Shiny and want to build apps that are more robust, more reactive, and easier to maintain. We will look at practical reactive patterns, app structure, and some common choices that make Shiny dashboards easier to develop as they become more complex. In the second part of the workshop, I will show how a Shiny app can move from local development to a public deployment on Google Cloud Run, using a real dashboard project as an example. The session will give participants a practical introduction to a cloud-based workflow for publishing and maintaining Shiny applications in highly customizable environment. Basic familiarity with Shiny is assumed, and some previous experience building simple apps will help participants get the most out of the session.

Minimal registration fee: 20 euro (or 20 USD or 800 UAH)







Please note that the registration confirmation is sent 1 day before the workshop to all registered participants rather than immediately after registration


How can I register?



  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the registration form, attaching a screenshot of a donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after donation).

If you are not personally interested in attending, you can also contribute by sponsoring a participation of a student, who will then be able to participate for free. If you choose to sponsor a student, all proceeds will also go directly to organisations working in Ukraine. You can either sponsor a particular student or you can leave it up to us so that we can allocate the sponsored place to students who have signed up for the waiting list.


How can I sponsor a student?


  • Save your donation receipt (after the donation is processed, there is an option to enter your email address on the website to which the donation receipt is sent)

  • Fill in the sponsorship form, attaching the screenshot of the donation receipt (please attach the screenshot of the donation receipt that was emailed to you rather than the page you see after the donation). You can indicate whether you want to sponsor a particular student or we can allocate this spot ourselves to the students from the waiting list. You can also indicate whether you prefer us to prioritize students from developing countries when assigning place(s) that you sponsored.


If you are a university student and cannot afford the registration fee, you can also sign up for the waiting list here. (Note that you are not guaranteed to participate by signing up for the waiting list).



You can also find more information about this workshop series,  a schedule of our future workshops as well as a list of our past workshops which you can get the recordings & materials here.


Looking forward to seeing you during the workshop!










 














CougarStats: a free and open-source Statistics web app for Teaching and Learning

Hello,

I’d like to share CougarStats, a free and open-source R Shiny web app I developed to support the teaching and learning of Statistics. CougarStats runs entirely in a browser and is designed for accessibility and ease of use. You can explore the app here: https://www.cougarstats.ca/ 
 
The name CougarStats is inspired by Mount Royal University’s athletics mascot, the cougar, symbolizing strength and agility, and by the app’s focus on statistics. 
 
Key features of CougarStats
 
  • Descriptive Statistics: Compute measures like mean, median, mode, quartiles, IQR, standard deviation, and identify potential outliers. 
  • Data Visualization: Construct Boxplots, Histograms, and Scatterplots. 
  • Probability: Calculate marginal, joint, union, and conditional probability for contingency tables; exact and cumulative probabilities for Binomial, Poisson, Negative Binomial and Hypergeometric distributions; and cumulative probabilities for the Normal distribution. 
  • Sample Size Estimation: Determine the required sample sizes for various scenarios. 
  • Statistical Inference: Construct confidence intervals, conduct hypothesis tests for one- and two-samples (mean, proportion and standard deviation). 
  • ANOVA: Perform one-way Analysis of Variance with an option to conduct Bonferroni post hoc tests. 
  • Regression and Correlation: Fit simple linear regression models and compute Pearson correlation coefficient, multiple linear regression, logistic regression. 
  • Categorical Data Analysis: Perform Chi-Square test of independence with and without Yates’ continuity correction, Fisher’s exact test. 
  • Nonparametric Tests: Perform the Mann-Whitney U Test, Kruskal-Wallis test etc.
 
I would be delighted if you could explore CougarStats and share it with your students and colleagues who might find it useful.
 
Thank you for your time, and I look forward to hearing your thoughts.
 
Sincerely, 
Ashok


Ashok Krishnamurthy, PhD
Associate Professor
Department of Mathematics and Computing
Mount Royal University
4825 Mount Royal Gate SW
Calgary, AB, T3E 6K6 Canada

grouper: An R package for Optimal Group Assignment

Introduction

Universities are increasingly using collaborative learning pedagogies, which can benefit learners through deeper understanding of course content and teamwork skills. However, the realisation of these sought-after benefits depend on how educators assign learners to groups.

Educators have formulated various mathematical models to perform this assignment. Some have developed developed models that prioritised maximising students’ project preferences. Others developed a model that prioritised students’ preferences, group sizes and group composition. Yet other models address related, but distinct, problems such as assigning students to elective courses or incorporating staff workload into student-to-project supervisor assignments.

Whichever approach is used, it is apparent that there is a need for an algorithmic solution for the assignment. This would ease the burden on the instructor, while providing an objective procedure for the assignment. Our contribution is an R package grouper that offers two flexible group allocation strategies.

Optimisation Models

grouper provides two distinct integer linear programming optimisation models.

library(grouper)
library(ompr)
library(ompr.roi)
library(ROI.plugin.glpk)

Preference-Based Assignment

The Preference-Based Assignment (PBA) model allows educators to assign student groups to topics to maximise overall student preferences for those topics. The topics can be viewed as project titles. The model allows for repetitions of each project title. This formulation also allows each project team to comprise multiple sub-groups. This is useful in cases where the project requires teams with different functionality to work together, e.g. where one team works on a front-end while the other develops a back-end model.

To execute the optimisation routine, an instructor prepares:

      1. A group composition table listing the member students within each self-formed group
      2. A preference matrix containing the preference that each self-formed group has for each topic.
      3. A YAML file defining the remaining parameters of the model.

Examples

Consider the following simple dataset with 8 students:
pba_gc_ex002
#>   id grouping
#> 1  1        1
#> 2  2        1
#> 3  3        2
#> 4  4        2
#> 5  5        3
#> 6  6        3
#> 7  7        4
#> 8  8        4

Each student is in a self-formed group of size 2, indicated via the grouping column. Suppose that, for this set of students, the instructor wishes to assign students into two topics, with each topic having two sub-groups. This requires the preference matrix to have 4 columns – one for each topic-subgroup combination. Remember that the ordering of topics/subtopics in the preference matrix should be:

Topic1-Subtopic1, Topic2-Subtopic1, Topic1-Subtopic2, Topic2-Subtopic2

Thus there should be 4 rows in the preference matrix – one for each self-formed group.

pba_prefmat_ex002
#>      col1 col2 col3 col4
#> [1,]    4    3    2    1
#> [2,]    3    4    2    1
#> [3,]    1    2    4    3
#> [4,]    1    2    3    4

The YAML file for this model contains the following parameters:

n_topics: 2
B: 2
R: 1
nmin: 2
nmax: 2
rmin: 1
rmax: 1

B corresponds to the number of sub-topics per topic, while rmin and rmax denote the minimum and maximum number of repetitions of each topic. nmin and nmax denote the minimum and maximum number of members in each sub-topic group.

It is possible to assign each self-formed group to its optimal choice of topic-subtopic combination. In our solution, we should see that group 1 is assigned to subtopic 1 of topic 1, group 2 is assigned to sub-topic 1 of topic 2, and so on.

df_ex002_list <- extract_student_info(pba_gc_ex002, "preference", 
                                     self_formed_groups = 2, 
                                     pref_mat = pba_prefmat_ex002)
yaml_ex002_list <- extract_params_yaml(system.file("extdata", 
                                         "pba_params_ex002.yml",  
                                          package = "grouper"),
                                      "preference")
m2 <- prepare_model(df_ex002_list, yaml_ex002_list, "preference")
result2 <- solve_model(m2, with_ROI(solver="glpk"))

assign_groups(result2, assignment = "preference", 
              dframe=pba_gc_ex002, yaml_ex002_list, 
              group_names="grouping")
#>   topic2 subtopic rep group size
#> 1      1        1   1     1    2
#> 2      2        1   1     2    2
#> 3      1        2   1     3    2
#> 4      2        2   1     4    2

Diversity-Based Assignment

The Diversity-Based Assignment (DBA) model enables educators to assign students to groups and topics with the dual, but weighted, aims of maximising diversity (based on student attributes) within groups and balancing specific skill levels across different groups.

To execute the DBA optimisation routine, the instructor prepares:

      1. A group composition table containing:
        1. the member students within each self-formed group,
        2. the demographics that will be used to compute pairwise dissimilarity between students, and
        3. a numeric measure of each student’s skill.
      2. A YAML file defining the remaining parameters of the model.

Examples

Consider the following dataset, that comes with the package. There are 4 students in total.
dba_gc_ex001
#>   id major skill groups
#> 1  1     A     1      1
#> 2  2     A     1      2
#> 3  3     B     3      3
#> 4  4     B     3      4

It is intuitive that an assignment into two groups of size two, based on the diversity of majors alone, should assign students 1 and 2 into the first group and the remaining two students into another group.

The corresponding YAML dba_gc_ex001.yml file for this exercise consists of the following lines:

n_topics:  2
R:  1
nmin: 2
nmax: 2
rmin: 1
rmax: 1

To run the assignment, we can use the following commands. We can use either the gurobi solver, or the glpk solver for this example. Both are equally fast.

# Indicate appropriate columns using integer ids.
df_ex001_list <- extract_student_info(dba_gc_ex001, "diversity",
                                      demographic_cols = 2, 
                                      skills = 3, 
                                      self_formed_groups = 4)

yaml_ex001_list <- extract_params_yaml(system.file("extdata", 
                                         "dba_params_ex001.yml",  
                                         package = "grouper"),
                                       "diversity")
m1 <- prepare_model(df_ex001_list, yaml_ex001_list,
                    assignment="diversity",w1=0.5, w2=0.5)

result3 <- solve_model(m1, with_ROI(solver="glpk"))
assign_groups(result3, assignment = "diversity", 
              dframe=dba_gc_ex001, 
              group_names="groups")
#>   topic rep group id major skill
#> 1     1   1     2  2     A     1
#> 2     1   1     3  3     B     3
#> 3     2   1     1  1     A     1
#> 4     2   1     4  4     B     3

We can see that students 2 and 3 have been assigned to topic 1, repetition 1. Students 1 and 4 have been assigned to topic 2, repetition 1. w1 and w2 both have weights 0.5, which means the skills and demographic inputs are given equal weight in the optimisation.

At present, the routines use the daisy function from the cluster package to compute a pairwise dissimilarity matrix between students. However, it is also possible to supply your own custom dissimilarity matrix. Consider the following dataset of 4 students:

dba_gc_ex003
#>   year   major self_groups id
#> 1    1    math           1  1
#> 2    2 history           2  2
#> 3    3    dsds           3  3
#> 4    4    elts           4  4

Now consider a situation where we wish to consider years 1 and 2 different from years 3 and 4, and math and dsds (STEM majors) to be different from elts and history (non-STEM majors). For each difference, we assign a score of 1. This means that students 1 and 2 would have a dissimilarity score of 1 due to their difference in majors. Students 1 and 3 would also have a score of 1, but due to their difference in years. Students 1 and 4 would have score of 2, due to their differences in majors and in years. The overall dissimilarity matrix would be:

d_mat <- matrix(c(0, 1, 1, 2,
                  1, 0, 2, 1,
                  1, 2, 0, 1,
                  2, 1, 1, 0), nrow=4, byrow = TRUE)

To run the optimisation for this model, we can execute the following code:

df_ex003_list <- extract_student_info(dba_gc_ex003, "diversity",
                                       skills = NULL,
                                       self_formed_groups = 3,
                                       d_mat=d_mat)
yaml_ex003_list <- extract_params_yaml(system.file("extdata",   
                                         "dba_params_ex003.yml",
                                         package = "grouper"), 
                                       "diversity")
m3 <- prepare_model(df_ex003_list, yaml_ex003_list, w1=1.0, w2=0.0)
result <- solve_model(m3, with_ROI(solver="glpk")

assign_groups(result, "diversity", dba_gc_ex003,
              group_names="self_groups")
#>   topic rep group year   major id
#> 1     1   1     1    1    math  1
#> 2     1   1     4    4    elts  4
#> 3     2   1     2    2 history  2
#> 4     2   1     3    3    dsds  3

As you can see, the members of the two groups have maximal difference between them – they differ in terms of their year, and in terms of their major. Notice that we specified

skills = NULL

and

w2 = 0.0

This ensures that no skills columns were taken into account in this optimisation.

Gurobi Optimiser

While the routines above use the glpk optimiser, we recommend using the Gurobi optimiser. The latter is a commercial software that runs to completion much faster than glpk. For more information, please refer to this website. Note that academic licenses are available from Gurobi.

Shiny Applications

The package provides numerous options for each of the two optimisation models. However, there are also two shiny applications included with the package. They may be useful if one only needs a straightforward group assignment. 

To run the DBA shiny app, the following code will suffice:
library(shiny)
runApp(appDir=system.file("shiny", "dbaWebApp", package="grouper"))

# Analogous code for PBA app:
# runApp(appDir=system.file("shiny", "pbaWebApp", package="grouper"))

Here is a screen shot of the diversity-based shiny application.



The system folders with the shiny apps also contain example csv files for use with the apps.

More Details

The two optimisation models are flexibly parametrised. Here are some of the features:
    • Define the number of repetitions for each topic.
    • Define the max. and min. number of group members for each topic.
The vignettes also contain the precise mathematical formulation of the optimisation models. For full details, please refer to these links:

Understanding R’s `describe()` Function: A Complete Guide to Summary Statistics

Understanding R’s describe() Function: A Complete Guide to Summary Statistics

Introduction to describe()

The describe() function from R’s psych package (Revelle, 2023) provides a comprehensive statistical summary of your dataset. Unlike R’s base summary() function, it includes additional metrics that are particularly useful for data exploration and assumption checking.

library(psych)
describe(your_data)

Breaking Down the Output Columns

Here’s what each column in the output represents:

Column Description Formula/Calculation Ideal Use Case
vars Variable index number Tracking variable order
n Complete cases length(na.omit(x)) Data completeness check
mean Arithmetic average sum(x)/n Normally distributed data
sd Standard deviation sqrt(var(x)) Measuring spread
median 50th percentile quantile(x, 0.5) Skewed distributions
trimmed Mean after removing extremes mean(x, trim=0.1) Robust central tendency
mad Median absolute deviation median(abs(x-median(x))) Outlier-resistant spread
min Minimum value min(x) Range assessment
max Maximum value max(x) Range assessment
range Max – Min max(x)-min(x) Total spread
skew Distribution asymmetry sum((x-mean(x))³)/(n*sd(x)³) Detecting skew direction
kurtosis Tailedness sum((x-mean(x))⁴)/(n*sd(x)⁴)-3 Outlier propensity
se Standard error sd(x)/sqrt(n) Precision of mean estimate

Key Statistics and Their Interpretation

Central Tendency

  • Mean vs. Median: Differences indicate skewness
  • Trimmed Mean: Removes influence of outliers (default drops top/bottom 10%)

Variability

  • SD vs. MAD: Use MAD when outliers are present
  • Range: Simple but outlier-sensitive

Distribution Shape

  • Skewness:
    • >0: Right-tailed
    • <0: Left-tailed
    • 0: Symmetric
  • Kurtosis (Excess):
    • >0: Heavy-tailed (more outliers than normal)
    • <0: Light-tailed

Practical Examples

Example 1: MPG from mtcars

describe(mtcars$mpg)

Output Interpretation:

   vars  n   mean    sd median trimmed   mad min  max range skew kurtosis   se
1     1 32 20.09 6.03   19.2   19.70 5.41 10.4 33.9  23.5 0.61    -0.37 1.07
  • Right-skewed (mean > median, positive skew)
  • Light-tailed (negative kurtosis)
  • SD (6.03) > MAD (5.41): Suggests some outlier influence

When to Use Which Statistic

Scenario Recommended Statistics
Normal Distribution Mean, SD
Skewed Data Median, IQR, MAD
Outlier Detection MAD, trimmed mean, kurtosis
Parametric Testing Mean, SE
Nonparametric Analysis Median, IQR

Extending the Functionality

Adding IQR

The default describe() doesn’t show IQR, but you can add it:

library(dplyr)
describe(mtcars) %>% 
  mutate(IQR = apply(mtcars, 2, IQR, na.rm = TRUE))

Comparing Groups

Use describeBy() for grouped statistics:

describeBy(mtcars$mpg, group = mtcars$cyl)

Conclusion

R’s describe() function provides a powerful starting point for exploratory data analysis. By understanding each statistic it provides, you can:

  • Detect data quality issues
  • Choose appropriate analysis methods
  • Understand your variables’ distributions
  • Make informed decisions about data transformations

For formal reporting, consider supplementing these metrics with visualization and statistical tests.

Pro Tip: Always visualize your data alongside these statistics – numbers tell part of the story, but plots reveal the full picture!

Happy coding!


Reference:
Revelle, W. (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University.

Understanding Statistical Coefficients: From Regression to Variation

The Data Analyst’s Guide to  Statistical Coefficients

What Are Coefficients?

In statistics and data analysis, coefficients are numerical measures that quantify relationships between variables or characteristics of data distributions. They serve as fundamental indicators in statistical modeling and data interpretation.

1. Regression Coefficient

Definition

The regression coefficient measures the relationship between an independent variable (X) and a dependent variable (Y).

Formula

For linear model Y = aX + b:

  • a: Regression coefficient (change in Y per unit change in X)
  • b: Intercept

R Implementation

# Linear regression example
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

# Extract coefficients
coef(model)

Interpretation

A coefficient of -5.34 for vehicle weight (wt) means each additional ton reduces mileage by 5.34 mpg on average.

2. Coefficient of Determination (R²)

Definition

R-squared represents the proportion of variance in the dependent variable explained by the model (0-1 scale).

R Code

# Get R-squared value
summary(model)$r.squared

Guidelines

  • R² = 0.75 → Model explains 75% of data variation
  • Higher values indicate better model fit

3. Coefficient of Variation (CV)

Definition

CV is a standardized measure of dispersion expressed as percentage of the mean.

Formula

CV% = (Standard Deviation / Mean) × 100%

R Function

# Calculate CV
cv <- function(x) {
  (sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE)) * 100
}

# Example usage
cv(mtcars$mpg)

Interpretation Benchmarks

  • CV < 15%: Low variability
  • 15-30%: Moderate variability
  • >30%: High variability

4. Correlation Coefficient

Definition

Measures the strength and direction of linear relationship between two variables (-1 to 1).

R Implementation

# Calculate correlation
cor(mtcars$mpg, mtcars$wt)

# Correlation matrix
cor(mtcars[, c("mpg", "wt", "hp")])

Interpretation

  • 1: Perfect positive correlation
  • -1: Perfect negative correlation
  • 0: No linear correlation

Other Common Coefficients

Coefficient Description R Package/Function
Skewness Measures distribution asymmetry moments::skewness()
Kurtosis Measures tail heaviness moments::kurtosis()
Concordance Assesses agreement epiR::epi.ccc()

Implementation in R

Comprehensive Analysis

library(psych)

# Descriptive statistics (includes multiple coefficients)
describe(mtcars)

# Full regression output
summary(lm(mpg ~ ., data = mtcars))

Custom Coefficient Calculations

# Multi-coefficient function
data_analysis <- function(x) {
  list(
    mean = mean(x),
    sd = sd(x),
    cv = cv(x),
    skewness = moments::skewness(x),
    kurtosis = moments::kurtosis(x)
  )
}

lapply(mtcars[, 1:4], data_analysis)

Visualization

library(ggplot2)
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "MPG vs Weight with Regression Line",
       x = "Weight (tons)",
       y = "Miles per Gallon")

Key Takeaways

  1. Select coefficients based on analytical goals:
    • Variable relationships → Regression/Correlation coefficients
    • Model evaluation → R-squared
    • Variability comparison → CV
  2. R advantages:
    • Built-in functions for all major coefficients
    • Seamless integration of statistical and visual analysis
  3. Best practices:
    • Understand assumptions behind each coefficient
    • Combine statistical results with domain knowledge
    • Clearly distinguish between different coefficients
  4. Advanced applications:
    # Robust regression (for outlier-resistant coefficients)
    library(MASS)
    rlm(mpg ~ wt, data = mtcars)
    
    # Standardized coefficients
    library(lm.beta)
    lm.beta(model)

By mastering these statistical coefficients and their R implementations, you’ll be equipped to conduct more rigorous data analysis and communicate results effectively. Remember that coefficients are tools – their proper interpretation always depends on context and research questions.

Happy coding!

Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny

Summer School in Science Mapping (SSSM) 2025 – I International Edition

Title: Producing Systematic Literature Reviews with Bibliometrix R and Biblioshiny
Date & Location: June 9-13, 2025 – Naples, ITA

We are pleased to announce the upcoming Summer School in Science Mapping (SSSM) 2025 – I International Edition, an intensive training program focused on conducting Systematic Literature Reviews using the Bibliometrix R package and its shiny-app Biblioshiny.

Organized by the academic spin-off K-Synth in collaboration with the Department of Economics and Statistics at the University of Naples Federico II, the school will be held in Naples, Italy, from June 9 to June 13, 2025.

Aim and Scope

The SSSM 2025 is an intensive training program tailored for early-career researchers and academics seeking to enhance their expertise in bibliometric methods and scientific mapping. By integrating theoretical foundations with practical sessions, the school equips participants with robust skills in citation analysis, co-citation techniques, science mapping, and reproducible workflows for scholarly evaluation. Designed as both a learning and networking opportunity, SSSM 2025 fosters methodological development and international collaboration in a dynamic, research-oriented environment.

The school’s content covers:

– Overview of bibliometric theory and methods
– Query design and data retrieval from major scientific databases
– Descriptive, relational, and structural bibliometric analyses in R
– Practical training in Bibliometrix R package and Biblioshiny app
– Applications to real-world research review cases

Lecturers and Guest Speakers

The school will be led by Professors Massimo Aria and Corrado Cuccurullo, the developers of Bibliometrix and Biblioshiny.

Additionally, the 2025 edition will feature the following keynotes by distinguished international scholars in the field of scientometrics:

– Nicolas Robinson-Garcia (University of Granada), Scientific Director of the Computational Social Sciences and Humanities Unit (U-CHASS)
– Manuel Jesús Cobo Martín (University of Cádiz), lead developer of the SciMAT software
– Nicola De Bellis (University of Modena and Reggio Emilia), Coordinator of the Bibliometric Office and author of influential studies in the evaluation of scientific research

Target Audience and Prerequisites

This Summer School is designed for PhD students, postdoctoral researchers, and academics affiliated with universities or research institutions. Participants are expected to have a basic knowledge of R programming and be familiar with RStudio.

Registration and Fees

Registration is open on the official Bibliometrix website (check the Summer School section):

https://www.bibliometrix.org/sssm/

For any inquiries, feel free to contact the organizing committee at: [email protected]

R programming book in Greek language

For anyone interested, “Programming in R” (title translated) is a free book on R programming written in Greek.

It presents a programmer’s point of view of R, for beginners (in fact for people with absolutely no programming experience) to advanced programmers. 

Thus, the book does not delve deeply on data science, machine learning, statistics and other such topics where R is broadly used.

Its emphasis is on R as a programming language.

The book may be useful to people who are interested in R programming and can read (or are willing to translate) Greek text.

Chapter titles:

1. Getting started with R
2. The basic elements of R language
3. The essential tools of an R programmer
4. Common object types
5. Functions and functional programming
6. Classes and object-oriented programming
7. Collaboration with other programming languages
8. Package creation
9. Data and content

Link: https://repository.kallipos.gr/handle/11419/8588?&locale=en (in English, with link to download PDF)
DOI: http://dx.doi.org/10.57713/kallipos-100
ISBN: 978-618-5667-90-0

TALL – Text Analysis for ALL, a new R Shiny app for NLP and Text Mining workflows

TALL – Text Analysis for ALL is an R Shiny app that includes a wide set of methodologies specifically tailored for various text analysis tasks. It aims to address the needs of researchers without extensive programming skills, providing a versatile and general-purpose tool for analyzing textual data. With TALL, researchers can leverage a wide range of text analysis techniques without the burden of extensive programming knowledge, enabling them to extract valuable insights from textual data in a more efficient and accessible manner.

Setup

TALL can be installed in two ways, depending on whether you want the stable version or the latest development version.

Official release

You can install the official release of TALL from the Comprehensive R Archive Network CRAN and updated monthly.

if (!require("pak", quietly=TRUE)) install.packages("pak")
pak::pkg_install("tall")

Development release

If you want access to the most recent features and updates not yet available on CRAN, you can install the development version directly from our GitHub repository with:

if (!require("pak", quietly=TRUE)) install.packages("pak")
pak::pkg_install("massimoaria/tall")

Run Tall

Load the library with:

library("tall")

and then run TALL shiny app with:

tall()

Introduction

In the age of information abundance, researchers across diverse disciplines are confronted with the formidable task of analyzing voluminous textual data. Textual data, encompassing research articles, social media posts, customer reviews, and survey responses, harbors invaluable insights that can propel knowledge advancement in various fields, ranging from social sciences to healthcare and beyond. Researchers endeavor to analyze textual data to unveil patterns, discern trends, extract meaningful information, and gain deeper understandings of diverse phenomena. By leveraging sophisticated natural language processing (NLP) techniques and machine learning algorithms, researchers can delve into the semantic and syntactic structures of texts, perform topic detection, polarity detection, and text summarization, among other analyses. Additionally, the advent of digital platforms and the exponential growth of online content have generated unprecedented volumes of textual data that were previously inaccessible or challenging to acquire.

Researchers can harness the power of these textual resources to delve into novel research questions, corroborate existing theories, and generate groundbreaking insights. Through the utilization of computational tools and methodologies, researchers can efficiently process and analyze expansive volumes of text, substantially reducing the time and effort expended compared to manual analysis. Furthermore, there is a burgeoning recognition of the need for text analysis tools tailored to individuals who may not possess in-depth programming expertise. While programming languages like R and Python offer powerful capabilities for data analysis, not all researchers have the time or resources to acquire proficiency in these languages. To address this challenge, a growing number of user-friendly text analysis tools have emerged, providing researchers with a viable alternative to traditional programming-based approaches. These tools empower researchers from diverse backgrounds to effectively process and analyze textual data, fostering a more inclusive research environment and democratizing access to the transformative power of text analysis.

For researchers who lack programming skills, TALL offers a viable solution, providing an intuitive interface that allow researchers to interact with data and perform analyses without the need for extensive programming knowledge.

TALL offers a comprehensive workflow for data cleaning, pre-processing, statistical analysis, and visualization of textual data, by combining state-of-the-art text analysis techniques into an R Shiny app.

TALL workflow

First TALL seamlessly integrates the functionalities of a suite of R packages designed for NLP tasks with the user-friendly interface of web applications through the Shiny package environment.

The TALL workflow streamlines the discovery and analysis of textual data by systematically processing and exploring its content. This comprehensive framework empowers researchers with a versatile toolkit for text analysis, enabling them to efficiently navigate and extract meaningful insights from large volumes of textual data.

By leveraging the strengths of both R packages and Shiny’s interactive web interface, TALL provides a powerful and accessible platform for researchers to conduct thorough the following workflow:

  1. Import and Manipulation

  2. Pre-processing and Cleaning

  3. Statistical Text Analysis and Dynamic Visualization

Some screenshot from TALL

Import text from multiple file formats

Edit, split, and add external information

Automatic Lemmatization and PoS-Tagging through LLM

Language, Model, and Analysis Term Selection

Tagging Special Entities through multiple regex

 

Semantic Tagging

Automatic Multi-word creation

Multi-word creation by a list and Custom Term List

OVERVIEW – Descriptive statistics, concordance analysis and word frequency distributions

WORDS – Multiple methods for Topic Detection

DOCUMENTS – Main approaches for entire texts

“Introduction to R, Regression, and the rms Package”: short course by Frank Harrell

On May 11th 2026, Professor Frank Harrell with lead a workshop, Introduction to R, Regression, and the rms Package, covering foundational R and RStudio skills, linear and multiple regression concepts, and an introduction to the rms package for model fitting and diagnostics. It also introduces a reproducible workflow using Quarto, with a case study demonstration typical for empirical research. This one-day virtual course is offered through Instats.