Refine your R Skills with Free Access to DataCamp

Introducing Free Week

DataCamp is excited to announce their Free Week commencing Monday, 18 July at 8 AM ET. Anyone interested in developing their R programming skills or improving their data literacy can enjoy unlimited free access for a week until 24 July at 11:59 PM.

For those who are new to R or are supporting teams using R, now is the time to dig deeper into the programming language. DataCamp offers R courses from introductory courses to more advanced topics, meaning you’ll find learning opportunities no matter your level.

Free Access for Individuals

To access DataCamp Free Week for individuals, you’ll only need your email address when signing up on their Free Week page.
Once you’ve signed up, you will have access to their entire library of 387 courses, 83 projects, 55 practice sessions, and 15 assessments across Python, R, SQL, Power BI, Tableau, and more. If you are an R loyalist, you’ll be happy to hear about our 149 courses specializing in your favorite programming language.
Free Week access for individuals starts at 8 am ET on 18 July and finishes at 11.59 pm ET on 24 July.
Not only that, but to ensure you finish the week off with confidence, you also have access to the following resources:

Moreover, for their intermediate and advanced learners solely interested in developing your R skills, their skill and career tracks are perfect for you. They have skill and career tracks specially curated for developing R programming skills where you get to develop your knowledge in the following areas:

Free Access for Teams

DataCamp recognizes and appreciates the increasing dependence that companies have on data-driven decisions. With that, we created DataCamp for Business to help businesses upskill their workforce to become data-driven.
In DataCamp for Business, you and your team will have access to the following resources:

Build learning programs with their team to create custom learning paths just for your team (eg. if your team specializes in R, their team will help create R-specialized learning pathways)
Report on your ROI using DataCamp’s visualization and spreadsheet tools to thoroughly understand your team’s progress
Upskill your company by investing in professional development for all roles and skill levels
Get started and scale using DataCamp’s integrations

If you want to start developing your team today, sign up to DataCamp Teams during Free Week, and gain seven days of free access from the point of sign up.
You will receive seven days’ free access from the point of sign-up, and your payment will automatically renew after that period.
We cannot wait for you to join them along with 2,500 other companies that have upskilled their team with DataCamp. They are proud to have a diverse list of clients, including top companies from consulting, the FTSE 1000, and 180+ government agencies.

Why this is the year you should take the stage at EARL 2022…

EARL is Europe’s largest R community event dedicated to showcasing commercial applications of the R language. As a conference, it has always lived up to its promise of connecting and inspiring R users with creative suggestions and solutions, sparking new ideas, solving problems and sharing perspectives to advance the community.

2022 marks the return of face-to-face EARL (6th – 8th September at the Tower Hotel in London) – now run by Ascent, the new home of Mango Solutions. Over the past eight years, EARL has attracted some fascinating presentations from some engaging, authentic speakers, both experienced and first timers. This year, we’re keen to understand how recent global events and trends that have disrupted our view of ‘normal’ have impacted, changed or driven your R projects: from inspirational innovation to reducing operational cost and creating richer customer experiences. If you have an interesting application of R, our call for abstracts is now open and we’re inviting you to share your synopsis with us. Deadline for submissions is Thursday 30th June. Maybe you’ve built a Shiny app that helps detect bias, or you’ve been on a data journey you’d like to share. Perhaps you’ve built a data science syllabus for young minds or created an NLP tool to automate clinical processes. If you are searching for inspiration, potential applications of R might come under the following categories:

Responding to global events with R
The role of R in the business data science toolbox
Overcoming the challenges of using R commercially
Efficient R: dealing with huge data
Sustainable R / R for good
R tools & packages (eg. Shiny R, Purrr)
Building your R community
Women in R
The future of R in enterprise: 2022 and beyond

We are also looking for short form submissions: 10-minute lightning talks on a wide range of applications.

What’s presenting at EARL really like?

We asked our 2019 presenters what prompted their decision to speak at our last in-person EARL and their advice to others who may be considering submitting an abstract for EARL 2022. For Mitchell Stirling, Capacity and Modelling Manager at Heathrow Airport, the opportunity to present helped fulfil a professional ambition. “I discussed with my line manager, slightly tongue in cheek, that it should be an ambition in 2019 when he signed off a conference attendance in Scotland the previous year. As the work I’d been doing developed in 2019 and the opportunity presented itself, I started to think “why not?” – this is interesting and if I can show it interestingly, hopefully others would agree. I was slightly wary of the technical nature of the event, with my exposure to coding in R still better measured in minutes than hours (never mind days) but a reassurance that people would be interested in the ‘what’ and ‘why’ as well as the ‘how’, won me over.” Dr Zhanna Mileeva, a Data Scientist at NBrown Group confirmed that making a contribution to the data science community was an important factor in her decision to submit an abstract: “After some research I found the EARL conference as a great cross-sector forum for R users to share Data Science, AI and ML engineering knowledge, discuss modern business problems and pathways to solutions. It was a fantastic opportunity to contribute to this community, learn from it and re-charge with some fresh ideas.” In past years EARL has attracted speakers from across the globe and last year, Harold Selman, Lead Data Scientist at Ordina (NL) came from the Netherlands to speak at the conference. “I knew the EARL conference as a visitor and had given some presentations in The Netherlands, so I decided to give it a shot. The staff of the EARL conference are very helpful and open to questions, which made being a speaker very pleasant.” Some of our presenters have enjoyed the experience so much they have presented more than once. Chris Billingham, Lead Data Scientist at Manchester Airport Group’s Digital Agency MAG-O, is one such speaker. “I’ve had the good fortune to present twice at EARL.  I saw it as an opportunity to challenge myself to present at the biggest R conference in the UK.”

How to submit your abstract.

Feeling inspired? You can find the abstract submission form on our website. Here’s our recommendations for a successful submission.

Topic: Your topic can relate to any real-world application of R. We aim to represent a range of industry sectors and a balance of technical and strategic content.
Clarity: The talk synopsis should provide an overview of the topic and why you believe it will be of interest or resonate with the audience. We suggest an introduction or problem statement alongside any supporting facts that determine the talk objectives or expected takeaways.
Storytelling: Aim to demonstrate how the tools and techniques you used helped to transform and translate value with a clear and compelling narrative.
Approval: Before you submit, it’s a good idea to ensure your application has been approved by your wider organisation and or team.
Novel: Is the application particularly new or innovative? If your application of R is new or distinctive and not widely written about in the industry, please provide as much supporting information as you can for review purposes.
Target audience: 34% of our attendees are R practitioners and 46% of delegates typically have senior or leadership roles – consider the alignment of your proposal with these audiences.

We hope these hints and tips have been helpful – but feel free to get in touch if you have any questions by contacting [email protected].

EARL your way: book your tickets now!

Your EARL tickets are now live to purchase here. Offering you every possible EARL ticket combination, here is a quick summary of what you can expect. You can simply choose a 3-day jam-packed conference pass or a 1 or 2-day option to customise an itinerary that works for you.

Grab your EARLy bird tickets right away – limited for a period of 2 weeks and 2 weeks only, we are delighted to be offering an unlimited amount of tickets ranging from 15-25% discount on all ticket options, depending if you are NHS, not for profit or an academic.

Team networking.

Why not bring your colleagues along for a much needed team social at the largest commercial R event in the UK? Offering lots of networking opportunities from brands in similar markets – there will be plenty of time to swap market experiences, over coffee, at lunch or at our evening reception. We are certainly proud to be a part of such an enthusiastic community.

Full or half day workshop on day 1.

We are running a 1-day series of workshops to kick off EARL on 6th September, covering all areas of R from explainable machine learning, to time series visualisation, functional programming with purr, an introduction to plumber APIs to having some fun and making games in Shiny. There is plenty of choice with morning and afternoon sessions agenda.

Full conference pass.

Our all-access pass to EARL gives you full access to a 1-day workshop, full 2-day conference pass and access to the evening reception at the unforgettable Drapers Hall on day 2 – the former home of Henry VIII. We have got an impressive line-up of keynotes including mathematician, science presenter and all-round badass – Hannah Fry, Top 100 Global Innovator in Data & Analytics – Harry Powell and the unmissable Financial Times columnist John Burn-Murdock. To add to this excitement, we have approved used cases from Bumble, Samaritans, BBC, Meta, Bank of England, Dogs Trust, NHS, and partners RStudio alongside many more.

1 or 2-day conference pass.

If you would like access to the keynotes, session talks and abundance of networking opportunities, you can choose from a 1 or 2-day pass aligned to your areas of interest. The 2-day conference pass gives you access to the main evening reception.

Evening reception.

This year we have opted for an unforgettable experience at Drapers Hall (the former home of Henry VIII), where you will get the ability to network with colleagues, delegates and speakers over drinks, canapes, and dinner in unforgettable surroundings. Transport is provided in a provide London red bus transfer. This year promises an unforgettable experience, with a heavy weight line up, use cases from leading brands and the opportunity at last to share and network to your heart’s content. We look forward to meeting you. Book your tickets now.

useR! 2022 – all virtual – is next week!

Hello!

The all-virtual useR! 2022 conference opens next week, on 20 June, with 6 keynotes, 18 tutorials, and dozens of talks and posters to choose from.

Keynote speakers include Paola Moraga, Amanda Cox, the Afrimapr project, Julia Silge, Sebastian Meyer, and Mine Dogucu.

See the program overview at the conference website:
https://user2022.r-project.org/program/overview/

If you haven’t yet enrolled, sign up here
(free for people from low income countries, and price ranges from $6 to $85, depending on the country and if student/academia or industry):
https://user2022.r-project.org/participate/registration/

Sincerely,
The organizing committee of useR! 2022
—

If curious, here are the fees

Conference Fees

Conference fees help pay for the virtual platform, honoraria for tutorial and keynote speakers, and other expenses. The fees depend on the country where you live, and whether you work in industry or academia or are a student. The academia rate applies to non-profit organizations and government employees and the student rate applies to retired people. Freelancers are encouraged to select the rate that best applies.

	High Income Country	Higher Middle Income Country	Lower Middle Income Country	Low Income Country
Industry	$85	$29	$12	waived
Academia	$65	$22	$9	waived
Student	$45	$14	$6	waived

Tutorial Fees

The fees listed below are for two tutorials. If you book only one tutorial, you will get a 50% discount. If you select three tutorials the fee will be 50% higher.

	High Income Country	Higher Middle Income Country	Lower Middle Income Country	Low Income Country
Industry	$75	$22	$9	waived
Academia	$55	$19	$8	waived
Student	$35	$12	$5	waived

You may go ahead and sign-up here:
https://user2022.r-project.org/participate/registration/

useR! 2022 is almost here / casi ha llegado / approche à grands pas

useR! 2022 logo
[ES: Desplazándose hacia abajo por favor para el texto en español.
FR: Faites défiler svp pour le text français.]

Hello!

The all-virtual useR! 2022 conference opens on 20 June – less than 1 month from now – with 6 keynotes, 18 tutorials, and dozens of talks and posters to choose from. Tutorial spots are first-come, first-reserved, and some sessions are already sold out!

If you’ve already registered, we thank you – we’re looking forward to welcoming you at the new conference platform next month.

If you haven’t yet enrolled, now’s the time! Sign up at our website: https://user2022.r-project.org/participate/registration/

We’re excited about this year’s lineup, and we can’t wait to share it all with you!

Sincerely,
The organizing committee of useR! 2022

—

¡Hola!

La conferencia completamente virtual useR! 2022 abre el 20 de junio, en menos de un mes, con 6 charlas principales, 18 tutoriales y decenas de charlas y posters para elegir. Los espacios disponibles para tutoriales son por orden de llegada. ¡Algunas sesiones ya están agotadas!

Agradecemos a las personas que ya se registraron: las personas que hacemos useR! estamos ansiosas por saludarte en la nueva plataforma de la conferencia el próximo mes.

Si aún no te has inscrito ¡ahora es el momento! Regístrate ahora en nuestro sitio web: https://user2022.r-project.org/participate/registration/

¡El equipo está muy entusiasmado con el programa de este año y no vemos la hora de compartirlo contigo!

Sinceramente,
Comité organizador de useR! 2022

—

Bonjour!

La conférence virtuelle useR! 2022 s’ouvre le 20 juin, dans moins d’un mois, avec 6 discours luminaires, 18 tutoriels et des dizaines de courts discours et d’affiches. Certaines séances sont déjà complètes.

Aux personnes déjà inscrites: nous vous remercions. Nous nous réjouissons de vous accueillir sur la nouvelle plateforme de conference le mois prochain.

Si vous n’êtes pas déjà inscrit, c’est le moment!

https://user2022.r-project.org/participate/registration/

Nous sommes ravis du programme de cette année et nous avons hate de le partager avec vous.

Amicalement,
Comité d’organisation d’useR! 2022

Natural Gas Prices Are Again on an Unsustainable Upward Trajectory …

I identified a real-time market condition in the natural gas market (symbol = UNG) on LinkedIn and followed up on a prior r-bloggers post here. The market ultimately declined 42% following the earlier breach of its identified unsustainable Stealth Support Curve.

Chart from earlier r-blogger post

In short order, this same market is once again on another such path. Once this market meaningfully breaks its current Stealth Support Curve, the magnitude of its ultimate decline is naturally indeterminate and is in no way guaranteed to again be ‘large.’ Nonetheless, it is highly improbable for market price evolution to continue along its current trajectory. If market low prices continue to adhere to the current Stealth Support Curve (below), prices will increase at least +1,800% within a month. Thus, the forecast of impending Stealth Support Curve breach is not a difficult one. The relevant questions are 1) when will it occur and 2 ) what is the magnitude of decline following the breach?

Past and Current Stealth Support Curves

Stealth Support Curve Formulas

t1 = 9/28/2011

R Code

library(tidyverse)
library(readxl)

# Original data source - https://www.nasdaq.com/market-activity/funds-and-etfs/ung/historical

# Download reformatted data (columns/headings) from my github site and save to a local drive

# https://github.com/123blee/Stealth_Curves.io/blob/main/UNG_prices_4_11_2022.xlsx

# Load your local data file
ung <- read_excel("... Insert your local file path here .../UNG_prices_4_11_2022.xlsx")
ung


# Convert 'Date and Time' to 'Date' column
ung[["Date"]] <- as.Date(ung[["Date"]])
ung

bars <- nrow(ung)

# Add bar indicator as first tibble column
ung %
  add_column(t = 1:nrow(ung), .before = "Date")
ung

# Add 40 future days to the tibble for projection of the Stealth Curve once added
future <- 40
ung %
  add_row(t = (bars+1):(bars+future))

# Market Pivot Lows using 'Low' Prices
# Chart 'Low' UNG prices
xmin <- 2250
xmax <- bars + future
ymin <- 0
ymax <- 25
plot.new()
background <- c("azure1")
chart_title_low <- c("Natural Gas (UNG) \nDaily Low Prices ($)")
u <- par("usr") 
rect(u[1], u[3], u[2], u[4], col = background) 
par(ann=TRUE)
par(new=TRUE)
t <- ung[["t"]]
Price <- ung[["Low"]]
plot(x=t, y=Price, main = chart_title_low, type="l", col = "blue", 
     
     ylim = c(ymin, ymax) ,
     
     xlim = c(xmin, xmax ) ) 

# Add 1st Stealth Support Curve to tibble
# Stealth Support Curve parameters
a <-   -444.56  
b <-      6.26  
c <-  -2555.01  

ung %
  mutate(Stealth_Curve_Low_1 = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting
ung[["Stealth_Curve_Low_1"]][1:2275] <- NA
ung[["Stealth_Curve_Low_1"]][2550:xmax] <- NA
ung

# Add 1st Stealth Curve to chart
lines(t, ung[["Stealth_Curve_Low_1"]])

# Add 2nd Stealth Support Curve to tibble
# Stealth Support Curve parameters
a <-   -277.30  
b <-      8.70  
c <-  -2672.65 

ung %
  mutate(Stealth_Curve_Low_2 = a/(t + c) + b)
ung

# Omit certain Stealth Support Curve values from charting
ung[["Stealth_Curve_Low_2"]][1:2550] <- NA
ung[["Stealth_Curve_Low_2"]][2660:xmax] <- NA
ung

# Add 2nd Stealth Curve to chart
lines(t, ung[["Stealth_Curve_Low_2"]])

The current low price is $22.68 on 4-11-2022 (Market close = $23.30).

The contents of this article are in no way meant to imply trading, investing, and/or hedging advice. Consult your personal financial expert(s) for all such matters. Details of Stealth Curve parameterization are found in my Amazon text, ‘Stealth Curves: The Elegance of Random Markets’ .

Brian K. Lee, MBA, PRM, CMA, CFA

Connect with me on LinkedIn.

Generate an MS Excel Workbook from inside RMarkdown

I make a presentation every week or so with {RMarkdown}. Invariably, one or more associates, those not fluent in R, will ask, “Can I get a copy of your ‘Excel’ ?” I don’t do much with Microsoft Excel directly, however, I’ve made creating a workbook part of my workflow using {openxlsx}. Now, I can immediately fire off a matching MS Excel Workbook after a discussion and look responsive. Here is an example:

```{r setup}
  library(tidyverse)
  library(openxlsx)

  make_one_sheet <- function(){ 
    sheet_name <- knitr::opts_current$get()$label
    addWorksheet(wb, sheet = sheet_name)
    writeData(wb, sheet = sheet_name, x = df_to_excel) 
    insertPlot(wb, sheet = sheet_name, 
      startCol = ncol(df_to_excel) + 2 ) 
    saveWorkbook(wb, file ="my_excel.xlsx", 
      overwrite = TRUE) 
  }

  wb <- createWorkbook()
```

I define a function in the setup chunk, to add to each chunk, which writes out the data and an image of the ggplot to tie everything together. The chunk label becomes the worksheet name. Also the workbook is set up in the setup chunk. Saving the workbook with every chunk simplifies adding new chunks as the deck is developed.

```{r dot_plot}
  plot_data <- iris %>% filter(Sepal.Length > 5) 

  plot_data %>% ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +  
    geom_point() 

  df_to_excel <- plot_data %>% 
    select(Species, Sepal.Length, Sepal.Width)  

  make_one_sheet()
```

Each chunk has the same four steps: do all of the processing of data to get ready for the specific ggplot, create the ggplot like one normally would, reorganize the data frame for the worksheet, and finally make the worksheet/workbook. Then you can do another ggplot:

```{r histogram}
  plot_data <- iris %>%  group_by(Species) %>%
    mutate(Group = cut(Petal.Length, breaks = 0:7)) %>% 
    group_by(Group, Species) %>% tally()

  plot_data %>% ggplot(aes(x = Group, y = n, fill = Species)) +
    geom_bar(stat = "identity", position = "stack")

  df_to_excel <- plot_data %>%
    pivot_wider(id_cols = "Group", names_from = "Species", values_from = "n")
  
  make_one_sheet()
```

Sometimes you want to pivot data for the worksheet, change some of the columns names, etc., for the workbook. Now you have your slides and a companion Excel Workbook!

AFEchidna: a new package solving mixed linear model for plant and animal datasets

Mixed linear models(MLM) are linear models with a combination of fixed and random effects to explain the degree of variation in interest traits, such as milk yield in cows or volume growth in forest trees. MLM are widely used in the analysis in the progeny test data of plants and animals. . Nowadays, most software uses the Restricted Maximum Likelihood (REML) method to estimate the variance components of random effects, and then estimate the fixed effects and predict the random effects. Such genetic analysis software includes ASReml, Echidna, SAS, BLUPF90, and R packages sommer, breedR, etc. Echidna is a free software developed in 2018 by Professor Gilmour, the main developer of ASReml. It also uses REML method to estimate parameter values, and its syntax and function is very close to that of ASReml. It is the most powerful free software for animal and plant genetic assessment, but its usage is a little complicated, which may be difficult for ordinary users.

Here, I released AFEchidna, an R package based on Echidna software, and demonstrated how to use a mixed linear model to generate solutions for variance components, genetic parameters, and random effects BLUPs.

The main functions in AFEchidna:

get.es0.file: generate es0 file
echidna(): specify mixed linear model
Var(): output variance components
plot(): model diagnose plots
pin(): calculate genetic parameters
predict(): model predictions
coef(): model equation solutions
model.comp(): compare different models
update(): run new mode

library(AFEchidna)
setwd("D:\\Rdata")
get.es0.file(dat.file=" Provenance.csv") # generate .es file
get.es0.file(es.file=" Provenance.es") # generate .es0 file

Specified a mixed model:

m1.esr <- echidna( fixed=height~1+Prov,
random=~Block*Female,
residual=~units,
es0.file='Provenance.es0')

Output related results:
> Var(m1.esr)
Term Sigma SE Z.ratio
1 Residual 2.52700 0.131470 19.221115
2 Block 0.10749 0.089924 1.195343
3 Female 0.18950 0.083980 2.256490
4 Block:Female 0.19762 0.086236 2.291618
> pin(m1.esr, mulp=c(Va~4*V3,
+ Vp~V1+V3+V4,
+ h2~4*V3/(V1+V3+V4)), digit=5)
Term Estimate SE
1 Va 0.75801 0.33592
11 Vp 2.91415 0.14973
12 h2 0.26011 0.10990
> coef(m1.esr)$fixed
Term Level Effect SE
1 Prov 11 0.0000000 0.0000000
2 Prov 12 -1.6656325 0.3741344
3 Prov 13 -0.6237406 0.3701346
4 Prov 0 -1.2201920 0.3769892
5 mu 1 11.5120637 0.3619919
> coef(m1.esr)$random %>% head
Term Level Effect SE
1 Block 2 0.3110440 0.1854718
2 Block 3 0.1268193 0.1858290
3 Block 4 0.2055624 0.1858158
4 Block 5 -0.2918516 0.1866871
5 Block 0 -0.3515741 0.1878287
6 Female 191 -0.1633745 0.3433451

A easy way to run batch analysis:
mt.esr <- update(m1.esr,trait=~height+diameter+volume,batch=TRUE)
> Var(mt.esr)

V1-Residual; V2-Block; V3-Female; V4-Block.Female
Converge: 1 means True; 0 means FALSE.

V1 V2 V3 V4 V1.se V2.se V3.se V4.se Converge maxit
height 2.5270 0.107490 0.18950 0.19762 0.13147 0.089924 0.08398 0.08624 1 4
diameter 16.8810 0.167130 0.80698 0.69100 0.87659 0.198100 0.41470 0.50067 1 5
volume 0.0037 0.000084 0.00022 0.00016 0.00019 0.000077 0.00010 0.00011 1 6

55,000 in Awards for Energy & Buildings Hackathon, Sponsored by NYSERDA

The New York State Energy Research & Development Agency (NYSERDA) is partnering with Onboard Data to host a $55,000 Global Energy & Buildings Hackathon. We’re inviting all engineers, data scientists and software developers whether they are professionals, professors, researchers or students to participate. More below…

Challenge participants will propose exciting, new ideas that can improve our world’s buildings. The hackathon will share data from 200+ buildings to participants. This data set is rich and one of a kind. The data set is normalized from equipment, systems and IoT devices found within buildings. We seek submissions that positively impact or accelerate the decarbonization of New York State buildings.

Total awards are $55,000. Sign-ups stay open until April 15th and the competition is open from April 22nd to May 30th. More can be found here: www.rtemhackathon.com.

Advance the next generation of building technology!

Batched Imputation for High Dimensional Missing Data Problems

High dimensional data spaces are notoriously associated with slower computation time, whether imputation or some other operation. As such, in high dimensional contexts many imputation methods run out of gas, and fail to converge (e.g., MICE, PCA, etc., depending on the size of the data). Further, though some approaches to high dimensional imputation exist, most are limited by being unable to simultaneously and natively handle mixed-type data.

To address these problems of either inefficient or slow computation time, as well as the complexities associated with mixed-type data, I recently released the first version of a new algorithm, hdImpute, for fast, accurate imputation for high dimensional missing data problems. The algorithm is built on top of missForest and missRanger, with chained random forests as the imputation engine.

The benefit of the hdImpute is in simplifying the computational burden via a batch process comprised of a few stages. First, the algorithm divides the data space into smaller subsets of data based on cross-feature correlations (i.e., column-wise instead of row-wise subsets). The algorithm then imputes each batch, the size of which is controlled by a hyperparameter, batch, set by the user. Then, the algorithm continues to subsequent batches until all batches are individually imputed. Then, the final step joins the completed, imputed subsets and returns a single, completed data frame.

Let’s walk through a brief demo with some simulated data. First, create the data.

# load a couple libraries
library(tidyverse)
library(hdImpute) # using v0.1.0 here

# create the data
{
set.seed(1234)

data <- data.frame(X1 = c(1:6),
X2 = c(rep("A", 3), 
rep("B", 3)), 
X3 = c(3:8),
X4 = c(5:10),
X5 = c(rep("A", 3), 
rep("B", 3)), 
X6 = c(6,3,9,4,4,6))

data <- data[rep(1:nrow(data), 500), ] # expand/duplicate rows

data <- data[sample(1:nrow(data)), ] # shuffle rows
}

Next, take a look to make sure we are indeed working with mixed-type data.

# quick check to make sure we have mixed-type data
data %>% 
map(., class) %>% 
unique()

> [[1]] [1] "integer" [[2]] [1] "character" [[3]] [1] "numeric"

Good to go: three data classes represented in our data object. Practically, the value of this feature is that there is no requirement for lengthy preprocessing of the data, unless desired by the user of course.

Finally, introduce some NA’s to our data object, and store in d.

# produce NAs (30%)
d <- missForest::prodNA(data, noNA = 0.30) %>% 
as_tibble()

Importantly, the package allows for two approaches to use hdImpute: in stages (to allow for more flexibility, or, e.g., tying different stages into different points of a modeling pipeline); or as a single call to hdImpute(). Let’s take a look at the first approach via stages.

Approach 1: Stages

To use hdImpute in stages, three functions are used, which comprise each of the stages of the algorithm:

feature_cor(): creates the correlation matrix. Note: Dependent on the size and dimensionality of the data as well as the speed of the machine, this preprocessing step could take some time.
flatten_mat(): flattens the correlation matrix from the previous stage, and ranks the features based on absolute correlations. The input for flatten_mat() should be the stored output from feature_cor().
impute_batches(): creates batches based on the feature rankings from flatten_mat(), and then imputes missing values for each batch, until all batches are completed. Then, joins the batches to give a completed, imputed data set.

Here’s the full workflow.

# stage 1: calculate correlation matrix and store as matrix/array
all_cor <- feature_cor(d)

# stage 2: flatten and rank the features by absolute correlation and store as df/tbl
flat_mat <- flatten_mat(all_cor) # can set return_mat = TRUE to print the flattened and ranked feature list

# stage 3: impute, join, and return
imputed1 <- impute_batches(data = d,
features = flat_mat, 
batch = 2)

# take a look at the completed data
imputed1

d # compare to the original if desired

Of note, setting return_mat = TRUE returns all cross-feature correlations as previously mentioned. But calling the stored object (regardless of the value passed to return_mat) will return the vector of ranked features based on absolute correlation from calling flatten_mat(). Thus, the default for return_mat is set to FALSE, as it isn’t necessary to inspect the cross-feature correlations, though users certainly can. But all that is required for imputation in stage 3 is the vector of ranked features (passed to the argument features), which will be split into batches based on their position in the ranked vector.

That’s it! We have a completed data frame returned via hdImpute. The speed and accuracy of the algorithm are better understood in larger scale benchmarking experiments. But I hope the logic is clear enough from this simple demonstration.

Approach 2: Single Call

Finally, consider the simpler, yet less flexible approach of making a single call to hdImpute(). There are only two required arguments: data (original data with missing values) and batch (number of batches to create from the data object). Here’s what this approach might look like using our simulated data from before:

# fit and store in imputed2
imputed2 <- hdImpute(data = d, batch = 2)

# take a look
imputed2

# make sure you get the same results compared to the stages approach above
imputed1 == imputed2

Visually Comparing

Though several methods and packages exist to explore imputation error, a quick approach to comparing error between the imputed values and the original (complete) data is to visualize the contour of the relationship between a handful of features. For example, take a look at the contour plot of the original data.

data %>%
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal() + 
labs(title = "Original (Complete) Data")

Next, here’s the same plot but for the imputed set.

imputed2 %>% 
select(X1:X3) %>% 
ggplot(aes(x = X1, y = X3, fill = X2)) + 
geom_density_2d() + 
theme_minimal(). + 
labs(title = "Imputed (Complete) Data")

As expected, the imputed version has a bit of error (looser distributions) relative to the original version. But the overall pattern is quite similar, suggesting hdImpute did a fair job of imputing plausible values.

Contribute

To dig more into the package, check out the repo, which includes source code, a getting started vignette, tests, and all other documentation.

As the software is in its infancy, contribution at any level is welcomed, from bug reports to feature suggestions. Feel free to open an issue or PR to request the change or contribute.

Thanks and I hope this helps ease high dimensional imputation!