WhatsR – An R-Package for processing exported WhatsApp Chat Logs

WhatsApp is one of the most heavily used mobile instant messaging applications around the world. It is especially popular for everyday communication with friends and family and most users communicate on a daily or a weekly basis through the app. Interestingly, it is possible for WhatsApp users to extract a log file from each of their chats. This log file contains all textual communication in the chat that was not manually deleted or is not too far in the past.

This logging of digital communication is on the one hand interesting for researchers seeking to investigate interpersonal communication, social relationships, and linguistics, and can on the other hand also be interesting for individuals seeking to learn more about their own chatting behavior (or their social relationships).

The WhatsR R-package enables users to transform exported WhatsApp chat logs into a usable data frame object with one row per sent message and multiple variables of interest. In this blog post, I will demonstrate how the package can be used to process and visualize chat log files.


Installing the Package
The package can either be installed via CRAN or via GitHub for the most up-to-date version. I recommend to install the GitHub version for the most recent features and bugfixes.
# from CRAN
# install.packages("WhatsR")

# from GitHub
devtools::install_github("gesiscss/WhatsR")
The package also needs to be attached before it can be used. For creating nicer plots, I recommend to also install and attach the patchwork package.
# installing patchwork package
install.packages("patchwork")

# attaching packages
library(WhatsR)
library(patchwork)
Obtaining a Chat Log
You can export one of your own chat logs from your phone to your email address as explained in this tutorial. If you do this, I recommend to use the “without media” export option as this allows you to export more messages.

If you don’t want to use one of your own chat logs, you can create an artificial chat log with the same structure as a real one but with made up text using the WhatsR package!
## creating chat log for demonstration purposes

# setting seed for reproducibility
set.seed(1234)

# simulating chat log
# (and saving it automatically as a .txt file in the working directory)
create_chatlog(n_messages = 20000,
               n_chatters = 5,
               n_emoji = 5000,
               n_diff_emoji = 50,
               n_links = 999,
               n_locations = 500,
               n_smilies = 2500,
               n_diff_smilies = 10,
               n_media = 999,
               n_sdp = 300,
               startdate = "01.01.2019",
               enddate = "31.12.2023",
               language = "english",
               time_format = "24h",
               os = "android",
               path = getwd(),
               chatname = "Simulated_WhatsR_chatlog")
Parsing Chat Log File
Once you have a chat log on your device, you can use the WhatsR package to import the chat log and parse it into a usable data frame structure.
data <- parse_chat("Simulated_WhatsR_chatlog.txt", verbose = TRUE)
Checking the parsed Chat Log
You should now have a data frame object with one row per sent message and 19 variables with information extracted from the individual messages. For a detailed overview what each column contains and how it is computed, you can check the related open source publication for the package. We also add a tabular overview here.
## Checking the chat log
dim(data)
colnames(data)

Column Name

Description

DateTime

Timestamp for date and time the message was sent. Formatted as yyyy-mm-dd hh:mm:ss

Sender

Name of the sender of the message as saved in the contact list of the exporting phone or telephone number. Messages inserted by WhatsApp into the chat are coded with “WhatsApp System Message”

Message

Text of user-generated messages with all information contained in the exported chat log

Flat

Simplified version of the message with emojis, numbers, punctuation, and URLs removed. Better suited for some text mining or machine learning tasks

TokVec

Tokenized version of the Flat column. Instead of one text string, each cell contains a list of individual words. Better suited for some text mining or machine learning tasks

URL

A list of all URLs or domains contained in the message body

Media

A list of all media attachment filenames contained in the message body

Location

A list of all shared location URLs or indicators in the message body, or indicators for shared live locations

Emoji

A list of all emoji glyphs contained in the message body

EmojiDescriptions

A list of all emojis as textual representations contained in the message body

Smilies

A list of all smileys contained in the message body

SystemMessage

Messages that are inserted by WhatsApp into the conversation and not generated by users

TokCount

Amount of user-generated tokens per message

TimeOrder

Order of messages as per the timestamps on the exporting phone

DisplayOrder

Order of messages as they appear in the exported chat log

Checking Descriptives of Chat Logs
Now, you can have a first look at the overall statistics of the chat log. You can check the number of messages, sent tokens, number of chat participants, date of first message, date of last message, the timespan of the chat, and the number of emoji, smilies, links, media files, as well as locations in the chat log.
# Summary statistics
summarize_chat(data, exclude_sm = TRUE)
We can also check the distribution of the number of tokens per message and a set of summary statistics for each individual chat participant.
# Summary statistics
summarize_tokens_per_person(data, exclude_sm = TRUE)
Visualizing Chat Logs
The chat characteristics can now be visualized using the custom functions from the WhatsR package. These functions are basically wrappers to ggplot2 with some options for customizing the plots. Most plots have multiple ways of visualizing the data. For the visualizations, we can exclude the WhatsApp System Messages using ‘exclude_sm= TRUE’. Lets try it out:

Amount of sent messages
# Plotting amount of messages
p1 <- plot_messages(data, plot = "bar", exclude_sm = TRUE)
p2 <- plot_messages(data, plot = "cumsum", exclude_sm = TRUE)
p3 <- plot_messages(data, plot = "heatmap", exclude_sm = TRUE)
p4 <- plot_messages(data, plot = "pie", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p3) | free(p2)) / (free(p1) | free(p4))
The graphic shows four different ways of visualizing the amount of sent messages in a WhatsApp chat log
Four different ways of visualizing the amount of sent messages in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens
# Plotting amount of messages
p5 <- plot_tokens(data, plot = "bar", exclude_sm = TRUE)
p6 <- plot_tokens(data, plot = "box", exclude_sm = TRUE)
p7 <- plot_tokens(data, plot = "violin", exclude_sm = TRUE)
p8 <- plot_tokens(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p5) | free(p6)) / (free(p7) | free(p8))
The graphic shows Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens in a WhatsApp chat log. Click image to zoom in.

Amount of sent tokens over time
# Plotting amount of tokens over time
p9 <- plot_tokens_over_time(data,
                            plot = "year",
                            exclude_sm = TRUE)

p10 <- plot_tokens_over_time(data,
                             plot = "day",
                             exclude_sm = TRUE)

p11 <- plot_tokens_over_time(data,
                             plot = "hour",
                             exclude_sm = TRUE)

p12 <- plot_tokens_over_time(data,
                             plot = "heatmap",
                             exclude_sm = TRUE)

p13 <- plot_tokens_over_time(data,
                             plot = "alltime",
                             exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p9) | free(p10)) / (free(p11) | free(p12))
The plot shows Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log
Four different ways of visualizing the amount of sent tokens over time in a WhatsApp chat log. Click image to zoom in.

Amount of sent links
# Plotting amount of links
p14 <- plot_links(data, plot = "bar", exclude_sm = TRUE)
p15 <- plot_links(data, plot = "splitbar", exclude_sm = TRUE)
p16 <- plot_links(data, plot = "heatmap", exclude_sm = TRUE)
p17 <- plot_links(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p14) | free(p15)) / (free(p16) | free(p17))
The graphic shows four different ways of visualizing the amount of sent links in a WhatsApp chat log
Four different ways of visualizing the amount of sent links in a WhatsApp chat log. Click image to zoom in.

Amount of sent smilies
# Plotting amount of smilies
p18 <- plot_smilies(data, plot = "bar", exclude_sm = TRUE)
p19 <- plot_smilies(data, plot = "splitbar", exclude_sm = TRUE)
p20 <- plot_smilies(data, plot = "heatmap", exclude_sm = TRUE)
p21 <- plot_smilies(data, plot = "cumsum", exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p18) | free(p19)) / (free(p20) | free(p21))
The graphic shows Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log
Four different ways of visualizing the amount of sent smileys in a WhatsApp chat log. Click image to zoom in.

Amount of sent emoji
# Plotting amount of messages
p22 <- plot_emoji(data,
 plot = "bar",
 min_occur = 300,
 exclude_sm = TRUE,
 emoji_size=5)

p23 <- plot_emoji(data,
 plot = "splitbar",
 min_occur = 70,
 exclude_sm = TRUE,
 emoji_size=5)

p24 <- plot_emoji(data,
 plot = "heatmap",
 min_occur = 300,
 exclude_sm = TRUE)

p25 <- plot_emoji(data,
 plot = "cumsum",
 min_occur = 300,
 exclude_sm = TRUE)

# Printing plots with patchwork package
(free(p22) | free(p23)) / (free(p24) | free(p25))
The graphic shows four different ways of visualizing the amount of sent emoji in a WhatsApp chat log
Four different ways of visualizing the amount of sent emoji in a WhatsApp chat log. Click image to zoom in.

Distribution of reaction times
# Plotting distribution of reaction times
p26 <- plot_replytimes(data,
                       type = "replytime",
                       exclude_sm = TRUE)
p27 <- plot_replytimes(data,
                       type = "reactiontime",
                       exclude_sm = TRUE)

# Printing plots with patchwork package
free(p26) | free(p27)
The graphic shows the average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log
Average response times and times it takes to answer to messages for each individual chat participant in a WhatsApp chat log. Click image to zoom in.

Lexical Dispersion
A lexical dispersion plot is a visualization of where specific words occur within a text corpus. Because the simulated chat log in this example is using lorem ipsum text where all words occur similarly often, we add the string “testword” to a random subsample of messages. For visualizing real chat logs, this would of course not be necessary.
# Adding "testword" to random subset of messages for demonstration         # purposes
set.seed(12345)
word_additions <- sample(dim(data)[1],50)
data$TokVec[word_additions]
sapply(data$TokVec[word_additions],function(x){c(x,"testword")})
data$Flat[word_additions] <- sapply(data$Flat[word_additions],
  function(x){x <- paste(x,"testword");return(x)})
Now you can create the lexical dispersion plot:
# Plotting lexical dispersion plot
plot_lexical_dispersion(data,
                        keywords = c("testword"),
                        exclude_sm = TRUE)
Lexical dispersion plot for the occurance of the word "testword" in the simulated WhatsApp chat log.
Lexical dispersion plot for the occurance of the word “testword” in the simulated WhatsApp chat log. Click image to zoom in.

Response Networks
# Plotting response network
plot_network(data,
             edgetype = "n",
             collapse_sessions = TRUE,
             exclude_sm = TRUE)
Network graph showing how often each chat participant directly responded to the previous messages (A subsequent message is counted as a "response" here)
Network graph showing how often each chat participant directly responded to the previous messages (a subsequent message is counted as a “response” here). Click image to zoom in.

Issues and long-term availability.
Unfortunately, WhatsApp chat logs are a moving target when it comes to plotting and visualization. The structure of exported WhatsApp chat logs keeps changing from time to time. On top of that, the structure of chat logs is different for chats exported from different operating systems (Android & iOS) and for different time (am/pm vs. 24h format) and language (e.g. English & German) settings on the exporting phone. When the structure changes, the WhatsR package can be limited in its functionality or become completely dysfunctional until it is updated and tested. Should you encounter any issues, all reports on the GitHub issues page are welcome. Should you want to contribute to improving on or maintaining the package, pull requests and collaborations are also welcome!

Text processing and stemming for classification tasks in master data management context

author: “Genrikh Ananiev”
Problem description
Under business conditions, narrowly specialized tasks often come across, which require a special approach because they do not fit into the standard data processing flow and constructing models. One of these tasks is the classification of new products in master data management process (MDM).

Example 1
You are working in a large company (supplier) engaged in the production and / or sales of products, including through wholesale intermediaries (distributors). Often your distributors have an obligation (in front of the company in which you work) regularly providing reporting on your own sales of your products – the so-called Sale Out. Not always, distributors are able to report on the sold products in your company codes, more often are their own codes and their own product names that differ from the names in your system. Accordingly, in your database there is a need to keep the table matching distributors with product codes of your account. The more distributors, the more variations of the name of the same product. If you have a large assortment portfolio, it becomes a problem that is solved by manual labor-intensive support for such matching tables when new product name variations in your accounting system are received.

If it refers to the names of such products as to the texts of documents, and the codes of your accounting system (to which these variations are tied) to consider classes, then we obtain a task of a multiple classification of texts. Such a matching table (which operators are maintained manually) can be considered a training sample, and if it is built on it such a classification model – it would be able to reduce the complexity of the work of operators to classify the flow of new names of existing products. However, the classic approach to working with the text “as is” will not save you, it will be said just below.

Example 2

In the database of your company, data on sales (or prices) of products from external analytical (marketing) agencies are coming or from the parsing of third-party sites. The same product from each data source will also contain variations in writing. As part of this example, the task can be even more difficult than in Example 1 because often your company’s business users have the need to analyze not only your products, but also the range of your direct competitors and, accordingly, the number of classes (reference products) to which variations are tied – sharply increases.

What is the specificity of such a class of tasks?

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Why does the classic approach of the multiple classification of texts work poorly?

Consider the shortcomings of the classic text processing approach by steps:

        • Stop words.
      In such tasks there are no stop-words in the generally accepted concepts of any text processing package.

            •  Tochenization
          In classic packages from the box, the division of text on words is based on the presence of punctuation or spaces. As part of this class task (where the length of the text field input is often limited), it is not uncommon to receive product names without spaces where words are not clearly separated, but visually on the register of numbers or other language. How to pass toochenization from the box on your favorite programming language for the name of wine “Dom.CHRISTIANmoreau0,75LPLtr.EtFilChablis” ? (Unfortunately it’s not a joke)

                • Stemming
              Product names are not text in a classic understanding of the task (such as news from sites, services reviews or newspaper headers) which is amenable to release suffix that can be discarded. In the names of products, abbreviations are often present and the reductions of words of which are not clear how to allocate this suffix. And there are also the names of the brands from another language group (for example, the inclusion of the brands of French or Italian) that are not amenable to a normal stemming.

                    • Reducing document-terms matrices
                  Often, when building “Document-Term” matrices, the package of your language offers to reduce the sparsity of matrices to remove words (columns of the matrix) with a frequency below some minimum threshold. And in classical tasks, it really helps improve quality and reduce overhead in the training of the model. But not in such tasks. Above, I have already written that the distribution of classes is not strongly balanced – it can easily be on the same product name to the class (for example, a rare and expensive brand that has sold it for the first time and while there is only one time in the training sample). The classic approach of sparsity reduction we bring the quality of the classifier.

                        • Training the model
                      Usually, some kind of model is trained on texts (LibSVM, naive Bayesian classifier, neural networks, or something else) which is then used repeatedly. In this case, new classes can appear daily and the number of documents in the class can be counted as a single instance. Therefore, it makes no sense to learn one large model for a long time using- any algorithm with online training, for example, a KNN classifier with one nearest neighbor, is enough.

                      Next, we will try to compare the classification of the traditional approach with the classification based on the proposed package. We will use tidytext as an auxiliary package.

                      Case example

                      devtools::install_github(repo = 'https://github.com/edvardoss/abbrevTexts')
                      library(abbrevTexts)
                      library(tidytext) # text proccessing
                      library(dplyr) # data processing
                      library(stringr) # data processing
                      library(SnowballC) # traditional stemming approach
                      library(tm) #need only for tidytext internal purpose

                      The package includes 2 data sets on the names of wines: the original names of wines from external data sources – “rawProducts” and the unified names of wines written in the standards for maintaining the company’s master data – “standardProducts”. The rawProducts table has many spelling variations of the same product, these variations are reduced to one product in standardProducts through a many-to-one relationship on the “standartId” key column. PS Variations in the “rawProducts” table are generated programmatically, but with the maximum possible similarity to how product names come from external various sources in my experience (although somewhere I may have overdone it)

                      data(rawProducts, package = 'abbrevTexts')
                      head(rawProducts)




                      data(standardProducts, package = 'abbrevTexts')
                      head(standardProducts)


                      Train and test split

                      set.seed(1234)
                      trainSample <- sample(x = seq(nrow(rawProducts)),size = .9*nrow(rawProducts))
                      testSample <- setdiff(seq(nrow(rawProducts)),trainSample)
                      testSample
                      Create dataframes for ‘no stemming mode’ and ‘traditional stemming mode’

                      df <- rawProducts %>% mutate(prodId=row_number(), 
                                                   rawName=str_replace_all(rawName,pattern = '\\.','. ')) %>% 
                        unnest_tokens(output = word,input = rawName) %>% count(StandartId,prodId,word)
                      
                      df.noStem <- df %>% bind_tf_idf(term = word,document = prodId,n = n)
                      
                      df.SnowballStem <- df %>% mutate(wordStm=SnowballC::wordStem(word)) %>% 
                        bind_tf_idf(term = wordStm,document = prodId,n = n)
                      Create document terms matrix

                      dtm.noStem <- df.noStem %>% 
                        cast_dtm(document = prodId,term = word,value = tf_idf) %>% data.matrix()
                      
                      dtm.SnowballStem <- df.SnowballStem %>% 
                        cast_dtm(document = prodId,term = wordStm,value = tf_idf) %>% data.matrix()
                      Create knn model for ‘no stemming mode’ and calculate accuracy

                      knn.noStem <- class::knn1(train = dtm.noStem[trainSample,],
                                                test = dtm.noStem[testSample,],
                                                cl = rawProducts$StandartId[trainSample])
                      mean(knn.noStem==rawProducts$StandartId[testSample])
                      Accuracy is: 0.4761905 (47%)

                      Create knn model for ‘stemming mode’ and calculate accuracy

                      knn.SnowballStem <- class::knn1(train = dtm.SnowballStem[trainSample,],
                                                     test = dtm.SnowballStem[testSample,],
                                                     cl = rawProducts$StandartId[trainSample])
                      mean(knn.SnowballStem==rawProducts$StandartId[testSample])
                      Accuracy is: 0.5 (50%)

                      abbrevTexts primer

                      Below is an example on the same data but using the functions from abbrevTexts package

                      Separating words by case

                      df <- rawProducts %>% mutate(prodId=row_number(), 
                                                   rawNameSplitted= makeSeparatedWords(rawName)) %>% 
                              unnest_tokens(output = word,input = rawNameSplitted)
                      print(df)


                      As you can see, the tokenization of the text was carried out correctly: not only transitions from upper and lower case when writing together are taken into account, but also punctuation marks between words written together without spaces are taken into account.

                      Creating a stemming dictionary based on a training sample of words

                      After a long search among different stemming implementations, I came to the conclusion that traditional methods based on the rules of the language are not suitable for such specific tasks, so I had to look for my own approach. As a result, I came to the most optimal solution, which was reduced to unsupervised learning, which is not sensitive to the text language or the degree of reduction of the available words in the training sample.

                      The function takes a vector of words as input, the minimum word length for the training sample and the minimum fraction for considering the child word as an abbreviation of the parent word and then does the following:

                      1. Discarding words with a length less than the set threshold
                      2. Discarding words consisting of numbers
                      3. Sort the words in descending order of their length
                      4. For each word in the list:
                        4.1 Filter out words that are less than the length of the current word and greater than or equal to the length of the current word multiplied by the minimum fraction
                        4.2 Select from the list of filtered words those that are the beginning of the current word


                      Let’s say that we fix min.share = 0.7
                      At this intermediate stage (4.2), we get a parent-child table where such examples can be found:



                      Note that each line meets the condition that the length of the child’s word is not shorter than 70% of the length of the parent’s word.

                      However, there may be found pairs that can not be considered as abbreviations of words because in them different parents are reduced to one child, for example:



                      My function for such cases leaves only one pair.

                      Let’s go back to the example with unambiguous abbreviations of words



                      But if you look a little more closely, we see that there is a common word ‘bodeg’ for these 2 pairs and this word allows you to connect these pairs into one chain of abbreviations without violating our initial conditions on the length of a word to consider it an abbreviation of another word:
                      bodegas->bodeg->bode
                      So we come to a table of the form:


                      Such chains can be of arbitrary length and it is possible to assemble from the found pairs into such chains recursively. Thus we come to the 5th stage of determining the final child for each participant of the constructed chain of abbreviations of words

                      5. Recursively iterating through the found pairs to determine the final (terminal) child for all members of chains
                      6. Return the abbreviation dictionary

                      The makeAbbrStemDict function is automatically paralleled by several threads loading all the processor cores, so it is advisable to take this point into account for large volumes of texts.

                      abrDict <- makeAbbrStemDict(term.vec = df$word,min.len = 3,min.share = .6)
                      head(abrDict) # We can see parent word, intermediate results and total result (terminal child)


                      The output of the stemming dictionary in the form of a table is also convenient because it is possible to selectively and in a simple way in the “dplyr” paradigm to delete some of the stemming lines.

                      Lets say that we wont to exclude parent word “abruzz” and terminal child group “absolu” from stemming dictionary:

                      abrDict.reduced <- abrDict %>% filter(parent!='abruzz',terminal.child!='absolu')
                      print(abrDict.reduced)


                      Compare the simplicity and clarity of this solution with what is offered in stackoverflow:

                      Text-mining with the tm-package – word stemming

                      Stem using abbreviate dictionary

                      df.AbbrStem <- df %>% left_join(abrDict %>% select(parent,terminal.child),by = c('word'='parent')) %>% 
                          mutate(wordAbbrStem=coalesce(terminal.child,word)) %>% select(-terminal.child)
                      print(df.AbbrStem)


                      TF-IDF for stemmed words

                      df.AbbrStem <- df.AbbrStem %>% count(StandartId,prodId,wordAbbrStem) %>% 
                        bind_tf_idf(term = wordAbbrStem,document = prodId,n = n)
                      print(df.AbbrStem)


                      Create document terms matrix

                      dtm.AbbrStem <- df.AbbrStem %>% 
                        cast_dtm(document = prodId,term = wordAbbrStem,value = tf_idf) %>% data.matrix()
                      Create knn model for ‘abbrevTexts mode’ and calculate accuracy

                      knn.AbbrStem <- class::knn1(train = dtm.AbbrStem[trainSample,],
                                                      test = dtm.AbbrStem[testSample,],
                                                      cl = rawProducts$StandartId[trainSample])
                      mean(knn.AbbrStem==rawProducts$StandartId[testSample]) 
                      Accuracy for “abbrevTexts”: 0.8333333 (83%)

                      As you can see , we have received significant improvements in the quality of classification in the test sample.
                      Tidytext is a convenient package for a small courpus of texts, but in the case of a large courpus of texts, the “AbbrevTexts” package is also perfectly suitable for preprocessing and normalization and usually gives better accuracy in such specific tasks compared to the traditional approach

                      edvardoss/abbrevTexts: Functions that will make life less sad when working with abbreviated text for multiclassification tasks (github.com)

                      Lyric Analysis with NLP and Machine Learning using R: Part One – Text Mining

                      June 22
                      By Debbie Liske

                      This is Part One of a three part tutorial series originally published on the DataCamp online learning platform in which you will use R to perform a variety of analytic tasks on a case study of musical lyrics by the legendary artist, Prince. The three tutorials cover the following:


                      Musical lyrics may represent an artist’s perspective, but popular songs reveal what society wants to hear. Lyric analysis is no easy task. Because it is often structured so differently than prose, it requires caution with assumptions and a uniquely discriminant choice of analytic techniques. Musical lyrics permeate our lives and influence our thoughts with subtle ubiquity. The concept of Predictive Lyrics is beginning to buzz and is more prevalent as a subject of research papers and graduate theses. This case study will just touch on a few pieces of this emerging subject.



                      Prince: The Artist

                      To celebrate the inspiring and diverse body of work left behind by Prince, you will explore the sometimes obvious, but often hidden, messages in his lyrics. However, you don’t have to like Prince’s music to appreciate the influence he had on the development of many genres globally. Rolling Stone magazine listed Prince as the 18th best songwriter of all time, just behind the likes of Bob Dylan, John Lennon, Paul Simon, Joni Mitchell and Stevie Wonder. Lyric analysis is slowly finding its way into data science communities as the possibility of predicting “Hit Songs” approaches reality.

                      Prince was a man bursting with music – a wildly prolific songwriter, a virtuoso on guitars, keyboards and drums and a master architect of funk, rock, R&B and pop, even as his music defied genres. – Jon Pareles (NY Times)
                      In this tutorial, Part One of the series, you’ll utilize text mining techniques on a set of lyrics using the tidy text framework. Tidy datasets have a specific structure in which each variable is a column, each observation is a row, and each type of observational unit is a table. After cleaning and conditioning the dataset, you will create descriptive statistics and exploratory visualizations while looking at different aspects of Prince’s lyrics.

                      Check out the article here!




                      (reprint by permission of DataCamp online learning platform)