Data frame load

We loaded the data frame collected by Twitter, using the TwitteR package We used the tm package to build a corpus, and cleaning its.

library(tm)
## Loading required package: NLP
library(readr)
library(wordcloud)
## Loading required package: RColorBrewer
library(readxl)
library(utf8)
covid0320<-read_xlsx("covid0320.xlsx")

covid0320$text<-as_utf8(covid0320$text, normalize = FALSE)

Building the corpus

A corpus is “a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject”. The tm package use the Corpus() function to create a corpus.

library(tm)
mydata<-VCorpus(VectorSource(covid0320$text))

Cleaning

Once we have successfully loaded the data into the workspace, it necessary to clean this data. Our goal at this step is to create independent terms(words) from the data file before we can start counting how frequent they appear. Since R is case sensitive, we shall first convert the entire text to lowercase to avoid considering the same words like “write” and “Write” differently. We shall remove URLs, emojis, non-Italian words, punctuations, numbers, whitespace and stop words. We create a custom stop words (MySpowords vector).

mydata <- tm_map(mydata, content_transformer(tolower))
mydata<-tm_map(mydata, content_transformer(gsub), pattern="\\W",replace=" ")
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeNumPunct))
mydata <- tm_map(mydata, removeWords, stopwords("italian"))
myStopwords <- c(setdiff(stopwords('italian'), c("r")),"coronavirus", "covid")
mydata <- tm_map(mydata, removeWords, myStopwords)
# remove extra whitespace
mydata <- tm_map(mydata, stripWhitespace)
# Remove numbers
mydata <- tm_map(mydata, removeNumbers)
# Remove punctuations
mydata <- tm_map(mydata, removePunctuation)
writeCorpus(mydata)

##Building a Term Matrix and Revealing word frequencies After the cleaning process, we build a matrix (Terms by Documents) that logs the number of times a term appears in our clean dataset.

tdm<-TermDocumentMatrix(mydata)
# transform tdm into a matrix
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

Wordcloud and Barplot

We build a word cloud and a bar plot to a visual representation of text data. Both show the most important words, based threshold of frequency.

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
## Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
## 50, : fotografiesegnanti could not be fit on page. It will not be plotted.

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,        
        col =heat.colors(10), main ="Most frequent words",        
        ylab = "Word frequencies")