Data frame load

We loaded the data frame collected by Twitter, using the TwitteR package We used the tm package to build a corpus, and cleaning its.

covid0320$text<-as_utf8(covid0320$text, normalize = FALSE)

Building the corpus

A corpus is “a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject”. The tm package use the Corpus() function to create a corpus.



Once we have successfully loaded the data into the workspace, it necessary to clean this data. Our goal at this step is to create independent terms(words) from the data file before we can start counting how frequent they appear. Since R is case sensitive, we shall first convert the entire text to lowercase to avoid considering the same words like “write” and “Write” differently. We shall remove URLs, emojis, non-Italian words, punctuations, numbers, whitespace and stop words. We create a custom stop words (MySpowords vector).

mydata <- tm_map(mydata, content_transformer(tolower))
mydata<-tm_map(mydata, content_transformer(gsub), pattern="\\W",replace=" ")
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeNumPunct))
mydata <- tm_map(mydata, removeWords, stopwords("italian"))
myStopwords <- c(setdiff(stopwords('italian'), c("r")),"coronavirus", "covid")
mydata <- tm_map(mydata, removeWords, myStopwords)
# remove extra whitespace
mydata <- tm_map(mydata, stripWhitespace)
# Remove numbers
mydata <- tm_map(mydata, removeNumbers)
# Remove punctuations
mydata <- tm_map(mydata, removePunctuation)

##Building a Term Matrix and Revealing word frequencies After the cleaning process, we build a matrix (Terms by Documents) that logs the number of times a term appears in our clean dataset.

# transform tdm into a matrix
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

Wordcloud and Barplot

We build a word cloud and a bar plot to a visual representation of text data. Both show the most important words, based threshold of frequency.

wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,        
        col =heat.colors(10), main ="Most frequent words",        
        ylab = "Word frequencies")

Filtering of the most common hashtags

We selected all the hashtags of the corpus, and filtered the top 60 based on frequency.

covid0320$text<-as.character(covid0320$text, encoding="UTF-8")
tweet_dfm <- dfm(mycorp, remove_punct = TRUE)
dfm<- dfm(mycorp)
tag_dfm <- dfm_select(tweet_dfm, pattern = ("#*"))
toptag <- names(topfeatures(tag_dfm, 50))

Constructing feature-occurrence matrix of the hashtags

We built an adjacency matrix of the most important hashtags, that allowed to draw the network of the most frequent hashtags. An adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In this case, the graph is undirected, then the adjacency matrix is symmetric.

tag_fcm <- fcm(tag_dfm)
toptag <- names(topfeatures(tag_dfm, 60))
topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
textplot_network(topgat_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
Filtering of the usernames

We built an adjacency matrix of the most important users, that allowed to draw the network based on frequency.

user_dfm <- dfm_select(dfm, pattern = "@*")
topuser <- names(topfeatures(user_dfm, 50))
Network of the most frequent users

We draw the network of the users top

user_fcm <- fcm_select(user_fcm, pattern = topuser)
textplot_network(user_fcm, min_freq = 0.1, edge_color = "orange", edge_alpha = 0.8, edge_size = 5)