Data frame load

We loaded the data frame collected by Twitter, using the TwitteR package We used the tm package to build a corpus, and cleaning its.

library(tm)
## Loading required package: NLP
library(readr)
library(wordcloud)
## Loading required package: RColorBrewer
library(readxl)
library(utf8)
covid0320<-read_xlsx("covid0320.xlsx")

covid0320$text<-as_utf8(covid0320$text, normalize = FALSE)

Building the corpus

A corpus is “a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject”. The tm package use the Corpus() function to create a corpus.

library(tm)
mydata<-VCorpus(VectorSource(covid0320$text))

Cleaning

Once we have successfully loaded the data into the workspace, it necessary to clean this data. Our goal at this step is to create independent terms(words) from the data file before we can start counting how frequent they appear. Since R is case sensitive, we shall first convert the entire text to lowercase to avoid considering the same words like “write” and “Write” differently. We shall remove URLs, emojis, non-Italian words, punctuations, numbers, whitespace and stop words. We create a custom stop words (MySpowords vector).

mydata <- tm_map(mydata, content_transformer(tolower))
mydata<-tm_map(mydata, content_transformer(gsub), pattern="\\W",replace=" ")
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
mydata <- tm_map(mydata, content_transformer(removeNumPunct))
mydata <- tm_map(mydata, removeWords, stopwords("italian"))
myStopwords <- c(setdiff(stopwords('italian'), c("r")),"coronavirus", "covid")
mydata <- tm_map(mydata, removeWords, myStopwords)
# remove extra whitespace
mydata <- tm_map(mydata, stripWhitespace)
# Remove numbers
mydata <- tm_map(mydata, removeNumbers)
# Remove punctuations
mydata <- tm_map(mydata, removePunctuation)
writeCorpus(mydata)

##Building a Term Matrix and Revealing word frequencies After the cleaning process, we build a matrix (Terms by Documents) that logs the number of times a term appears in our clean dataset.

tdm<-TermDocumentMatrix(mydata)
# transform tdm into a matrix
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

Wordcloud and Barplot

We build a word cloud and a bar plot to a visual representation of text data. Both show the most important words, based threshold of frequency.

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
## Warning in wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words =
## 50, : fotografiesegnanti could not be fit on page. It will not be plotted.

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,        
        col =heat.colors(10), main ="Most frequent words",        
        ylab = "Word frequencies")

Filtering of the most common hashtags

We selected all the hashtags of the corpus, and filtered the top 60 based on frequency.

library(quanteda)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
## The following object is masked from 'package:utils':
## 
##     View
library(readtext)
covid0320$text<-as.character(covid0320$text, encoding="UTF-8")
mycorp<-corpus(covid0320$text)
tweet_dfm <- dfm(mycorp, remove_punct = TRUE)
dfm<- dfm(mycorp)
tag_dfm <- dfm_select(tweet_dfm, pattern = ("#*"))
toptag <- names(topfeatures(tag_dfm, 50))

Constructing feature-occurrence matrix of the hashtags

We built an adjacency matrix of the most important hashtags, that allowed to draw the network of the most frequent hashtags. An adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In this case, the graph is undirected, then the adjacency matrix is symmetric.

tag_fcm <- fcm(tag_dfm)
toptag <- names(topfeatures(tag_dfm, 60))
topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
textplot_network(topgat_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)
## Registered S3 method overwritten by 'network':
##   method            from    
##   summary.character quanteda

Filtering of the usernames

We built an adjacency matrix of the most important users, that allowed to draw the network based on frequency.

user_dfm <- dfm_select(dfm, pattern = "@*")
topuser <- names(topfeatures(user_dfm, 50))
head(topuser)
## [1] "@zaiapresidente"  "@italianairforce" "@lorepregliasco"  "@segnanti"       
## [5] "@francescatotolo" "@regione_sicilia"
user_fcm <- fcm(user_dfm)
head(user_fcm)
## Feature co-occurrence matrix of: 6 by 6 features.
##                 features
## features         @carlostagnaro @lanf040264 @feliceaquila @terminologia
##   @carlostagnaro              0           0             0             0
##   @lanf040264                 0           0             0             0
##   @feliceaquila               0           0             0             0
##   @terminologia               0           0             0             0
##   @isabellarauti              0           0             0             0
##   @fpcgilvvf                  0           0             0             0
##                 features
## features         @isabellarauti @fpcgilvvf
##   @carlostagnaro              0          0
##   @lanf040264                 0          0
##   @feliceaquila               0          0
##   @terminologia               0          0
##   @isabellarauti              0          0
##   @fpcgilvvf                  0          0

Network of the most frequent users

We draw the network of the users top

user_fcm <- fcm_select(user_fcm, pattern = topuser)
textplot_network(user_fcm, min_freq = 0.1, edge_color = "orange", edge_alpha = 0.8, edge_size = 5)