Title: | Analyzing the Opinions in a Big Text Document |
---|---|
Description: | Designed for performing impact analysis of opinions in a digital text document (DTD). The package allows a user to assess the extent to which a theme or subject within a document impacts the overall opinion expressed in the document. The package can be applied to a wide range of opinion-based DTD, including commentaries on social media platforms (such as 'Facebook', 'Twitter' and 'Youtube'), online products reviews, and so on. The utility of 'opitools' was originally demonstrated in Adepeju and Jimoh (2021) <doi:10.31235/osf.io/c32qh> in the assessment of COVID-19 impacts on neighbourhood policing using Twitter data. Further examples can be found in the vignette of the package. |
Authors: | Monsuru Adepeju [cre, aut], |
Maintainer: | Monsuru Adepeju <[email protected]> |
License: | GPL-3 |
Version: | 2.0.0 |
Built: | 2024-11-12 03:41:08 UTC |
Source: | https://github.com/manalytics/opitools |
A list of keywords relating to the COVID-19 pandemic
covid_theme
covid_theme
A dataframe containing one variable:
keys: list of keywords
A DTD containing individual comments on a video showing the first debate between two US presidential nominees (Donald Trump and Hillary Clinton) in Sept. 2016. (Credit: NBC News).
debate_dtd
debate_dtd
A dataframe containing one variable
text: individual text records
The DTD only include the comments within the first 24hrs in which the video was posted. All individual comments in which the names of both candidates are mentioned are filtered out.
This function assesses the impacts of a theme
(or subject) on the overall opinion computed for a DTD
Different themes in a DTD can be identified by the keywords
used in the DTD. These keywords (or words) can be extracted by
any analytical means available to the users, e.g.
word_imp
function. The keywords must be collated and
supplied this function through the theme_keys
argument
(see below).
opi_impact(textdoc, theme_keys=NULL, metric = 1, fun = NULL, nsim = 99, alternative="two.sided", quiet=TRUE)
opi_impact(textdoc, theme_keys=NULL, metric = 1, fun = NULL, nsim = 99, alternative="two.sided", quiet=TRUE)
textdoc |
An |
theme_keys |
(a list) A one-column dataframe (of any number of length) containing a list of keywords relating to the theme or secondary subject to be investigated. The keywords can also be defined as a vector of characters. |
metric |
(an integer) Specify the metric to utilize
for the calculation of opinion score. Default: |
fun |
A user-defined function given that parameter
|
nsim |
(an integer) Number of replicas (ESD) to generate.
See detailed documentation in the |
alternative |
(a character) Default: |
quiet |
(TRUE or FALSE) To suppress processing
messages. Default: |
This function calculates the statistical
significance value (p-value
) of an opinion score
by comparing the observed score (from the opi_score
function) with the expected scores (distribution) (from the
opi_sim
function). The formula is given as
p = (S.beat+1)/(S.total+1)
, where S_total
is the total
number of replicas (nsim
) specified, S.beat
is number of replicas
in which their expected scores are than the observed score (See
further details in Adepeju and Jimoh, 2021).
Details of statistical significance of impacts
of a secondary subject B
on the opinion concerning the
primary subject A
.
(1) Adepeju, M. and Jimoh, F. (2021). An Analytical Framework for Measuring Inequality in the Public Opinions on Policing – Assessing the impacts of COVID-19 Pandemic using Twitter Data. https://doi.org/10.31235/osf.io/c32qh
# Application in marketing research: # data -> 'reviews_dtd' # theme_keys -> 'refreshment_theme' #RQ2a: "Do the refreshment outlets impact customers' #opinion of the services at the Piccadilly train station?" ##execute function output <- opi_impact(textdoc = reviews_dtd, theme_keys=refreshment_theme, metric = 1, fun = NULL, nsim = 99, alternative="two.sided", quiet=TRUE) #To print results print(output) #extracting the pvalue in order to answer RQ2a output$pvalue
# Application in marketing research: # data -> 'reviews_dtd' # theme_keys -> 'refreshment_theme' #RQ2a: "Do the refreshment outlets impact customers' #opinion of the services at the Piccadilly train station?" ##execute function output <- opi_impact(textdoc = reviews_dtd, theme_keys=refreshment_theme, metric = 1, fun = NULL, nsim = 99, alternative="two.sided", quiet=TRUE) #To print results print(output) #extracting the pvalue in order to answer RQ2a output$pvalue
Given a DTD,
this function computes the overall opinion score based on the
proportion of text records classified as expressing positive,
negative or a neutral sentiment.
The function first transforms
the text document into a tidy-format dataframe, described as the
observed sentiment document (OSD)
(Adepeju and Jimoh, 2021),
in which each text record is assigned a sentiment class based
on the summation of all sentiment scores expressed by the words in
the text record.
opi_score(textdoc, metric = 1, fun = NULL)
opi_score(textdoc, metric = 1, fun = NULL)
textdoc |
An |
metric |
(an integer) Specify the metric to utilize for
the calculation of opinion score. Valid values include
|
fun |
A user-defined function given that |
An opinion score is derived from all the sentiments
(i.e. positive, negative (and neutral) expressed within a
text document. We deploy a lexicon-based approach
(Taboada et al. 2011) using the AFINN
lexicon
(Nielsen, 2011).
Returns an opi_object
containing details of the
opinion measures from the text document.
(1) Adepeju, M. and Jimoh, F. (2021). An Analytical Framework for Measuring Inequality in the Public Opinions on Policing – Assessing the impacts of COVID-19 Pandemic using Twitter Data. https://doi.org/10.31235/osf.io/c32qh (2) Malshe, A. (2019) Data Analytics Applications. Online book available at: https://ashgreat.github.io/analyticsAppBook/index.html. Date accessed: 15th December 2020. (3) Taboada, M.et al. (2011). Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2), pp.267-307. (4) Lowe, W. et al. (2011). Scaling policy preferences from coded political texts. Legislative studies quarterly, 36(1), pp.123-155. (5) Razorfish (2009) Fluent: The Razorfish Social Influence Marketing Report. Accessed: 24th February, 2021. (6) Nielsen, F. A. (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages (2011) 93-98.
# Use police/pandemic posts on Twitter # Experiment with a standard metric (e.g. metric 1) score <- opi_score(textdoc = policing_dtd, metric = 1, fun = NULL) #print result print(score) #Example using a user-defined opinion score - #a demonstration with a component of SIM opinion #Score function (by Razorfish, 2009). The opinion #function can be expressed as: myfun <- function(P, N, O){ score <- (P + O - N)/(P + O + N) return(score) } #Run analysis score <- opi_score(textdoc = policing_dtd, metric = 5, fun = myfun) #print results print(score)
# Use police/pandemic posts on Twitter # Experiment with a standard metric (e.g. metric 1) score <- opi_score(textdoc = policing_dtd, metric = 1, fun = NULL) #print result print(score) #Example using a user-defined opinion score - #a demonstration with a component of SIM opinion #Score function (by Razorfish, 2009). The opinion #function can be expressed as: myfun <- function(P, N, O){ score <- (P + O - N)/(P + O + N) return(score) } #Run analysis score <- opi_score(textdoc = policing_dtd, metric = 5, fun = myfun) #print results print(score)
This function simulates the expectation distribution of the
observed opinion score (computed using the opi_score
function).
The resulting tidy-format dataframe can be described as the
expected sentiment document (ESD)
(Adepeju and Jimoh, 2021).
opi_sim(osd_data, nsim=99, metric = 1, fun = NULL, quiet=TRUE)
opi_sim(osd_data, nsim=99, metric = 1, fun = NULL, quiet=TRUE)
osd_data |
A list (dataframe). An |
nsim |
(an integer) Number of replicas (ESD) to simulate.
Recommended values are: 99, 999, 9999, and so on. Since the run time
is proportional to the number of replicas, a moderate number of
simulation, such as 999, is recommended. Default: |
metric |
(an integer) Specify the metric to utilize for the
calculation of the opinion score. Default: |
fun |
A user-defined function given that parameter
|
quiet |
(TRUE or FALSE) To suppress processing
messages. Default: |
Employs non-parametric randomization testing approach in order to generate the expectation distribution of the observed opinion scores (see details in Adepeju and Jimoh 2021).
Returns a list of expected opinion scores with length equal
to the number of simulation (nsim
) specified.
(1) Adepeju, M. and Jimoh, F. (2021). An Analytical Framework for Measuring Inequality in the Public Opinions on Policing – Assessing the impacts of COVID-19 Pandemic using Twitter Data. https://doi.org/10.31235/osf.io/c32qh
#Prepare an osd data from the output #of opi_score function. score <- opi_score(textdoc = policing_dtd, metric = 1, fun = NULL) #extract OSD OSD <- score$OSD #note that OSD is shorter in length #than policing_dtd, meaning that some #text records were not classified #Bind a fictitious indicator column osd_data2 <- data.frame(cbind(OSD, keywords = sample(c("present","absent"), nrow(OSD), replace=TRUE, c(0.35, 0.65)))) #generate expected distribution exp_score <- opi_sim(osd_data2, nsim=99, metric = 1, fun = NULL, quiet=TRUE) #preview the distribution hist(exp_score)
#Prepare an osd data from the output #of opi_score function. score <- opi_score(textdoc = policing_dtd, metric = 1, fun = NULL) #extract OSD OSD <- score$OSD #note that OSD is shorter in length #than policing_dtd, meaning that some #text records were not classified #Bind a fictitious indicator column osd_data2 <- data.frame(cbind(OSD, keywords = sample(c("present","absent"), nrow(OSD), replace=TRUE, c(0.35, 0.65)))) #generate expected distribution exp_score <- opi_sim(osd_data2, nsim=99, metric = 1, fun = NULL, quiet=TRUE) #preview the distribution hist(exp_score)
A tidy-format list (dataframe) showing the resulting
classification of each text record into positive, negative
or neutral sentiment. The second column of the dataframe consists of
labels variables present
and absent
to indicate whether any of
the secondary keywords exist in a text record.
osd_data
osd_data
A dataframe with the following variables:
ID: numeric id of text record with valid resultant sentiments score and classification.
sentiment: Containing the sentiment classes.
keywords: Indicator to show whether a secondary keyword in present or absent in a text record.
A text document (an DTD) containing twitter posts (for an anonymous geographical location 'A') on police/policing. The DTD also includes posts that express sentiments on policing in relation to the COVID-19 pandemic (Secondary subject B)
policing_dtd
policing_dtd
A dataframe containing one variable
text: individual text records
List of words relating to refreshments that can be found at the Piccadilly Train Station (Manchester)
refreshment_theme
refreshment_theme
A dataframe containing one variable:
keys: list of keywords
A text document (an DTD) containing the customer reviews of the Piccadilly train station (Manchester) downloaded from the www.tripadvisor.co.uk'. The reviews cover from July 2016 to March 2021.
reviews_dtd
reviews_dtd
A dataframe containing one variable
text: individual text records
List of signages at the Piccadilly Train Station (Manchester)
signage_theme
signage_theme
A dataframe containing one variable:
keys: list of keywords
A text document (an DTD) containing twitter posts (for an anonymous geographical location 2) on police/policing (primary subject A). The DTD includes posts that express sentiments on policing in relation to the COVID-19 pandemic (Secondary subject B)
tweets
tweets
A dataframe with the following variables:
text: individual text records
group: real/arbitrary groups of text records
This function examines whether the distribution of word frequencies in a text document follows the Zipf distribution (Zipf 1934). The Zipf's distribution is considered the ideal distribution of a perfect natural language text.
word_distrib(textdoc)
word_distrib(textdoc)
textdoc |
|
The Zipf's distribution is most easily observed by plotting the data on a log-log graph, with the axes being log(word rank order) and log(word frequency). For a perfect natural language text, the relationship between the word rank and the word frequency should have a negative slope with all points falling on a straight line. Any deviation from the straight line can be considered an imperfection attributable to the texts within the document.
A list of word ranks and their respective frequencies, and a plot showing the relationship between the two variables.
Zipf G (1936). The Psychobiology of Language. London: Routledge; 1936.
#Get an \code{n} x 1 text document tweets_dat <- data.frame(text=tweets[,1]) plt = word_distrib(textdoc = tweets_dat) plt
#Get an \code{n} x 1 text document tweets_dat <- data.frame(text=tweets[,1]) plt = word_distrib(textdoc = tweets_dat) plt
Produces a wordcloud which represents the level of importance of each word (across different text groups) within a text document, according to a specified measure.
word_imp(textdoc, metric= "tf", words_to_filter=NULL)
word_imp(textdoc, metric= "tf", words_to_filter=NULL)
textdoc |
An |
metric |
(character) The measure for determining the level of
importance of each word within the text document. Options
include |
words_to_filter |
A pre-defined vector of words (terms) to
filter out from the DTD prior to highlighting words importance.
default: |
The function determines the most important words
across various grouping of a text document. The measure
options include the tf
and tf-idf
. The idea of tf
is to rank words in the order of their number of occurrences
across the text document, whereas tf-idf
finds words that
are not used very much, but appear across
many groups in the document.
Graphical representation of words importance
according to a specified metric. A wordcloud is used
to represent words importance if tf
is specified, while
facet wrapped histogram is used if tf-idf
is specified.
A wordcloud is represents each word with a size corresponding
to its level of importance. In the facet wrapped histograms
words are ranked in each group (histogram) in their order
of importance.
Silge, J. and Robinson, D. (2016) tidytext: Text mining and analysis using tidy data principles in R. Journal of Open Source Software, 1, 37.
#words to filter out wf <- c("police","policing") output <- word_imp(textdoc = policing_dtd, metric= "tf", words_to_filter= wf)
#words to filter out wf <- c("police","policing") output <- word_imp(textdoc = policing_dtd, metric= "tf", words_to_filter= wf)