R-Packages
R-Package: ReinforcementLearning
This package performs model-free reinforcement learning in R. The implementation enables the learning of an optimal policy based on sample sequences consisting of states, actions and rewards. In addition, it supplies multiple predefined reinforcement learning algorithms, such as experience replay.
R-Package: SentimentAnalysis
This package performs a sentiment analysis of textual contents in R. The implementation utilizes various existing dictionaries, such as Harvard IV, or finance-specific dictionaries. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.
R-Package: textsampler
The textsampler R-Package samples texts from a predefined text source. This implementation uses tidy data principles and works seamlessly with existing text mining packages such as tm, tidytext, and rvest. In addition, it supplies multiple built-in text datasets for a hassle-free sampling of words, sentences, and texts.
Teaching Materials
Slide Deck: Tidy Data Manipulation in R
This slide deck presents an introduction to tidy data manipulation in R. The main learning goals are:
- Tidy data manipulation: Learn how to manipulate data using the “dplyr” R-package
- Pipe operator: Learn how increase code readability using pipes
- Joins: Learn how to efficiently join separate datasets in R
The slides can be downloaded here.
Slide Deck: Exploratory Text Analysis in R
This slide deck presents an introduction to explanatory text analysis in R. The main learning goals are:
- Exploratory text analysis: Learn how to gain an initial understanding of text data
- Tidy text analysis: Learn how to perform text analysis in a “tidy” way using tidytext
- Corpus analyis: Understand how to explore text corpora and perform tf-idf document weighting in R
The slides can be downloaded here.
Datasets
SentimentDictionaries
This library provides domain-specific dictionaries for sentiment analysis. Each dictionary consists of words that statistically feature a positive or negative polarity in movie reviews or financial filings The dictionaries are extracted from two different corpora, namely, IMDb movie reviews and U.S. regulated Form 8-K filings. Details are available from the following reference.
- Pröllochs N, Feuerriegel S, Neumann D (2018): Statistical Inferences for Polarity Identification in Natural Language, PLOS One, 13(12), pp. 1-21
Details
This library contains the following dictionary resources in CSV format.
- Movie reviews dictionary : This dictionary contains words that feature a positive or negative connotation in IMDb movie reviews (DictionaryIMDB.csv),
- Financial filings dictionary: This dictionary contains words that feature a positive or negative connotation in U.S. regulated 8-K filings (Dictionary8K.csv).
The individual columns of each dictionary are as follows:
- Words: This column lists the individual dictionary entries. We provide stems instead of complete words as stemming is part of the document preprocessing.
- Scores: This column denotes the polarity score of each entry.
- Idf: This column denotes the inverse document frequency (idf) of each entry.
Usage in R
We also provide both dictionaries in the form of a package for the statistical software R. You can install SentimentDictionaries from github with:
# install.packages("devtools")
devtools::install_github("nproellochs/SentimentDictionaries", subdir = "R-package")
Both dictionaries can be easily used in combination with the SentimentAnalysis R package.
SentimentDictionaries on GitHub: https://github.com/nproellochs/SentimentDictionaries
NegatedSentences
This repository provides annotations of negation scopes for 500 sentences from IMDb movie reviews. The dataset is labeled manually by two external persons (Annotator A and Annotator B). Each sentence contains at least one explicit negation phrase from the list of Jia et al. (2009). The labeled sentences can, for example, be used in machine learning models that aim at learning accurate negation scopes for sentiment analysis. Details are available from the following reference.
- Pröllochs, Feuerriegel and Neumann (2017): Understanding Negations in Information Processing: Learning from Replicating Human Behavior, Working Paper, available at arXiv.
Details
This library contains the following resources in CSV format.
- Negation Labels Annotator A: This file contains the annotations from Annotator A (sentences_annotator_a.csv).
- Negation Labels Annotator B: This file contains the annotations from Annotator B (sentences_annotator_b.csv).
The individual columns of each resource are as follows:
- Id: This column assigns a unique Id to each sentence.
- Sentence: This column contains the sentences that are labeled by the two human annotators.
- IsNegated: This column contains the negation pattern for each sentence. The value T denotes that a word is marked as negated by the human annotator, whereas F denotes that the word is marked as not negated.
NegatedSentences on GitHub: https://github.com/nproellochs/NegationDataset
NYSE CIK Ticker Symbol Master List
This file in CSV format links EDGAR CIK Numbers to stock ticker symbol. The list includes all companies listed at the New York Stock Exchange (NYSE) as of February 18, 2018. Furthermore, the file includes additional columns referring to market capitalization, company name, market capitalization, etc. The file can be downloaded here.