Analytical Paleobiology Workshop 2018

Creating World Clouds in R

It is relatively easy to create a customized world cloud in R using the package ‘worldcloud’.

Start by installing and then uploading relevant libraries. The uploaded packages will also include other libraries (‘NLP’ and ‘RColorBrewer’).

# install.packages('wordcloud') # enable to upload packages
# install.packages('tm')      # enable to upload packages
# install.packages('SnowballC')  # enable to upload packages
library(tm)

## Loading required package: NLP

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

Example 1: “Raven” by Poe

As our first example we will use a famous poem by Edgar A. Poe: The Raven. Make sure that the file ‘RAVEN.txt’ is located in your working directory or its path is correctly referenced. We will first upload and preprocess the text by cleaning it using the function ‘tm_map’. Examples of several common steps you may consider when cleaning text files are exemplified below.

rav <- readLines('RAVEN.txt')                       # upload your text file
rav2 <- Corpus(VectorSource(rav))                   # convert lines of text into lists
rav2 <- tm_map(rav2, content_transformer(tolower))      # convert the text to lower case
rav2 <- tm_map(rav2, removeWords, stopwords("english"))     # remove english common stopwords
rav2 <- tm_map(rav2, removeNumbers)                 # remove numbers
rav2 <- tm_map(rav2, removePunctuation)                 # remove punctuations
rav2 <- tm_map(rav2, stripWhitespace)               # eliminate extra white spaces

You can also custom-remove specific words, as long as you make sure to convert all words to lower case format first.

rav2 <- tm_map(rav2, removeWords, c('said', 'upon'))        # Here, I removed two words

Once you removed the words you believe to be meaningless, unnecessary or otherwise uselss, you can compute frequencies of the words (a critical variable for generating ‘WordClouds’)

#------------------ COMPUTE FREQUENCIES OF WORDS -----------------
out1 <- TermDocumentMatrix(rav2)                    # reformat into a set of lists
# out1$dimnames$Terms                           # a list of unique words can be accessed
out2 <- sort(rowSums(as.matrix(out1)),decreasing=TRUE)  # compute frequencies of words
as.matrix(out2)[1:10,]                          # check 10 most common words and their frequencies

##      door     raven   chamber nevermore      bird    lenore   nothing 
##        14        11        11        11        10         8         7 
##     still       thy      soul 
##         7         7         6

Interestingly, it is ‘door’ (and not ‘raven’ or ‘nevermore’) that is the most common word in the poem Raven. You can now generate a WorldCloud. First, we will generate one that is mostly based on default settings with limited parameterization.

#------------------ WORD CLOUD USING MOSTLY DEFAULT SETTINGS -----
# plot wordcloud with limited parameterization (mostly default settings)
wordcloud(words = names(out2), freq = out2, min.freq = 3,
          max.words=50, random.order=FALSE, rot.per=0.1, 
          colors=brewer.pal(8, "Set1"))

Let’s now do a more carefully customized WordCloud. First, decide which words to keep based on their frequency. For example…

n <- 2                                  # say, two occurrences minimum
wordc <- sum(out2>=n)                           # find out how many words occur 2 times minimum
wordc                                   # number of words with at least n occurrences

## [1] 107

In this case We retained 107 words.

Now that you know how many words you have, you can define your own colors (for ‘The Raven’, they should be “dark and dreary”). It may be useful to plot your color gradient firt to check the colors and adjust as needed.

mycol.F <- colorRampPalette(c("darkseagreen1", "black"))    # this function creates a function
mycols <- mycol.F(wordc)                        # apply this new function to define number of colors
mycols

##   [1] "#C1FFC1" "#BFFCBF" "#BDFABD" "#BBF7BB" "#B9F5B9" "#B7F2B7" "#B6F0B6"
##   [8] "#B4EEB4" "#B2EBB2" "#B0E9B0" "#AEE6AE" "#ACE4AC" "#ABE2AB" "#A9DFA9"
##  [15] "#A7DDA7" "#A5DAA5" "#A3D8A3" "#A2D6A2" "#A0D3A0" "#9ED19E" "#9CCE9C"
##  [22] "#9ACC9A" "#98CA98" "#97C797" "#95C595" "#93C293" "#91C091" "#8FBE8F"
##  [29] "#8EBB8E" "#8CB98C" "#8AB68A" "#88B488" "#86B286" "#84AF84" "#83AD83"
##  [36] "#81AA81" "#7FA87F" "#7DA57D" "#7BA37B" "#79A179" "#789E78" "#769C76"
##  [43] "#749974" "#729772" "#709570" "#6F926F" "#6D906D" "#6B8D6B" "#698B69"
##  [50] "#678967" "#658665" "#648464" "#628162" "#607F60" "#5E7D5E" "#5C7A5C"
##  [57] "#5B785B" "#597559" "#577357" "#557155" "#536E53" "#516C51" "#506950"
##  [64] "#4E674E" "#4C654C" "#4A624A" "#486048" "#475D47" "#455B45" "#435943"
##  [71] "#415641" "#3F543F" "#3D513D" "#3C4F3C" "#3A4C3A" "#384A38" "#364836"
##  [78] "#344534" "#324332" "#314031" "#2F3E2F" "#2D3C2D" "#2B392B" "#293729"
##  [85] "#283428" "#263226" "#243024" "#222D22" "#202B20" "#1E281E" "#1D261D"
##  [92] "#1B241B" "#192119" "#171F17" "#151C15" "#141A14" "#121812" "#101510"
##  [99] "#0E130E" "#0C100C" "#0A0E0A" "#090C09" "#070907" "#050705" "#030403"
## [106] "#010201" "#000000"

plot(rep(1,wordc), col=mycols, pch=15, cex=6,
     axes=F, xlab='', ylab='')                  # check your colors

You are now ready to plot WordCloud for Raven

wordcloud(words = names(out2), freq = out2, 
          max.words=wordc,                      # defines how many words to include
        min.freq=n,                         # defines minimum frequency
          scale=c(4,.5),                        # defines range of fonts sizes of words
          random.order=FALSE,                   # default (prints words in order of their frequency
        rot.per=0.5,                            # controls rotation
        colors=mycols,                      # defines colors of words
          vfont=c("gothic english", "plain"))           # define font ('gothic english' seems appropriate here)

Example 2: Summarize your own paper as a world cloud

One obvious application of world clouds is to convert your paper or abstract into a world cloud. You can then use it as a cool visuallization of words frequently used in your own text. Here, I uploaded one of our own papers:

Tyler and Kowalewski 2017, Surrogate taxa and fossils as reliable proxies of spatial biodiversity patterns in marine benthic communities. Proceedings of the Royal Society B 284: 20162839. http://dx.doi.org/10.1098/rspb.2016.2839PRSB)

This paper is stored in the file ‘tyler.text’.

rav <- readLines('tyler.txt')                       # upload your text file
rav2 <- Corpus(VectorSource(rav))                   # convert lines of text into lists
rav2 <- tm_map(rav2, content_transformer(tolower))      # convert the text to lower case
rav2 <- tm_map(rav2, removeWords, stopwords("english"))     # remove english common stopwords
rav2 <- tm_map(rav2, removeNumbers)                 # remove numbers
rav2 <- tm_map(rav2, removePunctuation)                 # remove punctuations
rav2 <- tm_map(rav2, stripWhitespace)               # eliminate extra white spaces
rav2 <- tm_map(rav2, removeWords, c('figure', 'datasets'))  # remove some words
out1 <- TermDocumentMatrix(rav2)                    # reformat into a set of lists
# out1$dimnames$Terms                           # a list of unique words can be accessed
out2 <- sort(rowSums(as.matrix(out1)),decreasing=TRUE)  # compute frequencies of words
as.matrix(out2)[1:20,]                          # check 20 most common words and their frequencies

##     mollusk   diversity   estimates     spatial communities assemblages 
##          60          46          40          37          24          22 
##  nonmollusk  ecosystems    habitats     species comparisons        data 
##          22          20          20          20          19          18 
##  localities      marine     however   community    fidelity       table 
##          18          17          17          16          16          16 
##  consistent       death 
##          16          15

Interesting words to notice here are ‘consistent’ and ‘however’. You could, of course, remove those if you want to maximally focus on topical words. However, I will keep those two words in this example. Now we will again customize our colors.

n <- 7                                  # 7 occurrences minimum
wordc <- sum(out2>=n)                           # find out how many words occur 2 times minimum
wordc                                   # number of words with at least n occurrences

## [1] 78

mycol.F <- colorRampPalette(c('forestgreen', 'yellow3', 'coral1'))  # this function creates a function
mycols <- mycol.F(wordc)                        # apply this new function to define number of colors
plot(rep(1,wordc), col=mycols, pch=15, cex=6,
     axes=F, xlab='', ylab='')                  # check your colors

Now we can generate the worldcloud and use it in your talks, insert it into your posters, or post it on your website.

wordcloud(words = names(out2), freq = out2, 
          max.words=wordc,                      # defines how many words to include
        min.freq=n,                         # defines minimum frequency
          scale=c(4,.5),                        # defines range of fonts sizes of words
          random.order=FALSE,                   # default (prints words in order of their frequency
        rot.per=0.5,                            # controls rotation
        colors=mycols,                      # defines colors of words
          vfont=c("sans serif", "plain"))                 # define font

And, of course, as always cite the packages you used (incudinig those loaded in the background).

# you can get references using 'citation' function. Disabled here to prevent excessive printout.
# citation("tm")
# citation("SnowballC")
# citation("wordcloud")
# citation("slam")
# citation("RColorBrewer")
# citation("NLP")

Ingo Feinerer and Kurt Hornik (2018). tm: Text Mining Package. R package version 0.7-4. https://CRAN.R-project.org/package=tm

Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.

Milan Bouchet-Valat (2014). SnowballC: Snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1. https://CRAN.R-project.org/package=SnowballC

Ian Fellows (2014). wordcloud: Word Clouds. R package version 2.5. https://CRAN.R-project.org/package=wordcloud

Kurt Hornik, David Meyer and Christian Buchta (2018). slam: Sparse Lightweight Arrays and Matrices. R package version 0.1-43. https://CRAN.R-project.org/package=slam

Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer

Kurt Hornik (2017). NLP: Natural Language Processing Infrastructure. R package version 0.1-11. https://CRAN.R-project.org/package=NLP

Comments/Questions/Corrections: Michal Kowalewski (kowalewski@ufl.edu)

Peer-review: This document has NOT been peer-reviewed.

Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Analytical Paleobiology Workshop 2018

Creating World Clouds in R

Michal Kowalewski

July 30, 2018

Creating World Clouds in R

Example 1: “Raven” by Poe

Example 2: Summarize your own paper as a world cloud