It is relatively easy to create a customized world cloud in R using the package ‘worldcloud’.
Start by installing and then uploading relevant libraries. The uploaded packages will also include other libraries (‘NLP’ and ‘RColorBrewer’).
# install.packages('wordcloud') # enable to upload packages
# install.packages('tm') # enable to upload packages
# install.packages('SnowballC') # enable to upload packages
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
As our first example we will use a famous poem by Edgar A. Poe: The Raven. Make sure that the file ‘RAVEN.txt’ is located in your working directory or its path is correctly referenced. We will first upload and preprocess the text by cleaning it using the function ‘tm_map’. Examples of several common steps you may consider when cleaning text files are exemplified below.
rav <- readLines('RAVEN.txt') # upload your text file
rav2 <- Corpus(VectorSource(rav)) # convert lines of text into lists
rav2 <- tm_map(rav2, content_transformer(tolower)) # convert the text to lower case
rav2 <- tm_map(rav2, removeWords, stopwords("english")) # remove english common stopwords
rav2 <- tm_map(rav2, removeNumbers) # remove numbers
rav2 <- tm_map(rav2, removePunctuation) # remove punctuations
rav2 <- tm_map(rav2, stripWhitespace) # eliminate extra white spaces
You can also custom-remove specific words, as long as you make sure to convert all words to lower case format first.
rav2 <- tm_map(rav2, removeWords, c('said', 'upon')) # Here, I removed two words
Once you removed the words you believe to be meaningless, unnecessary or otherwise uselss, you can compute frequencies of the words (a critical variable for generating ‘WordClouds’)
#------------------ COMPUTE FREQUENCIES OF WORDS -----------------
out1 <- TermDocumentMatrix(rav2) # reformat into a set of lists
# out1$dimnames$Terms # a list of unique words can be accessed
out2 <- sort(rowSums(as.matrix(out1)),decreasing=TRUE) # compute frequencies of words
as.matrix(out2)[1:10,] # check 10 most common words and their frequencies
## door raven chamber nevermore bird lenore nothing
## 14 11 11 11 10 8 7
## still thy soul
## 7 7 6
Interestingly, it is ‘door’ (and not ‘raven’ or ‘nevermore’) that is the most common word in the poem Raven. You can now generate a WorldCloud. First, we will generate one that is mostly based on default settings with limited parameterization.
#------------------ WORD CLOUD USING MOSTLY DEFAULT SETTINGS -----
# plot wordcloud with limited parameterization (mostly default settings)
wordcloud(words = names(out2), freq = out2, min.freq = 3,
max.words=50, random.order=FALSE, rot.per=0.1,
colors=brewer.pal(8, "Set1"))
Let’s now do a more carefully customized WordCloud. First, decide which words to keep based on their frequency. For example…
n <- 2 # say, two occurrences minimum
wordc <- sum(out2>=n) # find out how many words occur 2 times minimum
wordc # number of words with at least n occurrences
## [1] 107
In this case We retained 107 words.
Now that you know how many words you have, you can define your own colors (for ‘The Raven’, they should be “dark and dreary”). It may be useful to plot your color gradient firt to check the colors and adjust as needed.
mycol.F <- colorRampPalette(c("darkseagreen1", "black")) # this function creates a function
mycols <- mycol.F(wordc) # apply this new function to define number of colors
mycols
## [1] "#C1FFC1" "#BFFCBF" "#BDFABD" "#BBF7BB" "#B9F5B9" "#B7F2B7" "#B6F0B6"
## [8] "#B4EEB4" "#B2EBB2" "#B0E9B0" "#AEE6AE" "#ACE4AC" "#ABE2AB" "#A9DFA9"
## [15] "#A7DDA7" "#A5DAA5" "#A3D8A3" "#A2D6A2" "#A0D3A0" "#9ED19E" "#9CCE9C"
## [22] "#9ACC9A" "#98CA98" "#97C797" "#95C595" "#93C293" "#91C091" "#8FBE8F"
## [29] "#8EBB8E" "#8CB98C" "#8AB68A" "#88B488" "#86B286" "#84AF84" "#83AD83"
## [36] "#81AA81" "#7FA87F" "#7DA57D" "#7BA37B" "#79A179" "#789E78" "#769C76"
## [43] "#749974" "#729772" "#709570" "#6F926F" "#6D906D" "#6B8D6B" "#698B69"
## [50] "#678967" "#658665" "#648464" "#628162" "#607F60" "#5E7D5E" "#5C7A5C"
## [57] "#5B785B" "#597559" "#577357" "#557155" "#536E53" "#516C51" "#506950"
## [64] "#4E674E" "#4C654C" "#4A624A" "#486048" "#475D47" "#455B45" "#435943"
## [71] "#415641" "#3F543F" "#3D513D" "#3C4F3C" "#3A4C3A" "#384A38" "#364836"
## [78] "#344534" "#324332" "#314031" "#2F3E2F" "#2D3C2D" "#2B392B" "#293729"
## [85] "#283428" "#263226" "#243024" "#222D22" "#202B20" "#1E281E" "#1D261D"
## [92] "#1B241B" "#192119" "#171F17" "#151C15" "#141A14" "#121812" "#101510"
## [99] "#0E130E" "#0C100C" "#0A0E0A" "#090C09" "#070907" "#050705" "#030403"
## [106] "#010201" "#000000"
plot(rep(1,wordc), col=mycols, pch=15, cex=6,
axes=F, xlab='', ylab='') # check your colors
You are now ready to plot WordCloud for Raven
wordcloud(words = names(out2), freq = out2,
max.words=wordc, # defines how many words to include
min.freq=n, # defines minimum frequency
scale=c(4,.5), # defines range of fonts sizes of words
random.order=FALSE, # default (prints words in order of their frequency
rot.per=0.5, # controls rotation
colors=mycols, # defines colors of words
vfont=c("gothic english", "plain")) # define font ('gothic english' seems appropriate here)
One obvious application of world clouds is to convert your paper or abstract into a world cloud. You can then use it as a cool visuallization of words frequently used in your own text. Here, I uploaded one of our own papers:
Tyler and Kowalewski 2017, Surrogate taxa and fossils as reliable proxies of spatial biodiversity patterns in marine benthic communities. Proceedings of the Royal Society B 284: 20162839. http://dx.doi.org/10.1098/rspb.2016.2839PRSB)
This paper is stored in the file ‘tyler.text’.
rav <- readLines('tyler.txt') # upload your text file
rav2 <- Corpus(VectorSource(rav)) # convert lines of text into lists
rav2 <- tm_map(rav2, content_transformer(tolower)) # convert the text to lower case
rav2 <- tm_map(rav2, removeWords, stopwords("english")) # remove english common stopwords
rav2 <- tm_map(rav2, removeNumbers) # remove numbers
rav2 <- tm_map(rav2, removePunctuation) # remove punctuations
rav2 <- tm_map(rav2, stripWhitespace) # eliminate extra white spaces
rav2 <- tm_map(rav2, removeWords, c('figure', 'datasets')) # remove some words
out1 <- TermDocumentMatrix(rav2) # reformat into a set of lists
# out1$dimnames$Terms # a list of unique words can be accessed
out2 <- sort(rowSums(as.matrix(out1)),decreasing=TRUE) # compute frequencies of words
as.matrix(out2)[1:20,] # check 20 most common words and their frequencies
## mollusk diversity estimates spatial communities assemblages
## 60 46 40 37 24 22
## nonmollusk ecosystems habitats species comparisons data
## 22 20 20 20 19 18
## localities marine however community fidelity table
## 18 17 17 16 16 16
## consistent death
## 16 15
Interesting words to notice here are ‘consistent’ and ‘however’. You could, of course, remove those if you want to maximally focus on topical words. However, I will keep those two words in this example. Now we will again customize our colors.
n <- 7 # 7 occurrences minimum
wordc <- sum(out2>=n) # find out how many words occur 2 times minimum
wordc # number of words with at least n occurrences
## [1] 78
mycol.F <- colorRampPalette(c('forestgreen', 'yellow3', 'coral1')) # this function creates a function
mycols <- mycol.F(wordc) # apply this new function to define number of colors
plot(rep(1,wordc), col=mycols, pch=15, cex=6,
axes=F, xlab='', ylab='') # check your colors
Now we can generate the worldcloud and use it in your talks, insert it into your posters, or post it on your website.
wordcloud(words = names(out2), freq = out2,
max.words=wordc, # defines how many words to include
min.freq=n, # defines minimum frequency
scale=c(4,.5), # defines range of fonts sizes of words
random.order=FALSE, # default (prints words in order of their frequency
rot.per=0.5, # controls rotation
colors=mycols, # defines colors of words
vfont=c("sans serif", "plain")) # define font
And, of course, as always cite the packages you used (incudinig those loaded in the background).
# you can get references using 'citation' function. Disabled here to prevent excessive printout.
# citation("tm")
# citation("SnowballC")
# citation("wordcloud")
# citation("slam")
# citation("RColorBrewer")
# citation("NLP")
Ingo Feinerer and Kurt Hornik (2018). tm: Text Mining Package. R package version 0.7-4. https://CRAN.R-project.org/package=tm
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.
Milan Bouchet-Valat (2014). SnowballC: Snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1. https://CRAN.R-project.org/package=SnowballC
Ian Fellows (2014). wordcloud: Word Clouds. R package version 2.5. https://CRAN.R-project.org/package=wordcloud
Kurt Hornik, David Meyer and Christian Buchta (2018). slam: Sparse Lightweight Arrays and Matrices. R package version 0.1-43. https://CRAN.R-project.org/package=slam
Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
Kurt Hornik (2017). NLP: Natural Language Processing Infrastructure. R package version 0.1-11. https://CRAN.R-project.org/package=NLP
Comments/Questions/Corrections: Michal Kowalewski (kowalewski@ufl.edu)
Peer-review: This document has NOT been peer-reviewed.
Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.