This text analysis is part of my poster (co-authored with Rodrigo Werle and Sarah Marinho) presented at the 2020 Virtual North Central Weed Science Society NCWSS annual meeting (December 2020). Here I am presenting part of that abstract focused on weed species ranked amongst the top 100 words from 2001 through 2020 (I am adding the 2020 meeting proceedings to this analysis). I am also running the text analysis with less coding than for the 2020 NCWSS meeting which is what I am going to show in this analysis. If you are only interested in the final figure, please scroll to the bottom of this page.

First we have to load the packages needed for this analysis. Please run the codes below:

library(tidyverse)
library(tidytext)
library(textreadr)
library(pdftools)
library(ggtext)
# if you do not have any of these packages installed, please run install.packages("name_of_the_package")

I have downloaded all NCWSS proceedings and added into a folder named “docs” (you can name the folder as you choose). You can find all PDFs here (folder “docs”).

I used the str_c function to get all PDFs, which are in the folder “docs”. ThePDFs output contains the path for all 20 NCWSS proceedings.

pdfs <- str_c("docs", "/", list.files("docs", pattern = "*.pdf"), 
              sep = "")
pdfs 
##  [1] "docs/nc2001.pdf" "docs/nc2002.pdf" "docs/nc2003.pdf" "docs/nc2004.pdf"
##  [5] "docs/nc2005.pdf" "docs/nc2006.pdf" "docs/nc2007.pdf" "docs/nc2008.pdf"
##  [9] "docs/nc2009.pdf" "docs/nc2010.pdf" "docs/nc2011.pdf" "docs/nc2012.pdf"
## [13] "docs/nc2013.pdf" "docs/nc2014.pdf" "docs/nc2015.pdf" "docs/nc2016.pdf"
## [17] "docs/nc2017.pdf" "docs/nc2018.pdf" "docs/nc2019.pdf" "docs/nc2020.pdf"

Next you will name all PDFs. If you run the code below, list.files will keep the PDFs names as shown in the code above.

pdf_names <- list.files("docs", pattern = "*.pdf")
pdf_names
##  [1] "nc2001.pdf" "nc2002.pdf" "nc2003.pdf" "nc2004.pdf" "nc2005.pdf"
##  [6] "nc2006.pdf" "nc2007.pdf" "nc2008.pdf" "nc2009.pdf" "nc2010.pdf"
## [11] "nc2011.pdf" "nc2012.pdf" "nc2013.pdf" "nc2014.pdf" "nc2015.pdf"
## [16] "nc2016.pdf" "nc2017.pdf" "nc2018.pdf" "nc2019.pdf" "nc2020.pdf"

Here is where the magic occurs, I will use the function map of the package purrr (tidyverse core). Using map function saves coding and time.

pdfs_text <- map(pdfs, pdftools::pdf_text)

This “magic” is called iteration, so instead of running the analysis by each year we can run it all together. Running pdfs_text alone you get you all proceedings organized as a list. I am not running pdfs_text here because it is a large output. Nonetheless, pdfs_text is not tidy for the analysis yet.

The iteration with map function should be proceeded with a tibble function to organize the proceedings of each year.

pdf <- tibble(document = pdf_names, text = pdfs_text) %>% 
  mutate(year = 2001:2020) # adding a column for each year
pdf
## # A tibble: 20 x 3
##    document   text         year
##    <chr>      <list>      <int>
##  1 nc2001.pdf <chr [211]>  2001
##  2 nc2002.pdf <chr [211]>  2002
##  3 nc2003.pdf <chr [205]>  2003
##  4 nc2004.pdf <chr [188]>  2004
##  5 nc2005.pdf <chr [229]>  2005
##  6 nc2006.pdf <chr [216]>  2006
##  7 nc2007.pdf <chr [248]>  2007
##  8 nc2008.pdf <chr [215]>  2008
##  9 nc2009.pdf <chr [165]>  2009
## 10 nc2010.pdf <chr [111]>  2010
## 11 nc2011.pdf <chr [174]>  2011
## 12 nc2012.pdf <chr [145]>  2012
## 13 nc2013.pdf <chr [143]>  2013
## 14 nc2014.pdf <chr [107]>  2014
## 15 nc2015.pdf <chr [120]>  2015
## 16 nc2016.pdf <chr [110]>  2016
## 17 nc2017.pdf <chr [136]>  2017
## 18 nc2018.pdf <chr [125]>  2018
## 19 nc2019.pdf <chr [233]>  2019
## 20 nc2020.pdf <chr [174]>  2020

As you can see in the tibble (data frame) above, each proceeding is stored as a list by each year (e.g., <chr [248]>).

Now that we have a tidy tibble, we can proceed with the tokenization using the function unnest_tokens.

pdf1 <- pdf %>% 
  unnest(text) %>% # pdfs_text is a list
  mutate(text = str_to_lower(text), # making all text lower case
         text = str_replace(text, "2,4-d", 
                            "twofourd"), # need to replace it
         text = str_replace(text, "marestail", 
                            "horseweed")) %>% # marestail = horseweed
  unnest_tokens(word, text, strip_numeric = TRUE)

pdf1 %>% 
  slice_head(n = 10)
## # A tibble: 10 x 3
##    document    year word        
##    <chr>      <int> <chr>       
##  1 nc2001.pdf  2001 industry    
##  2 nc2001.pdf  2001 donations   
##  3 nc2001.pdf  2001 of          
##  4 nc2001.pdf  2001 intellectual
##  5 nc2001.pdf  2001 property    
##  6 nc2001.pdf  2001 rights      
##  7 nc2001.pdf  2001 to          
##  8 nc2001.pdf  2001 universities
##  9 nc2001.pdf  2001 thomas      
## 10 nc2001.pdf  2001 s

Notice that I used mutate function to change 2,4-D to “twofourd” because tokenization would split it in 2, 4 and D. Because the species has more than one common name, I treat marestail = horseweed.

Next we need to remove the “stopwords”. Stopwords are words like “in”, “and”, “at”, “their”, “about” etc. The function get_stopwords from tidytext package has five “stopword” sources, I will add them all and stored in stopwords. See below:

stopwords <- get_stopwords("en", source = c("smart")) %>% 
  bind_rows(get_stopwords("en", source = c("marimo"))) %>% 
  bind_rows(get_stopwords("en", source = c("nltk"))) %>% 
  bind_rows(get_stopwords("en", source = c("stopwords-iso"))) %>% 
  bind_rows(get_stopwords("en", source = c("snowball")))

stopwords %>% 
  slice_head(n=10)
## # A tibble: 10 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart

Now that I have a tibble called “stopwords”, I will use anti_join function to remove the stopwords from pdf1

pdf2 <- pdf1 %>%
  anti_join(stopwords, by = "word")

The get_stopwords function with all sources attributes is not enough to remove all words needed for my goal in this analysis. For example, I do not want to have words like “virtual”, “kansas”, “werle”, “proceedings” etc. I have manually made a random “stopwords” for weed science meetings, please check the WSSA text analysis. I am bringing a “stopword” that I made in my previous analysis in a source code “stop_words.R”. You can find “stop_words.R” here.

source("stop_words.R")

I have saved it as stop_tibble, which is used also with anti_join function. The anti_join function as described above will remove all “stopwords” in stop_tibble from pdf2. Notice that here I am also using mutate to bring back 2,4-D.

pdf3 <- pdf2 %>% 
  anti_join(stop_tibble, by = c("word")) %>% # stop_tibble is in the source code
  mutate(word = str_replace(word, "twofourd", "2,4-d")) # bring back 2,4-d

Next I will use functions to count words over the years, arrange it as descending, group_by year, rank top 100 words (row_number) and filter the top 100 words by year.

pdf4 <- pdf3 %>% 
  count(year, word) %>% 
  arrange(year, -n) %>% 
  group_by(year) %>% 
  mutate(rank = row_number()) %>% 
  filter(rank <= 100)

Now I have the top 100 words for each year (NCWSS proceedinds):

pdf4 %>% 
  slice_head(n = 10)
## # A tibble: 200 x 4
## # Groups:   year [20]
##     year word           n  rank
##    <int> <chr>      <int> <int>
##  1  2001 control      716     1
##  2  2001 weed         632     2
##  3  2001 glyphosate   475     3
##  4  2001 applied      427     4
##  5  2001 herbicide    369     5
##  6  2001 corn         358     6
##  7  2001 treatments   339     7
##  8  2001 soybean      257     8
##  9  2001 common       240     9
## 10  2001 yield        226    10
## # … with 190 more rows

In this analysis I am interested only on weeds present in the top 100 words in 2001 and 2020. Therefore, I am using if_else function to create new columns for highlighting selected weed species. You can change and select any word if want as I did it with herbicides in my poster at the 2020 NCWSS meeting.

pdf5 <- pdf4 %>% 
  mutate(highlight = if_else(word %in% c("amaranth", "palmer", 
                                         "kochia", "horseweed", 
                                         "grass", "nightshade", 
                                         "waterhemp", "velvetleaf", 
                                         "ragweed", "sunflower",
                                         "foxtail"), TRUE, FALSE),
       variable_col = if_else(highlight == TRUE, word, "NA"))

pdf5 %>% 
  slice_head(n = 5)
## # A tibble: 100 x 6
## # Groups:   year [20]
##     year word           n  rank highlight variable_col
##    <int> <chr>      <int> <int> <lgl>     <chr>       
##  1  2001 control      716     1 FALSE     NA          
##  2  2001 weed         632     2 FALSE     NA          
##  3  2001 glyphosate   475     3 FALSE     NA          
##  4  2001 applied      427     4 FALSE     NA          
##  5  2001 herbicide    369     5 FALSE     NA          
##  6  2002 weed         834     1 FALSE     NA          
##  7  2002 control      658     2 FALSE     NA          
##  8  2002 glyphosate   624     3 FALSE     NA          
##  9  2002 corn         445     4 FALSE     NA          
## 10  2002 applied      330     5 FALSE     NA          
## # … with 90 more rows

Now the tibble is ready. Then, I will proceed with data visualization. First I will set the font family, colors and theme.

#Set theme
library(extrafont)
extrafont::loadfonts()
font_family <- 'Helvetica'
title_family <- ".New York"
background <- "#1D1D1D"
text_colour <- "white"
axis_colour <- "white"
plot_colour <- "black"
theme_style <- theme(text = element_text(family = font_family),
                  rect = element_rect(fill = background),
                  plot.background = element_rect(fill = background, color = NA),
                  plot.title = element_markdown(family = title_family,
                                            face = 'bold', size = 80, colour = text_colour),
                  plot.subtitle = element_markdown(family = title_family, 
                                                   size = 40, colour = text_colour),
                  plot.caption = element_markdown(family = title_family,
                                              size = 25, colour = text_colour, hjust = 0),
                  panel.background = element_rect(fill = background, color = NA),
                  panel.border = element_blank(),
                  panel.grid.major.y = element_blank(),
                  panel.grid.major.x = element_blank(),
                  panel.grid.minor.x = element_blank(),
                  plot.margin = unit(c(3, 0.5, 0.5, 0.5), "cm"), # top, left, bottom, right
                  axis.title.y = element_text(face = 'bold', size = 40, 
                                              colour = text_colour),
                  axis.title.x = element_blank(),
                  axis.text.x.bottom = element_text(size = 45, colour= axis_colour, 
                                                    vjust = 17),
                  axis.text.x.top = element_text(size = 45, colour= axis_colour, 
                                                    vjust = -14),
                  axis.text.y = element_text(size = 30, colour = text_colour),
                  axis.ticks = element_blank(),
                  axis.line = element_blank(),
                  legend.text = element_text(size = 20, colour= text_colour),
                  legend.title = element_text(size = 25, colour= text_colour),
                  legend.position="none") 


theme_set(theme_classic() + theme_style)

#Set colour palette
cols <- c("#F2D9F3", "#F2D9F3", "#00E5E5", "#DEB887", 
          "#FAC8C8", "#39393A", "#FA9664", 
          "#FF4040", "#48DE7A", "#942DC7", 
          "#F5F5DC", "#FAFA00")

Then I will plot the data. The idea here is to see the trend in weeds within the top 100 words from 2001 through 2020.

figure <- pdf5 %>% 
  ggplot(aes(x = year, y = rank, group = word)) +
  geom_line(data = pdf5 %>% filter(variable_col == "NA"),
                                      color = "#39393A", size = 4) +
  geom_point(data = pdf5 %>% filter(variable_col == "NA"),
                                      color = "#39393A", size = 10) +
  geom_line(data = pdf5 %>% filter(variable_col != "NA"),
                                       aes(color = variable_col), size = 4) +
  geom_point(data = pdf5 %>% filter(variable_col != "NA"),
                                       aes(color = variable_col), size = 10) +
  scale_y_reverse(breaks = 100:1, sec.axis = dup_axis()) +
  scale_x_continuous(breaks = seq(2001, 2020, 2), limits= c(1999.8, 2021.2), 
                     expand = c(.05, .05), sec.axis = dup_axis()) +
  geom_text(data = pdf5 %>% filter(year == "2001"),
            aes(label = word, x = 2000.8, color = variable_col),
            hjust = "right",
            fontface = "bold",
            size = 11) +
  geom_text(data = pdf5 %>%  filter(year == "2020"),
            aes(label = word, x = 2020.2, color = variable_col),
            hjust = "left",
            fontface = "bold",
            size = 11) +
  coord_cartesian(ylim = c(101,1)) +
   scale_color_manual(values = cols) +
  labs(title = "<b style='color:red;'>NCWSS</b> annual meeting 
       proceedinds text analysis from 2001 through 2020",
       subtitle = "Figure shows the rank of top 100 words of 2001 (left) 
       and 2020 (right) <b style='color:red;'>NCWSS</b> annual meeting proceedings. 
       Common weed species names are highlighed to <br> describe 
       their change across 20 years.", 
       y= "Rank",
       caption = "Visualization: @maxwelco adapted from @JaredBraggins | Source: NCWSS") 


#Export plot 
ggsave("top_weeds.png", width = 40, height = 60, dpi=400, limitsize = FALSE, figure)

Check the figure carefully. What were scientists in the society focused in 2001? What has changed in 20 years? What hasn’t? Draw your own conclusions.

This figure was adapted from one of JaredBraggins Tidy Tuesday visualizations.


Click here to learn more about Tidy Text with Julia Silge.

Thanks to Rodrigo Werle for reviewing this post.