# Setup ------------------------------------------------------------------------------------
#Clear memory
rm(list = ls(all=T))
# devtools::install_github("AlbertRapp/tidychatmodels")
# Load libraries and install if not installed already
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse, # Grammar for data and graphics
here, # Relative file paths
tidytext, # Text analysis in a tidy format
arrow, # Files that are fast to write and read
ggrepel, # Repulsive text labels in ggplot2
ggtext, # Fancy text in ggplot2
knitr, # For kable and other rmarkdown things
gridExtra, # Arrange multiple plots into one
colorspace, # Fancy stuff with colors like darken and lighten
ggwordcloud, # Word clouds with ggplot2
tidychatmodels, # Access large language models directly in R
dotenv, # To create environment variable of API key
reticulate # Use Python code within Rmarkdown
)
#Set some output options
knitr::opts_chunk$set(include = TRUE, warning = FALSE, message = FALSE,
fig.width = 10, fig.height = 5)
# Amazon Color Palette
my_pal <- c("#ff9900", "#146eb4", "#232f3e", "#f2f2f2", "#000000")
line_colour = my_pal[3]
weak_text = my_pal[3]
strong_text = my_pal[5]
back_colour = my_pal[4]
# Define my ggplot theme as a funciton to avoid repetitive code
my_theme <- function(base_size = 18) {
theme_minimal(base_size = base_size) +
theme(
panel.grid = element_blank(),
panel.background = element_rect(fill = back_colour,
color = back_colour),
plot.background = element_rect(fill = back_colour,
colour = back_colour),
plot.caption.position = "plot",
plot.title.position = "plot",
plot.title = element_textbox_simple(size = rel(2),
# family = main_font,
color = strong_text,
margin = margin(4, 0, 10, 4)),
plot.subtitle = element_textbox_simple(size = rel(1.25),
# family = main_font,
colour = weak_text,
margin = margin(0, 4, 6, 4)),
axis.title.y = element_text(size = rel(1.2),
# family = main_font,
colour = strong_text,
margin = margin(0, 6 , 0, 4)),
axis.title.x = element_text(size = rel(1.2),
# family = main_font,
colour = strong_text,
margin = margin(6, 0 , 2, 0)),
axis.text = element_text(size = rel(1.1),
# family = main_font,
colour = weak_text,
margin = margin(0, 0, 0, 6)),
plot.caption = element_textbox_simple(size = rel(0.8),
colour = weak_text,
# family = main_font,
hjust = 0.5, # Seems to be ignored
margin = margin(4,0,2,4)),
legend.title = element_text(size = rel(1),
# family = main_font,
colour = strong_text),
legend.text = element_text(size = rel(0.9),
# family = main_font,
colour = weak_text)
)
}
I made this write-up for a few different reasons. First and foremost, I have dabbled in text mining a bit over the past year or so, and thought it would be interesting to take on a small project. Second, I wanted to curate a data set for the Tidy Tuesday project. By reading in the raw PDF files of Amazon’s annual reports and turning them into “tidy” data frames, I hope that I can contribute something useful to Tidy Tuesday, which I have had a great experience participating in. Third, I wanted to explore the relevance of basic text mining techniques when LLMs are also an option. Finally, I wanted to get into working with LLMs programatically so that I can learn to use them for tasks such as summarization and document comparison.
I believe that LLMs have limitations but are strong tools if used correctly. People and organizations should be embracing such tools and finding ways in which they can be used to complete tasks more efficiently where appropriate. However, LLMs are not perfect, and many people do not have access paid features of LLMs. At many organizations, there are strict rules around how tools like ChatGPT can be used. Concerns about issues such as privacy, transparency, accuracy, computational intensity, and cost are all valid, and as a result, more basic NLP methods remain relevant.
I hope that I can provide some insight into some use cases that LLMs might be appropriate for, and what sorts of use cases more basic techniques are appropriate for. At the end, I also make an attempt to use an LLM in tandem with some more basic text mining techniques, to highlight the idea that these methods can compliment one another. That being said, I am not an expert and I do not claim that my methods are the most efficient.
Note that the majority of the basic text mining techniques used here were learned from the book Text Mining With R by Julia Silge and David Robinson, which is available for free online (https://www.tidytextmining.com/).
As a publicly-traded company, Amazon releases an annual report every year (with a December 31st year end). An annual report is essentially a summary of the company’s performance over the past year. It includes details on how well the company did financially, what goals were achieved, and what challenges it faced. The report also provides information about Amazon’s plans and ambitions for the future.
I chose to use Amazon’s annual reports as the group of documents (corpus) to analyze for a few reasons; Amazon is an extremely well-known company, it has an interesting history (it went from selling books to selling everything under the sun), and there are nearly twenty years of annual reports available online. Other examples of things we can analyze with similar methods are articles, social media posts, and open-ended survey responses. While we group these annual reports by year, we could, for example, group articles or social media posts by author to compare different authors.
The main goals here are to learn and explore methods. We will focus on showing how different methods can be used to understand key themes in the annual reports. Specifically, we want to understand the key themes of the most recent report, and how those themes differ from the themes of other years’ reports.
Let us start with some very simple techniques and see what sorts of insights we can glean. Below, the code I used to read in the PDFs and save them as .feather files. However, it is commented-out because it takes a long time to run. Instead, I load the already extracted data.
# # Setup---------------------------------------------------------------------------------------------
#
# # Clear memory
# rm(list = ls(all.names = TRUE))
#
# # Load libraries and install if not installed already
# if (!require("pacman")) install.packages("pacman")
# pacman::p_load(
# tidyverse, # Grammar for data and graphics
# here, # Relative file paths
# magrittr, # More pipeable functions (eg., set_colnames())
# pdftools, # Read PDFs Into R
# feather, # Files that are fast to write and read
# furrr # Run functions from purrr package (e.g., map_dfr()) in parallel
# )
#
#
# # Set up parallel computing for purrr (furrr) functions
# future::plan("multicore")
#
#
#
# # Read in Amazon Annual Reports ---------------------------------------------------------------------------------------
#
#
# # Function to read in one report
# read_one_standard <- function(year_numeric) {
#
# year_string = as.character(year_numeric)
#
# # Read in one pdf with name according to pattern
# pdftools::pdf_text(pdf = here(paste0("Amazon_Budgets/PDFs/", year_string, "-Annual-Report.pdf"))) |> # has difficulty with 2019 and 2020 reports
# as.data.frame() |>
# magrittr::set_colnames("text") |>
# # Add Column indicating year
# mutate(year = year_string)
#
# }
#
# # Function to read in one report using ocr (more computationally intensive)
# # This is only necessary because there is something odd about the 2019 and 2020 PDFs
# read_one_ocr <- function(year_numeric) {
#
# year_string = as.character(year_numeric)
#
# # Read in one pdf with name according to pattern
# pdftools::pdf_ocr_text(pdf = here(paste0("Amazon_Budgets/PDFs/", year_string, "-Annual-Report.pdf"))) |>
# as.data.frame() |>
# magrittr::set_colnames("text") |>
# # Add Column indicating year
# mutate(year = year_string)
#
# }
#
#
# # Run the function on all years (except 2019 and 2020) in parallel and rbind into one tidy dataframe
# well_behaved_reports <- future_map_dfr(c(2005:2018, 2021:2023), read_one_standard) # This will take a while to run. Go grab a coffee or something.
#
# # Run the function on 2019 and 2020 in parallel and rbind into one tidy dataframe
# poorly_behaved_reports <- future_map_dfr(2019:2020, read_one_ocr) # This also will take a while to run. Go grab a second coffee or something.
#
# # Rbind all reports into one dataframe
# all_reports <- rbind(well_behaved_reports, poorly_behaved_reports)
#
#
# # Write Data ------------------------------------------------------------------------------------------------------------
#
# # Save in .feather format
# write_feather(all_reports, here("Amazon_Budgets/Data/Intermediate/all_reports_ocr_uncleaned.feather"))
# Read data from all years to start from this point in the .Rmd
all_reports_untokenized <- read_feather(here("Amazon_Budgets/Data/Intermediate/all_reports_ocr_uncleaned.feather"))
# Tokenize
all_reports <- all_reports_untokenized |>
unnest_tokens(
word, # Name of column of words in new dataframe
text # Name of column containing text in original dataframe
)
# Save tidy data tokenized by single word
write_feather(all_reports, here("Amazon_Budgets/Data/Intermediate/all_reports_tokenized.feather"))
First, I use basic functions to do something really simple - generate a list of the top ten most-used words across the nineteen annual reports in our data.
# Ten most-common words
all_reports |>
group_by(word) |>
summarise("Word Count" = n()) |>
slice_max(order_by = `Word Count`, n = 10) |>
kable(
col.names = c("Word", "Count")
)
Word | Count |
---|---|
and | 37970 |
of | 29008 |
the | 26705 |
to | 19861 |
in | 17945 |
our | 16328 |
we | 12183 |
for | 8489 |
as | 7451 |
a | 7248 |
As you can see, the results are profoundly uninteresting. We basically get a list of ten of the most common words in the English language. To get something more interesting, let’s remove a list of stop words. Stop words are words that are used commonly and are unlikely to be interesting for our analysis. Examples include words such as “the”, “it” and “or”.
# Remove Standard list of common words such as "and", "the', "or"
all_reports_no_stop <- all_reports |>
anti_join(stop_words, by = "word")
# A New List of the Most-Common Words
all_reports_no_stop |>
group_by(word) |>
summarise("Word Count" = n()) |>
slice_max(order_by = `Word Count`, n = 10) |>
kable(
col.names = c("Word", "Count")
)
Word | Count |
---|---|
cash | 3967 |
million | 3708 |
net | 3704 |
tax | 3596 |
december | 3486 |
31 | 3352 |
sales | 3275 |
income | 3119 |
financial | 2982 |
stock | 2939 |
It is clear from the above table that there are still words that will might be interesting for our analysis, such as individual numbers, which appear because of dates or financial tables in the document. The words “December” and “31” appear because Amazon uses a fiscal year end of December 31st. Although I remove the names of all months because I think they will not be very interesting, looking into the word “December” and finding that it appears frequently because it is the month of Amazon’s fiscal year-end has given us our first insight, albeit a fairly trivial one.
# Custom list of stop words to remove
my_stop_words <- c(
str_to_lower(month.name), # List of month names
letters # List of letters
)
# Remove above list of custom stop words and anything contain a number
all_reports_no_stop <- all_reports_no_stop |>
filter(!(word %in% my_stop_words), # List of custom stop words
!str_detect(word, "\\d"), # Anything containing a number
!str_detect(word, "_")) # Words containing an underscore
# A New List of the Most-Common Words
most_common_no_stop <- all_reports_no_stop |>
filter(year == 2023) |>
group_by(word) |>
summarise("Word Count" = n()) |>
slice_max(order_by = `Word Count`, n = 10)
most_common_no_stop |>
kable(
col.names = c("Word", "Count")
)
Word | Count |
---|---|
cash | 211 |
services | 211 |
billion | 200 |
tax | 198 |
including | 182 |
net | 177 |
sales | 174 |
income | 166 |
operating | 163 |
financial | 148 |
What if we try to do the same thing by asking an LLM to give us he most common words? First of all, it is usually not practical to feed an LLM an entire document that is as long as one of these annual reports. This is a hurdle we will have to creatively deal with later. For now, for the purposes of demonstration, let’s give an LLM (here I use mistral’s large model for cost reasons) the first 500 words of the document and ask it for the most common ones. Then let’s compare its response to a manually programmed answer, which I know is correct. To interact with the LLM here, I use R’s {TidyChatModels} package. Note that I comment out all of the code that involves API calls in this document to avoid paying for API calls unnecessarily every time I rebuild my website.
# # Turn text into a form that an LLM can analyze easily
# text_to_analyze <- all_reports_no_stop |>
# filter(year == 2023) |>
# slice_head(n = 500) |> # Providing an entire report would be costly
# select("word") |>
# unlist() |>
# paste(collapse = " ")
#
# # Test mistral chat through tidychatmodels
# mistral_chat <- create_chat('mistral', Sys.getenv('MISTRAL_API_KEY')) |>
# add_model('mistral-large-latest') |>
# add_params(temperature = 0.0
# # ,max_tokens = 20000
# ) |>
# add_message(
# role = 'system',
# message = 'You receive text and return the ten most-common words in that text along with the number of times each of those words appears. Do not return anything else'
# ) |>
# add_message(
# paste(
# text_to_analyze
# )
# ) |>
# perform_chat()
# # Output
# result <- mistral_chat |> extract_chat(silent = TRUE)
#
# # Save result
#
# write_rds(result, "Amazon_Budgets/Data/mistral_word_count.rds")
result <- read_rds("Amazon_Budgets/Data/mistral_word_count.rds")
# Print the result from the LLM
llm_result <- result |>
filter(role == "assistant") |>
select(message) |>
unlist() |>
paste(collapse = " ")
paste("LLM Result", llm_result)
## [1] "LLM Result ```\n1. amazon 34\n2. customers 20\n3. yoy 17\n4. revenue 11\n5. cost 10\n6. aws 9\n7. delivery 8\n8. serve 7\n9. progress 7\n10. unit 6\n```"
# Manual Result
all_reports_no_stop |>
filter(year == 2023) |>
slice_head(n = 500) |>
group_by(word) |>
summarise(count = n()) |>
arrange(desc(count)) |>
slice_head(n = 10) |>
kable(caption = "Manual Result",
col.names = c("Word", "Count"))
Word | Count |
---|---|
customers | 13 |
yoy | 9 |
aws | 8 |
cost | 7 |
revenue | 6 |
amazon | 5 |
lower | 5 |
selection | 5 |
serve | 5 |
we’ve | 5 |
It seems that the LLM is acting in a way that we do not expect. Perhaps it is counting the number of times strings appear, even if they are contained within larger strings. For example, it might included “amazon” and “amazon.com” both as appearances of amazon, even though we just want to count “amazon.” Let’s see what the number of words that contain the sub string “amazon,” really is using simple tidyverse code.
# Manually get the number of words Containing the string "amazon"
amazon_appr <- all_reports_no_stop |>
filter(year == 2023) |>
slice_head(n = 500) |>
filter(str_detect(word, "amazon")) |> # detects all words containing "amazon"
nrow()
The word “amazon” appears 9 times, including its appearances within other words. While this gets us closer to the LLM’s results, the results still do not align. This highlights the opacity of an LLM’s approach to solving specific problems, though certain models such as OpenAI’s o1 try to mitigate this issue. It is also worth noting that the LLM takes some time to produce a result, while our traditional methods with {TidyText} are effectively instantaneous.
An easy critique is that my prompt is not very good. I could have been more specific. However, that is part of the point I want to make. When prompting an LLM for a specific result, we need to provide a very specific and carefully-worded prompt. Even then, however, the process for arriving at the result is opaque, and it is hard know whether the result without manually checking it. Really what we have done here is used a needlessly complex tool for a very simple task. It is like we wanted to dig a a hole to plant a flower and brought an excavator to do the job. Maybe somebody who is very skilled with an excavator could do it, but a shovel is still the more appropriate and cost-effective tool.
Let’s continue with some more traditional text mining techniques, and show how we can use them to make some fun plots in {ggplot2}.
# Prepare data for the scatter plot
scatter_data <- all_reports_no_stop |>
mutate(first_second_half = if_else(year %in% 2005:2013, "first_half", "second_half")) |>
group_by(first_second_half, word) |>
summarise(word_count = n()) |>
pivot_wider(names_from = first_second_half, values_from = word_count) |>
mutate(second_half = second_half * 9/10, # Normalize second half to account for the fact it has one more year
difference = abs(first_half - second_half))
# Scatterplot
ggplot(scatter_data, aes(x = first_half, y = second_half)) +
geom_point(color = my_pal[1], alpha = 0.65, size = 2.5) +
geom_text_repel(label = if_else(scatter_data$first_half > 1000 | scatter_data$second_half > 1000 | scatter_data$difference > 300, scatter_data$word, ""),
color = weak_text,
size = 6) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = line_colour) + # 45-degree line
labs(title = "Comparing Word Frequencies Across Two Time Periods",
subtitle = "Note that the frequencies of words used in the 2014 to 2023
period are multiplied by 9/10 to account for the fact that that this period
is one year longer than the other.",
x = "2005 to 2013 Annual Reports", y = "2014 to 2023 Annual Reports") +
my_theme()
From this, we can glean a few interesting pieces of information. First, we see that words like “Cash”, “Tax”, and “Costs”, are used at relatively similar frequencies in both the first nine annual reports and the last ten. However, words like “Billion”, and “AWS”, are used much more frequently in the last ten years of annual reports than the first nine. The word “Billion” likely appears more in the second half, and “Million” in the first half because the company has grown a great deal, and now deals in billions of dollars rather than millions. “AWS” refers to Amazon Web Services. From this graphic we might guess that Amazon Web Services has become an increasingly important part of the company over time. On the flip side, we see that the words “stock” and “financial” were used more frequently in the first nine annual reports. We might want to investigate this more to understand why.
One attribute of these reports that might cause an issue for our analysis is total word count. If the annual reports get substantially shorter over time, we will likely see a downward trend in most words. However, such trends would not actually be meaningful. Let’s take a look at the word counts of the annual reports by year, after removing the stop words discussed earlier. We see that although the total number of words is generally fairly similar from year to year, the differences between certain years are meaningful; the 2023 report has nearly 40% more words than the 2010 report. Therefore, although I tried to make the two periods in the previous section comparable by adjusting for the number of years, it might be more appropriate to adjust for the total number of words.
# Do annual reports vary much in word count?
annual_word_count <- all_reports_no_stop |>
mutate(year = as.numeric(year)) |>
group_by(year) |>
summarise(word_count = n())
# Scatterplot
ggplot(annual_word_count, aes(x = year, y = word_count)) +
geom_point(color = my_pal[1], alpha = 0.65, size = 3.5) +
geom_line(color = my_pal[1], size = 0.75) +
labs(title = "Total Word Counts Over Time",
subtitle = "Stop Words Removed",
x = "", y = "Total Words Exluding Stop Words") +
my_theme()
Next, let us illustrate the effect of normalizing our frequencies for the total number of words used in each annual report, while also exploring the usage of the term “AWS”, which we identified earlier as a word that was used much more frequently from 2014 to 2023 than from 2005 to 2013. To do so, let’s look at both the raw frequency of the term by year, and an frequency adjusted for the total number of words in the annual report. The adjustment used involves dividing the occurrences of the word we want to look at by the total word count for a given year, and then multiplying by the average word count across all of the years.
# Prepare data for the line graph
one_word_line_data <- all_reports_no_stop |>
group_by(year, word) |>
mutate(year = as.numeric(year)) |>
summarise(word_count = n(), .groups = "drop") |>
filter(word == "aws") |>
complete(year = full_seq(year, 1), word = "aws", fill = list(word_count = 0)) |> # Otherwise years with 0 observations are lost
pivot_wider(names_from = word, values_from = word_count)
# Scatterplot
raw_freq_p <- ggplot(one_word_line_data, aes(x = year, y = aws)) +
geom_line(color = my_pal[1], size= 0.75) +
geom_point(color = my_pal[1], alpha = 0.65, size = 3.5) +
labs(title = "Raw Frequency of The Term AWS Over Time",
x = "", y = "Occurences") +
my_theme()
# Average Word Count
avg_word_count <- mean(annual_word_count$word_count)
# Join data for just aws with data for total word count
adjusted_word_line_data <- annual_word_count |>
mutate(year = as.numeric(year)) |>
full_join(one_word_line_data, by = "year") |>
replace(is.na(year), 0) |>
mutate(aws = aws/word_count*avg_word_count)
# Line Graph
adj_freq_p <- ggplot(adjusted_word_line_data, aes(x = year, y = aws)) +
geom_line(color = my_pal[1], size= 0.75) +
geom_point(color = my_pal[1], alpha = 0.65, size = 3.5) +
labs(title = "Word-Count-Adjusted Frequency of The Term AWS Over Time",
x = "", y = "Adjusted Occurences") +
my_theme()
gridExtra::grid.arrange(raw_freq_p, adj_freq_p,
ncol = 1)
Although AWS was established in 1997, we see that the term was hardly used in Amazon’s annual reports until around 2010, when it started to be used much more. Then, in the 2015 annual report the term was used almost 70 times. AWS seems to have remained an important topic in the reports since. Adjusting the series for the total word counts in the annual reports does not make a terribly large difference with this corpus of documents but it could in other cases.
By looking at changes in the usage of certain words over time and adjusting for total word counts, we have begun to look at something a little bit more interesting. However, there are more systematic approaches to finding which words are relatively common in which annual reports than what we have done so far. One approach is to combine term frequency (tf) with inverse document frequency (idf) to generate a tf-idf statistic. In the book Text Mining With R, Julia Silge and David Robinson describe tf-idf as follows
“The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.”
By looking at which words have the highest tf-idf statistics in a given annual report, we are able to get an idea of which words typify that annual report.
First, let’s look at the words with the highest tf-idf statistics from the 2023 annual report, to see which words are used most in the 2023 annual report relative to other years’ reports.
# Filter words containing numbers out of dataframe with stop words
all_reports_no_numbers <- all_reports |>
filter(!str_detect(word, "\\d"),
!str_detect(word, "_"))
# Generate table of tf-idf stats for all words and years
tf_idf_all <- all_reports_no_numbers |>
group_by(year, word) |>
summarise(word_count = n()) |>
bind_tf_idf(word, year, word_count)
# Table of just 2023 report
tf_idf_2023 <- tf_idf_all |>
filter(year == "2023")
tf_idf_2023 |>
slice_max(order_by = `tf_idf`, n = 10) |>
kable()
year | word | word_count | tf | idf | tf_idf |
---|---|---|---|---|---|
2023 | primitives | 31 | 0.0006762 | 2.9444390 | 0.0019911 |
2023 | genai | 18 | 0.0003927 | 2.9444390 | 0.0011561 |
2023 | stores | 66 | 0.0014397 | 0.4595323 | 0.0006616 |
2023 | primitive | 10 | 0.0002181 | 2.9444390 | 0.0006423 |
2023 | ai | 18 | 0.0003927 | 1.5581446 | 0.0006118 |
2023 | cybersecurity | 17 | 0.0003708 | 1.3350011 | 0.0004951 |
2023 | chips | 12 | 0.0002618 | 1.8458267 | 0.0004832 |
2023 | fms | 7 | 0.0001527 | 2.9444390 | 0.0004496 |
2023 | ending | 8 | 0.0001745 | 2.2512918 | 0.0003929 |
2023 | perishable | 6 | 0.0001309 | 2.9444390 | 0.0003854 |
The above table shows the words with the ten highest tf-idf statistics among words in Amazon’s 2023 annual report. Note that we do not have to remove a list of common words (e.g., “the”, “and”) here, because those words will be common in all reports, and therefore have low tf-idf statistics. We can see that in the most recent year, Amazon’s annual report focused on some hot topics in the world of tech, such as AI, chips, and cybersecurity. The words “primitive” and “primitive” which refer to small foundation pieces of larger processes within the company.
One way that we can visualize this is with a word cloud. Here, I make a word cloud of the 100 words with the highest tf-idf statistics from the 2023 annual report, with size and a color gradient mapped to the tf-idf statistic.
# Combos of two letters. To be removed for word cloud
letter_combo_list <- as.vector(outer(letters,letters, paste0))
# Function to make wordcloud of top tf-idf's for just one year
word_cloud_one_year <- function(year_numeric, max_size = 35) {
tf_idf_all |>
filter(year == year_numeric) |>
slice_max(order_by = `tf_idf`, n = 50) |>
filter(!str_detect(word, "_"), # Was picking up underline
!str_detect(word, "www."), # web addresses confuse ggplot
!(word %in% letters),
!(word %in% letter_combo_list)) |>
# mutate(word = str_replace(word, "\\.", "")) |> # Web addresses can trip up ggplot
ggplot(aes(label = word, size = tf_idf, color = tf_idf)) +
scale_size_area(max_size = max_size) +
scale_color_gradient(high = darken(my_pal[1], 0.15), low = lighten(my_pal[1], 0.15)) +
labs(title = paste("Words That Distinguish The", as.character(year_numeric), "Report")) +
geom_text_wordcloud() +
my_theme()
}
# Run the function for the year 2023
word_cloud_one_year(2023)
list_of_plots <- lapply(c(2005, 2010, 2015, 2020), word_cloud_one_year, max_size = 10)
grid.arrange(grid.arrange(grobs = list_of_plots, ncol = 2))
Before moving on, I should note that everything we have done up to this point could be done with n-grams instead of single words. An n-gram is a sequence of words. For example a bi-gram is a sequence of two words. Sequences of words can often provide quite different insights than words taken individually. However, my goal is not to rewrite everything already written in Text Mining With R, so for more information you can look there.
Here, we are are entering territory where an LLM can genuinely be useful. Let’s start by getting an LLM to summarize the key themes from the 2023 annual report. Although this sounds extremely simple (I initially assumed that one could just feed an entire annual report to an LLM and ask for a summary, which is not the case) doing it efficiently involves quite a few steps. Moreover, despite my attempts to purely use R for this, I found the need to use LangChain, which has no R implementation. Therefore, I use Python for these next sections. For simpler tasks, both {TidyChatModels} (created by Albert Rapp) and {TidyLLM} (created by Eduard Brüll) are good R packages.
Before moving on, I must give credit where credit is due. The majority of the concepts and a lot of the code used for summarization here came from Greg Kamradt’s tutorials (@GregKamradt on Twitter), specifically this tutorial:
Furthermore, the combination of my more limited Python skills and this being my first foray into LangChain meant that I did have to ask an LLM for some help with the python code. The irony in that is not lost on me.
First, I set up my API and test that I can use it to ping the desired LLM. I hid the code setting up the API key because API keys are like passwords and should be kept secret. For the rest of this exercise, I use OpenAI’s GPT-4o-mini.
#
# from langchain.chat_models import ChatOpenAI
#
# # For use later
# llm = ChatOpenAI(
# model_name="gpt-4o-mini", # Specify the GPT-4o Mini model
# temperature=0,
# max_retries=2,
# openai_api_key=OPENAI_API_KEY # Replace with your actual OpenAI API key
# )
#
First, let’s load our reports. In this case I read the PDFs with Python and saved them all together as a .json file beforehand. The below code that has been commented out, accomplishes that. To begin, I extract just the most recent report and work on developing a summary of it.
# # Define the base directory where the reports are located
# base_directory = r"path_to_directory"
#
# # Function to read PDF and extract text
# def read_pdf(file_path):
# reader = PdfReader(file_path)
# text = ""
# for i, page in enumerate(reader.pages):
# if i == 1: # Get only second page (index 1)
# text = page.extract_text() # Store the text of the second page
# break # No need to process further pages
# return text
#
#
# # Function to process all reports from 2005 to 2023
# def process_reports(start_year, end_year, base_directory):
# reports = []
# for year in range(start_year, end_year + 1):
# file_path = os.path.join(base_directory, f"{year}-Annual-Report.pdf")
#
# # Check if the file exists
# if os.path.exists(file_path):
# print(f"Processing: {file_path}")
# pdf_text = read_pdf(file_path) # Highlight: Modified to get only the 10th page
#
# if pdf_text: # Ensure the 10th page is not empty
# # No need to split text into chunks since we're dealing with a single page
# documents = [Document(page_content=pdf_text)] # Store as a single Document
# reports.append((year, documents))
# else:
# print(f"File for {year} not found: {file_path}")
#
# return reports
#
# # Run the function to read in all of the PDFs
# pyth_all_years = process_reports(2005, 2023, base_directory)
#
#
# ## Save the read-in PDFs as JSON files
# my_output_path = "my_output_path"
#
# # Save the reports to a JSON file
# def save_reports_to_json(reports, output_path):
# # Convert reports to a dictionary
# data = {year: [doc.page_content for doc in documents] for year, documents in reports}
#
# # Save to JSON
# with open(output_path, "w", encoding="utf-8") as f:
# json.dump(data, f, ensure_ascii=False, indent=4)
# print(f"Reports saved to {output_path}")
#
# # Save as .json
# save_reports_to_json(pyth_all_years, my_output_path)
#
import os
import json
from langchain.schema import Document # Make sure to import the Document class
input_path = path_to_my_file # Path hidden
# Function to load the JSON of all of the reports that we already saved
def load_reports_from_json(input_path):
with open(input_path, "r", encoding="utf-8") as f:
data = json.load(f)
# Convert the dictionary back to LangChain Document objects
reports = []
for year, chunks in data.items():
documents = [Document(page_content=chunk) for chunk in chunks]
reports.append((int(year), documents))
return reports
# Load all reports
all_reports = load_reports_from_json(input_path)
# Function to pull out reports for a list of specified years
def get_reports_for_years(all_reports, target_years):
selected_reports = {}
for year, documents in all_reports:
if year in target_years:
selected_reports[year] = " ".join([doc.page_content for doc in documents])
return selected_reports # Returns a dictionary of year -> combined report text
# Run function to get report for the year 2023
report_2023 = get_reports_for_years(all_reports, [2023])
# Run the function to get report for all other years
other_reports = get_reports_for_years(all_reports, list(range(2013, 2022)))
# Combine 2023 report into a single string
report_2023 = " ".join(report_2023.values())
# Combine all other reports into a single string
other_reports = " ".join(other_reports.values())
Now we have an object containing the 2023 report (target document) and an object containing all of the other reports (the rest of the corpus). As alluded to earlier, the main difficulty we run into is that it is not feasible to feed an entire annual report to the LLM. Even if it doesn’t exceed the token limit, doing so would be expensive. This is an even bigger problem if we want to feed decades worth of reports to the LLM. So what do we do?
I will avoid going into too much depth about exactly what I do here. Almost everything here is explained well in Greg Karmadt’s tutorial, which I linked above. The following steps describe the process.
Furthermore, to reduce cost and avoid exceeding token limits, I look at the last ten years of reports rather than all of the reports.
# # Loaders
# from langchain.schema import Document
#
# # Splitters
# from langchain.text_splitter import RecursiveCharacterTextSplitter
#
# # Model
# from langchain_openai import ChatOpenAI
#
# # Embedding Support
# from langchain_community.vectorstores import FAISS
# from langchain_openai import OpenAIEmbeddings
#
# # Summarizer we'll use for Map Reduce
# from langchain.chains.summarize import load_summarize_chain
#
# # Data stuff
# import numpy as np
#
# # Clustering
# from sklearn.cluster import KMeans
#
# def process_and_cluster_report(report_text, openai_api_key, num_clusters=10, chunk_size=10000, chunk_overlap=1000):
# """
# Processes a single report, splits it into chunks, embeds the chunks, and performs clustering.
#
# Args:
# report_text (str): The text content of the report to process.
# openai_api_key (str): API key for OpenAI.
# num_clusters (int): The number of clusters for K-means clustering. Default is 5.
# chunk_size (int): The maximum size of each text chunk. Default is 100.
# chunk_overlap (int): The overlap between chunks. Default is 10.
#
# Returns:
# tuple: Contains:
# - docs (list): List of chunked documents.
# - vectors (numpy.ndarray): Array of embeddings for each chunk.
# - kmeans (KMeans): Fitted KMeans model.
# """
# text_splitter = RecursiveCharacterTextSplitter(
# separators=["\n\n", "\n", "\t"], chunk_size=chunk_size, chunk_overlap=chunk_overlap
# )
# docs = text_splitter.create_documents([report_text])
#
# embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
# vectors = np.array(embeddings.embed_documents([x.page_content for x in docs]))
#
# kmeans = KMeans(n_clusters=num_clusters, random_state=97).fit(vectors)
# return docs, vectors, kmeans
#
# # Process and cluster the 2023 report
# docs_2023, vectors_2023, kmeans_2023 = process_and_cluster_report(
# report_text=report_2023,
# openai_api_key=OPENAI_API_KEY,
# num_clusters=10 # Remember to change between testing and implementation
# )
#
# # Process and cluster other reports
# docs_other, vectors_other, kmeans_other = process_and_cluster_report(
# report_text=other_reports,
# openai_api_key=OPENAI_API_KEY,
# num_clusters=150 # Remember to change between testing and implementation
# )
# # Function to find the centroid embeddigns
# def find_closest_embeddings(vectors, kmeans):
# """
# Finds the closest embeddings to the centroids of the clusters.
#
# Args:
# vectors (numpy.ndarray): Array of embeddings.
# kmeans (KMeans): Fitted KMeans model.
#
# Returns:
# list: Sorted indices of the closest points to each cluster centroid.
# """
# closest_indices = []
# for i in range(kmeans.n_clusters):
# distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
# closest_index = np.argmin(distances)
# closest_indices.append(closest_index)
# return sorted(closest_indices)
#
#
# # Find the centroid embeddings
# selected_indices_2023 = find_closest_embeddings(vectors_2023, kmeans_2023)
# selected_indices_other = find_closest_embeddings(vectors_other, kmeans_other)
# ## Use prompt template
#
# from langchain import PromptTemplate
#
# def summarize_chunks(docs, selected_indices, llm):
# """
# Summarizes selected chunks of a report.
#
# Args:
# docs (list): List of document chunks.
# selected_indices (list): Indices of selected chunks.
# llm: Language model for summarization.
#
# Returns:
# list: List of summaries for each selected chunk.
# """
# from langchain import PromptTemplate
# from langchain.chains.summarize import load_summarize_chain
#
# map_prompt = """You will be given a single passage of one of Amazon's annual reports. This section will be enclosed in triple backticks (```)
# Your goal is to give a summary of the key theme(s) from this section. Your response should be one to two sentences.
# ```{text}``` Summary of Key Themes:"""
#
# map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
#
# map_chain = load_summarize_chain(
# llm=llm,
# chain_type="stuff",
# prompt=map_prompt_template
# )
#
# summary_list = []
# selected_docs = [docs[doc] for doc in selected_indices]
# for i, doc in enumerate(selected_docs):
# chunk_summary = map_chain.run([doc])
# summary_list.append(chunk_summary)
#
# # Print only the first two chunk summaries as examples
# if i < 2:
# print(f"Summary #{i + 1} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")
#
# return summary_list
#
#
#
# # Summarize chunks
# summary_2023 = summarize_chunks(docs_2023, selected_indices_2023, llm)
# summary_other = summarize_chunks(docs_other, selected_indices_other, llm)
#
# # Save summaries so we don't have to call the API every time we run
# import json
# import os
# Save summaries to the folder
file_path = os.path.join(folder_path, "summaries.json") # Folder path hidden
# with open(file_path, "w") as f:
# json.dump({"summary_2023": summary_2023, "summary_other": summary_other}, f)
# Access the summaries
with open(file_path, "r") as f:
loaded_summaries = json.load(f)
summary_2023 = loaded_summaries["summary_2023"]
summary_other = loaded_summaries["summary_other"]
Here is the overall summary.
# def summarize_report_summaries(summary_list, llm):
# """
# Generates a verbose summary from a list of summaries for a report or set of reports.
#
# Args:
# summary_list (list): A list of summaries for the report(s).
# llm (LangChainModel): A language model instance for summarization.
#
# Returns:
# str: A verbose summary of the provided summaries.
# """
# # Combine all summaries into one string
# combined_summaries = "\n".join(summary_list)
#
# # Wrap the combined summaries into a Document
# summaries_document = Document(page_content=combined_summaries)
#
# # Print the token count for debugging
# total_tokens = llm.get_num_tokens(summaries_document.page_content)
# print(f"Combined summaries contain {total_tokens} tokens.")
#
# # Define the prompt for summarizing summaries
# combine_prompt = """
# You will be given a series of summaries from one annual report or a group of annual reports from the company Amazon.
# The summaries will be enclosed in triple backticks (```). Your goal is to give a verbose summary of the
# key themes of the annual report(s). Please write the summary in one to two paragraphs.
#
# ```{text}```
# SUMMARY:
# """
# combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
#
# # Create the summarization chain
# reduce_chain = load_summarize_chain(
# llm=llm,
# chain_type="stuff",
# prompt=combine_prompt_template,
# )
#
# # Generate the verbose summary
# verbose_summary = reduce_chain.run([summaries_document])
#
# # Print and return the verbose summary
# print(verbose_summary)
# return verbose_summary
#
# # Generate a verbose summary for report_2023
# complete_summary_2023 = summarize_report_summaries(summary_list=summary_2023, llm=llm)
# Save summaries to the folder
file_path = os.path.join(folder_path, "complete_summary_2023.json") # Folder path hidden
# with open(file_path, "w") as f:
# json.dump({"complete_summary_2023": complete_summary_2023},f)
# Load the complete summary back into Python
with open(file_path, "r") as f:
loaded_data = json.load(f)
# Access the summary
complete_summary_2023 = loaded_data["complete_summary_2023"]
While this is definitely cool, and would have blown my mind just a few years ago, it doesn’t really tell us how the 2023 annual report compares to the other annual reports like the tf-idf statistics did. Perhaps these are the key themes that the LLM would grab from any of the annual reports.
I have not found an established method to compare a document to others in a corpus of documents using an LLM. Therefore, in this section I am just exploring procedures that I have come up with myself. I do not proclaim these to be the best methods to use, and in my opinion the results leave some obvious room for improvement. I would love to hear feedback about how this process could be improved.
As an initial attempt, I use the following process that leverages the summaries I already made earlier.
Below is the LLM’s output.
# def compare_reports(target_summary, other_summaries, llm):
# """
# Compares the target report summary to other report summaries to identify unique themes.
#
# Args:
# target_summary (str): Summary of the target report.
# other_summaries (list): Summaries of other reports.
# llm: Language model for comparison.
#
# Returns:
# str: A summary describing how the target report differs from others.
# """
# from langchain.chains import LLMChain
# from langchain.prompts import PromptTemplate
#
# compare_prompt = """You will be provided with a target summary and summaries of other reports.
# Compare the target summary to the others and identify what makes it unique. Provide your answer in three to four sentences.
# Target Summary: {target_summary}
# Other Summaries: {other_summaries}
# Key Differences:"""
#
# compare_prompt_template = PromptTemplate(
# template=compare_prompt,
# input_variables=["target_summary", "other_summaries"]
# )
#
# compare_chain = LLMChain(llm=llm, prompt=compare_prompt_template)
# result = compare_chain.run(
# {"target_summary": target_summary, "other_summaries": " ".join(other_summaries)}
# )
# return result
#
#
# # Compare summaries
# comparison_result = compare_reports(
# target_summary=" ".join(summary_2023),
# other_summaries=summary_other,
# llm=llm
# )
# Save summaries to the folder
file_path = os.path.join(folder_path, "comparison_result.json") # folder path hidden
# with open(file_path, "w") as f:
# json.dump({"comparison_result": comparison_result},f)
# Load the complete summary back into Python
with open(file_path, "r") as f:
loaded_data = json.load(f)
# Access the summary
comparison_result = loaded_data["comparison_result"]
print(comparison_result)
## The target summary is unique in its comprehensive coverage of Amazon's financial growth, operational strategies, and the multifaceted challenges it faces in 2023. It emphasizes not only the company's revenue increases and innovations in customer experience but also delves into specific aspects such as the development of "fulfillment primitives," investments in generative AI, and the importance of human capital. Additionally, it highlights a range of risks, including legal challenges and complexities in financial management, which are less detailed in the other summaries. Overall, the target summary presents a more nuanced and detailed analysis of Amazon's strategic focus and operational landscape compared to the other summaries, which tend to be more general or focused on specific themes.
Although these results are a little bland, they are a start.
A second option that I came up with involves augmenting our LLM prompt with some of the results from our traditional text mining methods. I augment the LLM approach above by also feeding the LLM the top tf-idf terms from the target document. Below is the LLM’s output.
# Get R df of top 10 tf_idf terms that appear at least 12 times from 2023
top_tf_idf_2023 <- tf_idf_2023 |>
ungroup() |>
filter(word_count >= 12) |>
arrange(desc(tf_idf)) |>
slice_head(n = 10) |>
select(word)
reticulate::r_to_py(top_tf_idf_2023)
# top_tf_idf_2023 = ' '.join(r.top_tf_idf_2023['word'])
#
# def compare_reports_with_top_tfidf(target_summary, other_summaries, top_tfidf_terms, llm):
# """
# Compares the target report summary to other report summaries, using precomputed top TF-IDF terms.
#
# Args:
# target_summary (str): Summary of the target report.
# other_summaries (list): Summaries of other reports.
# top_tfidf_terms (list): List of top TF-IDF terms for the target report.
# llm: Language model for comparison.
#
# Returns:
# str: A summary describing how the target report differs from others.
# """
# from langchain.chains import LLMChain
# from langchain.prompts import PromptTemplate
#
# # Integrate unique terms into the LLM prompt
# compare_prompt = """You will be provided with three inputs: A summary of key themes from a target report. Key Terms that are used more frequently in the target report than
# the others. And summaries of key themes from the other reports in the corpus.
#
# Identify what makes the target report unique by comparing the summary of key themes from the target document to the summaries of key themes
# from the other documents, and taking into consideration the key terms provided. Give your answer in four to six sentences.
#
# Target Summary: {target_summary}
# Key Terms : {unique_terms}
# Other Summaries: {other_summaries}
# Key Differences:"""
#
# compare_prompt_template = PromptTemplate(
# template=compare_prompt,
# input_variables=["target_summary", "unique_terms", "other_summaries"]
# )
#
# compare_chain = LLMChain(llm=llm, prompt=compare_prompt_template)
# result = compare_chain.run(
# {
# "target_summary": target_summary,
# "unique_terms": ", ".join(top_tfidf_terms),
# "other_summaries": " ".join(other_summaries),
# }
# )
# return result
#
#
# # Assuming `r.top_tf_idf_2023` contains the top TF-IDF terms as a list
# comparison_result_tf_idf = compare_reports_with_top_tfidf(
# target_summary=" ".join(summary_2023),
# other_summaries=summary_other,
# top_tfidf_terms=top_tf_idf_2023,
# llm=llm
# )
# Save summaries to the folder
file_path = os.path.join(folder_path, "comparison_result_tf_idf.json") # Folder path hidden
# with open(file_path, "w") as f:
# json.dump({"comparison_result_tf_idf": comparison_result_tf_idf},f)
# Load the complete summary back into Python
with open(file_path, "r") as f:
loaded_data = json.load(f)
# Access the summary
comparison_result_tf_idf = loaded_data["comparison_result_tf_idf"]
print(comparison_result_tf_idf)
## The target report stands out from the other summaries primarily due to its comprehensive focus on Amazon's financial growth in 2023, emphasizing specific metrics such as revenue increases across various segments and the complexities of financial management, including tax regulations and cash flow dynamics. Unlike the other reports, which generally highlight Amazon's commitment to innovation and customer satisfaction, the target report delves deeper into operational strategies, such as the development of "fulfillment primitives" and investments in generative AI, showcasing a more nuanced understanding of how these elements contribute to Amazon's competitive edge.
##
## Additionally, the target report uniquely addresses the importance of human capital, detailing Amazon's strategies for employee development, safety, and engagement, which are less emphasized in the other summaries. It also highlights specific risks related to legal claims and operational inefficiencies, providing a more detailed risk assessment that is not as prevalent in the other reports. Overall, the target report presents a more intricate and multifaceted view of Amazon's current operational landscape, financial health, and strategic priorities, setting it apart from the broader themes discussed in the other documents.
These give us a better idea of how the 2023 report is unique compared to the other reports in the corpus.
I feel that I may have bitten off more than I can chew in this single little write up with this part, so I will not dive too far down this particular rabbit hole here. Again, my goal with this section was to learn and share what I learned. I am keen to keep working on this sort of problem and, should time permit, I plan to come back to this and make a future post about more robust approaches.