Many naturally occurring things follow a normal distribution.
I'm curious to know if paragraphs follow a normal distribution too?
Note: paragraph length could be measured in characters, or words.
I took a look at character lengths of paragraphs in Jane Austin's six major novels. That's by no means representative of English literature generally, but it's a start (here is reproducible R code). Here are the results.
First, here is the count of characters in paragraphs in every paragraph of all 6 novels:
Since it appears as though some extremely long paragraphs may skew right, this examines what happens when the longest 1% of paragraphs are excluded from the plot:
And this looks at what happens when the longest 10% are excluded, it's getting more normal, although I'm not sure if it can quite be called normal, although it does start to exhibit some normal-ish characteristics (mode closer to the center, mostly monotonic increasing on the left of mode, mostly monotonic decreasing to the right of mode), it would be best described as a log-normal distribution.
# install.packages('janeaustenr')
library(janeaustenr)
library(tidyverse)
library(ggthemes)
janeaustenr::austen_books() %>% as.data.frame %>% head(200)
# Assume it's a paragraph if it's followed by an empty line
df <- janeaustenr::austen_books()
df <- df %>%
mutate(char = nchar(text)) %>%
mutate(empty = ifelse(text == "", 1, 0))
df$para_length <- NA
chars <- 0
for(i in 1:nrow(df)){
row <- df[i, ]
if(row$empty == 0){
chars = chars + df[i, "char"]
}
if(row$empty == 1){
df[i-1, "para_length"] <- chars
chars <- 0
}
if(i %% 5000 == 0) { print(i)}
}
df %>%
filter(!is.na(para_length), para_length != 0) %>%
ggplot(aes(para_length)) +
geom_histogram(fill = "#F8766D") + theme_classic() +
ggtitle("Paragraph length in Characters",
subtitle="All paragraphs")
df %>%
filter(!is.na(para_length), para_length != 0) %>%
filter(quantile(para_length, 0.99) > para_length) %>%
ggplot(aes(para_length)) +
geom_histogram(fill = "#00BA38") + theme_classic() +
ggtitle("Paragraph length in Characters",
subtitle="Without longest 1% of paragraphs")
df %>%
filter(!is.na(para_length), para_length != 0) %>%
filter(quantile(para_length, 0.90) > para_length) %>%
ggplot(aes(para_length)) +
geom_histogram(fill="#619CFF") + theme_classic() +
ggtitle("Paragraph length in Characters",
subtitle="Without longest 10% of paragraphs")