stevecondylios/distribution-of-paragraph-length.md

## distribution-of-paragraph-length.md

      
    Raw
  

              distribution-of-paragraph-length.md
            
          
    Question: Does paragraph length follow a normal distribution?

Many naturally occurring things follow a normal distribution.
I'm curious to know if paragraphs follow a normal distribution too?
Note: paragraph length could be measured in characters, or words.
Answer:

I took a look at character lengths of paragraphs in Jane Austin's six major novels. That's by no means representative of English literature generally, but it's a start (here is reproducible R code). Here are the results.
First, here is the count of characters in paragraphs in every paragraph of all 6 novels:

Since it appears as though some extremely long paragraphs may skew right, this examines what happens when the longest 1% of paragraphs are excluded from the plot:

And this looks at what happens when the longest 10% are excluded, it's getting more normal, although I'm not sure if it can quite be called normal, although it does start to exhibit some normal-ish characteristics (mode closer to the center, mostly monotonic increasing on the left of mode, mostly monotonic decreasing to the right of mode), it would be best described as a log-normal distribution.

Reproducible R Code

# install.packages('janeaustenr')
library(janeaustenr)
library(tidyverse)
library(ggthemes)

janeaustenr::austen_books() %>% as.data.frame %>% head(200)

# Assume it's a paragraph if it's followed by an empty line

df <- janeaustenr::austen_books()

df <- df %>% 
  mutate(char = nchar(text)) %>% 
  mutate(empty = ifelse(text == "", 1, 0))

df$para_length <- NA
chars <- 0

for(i in 1:nrow(df)){

  row <- df[i, ]
  
  if(row$empty == 0){
    chars = chars + df[i, "char"]
  }
  
  if(row$empty == 1){
    df[i-1, "para_length"] <- chars
    chars <- 0
  }
  
  if(i %% 5000 == 0) { print(i)}
}

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill = "#F8766D") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="All paragraphs") 

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  filter(quantile(para_length, 0.99) > para_length) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill = "#00BA38") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="Without longest 1% of paragraphs") 

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  filter(quantile(para_length, 0.90) > para_length) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill="#619CFF") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="Without longest 10% of paragraphs")