Skip to content

Instantly share code, notes, and snippets.

@stevecondylios
Last active June 5, 2022 05:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stevecondylios/b24e1f3c70a386d13450d575e8423ae6 to your computer and use it in GitHub Desktop.
Save stevecondylios/b24e1f3c70a386d13450d575e8423ae6 to your computer and use it in GitHub Desktop.
Are paragraph lengths normally distributed?

Question: Does paragraph length follow a normal distribution?

Many naturally occurring things follow a normal distribution.

I'm curious to know if paragraphs follow a normal distribution too?

Note: paragraph length could be measured in characters, or words.

Answer:

I took a look at character lengths of paragraphs in Jane Austin's six major novels. That's by no means representative of English literature generally, but it's a start (here is reproducible R code). Here are the results.

First, here is the count of characters in paragraphs in every paragraph of all 6 novels:

enter image description here

Since it appears as though some extremely long paragraphs may skew right, this examines what happens when the longest 1% of paragraphs are excluded from the plot:

enter image description here

And this looks at what happens when the longest 10% are excluded, it's getting more normal, although I'm not sure if it can quite be called normal, although it does start to exhibit some normal-ish characteristics (mode closer to the center, mostly monotonic increasing on the left of mode, mostly monotonic decreasing to the right of mode), it would be best described as a log-normal distribution.

enter image description here

Reproducible R Code

# install.packages('janeaustenr')
library(janeaustenr)
library(tidyverse)
library(ggthemes)

janeaustenr::austen_books() %>% as.data.frame %>% head(200)

# Assume it's a paragraph if it's followed by an empty line

df <- janeaustenr::austen_books()

df <- df %>% 
  mutate(char = nchar(text)) %>% 
  mutate(empty = ifelse(text == "", 1, 0))

df$para_length <- NA
chars <- 0

for(i in 1:nrow(df)){

  row <- df[i, ]
  
  if(row$empty == 0){
    chars = chars + df[i, "char"]
  }
  
  if(row$empty == 1){
    df[i-1, "para_length"] <- chars
    chars <- 0
  }
  
  if(i %% 5000 == 0) { print(i)}
}

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill = "#F8766D") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="All paragraphs") 

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  filter(quantile(para_length, 0.99) > para_length) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill = "#00BA38") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="Without longest 1% of paragraphs") 

df %>% 
  filter(!is.na(para_length), para_length != 0) %>% 
  filter(quantile(para_length, 0.90) > para_length) %>% 
  ggplot(aes(para_length)) +
  geom_histogram(fill="#619CFF") + theme_classic() +
  ggtitle("Paragraph length in Characters",
    subtitle="Without longest 10% of paragraphs") 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment