Skip to content

Instantly share code, notes, and snippets.

@archiewood
Created July 11, 2024 18:45
Show Gist options
  • Save archiewood/db477aae35255b6739e7adfefbfcae17 to your computer and use it in GitHub Desktop.
Save archiewood/db477aae35255b6739e7adfefbfcae17 to your computer and use it in GitHub Desktop.
Word frequency analysis in SQL
with cleaned_docs as (
select
tweet_id,
regexp_replace(regexp_replace(text, '\\n', ' ', 'g'), 'https[^\\s]+', '', 'g') as cleaned_text
from tweets
where text is not null
),
tokenized_docs as (
select
tweet_id,
unnest(string_split(cleaned_text, ' ')) as token
from cleaned_docs
),
stop_words as (
select
unnest(en) as word
from 'https://raw.githubusercontent.com/stopwords-iso/stopwords-iso/master/python/stopwordsiso/stopwords-iso.json'
),
filtered_tokens as (
select
tweet_id,
regexp_replace(lower(token), '[^a-z]', '', 'g') as token
from tokenized_docs
where
length(token) > 1
and token not in (select word from stop_words)
and token is not null
),
word_frequencies as (
select
token,
count(*) as freq
from filtered_tokens
where token is not null
group by token
order by freq desc
)
select * from word_frequencies;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment