Skip to content

Instantly share code, notes, and snippets.

View jrjames83's full-sized avatar

Jeff James jrjames83

View GitHub Profile
4 qid:1 # 7555 rambo
3 qid:1 # 1370 rambo
3 qid:1 # 1369 rambo
2 qid:1 # 13258 rambo
4 qid:1 # 1368 rambo
1 qid:1 # 31362 rambo
1 qid:1 # 61410 rambo
0 qid:1 # 319074 rambo
0 qid:1 # 10296 rambo
0 qid:1 # 35868 rambo

Me: is tf-idf sensitive to word order?

ChatGPT: TF-IDF (term frequency-inverse document frequency) is a numerical statistic that is used to reflect the importance of a word in a document or a collection of documents. It is commonly used in text analysis and information retrieval to determine the relevance of a document to a particular query, and to determine the similarity between documents.

TF-IDF is based on the frequency of words within a document, and is therefore sensitive to variations in word order. This means that the TF-IDF score for a word can be different depending on where the word appears in the document, and whether it is next to other words with high or low TF-IDF scores.

For example, consider the following two sentences:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@jrjames83
jrjames83 / amazon_review_scraper.py
Created November 27, 2020 19:03
Scrape amazon reviews using python 3, beautifulsoup and pandas.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
import logging
headers = {
"authority": "www.amazon.com",
"pragma": "no-cache",
"cache-control": "no-cache",
import string
# Some random document
document = """BigQuery sure makes life easier for data scientists. You can query data for insights, build high quality ML models and easily interface with other Google Cloud services."""
# Remove punctuation
doc_wo_punct = document.translate(str.maketrans('', '', string.punctuation))
# Some keywords we'd like to extract
keywords = ["bigquery", "ML", "insights", "SQL", "analysis"]
-- WITH keyphrases AS (
-- SELECT '\\b' || name || '\\b' as keyword
-- FROM `bigquery-public-data.usa_names.usa_1910_2013`
-- ), docs AS (
-- SELECT title as doc
-- FROM `bigquery-samples.reddit.full`