Skip to content

Instantly share code, notes, and snippets.

@gd3kr
Created February 15, 2024 06:30
Show Gist options
  • Save gd3kr/948296cf675469f5028911f8eb276dbc to your computer and use it in GitHub Desktop.
Save gd3kr/948296cf675469f5028911f8eb276dbc to your computer and use it in GitHub Desktop.
Download a JSON List of twitter bookmarks
/*
the twitter api is stupid. it is stupid and bad and expensive. hence, this.
Literally just paste this in the JS console on the bookmarks tab and the script will automatically scroll to the bottom of your bookmarks and keep a track of them as it goes.
When finished, it downloads a JSON file containing the raw text content of every bookmark.
for now it stores just the text inside the tweet itself, but if you're reading this why don't you go ahead and try to also store other information (author, tweetLink, pictures, everything). come on. do it. please?
*/
let tweets = []; // Initialize an empty array to hold all tweet elements
const scrollInterval = 1000;
const scrollStep = 5000; // Pixels to scroll on each step
let previousTweetCount = 0;
let unchangedCount = 0;
const scrollToEndIntervalID = setInterval(() => {
window.scrollBy(0, scrollStep);
const currentTweetCount = tweets.length;
if (currentTweetCount === previousTweetCount) {
unchangedCount++;
if (unchangedCount >= 2) { // Stop if the count has not changed 5 times
console.log('Scraping complete');
console.log('Total tweets scraped: ', tweets.length);
console.log('Downloading tweets as JSON...');
clearInterval(scrollToEndIntervalID); // Stop scrolling
observer.disconnect(); // Stop observing DOM changes
downloadTweetsAsJson(tweets); // Download the tweets list as a JSON file
}
} else {
unchangedCount = 0; // Reset counter if new tweets were added
}
previousTweetCount = currentTweetCount; // Update previous count for the next check
}, scrollInterval);
function updateTweets() {
document.querySelectorAll('[data-testid="tweetText"]').forEach(tweetElement => {
const tweetText = tweetElement.innerText; // Extract text content
if (!tweets.includes(tweetText)) { // Check if the tweet's text is not already in the array
tweets.push(tweetText); // Add new tweet's text to the array
console.log("tweets scraped: ", tweets.length)
}
});
}
// Initially populate the tweets array
updateTweets();
// Create a MutationObserver to observe changes in the DOM
const observer = new MutationObserver(mutations => {
mutations.forEach(mutation => {
if (mutation.addedNodes.length) {
updateTweets(); // Call updateTweets whenever new nodes are added to the DOM
}
});
});
// Start observing the document body for child list changes
observer.observe(document.body, { childList: true, subtree: true });
function downloadTweetsAsJson(tweetsArray) {
const jsonData = JSON.stringify(tweetsArray); // Convert the array to JSON
const blob = new Blob([jsonData], { type: 'application/json' });
const url = URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = url;
link.download = 'tweets.json'; // Specify the file name
document.body.appendChild(link); // Append the link to the document
link.click(); // Programmatically click the link to trigger the download
document.body.removeChild(link); // Clean up and remove the link
}
@premrajnarkhede
Copy link

function updateTweets() {
    document.querySelectorAll('article[data-testid="tweet"]').forEach(tweetElement => {
        const authorName = tweetElement.querySelector('[data-testid="User-Name"]').innerText;
        const handle = tweetElement.querySelector('[role="link"]').href.split('/').pop();
        const tweetText = tweetElement.querySelector('[data-testid="tweetText"]').innerText;
        const time = tweetElement.querySelector('time').getAttribute('datetime');

        const retweets = tweetElement.querySelector('[data-testid="retweet"]').innerText;
        const likes = tweetElement.querySelector('[data-testid="like"]').innerText;
        const replies = tweetElement.querySelector('[data-testid="reply"]').innerText;

        const isTweetNew = !tweets.some(tweet => tweet.text === tweetText);
        if (isTweetNew) {
            tweets.push({
                authorName,
                handle,
                tweetText,
                time,
                retweets,
                likes,
                replies
            });
            console.log("tweets scraped: ", tweets.length);
        }
    });
}

Use this to get all other details. Thanks @gd3kr for amazing simple code

@luighifeodrippe
Copy link

I made some updates to the X scraping script to improve interaction data extraction and accuracy. A future update could include extracting the URL of images or videos within the tweet.

1. Comprehensive Interaction Data Extraction: The script now effectively captures detailed interaction metrics for each tweet, including replies, reposts, likes, bookmarks, and views, by filtering for relevant aria-labels.

2. Robust Parsing Logic: Introduced a new function, extractInteractionDataFromString, utilizing regular expressions to parse interaction metrics from the aria-label text, ensuring accurate numeric conversion of interaction data.

3. Keyword-Based Numeric Extraction: Added an auxiliary function, extractNumberForKeyword, to precisely extract numerical values associated with specific interaction terms, enhancing the script's parsing capability.

4. Refined Tweet Uniqueness Check: Improved the logic for identifying new tweets, reducing duplicates and ensuring a cleaner dataset.

5. Maintained JSON Export Functionality: Preserved the functionality to compile and download the scraped data as a JSON file, facilitating easy data export and analysis.

A huge thanks to @gd3kr and @premrajnarkhede for your insights and contributions that inspired these enhancements!

https://gist.github.com/luighifeodrippe/6e67c74bf5123ee0763cc0dd78b9411d

@typicallyze
Copy link

Is it possible to export the links of the tweets as well?

@TarasShu
Copy link

could you make a tutorial how to use it?

@iamhitarth
Copy link

iamhitarth commented Feb 17, 2024

Is it possible to export the links of the tweets as well?

@typicallyze you can add the following to the updateTweets function in @premrajnarkhede 's script and it'll get you the tweet link as well:

const link = tweetElement.querySelector('a[href*="/status/"]').href;

Then, add link to the object that is being pushed to tweets (under replies) and you'll be good to go.

@iamhitarth
Copy link

could you make a tutorial how to use it?

It's pretty simple @TarasShu .

  1. Go to your bookmarks page on Twitter.
  2. Press Command + Option + C to open the Inspector.
  3. Switch to the Console tab near the top of the Inspector.
  4. Paste in the full script from above and press Enter.
  5. You'll get tweets.json downloaded onto your machine.

@BromigoTools
Copy link

BromigoTools commented Apr 21, 2024

Is it possible to export the links of the tweets as well?

Hello, with the help of chatGPT, I combined the code in this repository and comments to work with collecting links, and I got some errors and also threw them into the chat GPT, here is the final code that works, but some tweets were not collected, I don’t know why, but the code may be useful for development

I don't understand this code myself and I would be glad if someone could tell me how to collect all tweets and check if 100% is collected

let tweets = []; // Initialize an empty array to hold all tweet elements

const scrollInterval = 1000;
const scrollStep = 5000; // Pixels to scroll on each step

let previousTweetCount = 0;
let unchangedCount = 0;

const scrollToEndIntervalID = setInterval(() => {
    window.scrollBy(0, scrollStep);
    const currentTweetCount = tweets.length;
    if (currentTweetCount === previousTweetCount) {
        unchangedCount++;
        if (unchangedCount >= 2) { // Stop if the count has not changed 5 times
            console.log('Scraping complete');
            console.log('Total tweets scraped: ', tweets.length);
            console.log('Downloading tweets as JSON...');
            clearInterval(scrollToEndIntervalID); // Stop scrolling
            observer.disconnect(); // Stop observing DOM changes
            downloadTweetsAsJson(tweets); // Download the tweets list as a JSON file
        }
    } else {
        unchangedCount = 0; // Reset counter if new tweets were added
    }
    previousTweetCount = currentTweetCount; // Update previous count for the next check
}, scrollInterval);


function updateTweets() {
    document.querySelectorAll('article[data-testid="tweet"]').forEach(tweetElement => {
        const authorNameElement = tweetElement.querySelector('[data-testid="User-Name"]');
        const handleElement = tweetElement.querySelector('[role="link"]');
        const tweetTextElement = tweetElement.querySelector('[data-testid="tweetText"]');
        const timeElement = tweetElement.querySelector('time');
        const retweetsElement = tweetElement.querySelector('[data-testid="retweet"]');
        const likesElement = tweetElement.querySelector('[data-testid="like"]');
        const repliesElement = tweetElement.querySelector('[data-testid="reply"]');
        const linkElement = tweetElement.querySelector('a[href*="/status/"]');

        // Check if all required elements exist
        if (authorNameElement && handleElement && tweetTextElement && timeElement && retweetsElement && likesElement && repliesElement && linkElement) {
            const authorName = authorNameElement.innerText;
            const handle = handleElement.href.split('/').pop();
            const tweetText = tweetTextElement.innerText;
            const time = timeElement.getAttribute('datetime');
            const retweets = retweetsElement.innerText;
            const likes = likesElement.innerText;
            const replies = repliesElement.innerText;
            const link = linkElement.href;

            const isTweetNew = !tweets.some(tweet => tweet.tweetText === tweetText);
            if (isTweetNew) {
                tweets.push({
                    authorName,
                    handle,
                    tweetText,
                    time,
                    retweets,
                    likes,
                    replies,
                    link
                });
                console.log("tweets scraped: ", tweets.length);
            }
        }
    });
}

// Initially populate the tweets array
updateTweets();

// Create a MutationObserver to observe changes in the DOM
const observer = new MutationObserver(mutations => {
    mutations.forEach(mutation => {
        if (mutation.addedNodes.length) {
            updateTweets(); // Call updateTweets whenever new nodes are added to the DOM
        }
    });
});

// Start observing the document body for child list changes
observer.observe(document.body, { childList: true, subtree: true });

function downloadTweetsAsJson(tweetsArray) {
    const jsonData = JSON.stringify(tweetsArray); // Convert the array to JSON
    const blob = new Blob([jsonData], { type: 'application/json' });
    const url = URL.createObjectURL(blob);
    const link = document.createElement('a');
    link.href = url;
    link.download = 'tweets.json'; // Specify the file name
    document.body.appendChild(link); // Append the link to the document
    link.click(); // Programmatically click the link to trigger the download
    document.body.removeChild(link); // Clean up and remove the link
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment