Skip to content

Instantly share code, notes, and snippets.

@vecna
Last active May 1, 2019 07:43
Show Gist options
  • Save vecna/c2c70f9f69ace691c1b45b2730044d16 to your computer and use it in GitHub Desktop.
Save vecna/c2c70f9f69ace691c1b45b2730044d16 to your computer and use it in GitHub Desktop.
Collection of commands and scripts; goal: CSV importing in mongodb, and generation of aggregated results as static CSV|JSON exported via nginx

Assuming there is a directory in which files get copied, the scripts above would

  • check the format first with the expected one (which can be the format of fbcrawl or webXray, at the moment)
  • perform mongoimport of the CSV

We need files with this file format: nameofwhatitis_date.csv

  • the script removed everything from the 1st underscore (_) till the end
  • the nameofwhatitis is used as column name, and would be used to identify the data collected.
  • for fbcrawl, it should be the name of the page
  • for webXray, it should be the name of the country analyzed, or the name of the site analyzed
#!/bin/sh -x
mkdir newfiles/
mkdir archived/
mkdir errors/

while true;
do
    inotifywait -e create newfiles/
    newfile=`/bin/ls newfiles/ | head -1`
    verify=`head -1 newfiles/$newfile | grep 'reactions,'`
    if [ ! $verify ]; then 
        echo "wrong format spot in newfiles/$newfile";
        mv newfiles/$newfile errors/
    else
        cname=`echo $newfile | sed -es/_.*//`
        command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
        mongo fbcrawl  --eval $command
        mongoimport -d fbcrawl -c $cname --file newfiles/$newfile --headerline --type=csv
        mv newfiles/$newfile archived/
    fi
    echo "processing of $newfile complete"
done

manual import

The script above work well if one file get copied per time, in the case more than one is, some file might remain unprocessed in the 'newfiles/' directory. In case the watcher above is not working, or in case we have to do a manual import, this script might help

if [ ! $1 ]; then
    echo "this script expect an argument, and should be the full path of the .csv to be imported"
    exit
else
    newfile=`basename $1`
    verify=`head -1 $1  | grep 'reactions,'`
    if [ ! $verify ]; then 
        echo "wrong format spot in $1, nothing done";
    else
        cname=`echo $newfile | sed -es/_.*//`
        command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
        mongo fbcrawl  --eval $command
        mongoimport -d fbcrawl -c $cname --file $1 --headerline --type=csv
        echo "moving the file to the 'archived/' directory"
        mv $1 archived/
    fi
    echo "processing of $1 complete"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment