vecna/csv-2-mongodb.md

## csv-2-mongodb.md

      
    Raw
  

              csv-2-mongodb.md
            
          
    Assuming there is a directory in which files get copied, the scripts above would

check the format first with the expected one (which can be the format of fbcrawl or webXray, at the moment)
perform mongoimport of the CSV

We need files with this file format: nameofwhatitis_date.csv

the script removed everything from the 1st underscore (_) till the end
the nameofwhatitis is used as column name, and would be used to identify the data collected.
for fbcrawl, it should be the name of the page
for webXray, it should be the name of the country analyzed, or the name of the site analyzed

#!/bin/sh -x
mkdir newfiles/
mkdir archived/
mkdir errors/

while true;
do
    inotifywait -e create newfiles/
    newfile=`/bin/ls newfiles/ | head -1`
    verify=`head -1 newfiles/$newfile | grep 'reactions,'`
    if [ ! $verify ]; then 
        echo "wrong format spot in newfiles/$newfile";
        mv newfiles/$newfile errors/
    else
        cname=`echo $newfile | sed -es/_.*//`
        command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
        mongo fbcrawl  --eval $command
        mongoimport -d fbcrawl -c $cname --file newfiles/$newfile --headerline --type=csv
        mv newfiles/$newfile archived/
    fi
    echo "processing of $newfile complete"
done

manual import

The script above work well if one file get copied per time, in the case more than one is, some file might remain unprocessed in the 'newfiles/' directory. In case the watcher above is not working, or in case we have to do a manual import, this script might help
if [ ! $1 ]; then
    echo "this script expect an argument, and should be the full path of the .csv to be imported"
    exit
else
    newfile=`basename $1`
    verify=`head -1 $1  | grep 'reactions,'`
    if [ ! $verify ]; then 
        echo "wrong format spot in $1, nothing done";
    else
        cname=`echo $newfile | sed -es/_.*//`
        command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
        mongo fbcrawl  --eval $command
        mongoimport -d fbcrawl -c $cname --file $1 --headerline --type=csv
        echo "moving the file to the 'archived/' directory"
        mv $1 archived/
    fi
    echo "processing of $1 complete"
fi