Assuming there is a directory in which files get copied, the scripts above would
- check the format first with the expected one (which can be the format of fbcrawl or webXray, at the moment)
- perform mongoimport of the CSV
We need files with this file format: nameofwhatitis_date.csv
- the script removed everything from the 1st underscore (_) till the end
- the nameofwhatitis is used as column name, and would be used to identify the data collected.
- for fbcrawl, it should be the name of the page
- for webXray, it should be the name of the country analyzed, or the name of the site analyzed
#!/bin/sh -x
mkdir newfiles/
mkdir archived/
mkdir errors/
while true;
do
inotifywait -e create newfiles/
newfile=`/bin/ls newfiles/ | head -1`
verify=`head -1 newfiles/$newfile | grep 'reactions,'`
if [ ! $verify ]; then
echo "wrong format spot in newfiles/$newfile";
mv newfiles/$newfile errors/
else
cname=`echo $newfile | sed -es/_.*//`
command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
mongo fbcrawl --eval $command
mongoimport -d fbcrawl -c $cname --file newfiles/$newfile --headerline --type=csv
mv newfiles/$newfile archived/
fi
echo "processing of $newfile complete"
done
The script above work well if one file get copied per time, in the case more than one is, some file might remain unprocessed in the 'newfiles/' directory. In case the watcher above is not working, or in case we have to do a manual import, this script might help
if [ ! $1 ]; then
echo "this script expect an argument, and should be the full path of the .csv to be imported"
exit
else
newfile=`basename $1`
verify=`head -1 $1 | grep 'reactions,'`
if [ ! $verify ]; then
echo "wrong format spot in $1, nothing done";
else
cname=`echo $newfile | sed -es/_.*//`
command=`echo 'db["'$cname'"].createIndex({"post_id":1},{unique:true})'`
mongo fbcrawl --eval $command
mongoimport -d fbcrawl -c $cname --file $1 --headerline --type=csv
echo "moving the file to the 'archived/' directory"
mv $1 archived/
fi
echo "processing of $1 complete"
fi