Skip to content

Instantly share code, notes, and snippets.

@MichaelSasser
Last active July 2, 2020 23:26
Show Gist options
  • Save MichaelSasser/631f297e60f2d2a6cb6d76dfde12e6e4 to your computer and use it in GitHub Desktop.
Save MichaelSasser/631f297e60f2d2a6cb6d76dfde12e6e4 to your computer and use it in GitHub Desktop.

I used these scripts with wordlist-dedup to deduplicate a whole collection of wordlists safely. I used a tempdir variable in the dedup script to temporary dir located on another harddrive to speed up the sorting process. I changed it to /tmp in the file below. You can choose something else, if you like.

  • dedup does the whole thing. Let's assume, you have a file: filename.ext. The script first sorts the output in a file: filename_sorted.ext and then deduplicates the file to a third one: filename_sorted_dedup.ext.
  • The script to_txt converts non txt files to txt files.
  • The cleanup script deletes non deduped files. Move the three files in the folder you want to dedup filewise.
#!/usr/bin/env bash
# wordlist-dedup
# Copyright (c) 2020 Michael Sasser <Michael@MichaelSasser.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
echo "Current directory: $(pwd)"
shopt -s globstar
for file in **/*.*; do # Whitespace-safe and recursive
if [[ "$file" == *"_sorted_dedup."* ]]; then
continue
fi
if [[ "$file" == "dedup" ]] || [[ "$file" == "cleanup" ]] || [[ "$file" == "to_txt" ]]; then
continue
fi
echo "Deleating file: \"${file}\"."
rm "$file"
done
#!/usr/bin/env bash
# wordlist-dedup
# Copyright (c) 2020 Michael Sasser <Michael@MichaelSasser.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
function dedup() {
fullfile=$(realpath "$1")
directory=$(dirname "$fullfile")
filename=$(basename -- "$fullfile")
extension="${filename##*.}"
filename="${filename%.*}"
in="${directory}/${filename}.${extension}"
out_sort="${directory}/${filename}_sorted.${extension}"
out_dedup="${directory}/${filename}_sorted_dedup.${extension}"
tempdir="/tmp"
echo " -> Sorting"
sort "$in" > "$out_sort" -T "$tempdir" || exit
sleep 1
echo " -> Deduplicating lines"
wordlist-dedup "$out_sort" "$out_dedup" || exit
}
echo "Current directory: $(pwd)"
shopt -s globstar
for file in **/*.*; do # Whitespace-safe and recursive
if [[ "$file" == *"_sorted_dedup."* ]]; then
echo "Ignoring file: \"${file}\" has already been processed."
continue
fi
if [[ "$file" == "dedup" ]] || [[ "$file" == "cleanup" ]] || [[ "$file" == "to_txt" ]]; then
echo "Ignoring file: \"${file}\" don't dedup this script, that would be bad!"
continue
fi
echo "Sorting and deduplicating lines in file: \"${file}\"."
dedup "$file"
done
#!/usr/bin/env bash
# wordlist-dedup
# Copyright (c) 2020 Michael Sasser <Michael@MichaelSasser.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
echo "Current directory: $(pwd)"
shopt -s globstar
for file in **/*.*; do # Whitespace-safe and recursive
if [[ "$file" == "dedup" ]] || [[ "$file" == "cleanup" ]] || [[ "$file" == "to_txt" ]]; then
continue
fi
fullfile=$(realpath "$file")
directory=$(dirname "$fullfile")
filename=$(basename -- "$fullfile")
extension="${filename##*.}"
filename="${filename%.*}"
in="${directory}/${filename}.${extension}"
out="${directory}/${filename%"_sorted_dedup"}.txt"
echo "Renaming file: \"${in}\" -> \"${out}\"."
mv "$in" "$out"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment