psychemedia/Dockerfile

## README.md

      
    Raw
  

              README.md
            
          
    Audiogrep / Videogrep Tools

Docker image containing several tools for tinkering with audio and video files.
The Dockerfile is an edit of the Dockerfile from kevinhughes27/audiogrep-docker that includes a patch and additional utilities and a shared folder. The other files from that repository are required to build the image.
Original audiogrep docs here: antiboredom/audiogrep
See also these examples of what audiogrep can do.
audiogrep also makes use of:

cmusphinx/pocketsphinx for automatic transcription
pydub to splice files together

#create shared folder on host
mkdir -p files

#The build around the Dockerfile needs to be in the context of other files from: https://github.com/kevinhughes27/audiogrep-docker
docker build -t psychemedia/avgrep .
#Transcribe an audio file
docker run --volume "${PWD}/files":/avgrepfiles --tty --interactive --rm psychemedia/avgrep  audiogrep --input avgrepfiles/MYFILE.mp3 --transcribe
#The transcription seems to chunk the audio file and produce a transcript for each as a separate file
#The audiogrep search seems to want a single trasncript with a different filename
#Create the single transcript file
cat files/MYFILE*.txt >> MYFILE.mp3.transcription.txt

#Generate a supercut
docker run --volume "${PWD}/files":/avgrepfiles --tty --interactive --rm psychemedia/avgrep  audiogrep --input /avgrepfiles/MYFILE.mp3 --search 'transparency | honest | health' --output /avgrepfiles/supercut.mp3 --regex --output-mode word

videogrep is also included in the container, but untested. Original videogrep docs here: antiboredom/videogrep
See also this example of what videogrep can do.
To help grab files from YouTube, youtube_dl is also included in the container.
Usage is along the lines of:
 docker run --volume "${PWD}/files":/audiogrepfiles --tty --interactive --rm psychemedia/avgrep  youtube-dl --extract-audio --audio-format mp3 -o '/avgrepfiles/%(id)s.mp3' https://www.youtube.com/watch?v=YOUTUBE_ID

Using a couple of test audio files with UK English speakers, I couldn't replicate anything like the original demos. Transcription was poor, the timing seemed really off (and didn't match searched for words), and some of the splices were of very long segments (minutes long). In the transcript, only single words seemed to be indentified, so I'm not sure how phrase identification is supposed to work.
I haven't looked at the code, but it might be worth generating a view reports over the extracted words to help identify sensible phrases. Something like nltk concordancing relative to a single word or multiple words would add another dimension to the reporting, and help the user spot keyword keyed phrases in the text, rather than the audio. (Adding the ability for the concordancer to act on OR'd words is a feature we can perhaps take away from audiogrep - I'll add it to my to do list!;-)

  
## Dockerfile
#Based on https://github.com/kevinhughes27/audiogrep-docker
# DOCKER-VERSION 1.4.0
FROM ubuntu:14.04

RUN apt-get update
RUN apt-get install -y software-properties-common

# FFMPEG
#The repository needs updating from the original
#Note that ffmpeg not standardly available for Ubuntu 14.04: http://www.faqforge.com/linux/how-to-install-ffmpeg-on-ubuntu-14-04/
RUN apt-add-repository ppa:mc3man/trusty-media
RUN apt-get update
RUN apt-get install -y ffmpeg

# PocketSpinx
RUN apt-get install -y pocketsphinx-utils
RUN apt-get install -y pocketsphinx-hmm-wsj1
RUN apt-get install -y pocketsphinx-lm-wsj

# python
RUN apt-get install -y git python python-pip python-dev

# audiogrep
RUN git clone https://github.com/antiboredom/audiogrep.git
RUN cd audiogrep && pip install -r requirements.txt && \
 chmod +x audiogrep/audiogrep.py && cp audiogrep/audiogrep.py /usr/bin/audiogrep

#RUN pip install audiogrep

RUN pip install moviepy
RUN pip install videogrep


#Tools to support grabbing of a/v files
#youtube_dl via https://electricarchaeology.ca/2016/04/19/audiogrep/
RUN pip install youtube_dl

RUN mkdir -p /avgrepfiles
VOLUME /avgrepfiles
	#Based on https://github.com/kevinhughes27/audiogrep-docker
	# DOCKER-VERSION 1.4.0
	FROM ubuntu:14.04

	RUN apt-get update
	RUN apt-get install -y software-properties-common

	# FFMPEG
	#The repository needs updating from the original
	#Note that ffmpeg not standardly available for Ubuntu 14.04: http://www.faqforge.com/linux/how-to-install-ffmpeg-on-ubuntu-14-04/
	RUN apt-add-repository ppa:mc3man/trusty-media
	RUN apt-get update
	RUN apt-get install -y ffmpeg

	# PocketSpinx
	RUN apt-get install -y pocketsphinx-utils
	RUN apt-get install -y pocketsphinx-hmm-wsj1
	RUN apt-get install -y pocketsphinx-lm-wsj

	# python
	RUN apt-get install -y git python python-pip python-dev

	# audiogrep
	RUN git clone https://github.com/antiboredom/audiogrep.git
	RUN cd audiogrep && pip install -r requirements.txt && \
	chmod +x audiogrep/audiogrep.py && cp audiogrep/audiogrep.py /usr/bin/audiogrep

	#RUN pip install audiogrep

	RUN pip install moviepy
	RUN pip install videogrep


	#Tools to support grabbing of a/v files
	#youtube_dl via https://electricarchaeology.ca/2016/04/19/audiogrep/
	RUN pip install youtube_dl

	RUN mkdir -p /avgrepfiles
	VOLUME /avgrepfiles