Skip to content

Instantly share code, notes, and snippets.

@wolfram77
Last active April 6, 2018 06:57
Show Gist options
  • Save wolfram77/ec8c521ff3a2259f991de530087d1c38 to your computer and use it in GitHub Desktop.
Save wolfram77/ec8c521ff3a2259f991de530087d1c38 to your computer and use it in GitHub Desktop.
VaSQL

Programming with Microphone and Wearable Gloves using Speech, Language, and Gesture Recognition

Introduction

Programming can often be a tedious and repetitive task, and when continued for long hours everyday might eventually cause Repetitive Strain Injury (RSI). RSI is a serious problem causes loss of productivity, and pain in the hands. As the jobs in the software industry keep growing (India being the second largest programmer producing country), the number of possible cases of RSI growing every year with alarming rate. Steps are to be taken to fix the root of the problem. Due to the sedentary lifestyle, there has been on rise in the number of chronic diseases such as obesity, diabetes, hypertension, and cardiovascular diseases since the 1990s. One of such attempts could be to switch traditional computer input devices, like keyboard and mouse, with new emerging technologies, like automatic speech recognition and wearable technology inluding watces, gloves etc. to enable people to be more active at work.

Current work

Research in speech recognition began in 1952 at Bell Labs. In the 1960s, Reddy first developed a continuous speech recognition system [1]. His research group began using Hidden Markov Model (HMM) which proved to be highly useful and became the dominant algorithm in the 1980s. They also started Dragon Systems in the 1990s, developing Dragon NaturallySpeaking (DNS) which still is the most accurate commercial speech recognition software. Huang (1980), developed the Sphinx-II system, which was the first to do speaker-independent, large vocabulary, continuous speech recognition. [2] Dragon Systems was used with Apple's digital assistant Siri. Since 2006, the National Security Agency, US has made use of speech recogniton for keyword spotting. Today, many aspects of speech recognition have been taken over by deep learning, alongwith Long short-term memory (LSTM), a recurrent neural network by Hochreiter and Schmidhuber (1997) [3]. The use of deep feedforward (non-recurrent) networks for acoustic modeling was introduced by Hinton (2009) and Deng (2009) at Microsoft Research which decreased word error rate by 30%, and was quickly adopted across the field [4].

There also have been few great attempts at programming with speech, one developed by Rudd (2013) where he writes python code in Emacs editor with a voice setup using Dragon NaturallySpeaking, Dragonfly and several of his custom commands to enable code browsing, selection, editing, and templates [5]. There is another open source system for writing programs by voice called Silvius, developed by Williams-King (2016) using the Kaldi speech recognition toolkit and Voxforge, Tedlium speech models [6]. Regarding wearable input devices, Rowberg (2015) has developed KeyGlove, which is a wearable, wireless, open-source input device. It uses customizable touch combinations and gestures to enter text data, control the mouse, switch between applications, perform multiple operations with a single action, and even play games. It is built to support physically relaxed single-handed operation, it is also perfect for handicapped or disabled users, or those who are prone to or suffer from RSI-related injuries.

Objective and Scope of study

While the systems mentioned above provide an excellent way to program by voice, they rely entirely on it. This can also lead to fatigue of the vocal chords and thus cannot be used for long programming sessions. Since our objective is to minimize repetitive strain to our fingers and eliminate the necessity to sit infront of a computer in order to reduce sedentary work, the proposal here is to build a system that combines the use of speech recognition as well as wearable technology (wireless glove) to assist in programming, as an alternative to traditional keyboard and mouse input.

The scope of this study would be the following programming languages: SQL, Javascript, HTML, and CSS (web programming languages). All experiments will make use of existing technology, software or blueprints instead of going for custom design. Prototypes could be developed as per necessary for completition of experiments, but no attempts would be made at producing a finshed product.

Methodology

The study will start with research on programming by making use of only speech recognition via microphone, with a display. A review of existing open source speech recognition systems will be carried out, comparing them with commercial free to use digital assistants such as Amazon's Alexa, Microsoft's Cortana, or Google's Assistant based on accuracy and speed. Different ways of dictating text, SQL, Javascript, HTML, and CSS will be tested, optimizing them for less vocal strain and intuitiveness. This could include writing keywords, identifiers in various cases, templates for blocks (like conditions, loops, functions, classes), and commands like scrolling through text, selecting it, modifying it and other control options, like saving file, or accessing terminal. To minimize the use / necessity of a display, tests would be made to find effective ways which can enable longer, more complex programs to be written, read, and modified without a display. A few volunteers, who would be requested for help in the above tests would also be checked to see how long a continous session can they manage with such a system (with and without display) versus the traditional keyboard-mouse system.

The next phase of research would begin with making a few prototypes of KeyGlove, and a few algorithms could be tested for their sensitivity, specificity and speed in detection of gestures. Similar to speech recognitions tests, different ways for gesturing text, SQL, Javascript, HTML, and CSS will be tested, optimizing the for less hand / wrist strain (as per recommendation from doctor) and intuitiveness. Writing of long and complex programs with this system could also be tried out with and without a display by help from volunteers, taking note of the length of the session they can manage before getting fatigued. This could be compared with the traditional system. These tests will then be repeated with a system combining the use of speech recognition as well as wearable glove. The interface (voice commands, gestures) would remain the same, and participants will be encouraged to make use of both systems. Results will be taken for with and without a display, comparing it with the separate individual systems.

The final phase of the project would be to experiment with different display technologies, alongwith a virtual screenspace concept. The display would only project a part of the virtual screen space at a time, similar to 3D games, allowing for a much wider virtual display on a narrow display device. Experiments could be carried out with a wearable watch, phone, tablet and projector. Participants would be asked to perfrom a programming test while standing and while walking, and results will be compared for each type of display, and one without a display.

Possible outcome

In the first phase of the project, a set of easy-to-use voice command sheet will be generated for programming. This could be good enough for SQL, Javascript, HTML, and CSS but not necessarily for the others. Following this, the volunteered programming test could be carried out. The second phase would similarly involve designing a gesture command sheet and experimenting it with volunteers both with and without a display. The same experiment (programming test) will be carried out with a combination of both systems. The final phase would involve developing a prototype display with virtual screenspace and putting participants to test separately with different displays (wearable watch, phone, tablet, or projector). The same experiment could be carried out where participants would be asked to walk around, instead of just standing still. Based off of conclusions from these experiments, a final prototype could be pieced together.

References

  1. Juang B.H.; Rabiner L.R. (2004). "Automatic Speech Recognition – A Brief History of the Technology Development". Elsevier Encyclopedia of Language and Linguistics.
  2. Huang, Xuedong; Baker, James; Reddy, Raj (2014). "A Historical Perspective of Speech Recognition". Communications of the ACM, Vol. 57 No. 1, Pages 94-103, 10.1145/2500887.
  3. Hochreiter, S; Schmidhuber, J (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.
  4. Hinton, Geoffrey; Deng, Li; Yu, Dong; Dahl, George; Mohamed, Abdel-Rahman; Jaitly, Navdeep; Senior, Andrew; Vanhoucke, Vincent; Nguyen, Patrick; Sainath, Tara; Kingsbury, Brian (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups". IEEE Signal Processing Magazine. 29 (6): 82–97. Bibcode:2012ISPM...29...82H. doi:10.1109/MSP.2012.2205597.
  5. Rudd, Tavis (2013). "Using Python to Code by Voice". PyCon US 2013.
  6. Williams-King, David (2016). "Coding by Voice with Open Source Speech Recognition". The Eleventh HOPE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment