Skip to content

Instantly share code, notes, and snippets.

@aseemk
Last active January 28, 2019 14:52
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save aseemk/6137395 to your computer and use it in GitHub Desktop.
Save aseemk/6137395 to your computer and use it in GitHub Desktop.
Node.js script to extract i18n strings from source code.

This is a little script I wrote to automatically scour a git repo's source code and extract i18n strings.

It searches specifically for gettext-style __() calls, which happen to be used by @mashpie's i18n-node and @jeresig's i18n-node-2 (which I'm using), but which are also used by other tools.

It searches for these strings in a pretty liberal way that works with JavaScript, CoffeeScript, and probably most other similar languages: it just looks for (non-word character), __, either ( or whitespace, followed by a single- or double-quoted string.

It takes these strings and updates a JSON file with them, showing you strings that have been added and removed since the last run. This JSON format is the same as used by the above mentioned modules and others, so there should be no conflicts.

Example run:

$ ./scripts/i18n._coffee 

Reading current strings...
31 current strings found.

Searching files for strings...
14 matching files found.

Extracting strings from files...
32 total strings found.

3 strings added:
  - Log In
  - Log in
  - Log Out

2 strings removed:
  - Login
  - Logout

This tool probably isn't perfect, but it gives me peace of mind knowing that forgetting a code path will no longer mean missed strings. It doesn't have to replace the runtime updating of these tools; it supplements them nicely.

This tool is written in Streamline syntax, but the only async part (besides the file I/O, which doesn't have to be) is the call to git. You could easily rewrite this without Streamline if you just wrap the rest in a function and pass that as a callback to exec().

Feedback welcome! If you try this out, let me know how you like it. Cheers.

#!/usr/bin/env _coffee
#
# Helper script to search all of our files for i18n strings and update our
# strings file. Helpful in case we missed a code path during testing.
#
# Specifically, searches for gettext `__()` calls in our checked-in files.
#
$ = require 'underscore'
echo = console.log
{exec} = require 'child_process'
FS = require 'fs'
Path = require 'path'
## CONSTANTS:
STRINGS_FILEPATH = "#{__dirname}/locales/en.json"
# High-level search for files that look like they may contain __ calls.
# -I (no long option) means exclude binary files.
# https://www.kernel.org/pub/software/scm/git/docs/git-grep.html
GIT_GREP_COMMAND = """
git grep -I --word-regexp --name-only -e '__' -- #{__dirname}
"""
# Tailored regex to match our `__()` calls and extract the strings.
# http://www.regular-expressions.info/reference.html =)
# XXX Is this a bad idea? Brittle? Or good enough and safe?
# TODO We don't do this currently, but do we want to detect and support calls
# with heredoc (triple-quoted) strings too? Are they even good for i18n tho?
I18N_CALL_REGEX = ///
\W # `__` cannot follow a letter, number, or another underscore
__
[(\s] # "calling" means either an `(` or whitespace (CoffeeScript)
( # and the string is either...
' # single-quoted...
(.+? # (match anything, but lazily, not greedily)
[^\\]) # and the closing quote is one that's *not* preceded by a `\`
'
| # or...
" # double-quoted...
(.+? # (match anything, but lazily, not greedily)
[^\\]) # and the closing quote is one that's *not* preceded by a `\`
"
)
///gi
## MAIN:
# Read in the current set of strings:
echo '\nReading current strings...'
oldStrs = Object.keys require STRINGS_FILEPATH
echo "#{oldStrs.length} current strings found."
# Grep our checked-in files for a rough match of files:
echo '\nSearching files for strings...'
files = exec GIT_GREP_COMMAND, _
files = files.trim().split '\n'
# Filter out Markdown files since they're only documentation right now:
# (And if we ever used Markdown for user-facing content, I bet we could just
# translate the whole Markdown file itself.)
files = files.filter (file) -> (Path.extname file) isnt '.md'
echo "#{files.length} matching files found."
# Search matching files for all instances of, and extract, i18n strings:
echo "\nExtracting strings from files..."
newStrsMap = {}
for file in files
code = FS.readFile "#{__dirname}/#{file}", 'utf8', _
while match = I18N_CALL_REGEX.exec code
if str = match[2]
newStrsMap[str] = str
else
console.error 'Anomaly!', match
newStrs = Object.keys newStrsMap
echo "#{newStrs.length} total strings found."
# Compare the old vs. new strings:
added = $(newStrs).difference oldStrs
removed = $(oldStrs).difference newStrs
echo "\n#{added.length} strings added:\n -", added.join '\n - '
echo "\n#{removed.length} strings removed:\n -", removed.join '\n - '
# Finally, update the JSON!
FS.writeFile STRINGS_FILEPATH, (JSON.stringify newStrsMap, null, 4), _
@svrin
Copy link

svrin commented Feb 22, 2014

It seems like you only check for single quoted string matches in line 74, I think to also work with double quoted string, the line needs to be adjusted to:

if str = match[2] or match[3]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment