Skip to content

Instantly share code, notes, and snippets.

@valyala
Created December 5, 2011 18:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save valyala/1434692 to your computer and use it in GitHub Desktop.
Save valyala/1434692 to your computer and use it in GitHub Desktop.
Word Chains solution
// This is a solution for the Word Chains problem described
// at http://socialcam.com/jobs/problems .
//
// The problem.
//
// Two words are connected by a word chain if it is possible to change one into
// the other by making a series of single-character changes, such that every
// intermediate form is also a word. For example, CAT and DOG are connected with
// a word chain because CAT, COT, COG and DOG are all words. DEMONIC and
// UMBRELLA are not.
//
// Write a program that takes a list of words (for example /usr/share/dict/words
// on a unix system) and then reads pairs of words on stdin and prints 'YES' if
// the words are connected by a chain, and 'NO' if they are not. The program
// should take the path to the word list as a command-line argument, and should
// then loop, reading pairs of whitespace-delimited words on stdin and printing
// 'YES' or 'NO.'
//
// Only one operation is allowed between words in the chain. That operation may
// consist of changing any single character, deleting any single character
// or inserting any single character at any position. All comparisons should be
// case insensitive.
//
//
// The solution.
//
// The core concept of the solution is subword - a tuple of a word's substring
// and a character position. Word's substring is created from a word by deleting
// a character at the given position. It is possible creating L distinct
// subwords from a word with length L. Word's substring is always one character
// shorter than the corresponding word. Distinct subwords can have equal
// substrings if they differ by character position.
// For example, ('oo', 0), ('fo', 1), ('fo', 2) are all possible subwords
// for the word 'foo'.
//
// If two words have a common subword, then one word can be converted into
// another word by character substitution at the given position.
// If a word matches a subword's substring, then this word can be converted
// into subword's word by character deletion at the given position. Character
// deletion is equivalent to character insertion for the matching word.
// Obviously subwords can be used for building all possible one-character
// transformations for dictionary words irregardless of alphabet used in words.
//
// The solution is split into two phases:
//
// 1. Dictionary words preprocessing.
// 2. Input words lookup.
//
// The first phase has the following steps:
//
// 1. Read all dictionary words into memory.
// This step has O(W) computational complexity and O(S) space complexity,
// where:
// - W is the total number of dictionary words;
// - S is the total number of chars in all dictionary words.
//
// 2. For each word from the dictionary create all possible subwords. If word's
// subword matches already existing subword, then connect the corresponding
// words.
// This step has O(S) computational complexity and O(S*E1) space coplexity,
// where:
// - E1 is the average number of connections found per each subword.
//
// 3. For each word from the dictionary find all subwords with substrings
// matching the given word. Connect the word to found subwords' words.
// This step has O(W) computational complexity and O(W*E2) space complexity,
// where:
// - E2 is the average number of connections found per each word.
//
// 4. Step 3 creates disconnected set of connected graphs from words.
// Each connected graph in this set represents all the words, which can be
// transformed from each other. If words belong to distinct connected graphs,
// then the transformation is impossible. Let's assign distinct colors
// for each connected graph and brush each word in each graph with
// the corresponding color.
// This step has O(W) computational complexity and O(W) space complexity.
//
// 5. Create a lookup table, which maps from word to color.
// This step has O(W) computational complexity and O(S) space complexity.
// Subwords can be deleted after this step.
//
// The second phase consists of one step:
//
// 1. For each word pair verify whether two given words belong to the same
// graph by comparing their colors obtained from the lookup table.
// This step has O(P) computational complexity and O(1) space complexity,
// where:
// - P is the total number of word pairs.
//
// For words from natural languages E1 and E2 can be considered constants.
// In this case computational complexity of the first phase is O(S).
// Storage required per each word from a natural language can be considered
// constant, so the solution requires O(S) space.
//
// It is worth noting algorithm's computational complexity doesn't depend
// on the size of alphabet used in dictionary words. I.e. words containing
// arbitrary unicode characters will be processed with the same speed as words
// containing only Latin chars. (Though uncode support is missing
// in the current implementation).
//
// This program reads 99K words from /usr/share/dict/words in 0.24 seconds
// on my laptop with the speed of ~413K words/s. Steps 1-4 require 16Mb of RAM
// (162 bytes per word) on 64-bit build. 32-bit build consumes 15Mb of RAM
// (152 bytes per word).
// Step 5 requires only a few Mb of RAM for storing a mapping between dictionary
// words and colors.
// After the dictionary load complete, it processes 50K word pairs from the same
// /usr/share/dict/words file in 0.09 seconds (i.e. ~556K word pairs/s).
//
// The program uses the following optimization tricks, which help increasing
// performance and decreasing memory usage:
//
// - Custom optimized containers are used instead of generic STL containers.
// - Memory for already known quantity of structures is allocated in bulk.
// - Small integer types are used in structs if possible instead of generic wide
// types.
// - Pointers are substituted by offsets from base address. Offsets have smaller
// sizes comparing to pointers (especially on x64 builds).
// - Access patterns to memory in loops is close to linear.
// - Unions are used for packing mutually exclusive objects into the same
// location.
// - Objects are stored by value in data containers. This saves memory
// by eliminating a pointer to value from data container and eliminates
// a memory dereference during access to such values.
// - Unneeded memory copies are avoided where possible.
// - Pragma pack() directive is used for tight structures' packing.
//
// Author: Aliaksandr Valialkin <valyala@gmail.com>.
#include <cassert> // for assert()
#include <cctype> // for tolower(), isspace()
#include <cstddef> // for size_t
#include <cstdlib> // for exit(), EXIT_SUCCESS/EXIT_FAILURE
#include <fstream> // for ifstream
#include <iostream> // for cin/cout/cerr
#include <istream>
#include <ostream>
#include <stdint.h> // for uint*_t
// Denies copy constructor and assignment operator for the given class.
// These operations are implicitly implemented by the compiler unless
// explicilty declared. Sometimes implicit implementation of these operations
// won't match your needs. For example, if a class contains pointers to objects
// it owns, implicit copying will lead to use after free errors.
//
// This macro MUST be placed only at the end of class definition!
#define DENY_COPYING(classname) \
private: \
classname(const classname &); \
void operator = (const classname &);
// It's unclear how to obtain SIZE_MAX in C++. So define it here.
#ifndef SIZE_MAX
# define SIZE_MAX ((size_t)-1)
#endif
using namespace std;
namespace
{
// Debug stream.
//
// This stream is completely eliminated if NDEBUG macro is present, so feel free
// scattering debug messages in the code wherever you want. This won't hurt
// performance in optimized builds.
class debug_ostream
{
private:
ostream &_output;
public:
explicit debug_ostream(ostream &output) : _output(output) {}
debug_ostream &operator << (ostream &(*const manipulator)(ostream &))
{
(void)manipulator;
#ifndef NDEBUG
manipulator(_output);
#endif
return *this;
}
template <class T>
debug_ostream &operator << (const T &v)
{
(void)v;
#ifndef NDEBUG
_output << v;
#endif
return *this;
}
};
// Debug stream initialization.
debug_ostream cdbg(cerr);
// Calculates the number of bytes required for storing the value N.
template <size_t N> struct bytes
{
static const size_t n = 1 + bytes<(N/256)>::n;
};
template<> struct bytes<0>
{
static const size_t n = 0;
};
// Determines a type suitable for holding the given number of bytes.
template <size_t Bytes> struct size_type;
template<> struct size_type<1> { typedef uint8_t t; };
template<> struct size_type<2> { typedef uint16_t t; };
// Moves object's contents from *src to *dst.
// It differs from assignment operator in two aspects:
// - dst contents is undefined before the move, so there is no need in freeing
// up dst resources during the move.
// - src contents can be modified during the move, so it can be left in
// the state, which consumes minimal CPU at destruction time.
//
// These aspects allow writing fast move() specializations.
template <class ValueType>
void move(ValueType *const dst, ValueType *const src)
{
// This move() implementation is suitable only for POD types.
// Non-POD types must provide own move() specialization.
*dst = *src;
}
// Stack implementation optimized for performance and memory usage.
// It packs multiple items in chunks up to ChunkSize items per chunk.
// This improves memory locality and shaves off memory required for 'next'
// pointers comparing to canonical list implementations.
//
// See http://en.wikipedia.org/wiki/Unrolled_linked_list for details.
//
// The stack owns items pushed into it. It uses move() template function
// for moving items from/to the stack.
#pragma pack(push, 1)
template <class ValueType, size_t ChunkSize>
class fast_stack
{
private:
typedef typename size_type<bytes<ChunkSize>::n>::t chunk_size_t;
static const chunk_size_t _MAX_CHUNK_SIZE = ((chunk_size_t)-1);
// Pointer to the next stack chunk.
fast_stack *_next;
// Space for items in the current chunk.
char _v[sizeof(ValueType) * ChunkSize];
// The number of items in the current chunk.
chunk_size_t _chunk_size;
ValueType *_get(const size_t n) const
{
assert(_chunk_size <= ChunkSize);
assert(n < _chunk_size);
return ((ValueType *)_v) + n;
}
fast_stack(fast_stack *const next, ValueType *const v) :
_next(next), _chunk_size(1)
{
assert(ChunkSize > 0);
assert(ChunkSize <= _MAX_CHUNK_SIZE);
cdbg << "Allocated stack chunk for " << ChunkSize << " items" << endl;
move(_get(0), v);
}
~fast_stack()
{
for (size_t i = 0; i < _chunk_size; ++i) {
_get(i)->~ValueType();
}
cdbg << "Released stack chunk for " << ChunkSize << " items" << endl;
}
public:
// Moves the given item onto the stack using move() template function.
// May update a pointer to the stack top.
static void push(fast_stack **const top_p, ValueType *const v)
{
assert(top_p != 0);
fast_stack *top = *top_p;
if (top == 0 || top->_chunk_size == ChunkSize) {
*top_p = new fast_stack(*top_p, v);
}
else {
assert(top->_chunk_size < ChunkSize);
++(top->_chunk_size);
move(top->_get(top->_chunk_size - 1), v);
}
}
// Pops the item from the top of the stack to *to_v using move() template
// function. May update a pointer to the stack top.
static void pop(fast_stack **const top_p, ValueType *const to_v)
{
assert(top_p != 0);
fast_stack *top = *top_p;
assert(top != 0);
assert(top->_chunk_size > 0);
move(to_v, top->_get(top->_chunk_size - 1));
--(top->_chunk_size);
if (top->_chunk_size == 0) {
*top_p = top->_next;
delete top;
}
}
// Moves items from the array to the stack using move() template function.
// Returns the top of the created stack.
static fast_stack *move_from_array(ValueType *begin, const size_t n)
{
assert(begin != 0);
const ValueType *const end = begin + n;
fast_stack *top = 0;
while (begin != end) {
push(&top, begin);
++begin;
}
return top;
}
// Destroys the given stack.
static void destroy(fast_stack *top)
{
while (top != 0) {
fast_stack *next = top->_next;
delete top;
top = next;
}
}
// Calls visitor for each item on the stack.
// The order of visited items is unspecified.
template <class Visitor>
static void for_each(const fast_stack *top, Visitor *const visitor)
{
while (top != 0) {
assert(top->_chunk_size > 0);
for (size_t i = 0; i < top->_chunk_size; ++i) {
ValueType *const v = top->_get(i);
(*visitor)(v);
}
top = top->_next;
}
}
template <class Predicate>
static ValueType *find_if(const fast_stack *top, Predicate *const predicate)
{
while (top != 0) {
assert(top->_chunk_size > 0);
for (size_t i = 0; i < top->_chunk_size; ++i) {
ValueType *const v = top->_get(i);
if ((*predicate)(const_cast<const ValueType *const>(v))) {
return v;
}
}
top = top->_next;
};
return 0;
}
DENY_COPYING(fast_stack)
};
#pragma pack(pop)
// A list optimized for storage space.
// It stores up to InlineChunkSize values in the object memory itself,
// then falls back to dynamically allocated storage.
//
// The list owns added items. It uses move() template function for moving
// items into the list.
#pragma pack(push, 1)
template <class ValueType, size_t InlineChunkSize, size_t ExternalChunkSize>
class compact_list
{
private:
// This type must be able handling up to InlineChunkSize + 1 distinct integer
// values. The (InlineChunkSize + 1) value is used as a flag indicating
// external storage is used instead of inline storage.
typedef typename size_type<bytes<InlineChunkSize+1>::n>::t
inline_storage_size_t;
typedef fast_stack<ValueType, ExternalChunkSize> external_storage_t;
static const size_t _MAX_INLINE_STORAGE_SIZE = ((inline_storage_size_t)-1);
union {
char inline_storage[sizeof(ValueType) * InlineChunkSize];
external_storage_t *external_storage;
} _u;
inline_storage_size_t _inline_storage_size;
ValueType *_get_from_inline_storage(const size_t n) const
{
assert(n < _inline_storage_size);
return ((ValueType *)_u.inline_storage) + n;
}
public:
compact_list() : _inline_storage_size(0)
{
assert(InlineChunkSize > 0);
assert(InlineChunkSize < _MAX_INLINE_STORAGE_SIZE);
}
~compact_list()
{
if (_inline_storage_size <= InlineChunkSize) {
for (size_t i = 0; i < _inline_storage_size; i++) {
_get_from_inline_storage(i)->~ValueType();
}
}
else {
assert(_u.external_storage != 0);
external_storage_t::destroy(_u.external_storage);
}
}
// Moves the given item into the list using move() template function.
void add(ValueType *const v)
{
if (_inline_storage_size < InlineChunkSize) {
assert(_inline_storage_size < _MAX_INLINE_STORAGE_SIZE);
++_inline_storage_size;
move(_get_from_inline_storage(_inline_storage_size - 1), v);
}
else {
if (_inline_storage_size == InlineChunkSize) {
cdbg << "Switching from inline to external storage, because list " <<
"size is greater than " << InlineChunkSize << endl;
assert(_inline_storage_size < _MAX_INLINE_STORAGE_SIZE);
++_inline_storage_size;
_u.external_storage = external_storage_t::move_from_array(
_get_from_inline_storage(0), InlineChunkSize);
}
assert(_u.external_storage != 0);
external_storage_t::push(&_u.external_storage, v);
}
}
// Calls visitor for each item on the list.
// The order of visited items is unspecified.
template <class Visitor>
void for_each(Visitor *const visitor) const
{
if (_inline_storage_size <= InlineChunkSize) {
for (size_t i = 0; i < _inline_storage_size; ++i) {
ValueType *const v = _get_from_inline_storage(i);
(*visitor)(v);
}
}
else {
assert(_u.external_storage != 0);
external_storage_t::for_each(_u.external_storage, visitor);
}
}
template <class Predicate>
ValueType *find_if(Predicate *const predicate) const
{
if (_inline_storage_size <= InlineChunkSize) {
for (size_t i = 0; i < _inline_storage_size; ++i) {
ValueType *const v = _get_from_inline_storage(i);
if ((*predicate)(const_cast<const ValueType *const>(v))) {
return v;
}
}
return 0;
}
assert(_u.external_storage != 0);
return external_storage_t::find_if(_u.external_storage, predicate);
}
DENY_COPYING(compact_list)
};
#pragma pack(pop)
// Simple hash table, which uses less memory comparing to sets and maps
// provided by STL.
//
// ValueType must implement equal_to() and hash() methods.
// Hash table owns objects put into it. It uses move() template function
// for object's movement. Objects put into hash table shouldn't
// modify hash() result.
template <class ValueType, size_t OptimalBucketSize>
class simple_hashtable
{
public:
typedef compact_list<ValueType, OptimalBucketSize, OptimalBucketSize * 2>
bucket_t;
private:
const size_t _buckets_count;
bucket_t *const _buckets;
template <class Visitor>
class _value_visitor
{
private:
Visitor &_visitor;
public:
explicit _value_visitor(Visitor &visitor) : _visitor(visitor) {}
void operator () (ValueType *const v) const
{
_visitor(v);
}
};
class _value_finder
{
private:
const ValueType *const _v;
const simple_hashtable *const _htb;
public:
_value_finder(const ValueType *const v,
const simple_hashtable *const htb) : _v(v), _htb(htb) {}
bool operator () (const ValueType *const v) const
{
assert(_htb->get_bucket(_v) == _htb->get_bucket(v));
return _v->equal_to(v);
}
};
ValueType *_find(const bucket_t *const bucket, const ValueType *const v) const
{
assert(bucket == get_bucket(v));
const _value_finder vf(v, this);
return bucket->find_if(&vf);
}
public:
explicit simple_hashtable(const size_t buckets_count) :
_buckets_count(buckets_count), _buckets(new bucket_t[buckets_count])
{
assert(_buckets_count > 0);
cdbg << "*** Allocated " << (_buckets_count * sizeof(_buckets[0])) <<
" bytes for " << _buckets_count << " hashtable buckets" << endl;
}
~simple_hashtable()
{
assert(_buckets != 0);
delete[] _buckets;
cdbg << "*** Deleted " << (_buckets_count * sizeof(_buckets[0])) <<
" bytes allocated for " << _buckets_count << " hashtable buckets" <<
endl;
}
bucket_t *get_bucket(const ValueType *const v) const
{
assert(_buckets_count > 0);
assert(_buckets != 0);
// TODO: use something faster than modulo operation?
size_t n = v->hash() % _buckets_count;
return _buckets + n;
}
// Moves the given item into hashtable using move() template function.
// Returns 0 on success or a pointer to already existing equivalent
// item on failure.
ValueType *insert(ValueType *const v)
{
bucket_t *const bucket = get_bucket(v);
ValueType *const tmp = _find(bucket, v);
if (tmp == 0) {
bucket->add(v);
}
else {
assert(v->equal_to(tmp));
}
return tmp;
}
// Returns a pointer to an item with value equal to the given value.
// Returns 0 if there is no item with the given value.
ValueType *find(const ValueType *const v) const
{
return _find(get_bucket(v), v);
}
// Call visitor for each item in the hashtable.
// The order of visited items is unspecified.
template <class Visitor>
void for_each(Visitor *const visitor) const
{
assert(_buckets != 0);
const _value_visitor<Visitor> vv(visitor);
for (size_t i = 0; i < _buckets_count; ++i) {
_buckets[i].for_each(vv);
}
}
DENY_COPYING(simple_hashtable)
};
// Jenkin's hash parts.
//
// See http://en.wikipedia.org/wiki/Jenkins_hash_function .
// C++ doesn't provide decent hash implementations, so we have to implement
// it here.
struct jenkins_hash
{
static uint32_t mix(uint32_t h, const char c)
{
h += c;
h += (h << 10);
h ^= (h >> 6);
return h;
}
static uint32_t final(uint32_t h)
{
h += (h << 3);
h ^= (h >> 11);
h += (h << 15);
return h;
}
};
// Tiny and fast string implementation.
// It builds strings on top of the provided buffer.
// It doesn't allocate additional memory.
#pragma pack(push, 1)
class tiny_string
{
public:
// Maximum string size value must fit this type.
// This type must be unsigned.
typedef uint8_t ts_size_t;
private:
// Buffer size value must fit this type.
// This type must be unsigned.
typedef uint32_t offset_t;
static const offset_t _MAX_STRING_SIZE = ((ts_size_t)-1);
static const offset_t _MAX_INLINE_STRING_SIZE = sizeof(offset_t);
static const offset_t _MAX_BUF_SIZE = ((offset_t)-2);
static const offset_t _TMP_STRING_OFFSET = ((offset_t)-1);
union {
// Offset of the string in the buffer.
offset_t offset;
// Inline string allows eliminating additional memory dereference.
char inline_string[_MAX_INLINE_STRING_SIZE];
} _u;
// The size of the string.
// Strings shorter than (_MAX_INLINE_STRING_SIZE + 1) are stored
// in _u.inline_string.
ts_size_t _size;
static char *_buf;
static offset_t _buf_size;
static offset_t _next_offset;
// Temporary buffer is used for holding a string read from input stream.
static char _tmp_buf[_MAX_STRING_SIZE];
void _verify() const
{
assert(_buf != 0);
assert(_next_offset <= _buf_size);
if (_size > _MAX_INLINE_STRING_SIZE) {
if (_u.offset != _TMP_STRING_OFFSET) {
assert(_u.offset <= _next_offset - _size);
}
}
}
static void _set_string(tiny_string *const ts, const offset_t offset,
const ts_size_t size)
{
if (size <= _MAX_INLINE_STRING_SIZE) {
const char *const src = (offset != _TMP_STRING_OFFSET) ? (_buf + offset) : _tmp_buf;
char *const dst = ts->_u.inline_string;
// This loop is faster than memcpy().
for (size_t i = 0; i < size; ++i) {
dst[i] = src[i];
}
}
else {
ts->_u.offset = offset;
}
ts->_size = size;
}
public:
// Sets the buffer, which will be used for strings building.
// The buffer must contain whitespace-delimited strings.
static void set_buffer(char *const buf, const size_t buf_size)
{
assert(_buf == 0);
assert(_MAX_BUF_SIZE <= SIZE_MAX);
if (buf_size > _MAX_BUF_SIZE) {
cerr << "Cannot reserve " << buf_size << " chars for tiny strings." <<
"Widen tiny_string::offset_t for handling more chars" << endl;
::exit(EXIT_FAILURE);
}
_buf = buf;
assert((offset_t) buf_size == buf_size);
_buf_size = buf_size;
_next_offset = 0;
cdbg << "*** Set a buffer for holding " << _buf_size <<
" tiny string chars" << endl;
}
// Builds the next string from the underlying buffer.
// Builds the string with zero length if the end of the buffer reached.
static void get_next_string(tiny_string *const ts)
{
assert(_buf != 0);
assert(_next_offset <= _buf_size);
// skip leading whitespace
char c = 0;
while (_next_offset < _buf_size) {
c = _buf[_next_offset++];
if (!::isspace(c)) {
break;
}
}
if (_next_offset == _buf_size) {
// end of buffer reached
_set_string(ts, _TMP_STRING_OFFSET, 0);
return;
}
// lowercase string contents.
const offset_t s_offset = _next_offset - 1;
assert(!::isspace(c));
do {
_buf[_next_offset - 1] = ::tolower(c);
c = _buf[_next_offset++];
} while (_next_offset < _buf_size && !::isspace(c));
// put trailing character into string.
if (!::isspace(c)) {
_buf[_next_offset - 1] = ::tolower(c);
}
// calculate string size
assert(s_offset + 1 < _next_offset);
const offset_t s_size = _next_offset - s_offset - 1;
if (s_size > _MAX_STRING_SIZE) {
cerr << "Input string [";
cerr.write(_buf + s_offset, s_size);
cerr << "] containing " << s_size << " chars is longer than " <<
"the allowed maximum " << _MAX_STRING_SIZE << " chars. Widen " <<
"tiny_string::ts_size_t for handling longer strings" << endl;
::exit(EXIT_FAILURE);
}
assert((ts_size_t) s_size == s_size);
_set_string(ts, s_offset, s_size);
}
// Reads the string from input stream.
// The string is read into a static temporary buffer, so the next call
// to read() overwrites the previous string in the temporary buffer.
static void read(istream &input, tiny_string *const ts)
{
// skip leading whitespace
int c;
do {
c = input.get();
} while (::isspace(c));
if (!input.good()) {
// end of input stream
_set_string(ts, _TMP_STRING_OFFSET, 0);
return;
}
// read and lowercase string contents.
offset_t s_size = 0;
assert(!::isspace(c));
do {
_tmp_buf[s_size++] = ::tolower(c);
c = input.get();
} while (s_size < _MAX_STRING_SIZE && !::isspace(c) && input.good());
if (!::isspace(c) && input.good()) {
assert(s_size == _MAX_STRING_SIZE);
cerr << "Tiny string [";
cerr.write(_tmp_buf, _MAX_STRING_SIZE);
cerr << "]... cannot be longer than " << _MAX_STRING_SIZE << " chars" <<
endl;
::exit(EXIT_FAILURE);
}
_set_string(ts, _TMP_STRING_OFFSET, s_size);
}
size_t size() const
{
_verify();
return _size;
}
const char *data() const
{
_verify();
if (_size <= _MAX_INLINE_STRING_SIZE) {
return _u.inline_string;
}
return (_u.offset != _TMP_STRING_OFFSET) ? (_buf + _u.offset) : _tmp_buf;
}
bool equal_to(const tiny_string *const ts) const
{
if (_size != ts->_size) {
return false;
}
const char *const s1 = data();
const char *const s2 = ts->data();
for (ts_size_t i = 0; i < _size; ++i) {
if (s1[i] != s2[i]) {
return false;
}
}
return true;
}
uint32_t hash() const
{
uint32_t h = 0;
const char *const s = data();
for (ts_size_t i = 0; i < _size; ++i) {
h = jenkins_hash::mix(h, s[i]);
}
return jenkins_hash::final(h);
}
#ifndef NDEBUG
void debug_print(ostream &output) const
{
output << "[";
output.write(data(), size());
output << "]";
}
#endif
};
#pragma pack(pop)
char *tiny_string::_buf;
tiny_string::offset_t tiny_string::_buf_size;
tiny_string::offset_t tiny_string::_next_offset;
char tiny_string::_tmp_buf[tiny_string::_MAX_STRING_SIZE];
#ifndef NDEBUG
ostream &operator << (ostream &output, const tiny_string &ts)
{
ts.debug_print(output);
return output;
}
#endif
// This class holds information related to a dictionary word:
// - A list of words, which can be transformed into the word via a single
// character substitution, character insertion or character deletion.
// - A color of the word. Initially all the words have no color (zero value).
// Words are colorified via graph traversing after all dictionary words
// are populated and all word connections are built.
#pragma pack(push, 1)
class word
{
public:
// The type of word index.
// Widen it in order to handle more words.
// Must be unsigned.
typedef uint32_t idx_t;
// The type of word color.
// Widen it in order to handle more colors.
// Must be unsigned.
typedef uint16_t color_t;
// Null index is used solely for performance optimization purposes.
static const idx_t NULL_IDX;
// A stack of words. It is used for traversing graphs of connected words
// during words brushing.
// Use large stack chunks for performance reasons.
typedef fast_stack<idx_t, 4000> stack_t;
private:
static word *_words;
static idx_t _words_count;
static idx_t _next_idx;
// ((idx_t)-1) is already occupied by NULL_IDX.
static const idx_t _MAX_WORDS_COUNT = ((idx_t)-2);
// It is observed that words from /usr/share/dict/words have ~3 connections
// per word, so store up to 4 connections inline for performance reasons :)
typedef compact_list<idx_t, 4, 8> connected_words_t;
// List of connected words.
connected_words_t _connected_words;
// String associated with the word.
const tiny_string _ts;
// Word's color.
color_t _color;
static idx_t _get_idx(const word *const w)
{
assert(w >= _words);
assert(w - _words < _next_idx);
return w - _words;
}
class _words_brusher
{
private:
stack_t **const _ws_p;
const color_t _color;
public:
_words_brusher(stack_t **const ws_p, const color_t color) :
_ws_p(ws_p), _color(color)
{
assert(_ws_p != 0);
assert(_color != 0);
}
void operator () (const idx_t *const idx) const
{
word *const w = word::get_by_idx(*idx);
if (w->get_color() == _color) {
return;
}
w->set_color(_color);
idx_t tmp = *idx;
stack_t::push(_ws_p, &tmp);
}
};
class _idx_finder
{
private:
const word::idx_t _idx;
public:
_idx_finder(const word::idx_t idx) : _idx(idx) {}
bool operator () (const word::idx_t *const idx) const
{
return (_idx == *idx);
}
};
explicit word(const tiny_string *const ts) :
_connected_words(), _ts(*ts), _color(0) {}
public:
static word *get_by_idx(const idx_t idx)
{
assert(idx < _next_idx);
return _words + idx;
}
// Reserves space for the given number of words.
static void reserve_words(const size_t words_count)
{
assert(_words == 0);
if (words_count > _MAX_WORDS_COUNT) {
cerr << "Cannot allocate space for " << words_count << " words. Maximum " <<
"number of words is " << _MAX_WORDS_COUNT <<
"Widen word::idx_t for handling more words" << endl;
::exit(EXIT_FAILURE);
}
if (words_count > SIZE_MAX / sizeof(_words[0])) {
cerr << "Cannot allocate space for " << words_count << " words" << endl;
::exit(EXIT_FAILURE);
}
const size_t buf_size = words_count * sizeof(_words[0]);
_words = (word *) operator new (buf_size);
assert((idx_t) words_count == words_count);
_words_count = words_count;
_next_idx = 0;
cdbg << "*** Allocated " << buf_size << " bytes for " << words_count <<
" words" << endl;
}
// Releases space allocated by reserve_words().
static void release_words()
{
assert(_words != 0);
assert(_words_count <= _MAX_WORDS_COUNT);
assert(_next_idx <= _words_count);
for (idx_t i = 0; i < _next_idx; ++i) {
get_by_idx(i)->~word();
}
operator delete (_words);
_words = 0;
cdbg << "*** Freed " << (_words_count * sizeof(_words[0])) <<
" bytes allocated for words" << endl;
}
// Adds new word into words storage.
static void add_word(const tiny_string *const ts)
{
assert(_words != 0);
assert(_words_count <= _MAX_WORDS_COUNT);
assert(_next_idx < _words_count);
++_next_idx;
word *const w = get_by_idx(_next_idx - 1);
new (w) word(ts);
cdbg << "Read " << w << endl;
}
bool is_connected_to(const word *const w) const
{
const idx_t idx = _get_idx(w);
const _idx_finder idf(idx);
const idx_t *const tmp = _connected_words.find_if(&idf);
if (tmp != 0) {
assert(*tmp == idx);
return true;
}
return false;
}
void connect_to(const word *const w)
{
assert(w->_ts.size() == _ts.size() ||
w->_ts.size() + 1 == _ts.size() ||
w->_ts.size() == _ts.size() + 1);
assert(!is_connected_to(w));
idx_t idx = _get_idx(w);
_connected_words.add(&idx);
}
void brush_connected_words(stack_t **const ws_p, const color_t color)
{
assert(color == _color);
cdbg << "Brushing words connected to " << this << " into color=" <<
color << endl;
const _words_brusher wb(ws_p, color);
_connected_words.for_each(&wb);
}
color_t get_color() const
{
return _color;
}
void set_color(const color_t color)
{
cdbg << "Brushing " << this << " with color=" << color << endl;
assert(_color == 0);
_color = color;
}
const tiny_string *get_string() const
{
return &_ts;
}
// Calls word_visitor for each word in the dictionary.
// The order of visited words is unspecified.
template <class WordVisitor>
static void for_each(WordVisitor *const word_visitor)
{
assert(_words != 0);
assert(_next_idx <= _words_count);
for (idx_t i = 0; i < _next_idx; ++i) {
const idx_t idx = i;
(*word_visitor)(_words + idx, idx);
}
}
#ifndef NDEBUG
void debug_print(ostream &output) const
{
output << "word(" << _ts << ")";
}
#endif
DENY_COPYING(word)
};
#pragma pack(pop)
const word::idx_t word::NULL_IDX = ((word::idx_t)-1);
word *word::_words;
word::idx_t word::_words_count;
word::idx_t word::_next_idx;
#ifndef NDEBUG
ostream &operator << (ostream &output, const word *const w)
{
w->debug_print(output);
return output;
}
#endif
// This class holds information related to a subword.
// Subword is a tuple (word, position), where subword's string can
// be created by removing a character from the word at the given position.
#pragma pack(push, 1)
class subword
{
private:
// Word's index.
word::idx_t _idx;
// Position of the character to delete in the word.
tiny_string::ts_size_t _pos;
word *_word() const
{
return word::get_by_idx(_idx);
}
const tiny_string *_get_string() const
{
return _word()->get_string();
}
size_t _substring_size() const
{
const tiny_string *const ts = _get_string();
assert(ts->size() > 0);
assert(_pos <= ts->size());
const size_t ts_size = ts->size();
return _pos == ts_size ? ts_size : ts_size - 1;
}
static bool _substring_equal_to(const subword *const sw1,
const subword *const sw2)
{
assert(sw1->_substring_size() == sw2->_substring_size());
const tiny_string *const ts1 = sw1->_get_string();
const tiny_string *const ts2 = sw2->_get_string();
const tiny_string::ts_size_t pos1 = sw1->_pos;
const tiny_string::ts_size_t pos2 = sw2->_pos;
assert(pos1 <= pos2);
assert(pos1 <= ts1->size());
assert(pos2 <= ts2->size());
const char *const s1 = ts1->data();
const char *const s2 = ts2->data();
for (size_t i = 0; i < pos1; ++i) {
if (s1[i] != s2[i]) {
return false;
}
}
for (size_t i = pos1 + 1; i <= pos2; ++i) {
assert(i < ts1->size());
assert(i <= ts2->size());
if (s1[i] != s2[i - 1]) {
return false;
}
}
for (size_t i = pos2 + 1; i < ts2->size(); ++i) {
assert(i < ts1->size());
if (s1[i] != s2[i]) {
return false;
}
}
return true;
}
public:
subword(const word::idx_t idx, const tiny_string::ts_size_t pos) :
_idx(idx), _pos(pos)
{
assert(_pos <= _get_string()->size());
}
void connect_to(const subword *const sw)
{
word *const w1 = _word();
word *const w2 = sw->_word();
if (w1->is_connected_to(w2)) {
cdbg << w1 << " is already connected to " << w2 << endl;
return;
}
cdbg << "Connecting " << w1 << " to " << w2 << " via char ";
if (_get_string()->size() == sw->_get_string()->size()) {
cdbg << "substitution";
}
else {
assert(_get_string()->size() + 1 == sw->_get_string()->size());
cdbg << "insertion";
}
cdbg << " at position " << (size_t)sw->_pos << endl;
w1->connect_to(w2);
w2->connect_to(w1);
}
// Compares only subwords' substrings, but not character deletion positions.
bool substring_equal_to(const subword *const sw) const
{
if (_substring_size() != sw->_substring_size()) {
return false;
}
return (_pos <= sw->_pos) ? _substring_equal_to(this, sw) :
_substring_equal_to(sw, this);
}
bool equal_to(const subword *const sw) const
{
return (_pos == sw->_pos && substring_equal_to(sw));
}
uint32_t hash() const
{
// Subwords with the same substring must have identical hash value,
// so they will trap into the same hash table bucket.
// So don't mix _pos into hash value.
// This property is exploited when building links between words.
const tiny_string *const ts = _get_string();
const char *const s = ts->data();
uint32_t h = 0;
for (size_t i = 0; i < _pos; ++i) {
h = jenkins_hash::mix(h, s[i]);
}
for (size_t i = _pos + 1; i < ts->size(); ++i) {
h = jenkins_hash::mix(h, s[i]);
}
return jenkins_hash::final(h);
}
};
#pragma pack(pop)
// Mapping between a word string and a color.
#pragma pack(push, 1)
struct word_to_color
{
tiny_string ts;
word::color_t color;
word_to_color(const tiny_string *const ts_, const word::color_t color_) :
ts(*ts_), color(color_)
{
assert(ts.size() > 0);
}
bool equal_to(const word_to_color *const wc) const
{
return ts.equal_to(&wc->ts);
}
uint32_t hash() const
{
return ts.hash();
}
};
#pragma pack(pop)
// Subwords hashtable is used for building connections between words and
// for words' brushing.
// The hashtable is deleted after words are brushed.
typedef simple_hashtable<subword, 2> subwords_t;
subwords_t *subwords;
// Lookup table is used for mapping between dictionary words and
// the corresponding colors.
typedef simple_hashtable<word_to_color, 2> lookup_table_t;
lookup_table_t *lookup_table;
void get_stats_for_dict_words(const char *const string_buffer,
const size_t string_buffer_size, size_t *const words_count_p,
size_t *const total_words_size_p)
{
cdbg << "*** Populating dict word stats" << endl;
size_t words_count = 0;
size_t total_words_size = 0;
size_t i = 0;
while (i < string_buffer_size) {
char c = string_buffer[i++];
if (::isspace(c)) {
continue;
}
++words_count;
do {
++total_words_size;
if (i == string_buffer_size) {
break;
}
c = string_buffer[i++];
} while (!::isspace(c));
}
cdbg << "*** Total words count: " << words_count << endl;
cdbg << "*** Total chars in all words: " << total_words_size << endl;
*words_count_p = words_count;
*total_words_size_p = total_words_size;
}
void read_dict_words(char *const string_buffer, const size_t string_buffer_size)
{
cdbg << "*** Reading dict words from the input" << endl;
tiny_string::set_buffer(string_buffer, string_buffer_size);
tiny_string ts;
while (true) {
tiny_string::get_next_string(&ts);
size_t s_size = ts.size();
if (s_size == 0) {
cdbg << "*** End of input stream reached" << endl;
break;
}
word::add_word(&ts);
}
}
void build_subwords_for_word(const word *const w, const word::idx_t idx)
{
cdbg << "Building subwords for " << w << endl;
const tiny_string *const ts = w->get_string();
for (size_t pos = 0; pos < ts->size(); ++pos) {
subword sw(idx, pos);
subword *const tmp = subwords->insert(&sw);
if (tmp != 0) {
assert(sw.equal_to(tmp));
sw.connect_to(tmp);
}
}
}
void build_subwords_for_dict_words()
{
cdbg << "*** Building subwords for dict words" << endl;
word::for_each(&build_subwords_for_word);
}
class word_connections_builder
{
private:
subword *const _sw;
public:
word_connections_builder(subword *const sw) : _sw(sw) {}
void operator () (subword *const sw) const
{
assert(subwords->get_bucket(_sw) == subwords->get_bucket(sw));
if (sw->substring_equal_to(_sw)) {
_sw->connect_to(sw);
}
}
};
void build_connections_for_word(word *const w, const word::idx_t idx)
{
cdbg << "Building connections for " << w << endl;
const tiny_string *const ts = w->get_string();
subword sw(idx, ts->size());
const word_connections_builder wcb(&sw);
subwords->get_bucket(&sw)->for_each(&wcb);
}
void build_connections_for_dict_words()
{
cdbg << "*** Building connections for dict words" << endl;
word::for_each(&build_connections_for_word);
}
void brush_connected_graph(word::stack_t **const ws_p, word::idx_t idx,
const word::color_t color)
{
word *w = word::get_by_idx(idx);
assert(w->get_color() == 0);
assert(color > 0);
word::idx_t null_idx = word::NULL_IDX;
word::stack_t::push(ws_p, &null_idx);
w->set_color(color);
while (idx != word::NULL_IDX) {
w = word::get_by_idx(idx);
w->brush_connected_words(ws_p, color);
word::stack_t::pop(ws_p, &idx);
}
}
class dict_word_colorifier
{
private:
word::stack_t *_ws;
word::color_t _next_color;
public:
dict_word_colorifier() : _ws(0), _next_color(0)
{
// Pre-allocate words' stack here in order to decrease the number of silly
// new()/delete() calls for each brush_connected_graph() call.
word::idx_t null_idx = word::NULL_IDX;
word::stack_t::push(&_ws, &null_idx);
assert(_ws != 0);
}
~dict_word_colorifier()
{
assert(_ws != 0);
word::idx_t idx;
word::stack_t::pop(&_ws, &idx);
assert(idx == word::NULL_IDX);
assert(_ws == 0);
}
void operator () (word *const w, const word::idx_t idx)
{
if (w->get_color() != 0) {
assert(w->get_color() <= _next_color);
return;
}
++_next_color;
if (_next_color <= 0) {
cerr << "Color space exhausted. Widen word::color_t for handling " <<
"more colors" << endl;
::exit(EXIT_FAILURE);
}
brush_connected_graph(&_ws, idx, _next_color);
}
word::color_t get_next_color() const
{
return _next_color;
}
DENY_COPYING(dict_word_colorifier)
};
void colorify_dict_words()
{
cdbg << "*** Colorifying dict words" << endl;
dict_word_colorifier dwc;
word::for_each(&dwc);
cdbg << "*** Total colors=" << dwc.get_next_color() << endl;
}
void add_word_to_lookup_table(const word *const w, const word::idx_t idx)
{
(void)idx;
word_to_color wtc(w->get_string(), w->get_color());
word_to_color *const tmp = lookup_table->insert(&wtc);
if (tmp != 0) {
assert(tmp->equal_to(&wtc));
assert(tmp->color == wtc.color);
cdbg << "Skipping duplicate " << w << endl;
}
}
void build_lookup_table()
{
cdbg << "*** Building a lookup table for dict words" << endl;
word::for_each(&add_word_to_lookup_table);
}
bool get_word_color(istream &input, word::color_t *const color)
{
tiny_string ts;
tiny_string::read(input, &ts);
if (ts.size() == 0) {
cdbg << "End of input stream reached" << endl;
return false;
}
assert(ts.size() > 0);
cdbg << "Obtaining the color for word " << ts << endl;
const word_to_color dummy_wtc(&ts, 0);
const word_to_color *wtc = lookup_table->find(&dummy_wtc);
if (wtc == 0) {
cdbg << "Cannot find word " << ts << " in the dict" << endl;
*color = 0;
}
else {
assert(wtc->equal_to(&dummy_wtc));
*color = wtc->color;
assert(*color > 0);
cdbg << "The word " << ts << " has color=" << *color << endl;
}
return true;
}
void read_word_pairs(istream &input)
{
cdbg << "*** Started reading word pairs" << endl;
while (input.good()) {
word::color_t color1, color2;
if (!get_word_color(input, &color1) || !get_word_color(input, &color2)) {
break;
}
if (color1 == color2 && color1 != 0) {
cout << "YES" << endl;
}
else {
cout << "NO" << endl;
}
}
cdbg << "*** Finished reading word pairs" << endl;
}
void show_usage(const char *const program_name)
{
assert(program_name != 0);
cerr << "Usage: " << program_name << " path/to/dict/file" << endl;
cerr << "\tFeed word pairs to the input and the program will answer" << endl;
cerr << "\tYES or NO for each pair depending on special rules" << endl;
cerr << "\tdescribed at http://socialcam.com/jobs/problems " << endl;
cerr << "\t(Word Chains problem)" << endl;
}
void read_dict_to_string_buffer(const char *const dict_filename,
char **const string_buffer, size_t *const string_buffer_size)
{
ifstream *const dict_file = new ifstream(dict_filename);
if (!(*dict_file)) {
cerr << "Cannot open the file [" << dict_filename << "]" << endl;
::exit(EXIT_FAILURE);
}
// Try reading the whole file into memory in one read() syscall.
// mmap() could be faster, but it isn't portable.
filebuf *const fbuf = dict_file->rdbuf();
const size_t file_size = fbuf->pubseekoff(0, dict_file->end, dict_file->in);
*string_buffer = new char[file_size];
cdbg << "*** Allocated " << file_size << " bytes for string buffer" << endl;
fbuf->pubseekpos(0, dict_file->in);
const size_t bytes_read = fbuf->sgetn(*string_buffer, file_size);
if (bytes_read != file_size) {
cerr << "Expected to read " << file_size << " bytes from dict file, " <<
"but read only " << bytes_read << " bytes" << endl;
::exit(EXIT_FAILURE);
}
dict_file->close();
delete dict_file;
*string_buffer_size = file_size;
}
void destroy_string_buffer(char *const string_buffer,
const size_t string_buffer_size)
{
delete[] string_buffer;
cdbg << "*** Released " << string_buffer_size << " bytes allocated " <<
"for string buffer" << endl;
(void)string_buffer_size;
}
} // end of anonymous namespace
int main(const int argc, const char *const argv[]) {
if (argc != 2) {
assert(argc >= 1);
const char *const program_name = argv[0];
assert(program_name != 0);
show_usage(program_name);
::exit(EXIT_FAILURE);
}
const char *const dict_filename = argv[1];
assert(dict_filename != 0);
char *string_buffer;
size_t string_buffer_size;
read_dict_to_string_buffer(dict_filename, &string_buffer,
&string_buffer_size);
size_t words_count, total_words_size;
get_stats_for_dict_words(string_buffer, string_buffer_size, &words_count,
&total_words_size);
assert(total_words_size >= words_count);
if (words_count == 0) {
cerr << "File " << dict_filename << " contains zero dict words" << endl;
::exit(EXIT_FAILURE);
}
word::reserve_words(words_count);
read_dict_words(string_buffer, string_buffer_size);
assert(subwords == 0);
subwords = new subwords_t(total_words_size);
build_subwords_for_dict_words();
build_connections_for_dict_words();
colorify_dict_words();
assert(lookup_table == 0);
lookup_table = new lookup_table_t(words_count);
build_lookup_table();
delete subwords;
subwords = 0;
word::release_words();
read_word_pairs(cin);
delete lookup_table;
lookup_table = 0;
destroy_string_buffer(string_buffer, string_buffer_size);
cdbg << "*** Bye!" << endl;
return EXIT_SUCCESS;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment