You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We seem to be seeing the 504s begin at 16:54:27 which seems to be the start of the snowball.
None of the content changes leading up to that seem particularly remarkable for the amount of subscription contents they generated:
irb(main):086:0> pp SubscriptionContent.where(content_change_id: 21300..21335, digest_run_subscriber_id: nil).joins(:content_change).group(:content_change_id, "content_changes.content_id", "content_changes.created_at").order("content_changes.created_at").count
{[21300, "25000916-b720-4734-ada9-5e132cfb531f", Thu, 08 Feb 2018 16:19:57 UTC +00:00]=>128,
[21301, "c28088ef-056f-49f3-932f-8d52e2fc6094", Thu, 08 Feb 2018 16:21:52 UTC +00:00]=>128,
[21302, "a2d1f8b0-7f1c-476e-818e-cc8de5ba0660", Thu, 08 Feb 2018 16:22:01 UTC +00:00]=>1142,
[21303, "77ecb4f6-db29-4988-8480-76e3b851994a", Thu, 08 Feb 2018 16:23:55 UTC +00:00]=>128,
[21304, "40045056-cd2e-4f7a-90be-3f7896153c6c", Thu, 08 Feb 2018 16:25:31 UTC +00:00]=>128,
[21305, "7fccc33a-4df5-4df7-af61-48bb0d8407ed", Thu, 08 Feb 2018 16:27:00 UTC +00:00]=>128,
[21306, "76125ba7-8ed6-4c5a-8cc0-fec717b2197c", Thu, 08 Feb 2018 16:28:38 UTC +00:00]=>128,
[21307, "4d8cd561-9235-4463-ada2-182c9ebf6b97", Thu, 08 Feb 2018 16:30:07 UTC +00:00]=>128,
[21308, "fd844695-2d93-4702-bd93-65211383c07e", Thu, 08 Feb 2018 16:31:14 UTC +00:00]=>861,
[21309, "57d72545-48cb-49f2-8772-9cb9ff8ec672", Thu, 08 Feb 2018 16:32:42 UTC +00:00]=>737,
[21310, "2553904c-e9a7-4afe-b193-7973d7d21171", Thu, 08 Feb 2018 16:32:57 UTC +00:00]=>2169,
[21311, "e288bf95-c490-4d10-98cb-b6cc4f4f2e34", Thu, 08 Feb 2018 16:34:33 UTC +00:00]=>1448,
[21312, "03be0dc5-c4be-4063-9d1a-116c54c74a4c", Thu, 08 Feb 2018 16:37:06 UTC +00:00]=>1491,
[21313, "c90dd329-ea2a-413a-b927-a67bd3cbbb84", Thu, 08 Feb 2018 16:44:21 UTC +00:00]=>10001,
[21314, "d04468d8-16a7-42a1-9f97-3508e6635b4e", Thu, 08 Feb 2018 16:44:43 UTC +00:00]=>451,
[21315, "5fd954f4-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:46:07 UTC +00:00]=>1196,
[21316, "e2bc228e-e675-4269-9871-bda0fdda3cde", Thu, 08 Feb 2018 16:47:27 UTC +00:00]=>730,
[21317, "131e0adb-68e7-48a9-be6d-5896bcba51df", Thu, 08 Feb 2018 16:47:39 UTC +00:00]=>3728,
[21318, "41011c51-ed1a-44c7-a533-0ebe9d3f92e6", Thu, 08 Feb 2018 16:48:03 UTC +00:00]=>2169,
[21319, "5faa235a-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:49:06 UTC +00:00]=>1196,
[21320, "834d05c3-7a29-4c18-a6eb-9653b826ee12", Thu, 08 Feb 2018 16:49:28 UTC +00:00]=>304,
[21321, "67c92acb-7a04-4db0-b648-a15bd5980d98", Thu, 08 Feb 2018 16:51:00 UTC +00:00]=>1196,
[21322, "5faa23f6-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:52:58 UTC +00:00]=>1196,
[21323, "51959543-9fff-4768-a610-4af3e978378a", Thu, 08 Feb 2018 16:54:29 UTC +00:00]=>730,
[21324, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:55:55 UTC +00:00]=>1196,
[21325, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:55:56 UTC +00:00]=>1196,
[21326, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:01 UTC +00:00]=>1196,
[21327, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:08 UTC +00:00]=>1196,
[21328, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:13 UTC +00:00]=>1196,
[21329, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:18 UTC +00:00]=>1196,
[21330, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:26 UTC +00:00]=>1196,
[21331, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:30 UTC +00:00]=>1196,
[21332, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:37 UTC +00:00]=>1196,
[21333, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:40 UTC +00:00]=>1196,
[21334, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:50 UTC +00:00]=>1196,
[21335, "5faa3004-7631-11e4-a3cb-005056011aef", Thu, 08 Feb 2018 16:56:51 UTC +00:00]=>1196}
There don't seem to be remarkable amounts of any type of traffic before the problems emerge just a general build up.
January 31st 2018, 00:00:00.000 3
February 1st 2018, 00:00:00.000 6
February 2nd 2018, 00:00:00.000 7,248
February 5th 2018, 00:00:00.000 164
February 6th 2018, 00:00:00.000 11,226
February 7th 2018, 00:00:00.000 58,327
February 8th 2018, 00:00:00.000 80,929
Subscription contents for each content change (plucking out duplicates)
SELECT COUNT(*)
FROM
(
SELECT DISTINCT CONCAT(content_id, ' ', public_updated_at) AS uniq, id
FROM content_changes
WHERE created_at >= '2018-02-08 16:00' AND created_at < '2018-02-08 18:00'
ORDER BY id ASC
) AS cc
INNER JOIN subscription_contents AS sc ON sc.content_change_id = cc.id AND sc.digest_run_subscriber_id IS NULL;
count
--------
679778
(1 row)
Subscription contents associated with content changes
SELECT COUNT(*)
FROM subscription_contents
INNER JOIN content_changes ON subscription_contents.content_change_id = content_changes.id
WHERE content_changes.created_at >= '2018-02-08 16:00' AND content_changes.created_at < '2018-02-08 18:00'
AND subscription_contents.digest_run_subscriber_id IS NULL;
count
--------
679778
(1 row)
Email Alert API appeared to be functioning correctly and didn't produe additional requests to Govdelivery.
However Email Alert Service was having problems communicating with Email Alert API and was receiving 504 responses, by retrying these Email Alert API was populated with duplicate notifications which it duly sent out.
Cause of requests to Email Alert API returning 504
The calls from Notify to /status-updates have shown evidence of receiving 504 responses on a number of prior occassions. Therefore it seems to be a common pattern that a large volume of emails could cause 504s. If this occurs during requests coming in this can also cause 504s which caused the re-attempts of content-change requests.
Load balancing may be sending too much stuff to backend-3
This doesn't seem to be the case according to these graphs
Resolution Ideas
Return a 409 conflict response in Email Alert API if we receive a duplicate content change - DONE
Isolate email-alert-api public endpoints to run on separate app so to limit affects
Perform testing in staging environment to see what we can actually handle request wise and set rate-limiting accordingly (this was tested previously - so may need adjusting)