Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@YaroSpace
Created September 17, 2014 15:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save YaroSpace/b80ae825360d08b998f1 to your computer and use it in GitHub Desktop.
Save YaroSpace/b80ae825360d08b998f1 to your computer and use it in GitHub Desktop.
ComPort Architecture draft
Architecture description
-
Aggregator
Orchestrates the overall process from fetching to updating the db,
scheduling and managing aggregator jobs and their stages for different modules
does:
fetch :all | latest - accepts a block with strategy to determine latest
jobs :all | :current - AggreagationJob - status, stop, pause, resume
update_job_status - hook to be called by other modules
knows:
resource_type :forum | :blog
resource_url
resource_download_schema - how to download sections/categories/topics/pages with posts
resource_parsing_schema - how to extract posts and their attributes from downloaded data
resource_nolmalize_schema - how to map parsed data to internal db schema
resource_validation_schema - how to validate parsed data
strategy - action on errors, callbacks
Downloader
Executes download jobs - iterates over provided download_schema,
collecting pages with posts
Updates download_job status
does:
get :job_id, :all|:options (options restrict the download)
jobs :all | :current - (DownloadJob - status, stop, pause, resume)
knows:
download_scema - supplied by Aggregator
Parser
Extracts posts and their attributes (category, topic, title, author, datetime)
from the download batch.
Updates parser_job status
does:
parse :job_id - extracts posts and attributes from job_id batch. Handles
links, images, emoji, etc.
jobs :all | :current - (ParseJob - status, stop, pause, resume)
knows:
parsing_schema - suppled by Aggreagator
Normalizer
Maps parsed data to DB schema.
Updates normalizer_job status
does:
normalize :job_id - parses the data in job_id batch
jobs :all | :current - (NormalizerJob - status, stop, pause, resume)
knows:
normalize_schema - suppled by Aggreagator
Validator
Validates parsed data and marks OK/Check
Updates validator_job status
does:
validate :job_id - validates the data in job_id batch
jobs :all | :current - (NormalizerJob - status, stop, pause, resume)
knows:
valiation_schema - suppled by Aggreagator, :manual|:strategy
Execution flow
-
1.
a) Aggregator.new settings = {
:type => :forum,
:url => 'www.forum.com',
:schemas => {},
:strategy => :pause_on_error,
}
b) Aggregator.fetch :all
c) returns aggregator job_id
d) calls Downloader
2.
a) Downloader.get :job_id, settings[:download_schema] = {
:section_url => '/threads?s=',
:sections_range => '2..section_end',
:section_end => 'find_css('.section').last'
:category_page => 'c=',
:categories_range => '1..categories_end',
:categories_end => 'find_css('.categories-page').last'
:topics_page_url => 'page=',
:topics_pages_range => '1..topics_pages_end',
:topics_pages_end => 'find_css('.topics-page').last'
:topic_url => 'topic=',
:topic_no => 'topics_css('.topic').last'
:posts_page_url => 'page=',
:posts_pages_range => '1..posts_pages_end',
:posts_pages_end => 'find_css('.posts-page').last'
}
b) On success - call Aggregator.update_job_status(download: 'ok')
on failure - call Aggregaror.update_job_status(download: error)
3.
a) Parser.parse :job_id, settings[:parser_schema] - on job status change event for :downloader
@dmitry
Copy link

dmitry commented Sep 17, 2014

I don't believe Downloader.get will work universally good. I would like to be able to extend Aggregator to a chat, products and so on.

@dmitry
Copy link

dmitry commented Sep 17, 2014

For the downloader, parser and normalizer I would like to use DSL, instead of plain mapping.

@dmitry
Copy link

dmitry commented Sep 18, 2014

@dmitry
Copy link

dmitry commented Sep 18, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment