
git-svn-id: svn://svn.cy55.de/Zope3/src/loops/trunk@2104 fd906abe-77d9-0310-91a1-e0d9ade77398
369 lines
9.8 KiB
Text
369 lines
9.8 KiB
Text
===============================================================
|
|
loops - Linked Objects for Organization and Processing Services
|
|
===============================================================
|
|
|
|
loops agents - running on client systems and other services,
|
|
collecting informations and transferring them to the loops server.
|
|
|
|
($Id$)
|
|
|
|
This package does not depend on zope or the other loops packages
|
|
but represents a standalone application.
|
|
|
|
But we need a reactor for working with Twisted; in order not to block
|
|
testing when running the reactor we use reactor.iterate() calls
|
|
wrapped in a ``tester`` object.
|
|
|
|
>>> from loops.agent.tests import tester
|
|
|
|
|
|
Basic Implementation, Agent Core
|
|
================================
|
|
|
|
The agent uses Twisted's cooperative multitasking model.
|
|
|
|
This means that all calls to services (like crawler, transporter, ...)
|
|
return a deferred that must be supplied with a callback method (and in
|
|
most cases also an errback method).
|
|
|
|
>>> from loops.agent import core
|
|
>>> agent = core.Agent()
|
|
|
|
|
|
Configuration Management
|
|
========================
|
|
|
|
Functionality
|
|
|
|
- Storage of configuration parameters
|
|
- Interface to the browser-based user interface that allows the
|
|
editing of configuration parameters
|
|
|
|
All configuration parameters are always accessible via the ``config``
|
|
attribute of the agent object.
|
|
|
|
>>> config = agent.config
|
|
|
|
This already provides all needed sections (transport, crawl, ui), so
|
|
we can directly put information into these sections by loading a
|
|
string with the corresponding assignment.
|
|
|
|
>>> config.load('transport.serverURL = "http://loops.cy55.de"')
|
|
>>> config.transport.serverURL
|
|
'http://loops.cy55.de'
|
|
|
|
This setting may also contain indexed access; thus we can model
|
|
configuration parameters with multiple instances (like crawling
|
|
jobs).
|
|
|
|
>>> config.load('''
|
|
... crawl[0].type = "filesystem"
|
|
... crawl[0].directory = "documents/projects"
|
|
... ''')
|
|
>>> config.crawl[0].type
|
|
'filesystem'
|
|
>>> config.crawl[0].directory
|
|
'documents/projects'
|
|
|
|
Subsections are created automatically when they are first accessed.
|
|
|
|
>>> config.load('ui.web.port = 8081')
|
|
>>> config.ui.web.port
|
|
8081
|
|
|
|
The ``setdefault()`` method allows to retrieve a value and set
|
|
it with a default if not found, in one statement.
|
|
|
|
>>> config.ui.web.setdefault('port', 8080)
|
|
8081
|
|
>>> config.transport.setdefault('userName', 'loops')
|
|
'loops'
|
|
|
|
>>> sorted(config.transport.items())
|
|
[('__name__', 'transport'), ('serverURL', 'http://loops.cy55.de'), ('userName', 'loops')]
|
|
|
|
We can output a configuration in a form that is ready for loading
|
|
just by converting it to a string representation.
|
|
|
|
>>> print config
|
|
crawl[0].directory = 'documents/projects'
|
|
crawl[0].type = 'filesystem'
|
|
transport.serverURL = 'http://loops.cy55.de'
|
|
transport.userName = 'loops'
|
|
ui.web.port = 8081
|
|
|
|
The configuration may also be saved to a file -
|
|
for testing purposes let's use the loops.agent package directory
|
|
for storage; normally it would be stored in the users home directory.
|
|
|
|
>>> import os
|
|
>>> os.environ['HOME'] = os.path.dirname(core.__file__)
|
|
|
|
>>> config.save()
|
|
|
|
>>> fn = config.getDefaultConfigFile()
|
|
>>> fn
|
|
'....loops.agent.cfg'
|
|
|
|
>>> print open(fn).read()
|
|
crawl[0].directory = 'documents/projects'
|
|
crawl[0].type = 'filesystem'
|
|
transport.serverURL = 'http://loops.cy55.de'
|
|
transport.userName = 'loops'
|
|
ui.web.port = 8081
|
|
|
|
The simplified syntax
|
|
---------------------
|
|
|
|
>>> config.load('''
|
|
... ui(
|
|
... web(
|
|
... port=11080,
|
|
... ))
|
|
... crawl[1](
|
|
... type='outlook',
|
|
... folder='inbox',
|
|
... )
|
|
... ''')
|
|
>>> print config.ui.web.port
|
|
11080
|
|
|
|
Cleaning up
|
|
-----------
|
|
|
|
>>> os.unlink(fn)
|
|
|
|
|
|
Scheduling
|
|
==========
|
|
|
|
Configuration (per job)
|
|
|
|
- schedule, repeating pattern, conditions
|
|
- following job(s), e.g. to start a transfer immediately after a crawl
|
|
|
|
How does this work?
|
|
-------------------
|
|
|
|
>>> from time import time
|
|
|
|
>>> from loops.agent.schedule import Job
|
|
>>> class TestJob(Job):
|
|
... def execute(self):
|
|
... d = super(TestJob, self).execute()
|
|
... print 'executing'
|
|
... return d
|
|
|
|
>>> scheduler = agent.scheduler
|
|
|
|
The ``schedule()`` method accepts the start time as a second argument,
|
|
if not present use the current time, i.e. start the job immediately.
|
|
|
|
>>> startTime = scheduler.schedule(TestJob())
|
|
|
|
>>> tester.iterate()
|
|
executing
|
|
|
|
We can set up a more realistic example using the dummy crawler and transporter
|
|
classes from the testing package.
|
|
|
|
>>> from loops.agent.testing import crawl
|
|
>>> from loops.agent.testing import transport
|
|
|
|
>>> crawlJob = crawl.CrawlingJob()
|
|
>>> transporter = transport.Transporter(agent)
|
|
>>> transportJob = transporter.createJob()
|
|
>>> crawlJob.successors.append(transportJob)
|
|
>>> startTime = scheduler.schedule(crawlJob)
|
|
|
|
The Job class offers two callback hooks: ``whenStarted`` and ``whenFinished``.
|
|
Use this for getting notified about the starting and finishing of a job.
|
|
|
|
>>> def finishedCB(job, result):
|
|
... print 'Crawling finished, result:', result
|
|
>>> crawlJob.whenFinished = finishedCB
|
|
|
|
Now let the reactor run...
|
|
|
|
>>> tester.iterate()
|
|
Crawling finished, result: [<loops.agent.testing.crawl.DummyResource ...>]
|
|
Transferring: Dummy resource data for testing purposes.
|
|
|
|
Using configuration with scheduling
|
|
-----------------------------------
|
|
|
|
Let's start with a fresh agent, directly supplying the configuration
|
|
(just for testing).
|
|
|
|
>>> config = '''
|
|
... crawl[0].type = 'dummy'
|
|
... crawl[0].directory = '~/documents'
|
|
... crawl[0].pattern = '*.doc'
|
|
... crawl[0].starttime = %s
|
|
... crawl[0].transport = 'dummy'
|
|
... crawl[0].repeat = 0
|
|
... transport.serverURL = 'http://loops.cy55.de'
|
|
... ''' % int(time())
|
|
|
|
>>> agent = core.Agent(config)
|
|
|
|
We also register our dummy crawling job and transporter classes as
|
|
we can not perform real crawling and transfers when testing.
|
|
|
|
>>> agent.crawlTypes = dict(dummy=crawl.CrawlingJob)
|
|
>>> agent.transportTypes = dict(dummy=transport.Transporter)
|
|
|
|
>>> agent.scheduleJobsFromConfig()
|
|
|
|
>>> tester.iterate()
|
|
Transferring: Dummy resource data for testing purposes.
|
|
|
|
|
|
Crawling
|
|
========
|
|
|
|
General
|
|
-------
|
|
|
|
Functionality
|
|
|
|
- search for new or changed resources according to the search and
|
|
filter criteria
|
|
- keep a record of resources transferred already in order to avoid
|
|
duplicate transfers (?)
|
|
|
|
Configuration (per crawl job)
|
|
|
|
- predefined metadata
|
|
|
|
Local File System
|
|
-----------------
|
|
|
|
Configuration (per crawl job)
|
|
|
|
- directories to search
|
|
- filter criteria, e.g. file type
|
|
|
|
Metadata sources
|
|
|
|
- path, filename
|
|
|
|
Implementation and documentation: see loops/agent/crawl/filesystem.py
|
|
and .../filesystem.txt.
|
|
|
|
E-Mail-Clients
|
|
--------------
|
|
|
|
Configuration (per crawl job)
|
|
|
|
- folders to search
|
|
- filter criteria (e.g. sender, receiver, subject patterns)
|
|
|
|
Metadata sources
|
|
|
|
- folder names (path)
|
|
- header fields (sender, receiver, subject, ...)
|
|
|
|
Special handling
|
|
|
|
- HTML vs. plain text content: if a mail contains both HTML and plain
|
|
text parts the transfer may be limited to one of these parts (configuration
|
|
setting)
|
|
- attachments may be ignored (configuration setting; useful when attachments
|
|
are copied to the local filesystem and transferred from there anyways)
|
|
|
|
|
|
Transport
|
|
=========
|
|
|
|
Configuration
|
|
|
|
- ``transport.serverURL``: URL of the target loops site, e.g.
|
|
"http://z3.loops.cy55.de/bwp/d5"
|
|
- ``transport.userName``, ``transport.password`` for logging in to loops
|
|
- ``transport.machineName: name under which the client computer is
|
|
known to the loops server
|
|
- ``transport.method``, e.g. "PUT"
|
|
|
|
The following information is intended for the default transfer
|
|
protocol/method HTTP PUT but probably also pertains to other protocols
|
|
like e.g. FTP.
|
|
|
|
Format/Information structure
|
|
----------------------------
|
|
|
|
- Metadata URL (for storing or accessing metadata sets - optional, see below):
|
|
``$loopsSiteURL/resource_meta/$machine_name/$user/$app/$path.xml``
|
|
- Resource URL (for storing or accessing the real resources):
|
|
``$loopsSiteURL/resource_data/$machine_name//$user/$app/$path``
|
|
- ``$app`` names the type of application providing the resource, e.g.
|
|
"filesystem" or "mail"
|
|
- ``$path`` represents the full path, possibly with drive specification in front
|
|
(for filesystem resources on Windows), with special characters URL-escaped
|
|
|
|
Note that the URL uniquely identifies the resource on the local computer,
|
|
so a resource transferred with the exact location (path and filename)
|
|
on the local computer as a resource transferred previously will overwrite
|
|
the old version, so that the classification of the resource within loops
|
|
won't get lost. (This is of no relevance to emails.)
|
|
|
|
Metadata sets are XML files with metadata for the associated resource.
|
|
Usually a metadata set has the extension ".xml"; if the extension is ".zip"
|
|
the metadata file is a compressed file that will be expanded on the
|
|
server.
|
|
|
|
Data files may also be compressed in which case there must be a corresponding
|
|
entry in the associated metadata set.
|
|
|
|
|
|
Logging
|
|
=======
|
|
|
|
Configuration
|
|
|
|
- log format(s)
|
|
- log file(s) (or other forms of persistence)
|
|
|
|
Example
|
|
-------
|
|
|
|
We set the logging configuration to log level 20 (INFO) using the
|
|
standard log handler that prints to ``sys.stdout``.
|
|
|
|
>>> agent.config.logging.standard = 20
|
|
>>> logger = agent.logger
|
|
>>> logger.setup()
|
|
|
|
The we can log an event providing a dictionary with the data to be logged.
|
|
|
|
>>> logger.log(dict(object='job', event='start'))
|
|
20... event:start object:job
|
|
|
|
We can also look at the logging records collected in the logger.
|
|
|
|
>>> len(logger)
|
|
1
|
|
|
|
>>> print logger[-1]
|
|
20... event:start object:job
|
|
|
|
|
|
Software Loader
|
|
===============
|
|
|
|
Configuration (general)
|
|
|
|
- source list: URL(s) of site(s) providing updated or additional packages
|
|
|
|
Configuration (per install/update job)
|
|
|
|
- command: install, update, remove
|
|
- package names
|
|
|
|
|
|
Browser-based User Interface
|
|
============================
|
|
|
|
The user interface is provided via a browser-based application
|
|
based on Twisted and Nevow.
|
|
|