loops/agent/README.txt

===============================================================
loops - Linked Objects for Organization and Processing Services
===============================================================

loops agents - running on client systems and other services,
collecting informations and transferring them to the loops server.

  ($Id$)

This package does not depend on zope or the other loops packages
but represents a standalone application.

But we need a reactor for working with Twisted; in order not to block
testing when running the reactor we use reactor.iterate() calls
wrapped in a ``tester`` object.

  >>> from loops.agent.tests import tester


Basic Implementation, Agent Core
================================

The agent uses Twisted's cooperative multitasking model.

This means that all calls to services (like crawler, transporter, ...)
return a deferred that must be supplied with a callback method (and in
most cases also an errback method).

  >>> from loops.agent import core
  >>> agent = core.Agent()


Configuration Management
========================

Functionality

- Storage of configuration parameters
- Interface to the browser-based user interface that allows the
  editing of configuration parameters

All configuration parameters are always accessible via the ``config``
attribute of the agent object.

  >>> config = agent.config

This already provides all needed sections (transport, crawl, ui), so
we can directly put information into these sections by loading a
string with the corresponding assignment.

  >>> config.load('transport.serverURL = "http://loops.cy55.de"')
  >>> config.transport.serverURL
  'http://loops.cy55.de'

This setting may also contain indexed access; thus we can model
configuration parameters with multiple instances (like crawling
jobs).

  >>> config.load('''
  ... crawl[0].type = "filesystem"
  ... crawl[0].directory = "documents/projects"
  ... ''')
  >>> config.crawl[0].type
  'filesystem'
  >>> config.crawl[0].directory
  'documents/projects'

Subsections are created automatically when they are first accessed.

  >>> config.load('ui.web.port = 8081')
  >>> config.ui.web.port
  8081

The ``setdefault()`` method allows to retrieve a value and set
it with a default if not found, in one statement.

  >>> config.ui.web.setdefault('port', 8080)
  8081
  >>> config.transport.setdefault('userName', 'loops')
  'loops'

  >>> sorted(config.transport.items())
  [('__name__', 'transport'), ('serverURL', 'http://loops.cy55.de'), ('userName', 'loops')]

We can output a configuration in a form that is ready for loading
just by converting it to a string representation.

  >>> print config
  crawl[0].directory = 'documents/projects'
  crawl[0].type = 'filesystem'
  transport.serverURL = 'http://loops.cy55.de'
  transport.userName = 'loops'
  ui.web.port = 8081

The configuration may also be saved to a file -
for testing purposes let's use the loops.agent package directory
for storage; normally it would be stored in the users home directory.

  >>> import os
  >>> os.environ['HOME'] = os.path.dirname(core.__file__)

  >>> config.save()

  >>> fn = config.getDefaultConfigFile()
  >>> fn
  '....loops.agent.cfg'

  >>> print open(fn).read()
  crawl[0].directory = 'documents/projects'
  crawl[0].type = 'filesystem'
  transport.serverURL = 'http://loops.cy55.de'
  transport.userName = 'loops'
  ui.web.port = 8081

The simplified syntax
---------------------

  >>> config.load('''
  ... ui(
  ...   web(
  ...     port=11080,
  ... ))
  ... crawl[1](
  ...   type='outlook',
  ...   folder='inbox',
  ... )
  ... ''')
  >>> print config.ui.web.port
  11080

Cleaning up
-----------

  >>> os.unlink(fn)


Scheduling
==========

Configuration (per job)

- schedule, repeating pattern, conditions
- following job(s), e.g. to start a transfer immediately after a crawl

How does this work?
-------------------

  >>> from time import time

  >>> from loops.agent.schedule import Job
  >>> class TestJob(Job):
  ...     def execute(self):
  ...         d = super(TestJob, self).execute()
  ...         print 'executing'
  ...         return d

  >>> scheduler = agent.scheduler

The ``schedule()`` method accepts the start time as a second argument,
if not present use the current time, i.e. start the job immediately.

  >>> startTime = scheduler.schedule(TestJob())

  >>> tester.iterate()
  executing

We can set up a more realistic example using the dummy crawler and transporter
classes from the testing package.

  >>> from loops.agent.testing import crawl
  >>> from loops.agent.testing import transport

  >>> crawlJob = crawl.CrawlingJob()
  >>> transporter = transport.Transporter(agent)
  >>> transportJob = transporter.createJob()
  >>> crawlJob.successors.append(transportJob)
  >>> startTime = scheduler.schedule(crawlJob)

The Job class offers two callback hooks: ``whenStarted`` and ``whenFinished``.
Use this for getting notified about the starting and finishing of a job.

  >>> def finishedCB(job, result):
  ...     print 'Crawling finished, result:', result
  >>> crawlJob.whenFinished = finishedCB

Now let the reactor run...

  >>> tester.iterate()
  Crawling finished, result: [<loops.agent.testing.crawl.DummyResource ...>]
  Transferring: Dummy resource data for testing purposes.

Using configuration with scheduling
-----------------------------------

Let's start with a fresh agent, directly supplying the configuration
(just for testing).

  >>> config = '''
  ... crawl[0].type = 'dummy'
  ... crawl[0].directory = '~/documents'
  ... crawl[0].pattern = '*.doc'
  ... crawl[0].starttime = %s
  ... crawl[0].transport = 'dummy'
  ... crawl[0].repeat = 0
  ... transport.serverURL = 'http://loops.cy55.de'
  ... ''' % int(time())

  >>> agent = core.Agent(config)

We also register our dummy crawling job and transporter classes as
we can not perform real crawling and transfers when testing.

  >>> agent.crawlTypes = dict(dummy=crawl.CrawlingJob)
  >>> agent.transportTypes = dict(dummy=transport.Transporter)

  >>> agent.scheduleJobsFromConfig()

  >>> tester.iterate()
  Transferring: Dummy resource data for testing purposes.


Crawling
========

General
-------

Functionality

- search for new or changed resources according to the search and
  filter criteria
- keep a record of resources transferred already in order to avoid
  duplicate transfers (?)

Configuration (per crawl job)

- predefined metadata

Local File System
-----------------

Configuration (per crawl job)

- directories to search
- filter criteria, e.g. file type

Metadata sources

- path, filename

Implementation and documentation: see loops/agent/crawl/filesystem.py
and .../filesystem.txt.

E-Mail-Clients
--------------

Configuration (per crawl job)

- folders to search
- filter criteria (e.g. sender, receiver, subject patterns)

Metadata sources

- folder names (path)
- header fields (sender, receiver, subject, ...)

Special handling

- HTML vs. plain text content: if a mail contains both HTML and plain
  text parts the transfer may be limited to one of these parts (configuration
  setting)
- attachments may be ignored (configuration setting; useful when attachments
  are copied to the local filesystem and transferred from there anyways)


Transport
=========

Configuration

- ``transport.serverURL``: URL of the target loops site, e.g.
  "http://z3.loops.cy55.de/bwp/d5"
- ``transport.userName``, ``transport.password`` for logging in to loops
- ``transport.machineName: name under which the client computer is
  known to the loops server
- ``transport.method``, e.g. "PUT"

The following information is intended for the default transfer
protocol/method HTTP PUT but probably also pertains to other protocols
like e.g. FTP.

Format/Information structure
----------------------------

- Metadata URL (for storing or accessing metadata sets - optional, see below):
  ``$loopsSiteURL/resource_meta/$machine_name/$user/$app/$path.xml``
- Resource URL (for storing or accessing the real resources):
  ``$loopsSiteURL/resource_data/$machine_name//$user/$app/$path``
- ``$app`` names the type of application providing the resource, e.g.
  "filesystem" or "mail"
- ``$path`` represents the full path, possibly with drive specification in front
  (for filesystem resources on Windows), with special characters URL-escaped

Note that the URL uniquely identifies the resource on the local computer,
so a resource transferred with the exact location (path and filename)
on the local computer as a resource transferred previously will overwrite
the old version, so that the classification of the resource within loops
won't get lost. (This is of no relevance to emails.)

Metadata sets are XML files with metadata for the associated resource.
Usually a metadata set has the extension ".xml"; if the extension is ".zip"
the metadata file is a compressed file that will be expanded on the
server.

Data files may also be compressed in which case there must be a corresponding
entry in the associated metadata set.


Logging
=======

Configuration

- log format(s)
- log file(s) (or other forms of persistence)

Example
-------

We set the logging configuration to log level 20 (INFO) using the
standard log handler that prints to ``sys.stdout``.

  >>> agent.config.logging.standard = 20
  >>> logger = agent.logger
  >>> logger.setup()

The we can log an event providing a dictionary with the data to be logged.

  >>> logger.log(dict(object='job', event='start'))
  20... event:start object:job

We can also look at the logging records collected in the logger.

  >>> len(logger)
  1

  >>> print logger[-1]
  20... event:start object:job


Software Loader
===============

Configuration (general)

- source list: URL(s) of site(s) providing updated or additional packages

Configuration (per install/update job)

- command: install, update, remove
- package names


Browser-based User Interface
============================

The user interface is provided via a browser-based application
based on Twisted and Nevow.