loops/loops/classifier
2024-10-02 12:54:34 +02:00
..
testdata start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
__init__.py start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
base.py classifier: Python3 fixes 2024-09-26 22:30:52 +02:00
browser.py start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
classifier_macros.pt start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
configure.zcml start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
interfaces.py start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
README.txt classifier: Python3 fixes 2024-09-26 22:30:52 +02:00
sample.py classifier: Python3 fixes 2024-09-26 22:30:52 +02:00
standard.py start working on Python3-compatible version 2024-09-23 17:30:48 +02:00
tests.py avoid testing deprecation errors with Python3.12 2024-10-02 12:54:34 +02:00
testsetup.py storage: tests with Python3 OK 2024-09-28 09:03:56 +02:00

===============================================================
loops - Linked Objects for Organization and Processing Services
===============================================================

Automatic classification of resources.


Setting up a loops Site and Utilities
=====================================

Let's do some basic set up

  >>> from zope import component, interface
  >>> from zope.traversing.api import getName
  >>> from zope.app.testing.setup import placefulSetUp, placefulTearDown
  >>> site = placefulSetUp(True)

and build a simple loops site with a concept manager and some concepts
(with a relation registry, a catalog, and all the type machinery - what
in real life is done via standard ZCML setup or via local utility
configuration):

  >>> from loops.classifier.testsetup import TestSite
  >>> t = TestSite(site)
  >>> concepts, resources, views = t.setup()

  >>> len(concepts), len(resources)
  (15, 0)

Let's now add an external collection that reads in a set of resources
from external files so we have something to work with.

  >>> from loops.concept import Concept
  >>> from loops.setup import addObject
  >>> from loops.common import adapted
  >>> from loops.classifier.testsetup import dataDir

  >>> tExternalCollection = concepts['extcollection']
  >>> coll01 = addObject(concepts, Concept, 'coll01',
  ...                    title='Collection One', conceptType=tExternalCollection)
  >>> aColl01 = adapted(coll01)
  >>> aColl01.baseAddress = dataDir
  >>> aColl01.address = ''

  >>> aColl01.update()
  >>> len(resources)
  7
  >>> rnames = list(sorted(resources.keys()))
  >>> rnames[0]
  'cust_im_contract_webbg_20071015.txt'


Filename-based Classification
=============================

Let's first look at the external address (i.e. the file name) of the
resource we want to classify.

  >>> r1 = resources[rnames[0]]
  >>> adapted(r1)
  <loops.resource.ExternalFileAdapter object ...>
  >>> adapted(r1).externalAddress
  'cust_im_contract_webbg_20071015.txt'

OK, that's what we need. So we get the preconfigured classifier
(see testsetup.py) and let it classify the resource.

  >>> classifier = adapted(concepts['fileclassifier'])

Before just processing the resource we'll have a look at the details
and follow the classifier step by step.

  >>> from loops.classifier.base import InformationSet
  >>> from loops.classifier.interfaces import IExtractor, IAnalyzer
  >>> infoSet = InformationSet()
  >>> for name in classifier.extractors.split():
  ...     print('extractor:', name)
  ...     extractor = component.getAdapter(adapted(r1), IExtractor, name=name)
  ...     infoSet.update(extractor.extractInformationSet())
  extractor: filename

  >>> infoSet
  {'filename': 'cust_im_contract_webbg_20071015'}

Let's now use the sample analyzer - an example that interprets very carefully
the underscore-separated parts of the filename.

  >>> analyzer = component.getAdapter(classifier, name=classifier.analyzer)
  >>> statements = analyzer.extractStatements(infoSet)
  >>> statements
  []

So there seems to be something missing - we have to create concepts
that may be identified as being candidates for classification.

  >>> tInstitution = addObject(concepts, Concept, 'institution',
  ...                     title='Institution', conceptType=concepts['type'])
  >>> cust_im = addObject(concepts, Concept, 'im_editors',
  ...                     title='im Editors', conceptType=tInstitution)
  >>> cust_mc = addObject(concepts, Concept, 'mc_consulting',
  ...                     title='MC Management Consulting', conceptType=tInstitution)

  >>> tDoctype = addObject(concepts, Concept, 'doctype',
  ...                     title='Document Type', conceptType=concepts['type'])
  >>> dt_note = addObject(concepts, Concept, 'dt_note',
  ...                     title='Note', conceptType=tDoctype)
  >>> dt_contract = addObject(concepts, Concept, 'dt_contract',
  ...                     title='Contract', conceptType=tDoctype)

  >>> tPerson = concepts['person']
  >>> webbg = addObject(concepts, Concept, 'webbg',
  ...                     title='Gerald Webb', conceptType=tPerson)
  >>> smitha = addObject(concepts, Concept, 'smitha',
  ...                     title='Angelina Smith', conceptType=tPerson)
  >>> watersj = addObject(concepts, Concept, 'watersj',
  ...                     title='Jerry Waters', conceptType=tPerson)
  >>> millerj = addObject(concepts, Concept, 'millerj',
  ...                     title='Jeannie Miller', conceptType=tPerson)

  >>> t.indexAll(concepts, resources)

  >>> from zope.catalog.interfaces import ICatalog
  >>> cat = component.getUtility(ICatalog)

  >>> statements = analyzer.extractStatements(infoSet)
  >>> len(statements)
  3

So we are now ready to have the whole stuff run in one call.

  >>> classifier.process(r1)
  
Classifier fileclassifier: Assigning: ...

  >>> list(sorted([c.title for c in r1.getConcepts()]))
  ['Collection One', 'Contract', 'External File', 'Gerald Webb', 'im Editors']

  >>> for name in rnames[1:]:
  ...     classifier.process(resources[name])

Classifier fileclassifier: Assigning: ...

  >>> len(webbg.getResources())
  4
  >>> len(webbg.getResources((concepts['ownedby'],)))
  3

We can repeat the process without getting additional assignments.

  >>> for name in rnames[1:]:
  ...     classifier.process(resources[name])

Classifier fileclassifier: Already assigned: ...

  >>> len(webbg.getResources())
  4


Fin de partie
=============

  >>> placefulTearDown()