cybertools/text
2019-04-26 17:42:10 +02:00
..
config clean-up of cybertools.text; provide msword conversion 2007-03-01 11:40:35 +00:00
lib added basic support for converting html files to text using BeautifulSoup 2007-03-05 09:04:57 +00:00
testfiles added text conversion for RTF 2007-03-08 11:49:02 +00:00
__init__.py added package cybertools.text 2006-10-04 08:53:58 +00:00
base.py clean-up of cybertools.text; provide msword conversion 2007-03-01 11:40:35 +00:00
doc.py provide text converters for XLS and PPT 2007-03-07 19:32:15 +00:00
html.py make HTML transformation work 2009-02-07 16:27:04 +00:00
interfaces.py clean-up of cybertools.text; provide msword conversion 2007-03-01 11:40:35 +00:00
mimetypes.py create special MIME type for .cdr files so that they aren't interpreted as image files 2011-05-27 16:40:16 +02:00
ooffice.py added OpenOffice transformation for full-text indexing 2008-12-21 11:34:16 +00:00
pdf.py provide text converters for XLS and PPT 2007-03-07 19:32:15 +00:00
ppt.py merged Dojo 1.0 branch 2008-02-10 09:56:27 +00:00
README.txt make HTML transformation work 2009-02-07 16:27:04 +00:00
rtf.py added text conversion for RTF 2007-03-08 11:49:02 +00:00
tests.py fix tests/doctests according to current ZTK and BlueBream versions 2019-04-26 17:42:10 +02:00
xls.py provide text converters for XLS and PPT 2007-03-07 19:32:15 +00:00

=================================================
Text Transformations, e.g. for Full-text Indexing
=================================================

  ($Id$)

If a converter program needed is not available we want to put a warning
into Zope's server log; in order to be able to test this we register
a log handler for testing:

  >>> from zope.testing.loggingsupport import InstalledHandler
  >>> log = InstalledHandler('zope.server')

The test files are in a subdirectory of the text package:

  >>> import os
  >>> from cybertools import text
  >>> testdir = os.path.join(os.path.dirname(text.__file__), 'testfiles')

HTML
----

  >>> from cybertools.text.html import htmlToText
  >>> html = open(os.path.join(testdir, 'selfhtml.html')).read()
  >>> text = htmlToText(html.decode('ISO8859-15'))
  >>> '<p>' in html
  True
  >>> '<p>' in text
  False

PDF Files
---------

Let's start with a PDF file:

  >>> from cybertools.text.pdf import PdfTransform
  >>> transform = PdfTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.pdf'))

This will be transformed to plain text:

  >>> result = transform(f)

Let's check the log, should be empty:

  >>> print log

So what is in the plain text result?

  >>> words = result.split()
  >>> len(words)
  89
  >>> u'lamb' in words
  True

Word Documents
--------------

  >>> from cybertools.text.doc import DocTransform
  >>> transform = DocTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.doc'))
  >>> result = transform(f)
  >>> print log
  >>> words = result.split()
  >>> len(words)
  89
  >>> u'lamb' in words
  True

RTF Files
---------

  >>> from cybertools.text.rtf import RtfTransform
  >>> transform = RtfTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.rtf'))
  >>> result = transform(f)
  >>> print log
  >>> words = result.split()
  >>> len(words)
  90
  >>> u'lamb' in words
  True

PowerPoint Presentations
------------------------

  >>> from cybertools.text.ppt import PptTransform
  >>> transform = PptTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.ppt'))
  >>> result = transform(f)
  >>> print log
  >>> words = result.split()
  >>> len(words)
  102
  >>> u'lamb' in words
  True

Excel Spreadsheets
------------------

  >>> from cybertools.text.xls import XlsTransform
  >>> transform = XlsTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.xls'))
  >>> result = transform(f)
  >>> print log
  >>> words = result.split()
  >>> len(words)
  89
  >>> u'lamb' in words
  True

OpenOffice
----------

  >>> from cybertools.text.ooffice import OOTransform
  >>> transform = OOTransform(None)
  >>> f = open(os.path.join(testdir, 'mary.odt'))
  >>> result = transform(f)
  >>> print log
  >>> words = result.split()
  >>> len(words)
  89
  >>> u'lamb' in words
  True

  >>> f = open(os.path.join(testdir, 'mary.ods'))
  >>> result = transform(f)
  >>> len(result.split())
  89

  >>> f = open(os.path.join(testdir, 'mary.odp'))
  >>> result = transform(f)
  >>> len(result.split())
  99