Skip to Content
avatar image
Former Member

Training example based taxonomy

I created a file system repository containing some text files of key words in order to provide an initial training set for a taxonomy. According to the Classification Inbox "text/plain", "text" and "plain" have been included among keywords for most of the categories created. How do I stop this happening and is it possible to remove keywords that have been incorrectly learned from documents that have since been automatically classified? Is there a "do not use these words" list these can be added to?

I have to use the example-based taxonomy as this is a project requirement.

Thanks,

-Richard

Add comment
10|10000 characters needed characters exceeded

  • Follow
  • Get RSS Feed

3 Answers

  • Best Answer
    Apr 26, 2004 at 11:42 AM

    No, unfortunately it's not possible to influence the extracted keywords directly. I assume that the mentioned technical keywords are attributes of your documents.

    But this only occures when the documents contain only very little content text. Can this be the case in you example? Then sample based classification is not working anyhow!

    Regards Matthias

    Add comment
    10|10000 characters needed characters exceeded

  • avatar image
    Former Member
    Jun 24, 2004 at 03:46 PM

    Update:

    I've since been involved in a project where we demonstrated both Query and Example based taxonomies on the same content.

    Query based taxonomy took a lot longer to set up and we needed to have several reviews of queries and results to get something suitable, but it has the advantage of allowing rules based on CM properties.

    Example based taxonomy was nice and quick to set up once we had secured example documents (don't under-estimate this - it took weeks). We eventually took a large document that summarised every category they wanted to use, broke it up into individual document of a couple of paragraphs each and fed this into TREX. We found it was pretty accurate with about 6 lines of text to work on.

    -R

    Add comment
    10|10000 characters needed characters exceeded

  • avatar image
    Former Member
    Apr 26, 2004 at 12:57 PM

    Hi!

    Yes, it's picking up attributes as keywords, normally the filename, extension and MIME type. The example documents contained just 3 - 6 fkeywords and synonyms we wanted to base the category on as the initital training set.

    How small can a sample document be before it picks up attributes?

    Even when we've obtained real documents, it sometimes learns the wrong words from the text, such as company name or the author. Even if we manually reclassify documents, this does not appear to affect the keywords it's using so next time similar documents are added (ie, company documents with the company name on them) the go back into the category again.

    Cheers,

    -Richard

    Add comment
    10|10000 characters needed characters exceeded

    • Hi Karsten,

      there is no advantage over the taxonomy training iView. It is just that if there are a lot of documents the manual classification of training documents would be cumbersome. I would also use QBT instead of TBT but Richard said TBT is a project requirement.

      regards

      Gabriel