on 03-25-2004 11:34 AM
I created a file system repository containing some text files of key words in order to provide an initial training set for a taxonomy. According to the Classification Inbox "text/plain", "text" and "plain" have been included among keywords for most of the categories created. How do I stop this happening and is it possible to remove keywords that have been incorrectly learned from documents that have since been automatically classified? Is there a "do not use these words" list these can be added to?
I have to use the example-based taxonomy as this is a project requirement.
Thanks,
-Richard
No, unfortunately it's not possible to influence the extracted keywords directly. I assume that the mentioned technical keywords are attributes of your documents.
But this only occures when the documents contain only very little content text. Can this be the case in you example? Then sample based classification is not working anyhow!
Regards Matthias
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Update:
I've since been involved in a project where we demonstrated both Query and Example based taxonomies on the same content.
Query based taxonomy took a lot longer to set up and we needed to have several reviews of queries and results to get something suitable, but it has the advantage of allowing rules based on CM properties.
Example based taxonomy was nice and quick to set up once we had secured example documents (don't under-estimate this - it took weeks). We eventually took a large document that summarised every category they wanted to use, broke it up into individual document of a couple of paragraphs each and fed this into TREX. We found it was pretty accurate with about 6 lines of text to work on.
-R
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi!
Yes, it's picking up attributes as keywords, normally the filename, extension and MIME type. The example documents contained just 3 - 6 fkeywords and synonyms we wanted to base the category on as the initital training set.
How small can a sample document be before it picks up attributes?
Even when we've obtained real documents, it sometimes learns the wrong words from the text, such as company name or the author. Even if we manually reclassify documents, this does not appear to affect the keywords it's using so next time similar documents are added (ie, company documents with the company name on them) the go back into the category again.
Cheers,
-Richard
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Richard,
a possibility that came to my mind is to start first with a query-based taxonomy with the keywords as queries you used in the small training documents at the beginning. Then after a certain amount of documents have been classified you can switch the taxonomy to "training-based" in the index administration iview. The more correctly classified docs by query in your categories the better the training-based classification should work.
regards
Gabriel
Hi Gabriel,
what would be the advantage over the taxonomy training iView? Here you search for documents with which to train nodes of an Example-based Taxonomy. And the advanced search interface basically offers you the same possibilities for a single query as the Taxonomy Maintenance UI for Query-based taxonomies (QBT).
Generally speaking, though, I'd use QBT anyhow, in a case where criteria seem to be very much focussed on single keywords.
Regards,
Karsten
User | Count |
---|---|
85 | |
10 | |
10 | |
9 | |
6 | |
6 | |
6 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.