puthimaas - PLM Blog: LSA

Tuesday, September 6, 2011

Using Rule-Based or Supervised Document Classification for Legacy Data Migration in a PLM Implementation

I was reading about Latent Semantic Analysis (LSA, also Latent semantic indexing ) recently and the controversies over whether Google used this technique to rank their search results, though the jury seems to be that they use something much more sophisticated statistical methods of text analysis. Latent semantic analysis (LSA) is a technique in natural language processing, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur close together in text.

Latent semantic indexing is closely related to LSA and is used in a assortment of information retrieval and text processing applications, although its primary use is for automated document categorization. Document classification/categorization are used to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism (Rule Based). There are Open Source tools [like Mallet] for statistical natural language processing, document classification, clustering, topic modeling, information extraction, etc.

The reason I bring up this topic in my experience in legacy data import during PLM implementations. Legacy data migration is tough to say the least. Stephen Porter gives a good overview here: The PLM State: What’s thebig deal about data migration?. Some of my real life experiences include:

1. Manual scanning of historical documents, manual classification of those documents into folders and uploading it to the PLM environment using vendor tools in a FDA regulated organization’s implementation.

2. Legacy data extraction from a commercial document management system, mapping data with vendor PLM system, cleaning the legacy data, and finally data import.

3. Legacy system consolidation - Merging numerous home grown legacy systems into one commercial PLM system.

None of the processes used were scalable or easy to start with. Also the amount of time taken cannot be guaranteed. In such scenarios wouldn’t using Rule-Based or Supervised Document Classification make sense? Arguably CAD data would be difficult to handle and historical revisions or intermediate iterations of files between releases might be lost but probably for non-CAD data using such techniques would make up for huge investments in time and labor for legacy data migrations.