I was reading about Latent Semantic Analysis (LSA, also Latent
semantic indexing ) recently and the controversies over whether Google used this
technique to rank their search results, though the jury seems to be that they use something much more sophisticated statistical
methods of text analysis. Latent semantic analysis (LSA) is a technique in natural language processing, of
analyzing relationships between a set of documents and the terms they contain
by producing a set of concepts related to the documents and terms. LSA assumes
that words that are close in meaning will occur close together in text.
Latent semantic indexing is closely related to LSA and is used in a assortment of information retrieval and text processing
applications, although its primary use is for automated document categorization.
Document classification/categorization are used to assign an electronic
document to one or more categories, based on its contents. Document
classification tasks can be divided into two sorts: supervised document
classification where some external mechanism (such as human feedback) provides
information on the correct classification for documents, and unsupervised
document classification (also known as document clustering), where the
classification must be done entirely without reference to external information.
There is also a semi-supervised document classification, where parts of the
documents are labeled by the external mechanism (Rule Based).
There are Open Source tools [like Mallet] for statistical
natural language processing, document classification, clustering, topic
modeling, information extraction, etc.
The reason I bring up this topic in my experience in legacy
data import during PLM implementations. Legacy data migration is tough to say
the least. Stephen Porter gives a good overview here: The PLM State: What’s thebig deal about data migration?.
Some of my real life experiences include:
1.
Manual scanning of historical documents, manual
classification of those documents into folders and uploading it to the PLM
environment using vendor tools in a FDA regulated organization’s implementation.
2.
Legacy data extraction from a commercial document
management system, mapping data with vendor PLM system, cleaning the legacy
data, and finally data import.
3.
Legacy system consolidation - Merging numerous
home grown legacy systems into one commercial PLM system.
None of the processes used were scalable or easy to start
with. Also the amount of time taken cannot be guaranteed. In such scenarios
wouldn’t using Rule-Based or Supervised Document Classification make sense?
Arguably CAD data would be difficult to handle and historical revisions or intermediate
iterations of files between releases might be lost but probably for non-CAD data using such techniques would make
up for huge investments in time and labor for legacy data migrations.