Skip to content
petermr edited this page Aug 10, 2020 · 7 revisions

AMI word analysis

NOTE: this is a picocli subcommand but not fully integrated. It was developed from AMIArgProcessor/WordArgProcessor. It should be "relatively easy" to transfer to a Lucene pipeline under ami word. It is often run automatically in parallel to ami search which complicates debugging. There should be no coupling.

architecture

current stack (deepest first) 2020-08-10

WordCollectionFactory.transformWordStream(List<String>) line: 148	
WordCollectionFactory.createWordList() line: 136	
WordCollectionFactory.extractWords() line: 104	
WordArgProcessor.extractWords() line: 168	
<called from argProcessor or AMIWordsTool>

WordArgProcessor

WordArgProcessor.extractWords() runs:

		getOrCreateWordCollectionFactory();
		wordCollectionFactory.extractWords();

WordCollectionFactory

extractWords essentially calls:

List<String> createWordList()

The rawWords are extracted by currentCTree.extractWords() and fed into transformWordStream(rawWords)

			List<String> rawWords = currentCTree.extractWords();
			wordList = (rawWords == null) ? null : transformWordStream(rawWords);
			if (wordsTool != null && wordsTool.getVerbosityInt() >= 2) {
				LOG.debug("wordsTool " + wordList.size());
			}

List<String> transformWordStream(List<String> transformedWords) (old version)

Takes token stream (as List)

		AMIArgProcessor wordArgProcessor = (AMIArgProcessor) amiArgProcessor;
		if (amiArgProcessor.getChosenWordTypes().contains(AMIArgProcessor.ABBREVIATION)) {
			transformedWords = createAbbreviations(transformedWords);
		}
		if (amiArgProcessor.getChosenWordTypes().contains(AMIArgProcessor.CAPITALIZED)) {
			transformedWords = createCapitalized(transformedWords);
		} 
		if (amiArgProcessor.getWordCaseList().contains(AMIArgProcessor.IGNORE)) {
			transformedWords = toLowerCase(transformedWords);
		}
		List<WordSetWrapper> stopWordSetList = wordArgProcessor.getStopwordSetList();
		for (WordSetWrapper stopWordSet : stopWordSetList) {
			transformedWords = applyStopwordFilter(stopWordSet, transformedWords);
		}
		if (amiArgProcessor.getStemming()) {
			transformedWords = LuceneUtils.applyPorterStemming(transformedWords);
		}
		return transformedWords;

List<String> transformWordStream(AMIWordsTool wordsTool, List<String> transformedWords) new version

as above but controlled by picocli options.

		if (wordsTool.isAbbreviation()) {
			transformedWords = createAbbreviations(transformedWords);
		}
		if (wordsTool.isCapital()) {
			transformedWords = createCapitalized(transformedWords);
		} 
		if (wordsTool.isIgnoreCase()) {
			transformedWords = toLowerCase(transformedWords);
		}
		List<WordSetWrapper> stopWordSetList = wordsTool.getStopWordsSetList();
		for (WordSetWrapper stopWordSet : stopWordSetList) {
			transformedWords = applyStopwordFilter(stopWordSet, transformedWords);
		}
		if (wordsTool.isStemming()) {
			transformedWords = LuceneUtils.applyPorterStemming(transformedWords);
		}
		return transformedWords;
	}