Event: FORCE Machine Learning Hackathon, Stavanger 2018 https://events.agilescientific.com/event/force-hackathon
Project: Extract good shows from 400 well reports https://events.agilescientific.com/project/extract-oil-show-confidence-from-pdf-reports
Team: Jesse Lord, Anne Estoppey, Kaouther Hadji, Chris Olsen, Florian Basier
Methodology:
- OCR PDFs (provided: https://drive.google.com/drive/folders/1ew1Re1LhLSOvculHYPrzqcKiYU3Oo3rU)
- Extract text from PDFs
- Find common phrases, beyond the obvious "good shows"
- For each phrase, search for a 4 digit number that is nearby in the document
- SMEs label each of the matches as a good show, or not
- Train/Test split, vectorization, then linear regression
- Visualise matches at depth on a map, and colour code confidence for each match. Run the UI using NPM, then choose the included csv once presented with the interface.
The pipeline module contains the code blocks required to perform the above methodology. Each block typically outputs to a csv, that is picked up by the next:
- phrases_csv will generate bigrams and trigrams from a list of indicators
- extract_pdf will read pdfs (expected to be in data\reports) to a dataframe
- count_phrases stores the count of phrases (from step 1) found in the pdf text dataframe
- top_phrases shows the top n phrases using the count_phrases csv: most hits and most docs
- extract_phrases searches for nearby 4 digit numbers in each document, for each phrase, and stores the word distance between phrase and depth
- label starts an interactive labelling of the matches found in step 5
- wb_xy uses regex to get the wellbore from each filename, then the coordinates found in data\csv\wb_latlon.csv to assign a location to each wellbore
- learn performs a train/test split, applies a count vectorizer, adds depth and distance as features, then performs logistic regression, outputting auc and confidence
Results: Using only the phrase 'good show', 44 matches with depth were found. After splitting there were only 11 records left to test on. While the results were perfect, this clearly isn't enough of a sample size to draw any meaningful conclusions. Our next step, if we had the time, was to select common phrases generated from the indicators vocabulary, and run this pipeline again.
The count/top phrase searches for trigrams did not complete before the event was done, but we did have a look at the top bigrams:
Top 20 phrases across all documents (most hits):
as f 3369.0
very f 3250.0
very fine 2842.0
v f 2249.0
f wh 2076.0
v sl 1804.0
as above 1802.0
red f 1800.0
cut f 1783.0
v bl 1608.0
v blk 1580.0
yel brn 1578.0
slightly calcareous 1535.0
wh f 1420.0
sl slty 1287.0
dk brn 1184.0
bl wh 1179.0
wh cut 1114.0
yel wh 1082.0
yel f 1010.0
dtype: float64
Top 20 phrases across all documents (most docs):
as f 336
very f 314
red f 299
very fine 286
f wh 268
slightly calcareous 252
f v 252
cut f 237
as cut 235
as v 232
slightly silty 225
mud f 211
f mud 202
v f 193
f hydrocarbon 187
very calcareous 183
f pyrite 183
v sl 172
pale yel 167
red brown 165
dtype: int64
Top phrases (hits and docs):
as above
as cut
as f
as v
bl wh
cut f
dk brn
f hydrocarbon
f mud
f pyrite
f v
f wh
mud f
pale yel
red brown
red f
sl slty
slightly calcareous
slightly silty
v bl
v blk
v f
v sl
very calcareous
very f
very fine
wh cut
wh f
yel brn
yel f
yel wh
Update: Trigrams finally finished. Top 20 phrases across all documents (most hits):
bl wh f 492.0
yel wh f 418.0
wh cut f 346.0
dull yel f 236.0
as cut f 233.0
wh cut fluor 225.0
bl wh cut 210.0
dk yel brn 207.0
white cut f 199.0
blue white f 194.0
blue white fluor 193.0
dull yel fluor 192.0
bl wh fluor 186.0
yel wh cut 181.0
white cut fluor 179.0
brn oil stn 155.0
dark yellow brown 150.0
f very fine 145.0
wk bl wh 141.0
white cut fluorescence 138.0
dtype: float64
Top 20 phrases across all documents (most docs):
as cut f 102
f very fine 79
pale yellow brown 70
dark yellow brown 62
f hydrocarbon show 61
wh cut f 60
dk yel brn 59
white cut f 57
white cut fluor 53
yel wh f 53
dull yellow f 51
dull yel f 51
dark yellowish brown 49
dull yellow fluor 49
white cut fluorescence 47
yellow white f 45
yel wh cut 44
yellow white fluor 43
wh cut fluor 41
dull yellow fluorescence 38
dtype: int64
Top phrases (hits and docs):
as cut f
bl wh cut
bl wh f
bl wh fluor
blue white f
blue white fluor
brn oil stn
dark yellow brown
dark yellowish brown
dk yel brn
dull yel f
dull yel fluor
dull yellow f
dull yellow fluor
dull yellow fluorescence
f hydrocarbon show
f very fine
pale yellow brown
wh cut f
wh cut fluor
white cut f
white cut fluor
white cut fluorescence
wk bl wh
yel wh cut
yel wh f
yellow white f
yellow white fluor