- Case study focused around photo OCR
- Three reasons to do this
- Look at how a complex system can be put together
- The idea of a machine learning pipeline
- What to do next
- How to do it
- Some more interesting ideas
- Applying machine learning to tangible problems
- Artificial data synthesis
What is the photo OCR problem?
Photo OCR = photo optical character recognition
With growth of digital photography, lots of digital pictures
One idea which has interested many people is getting computers to understand those photos
The photo OCR problem is getting computers to read text in an image
Possible applications for this would include
Make searching easier (e.g. searching for photos based on words in them)
Car navigation
OCR of documents is a comparatively easy problem
- From photos it's really hard
OCR pipeline
- Look through image and find text
- Do character segmentation
- Do character classification
- Optional some may do spell check after this too
- We're not focussing on such systems though
- Pipelines are common in machine learning
- Separate modules which may each be a machine learning component or data processing component
- If you're designing a machine learning system, pipeline design is one of the most important questions
- Performance of pipeline and each module often has a big impact on the overall performance a problem
- You would often have different engineers working on each module
- Offers a natural way to divide up the workload
How do the individual models work?
Here focus on a sliding windows classifier
As mentioned, stage 1 is text detection
Unusual problem in computer vision - different rectangles (which surround text) may have different aspect ratios (aspect ratio being height : width)
Let's start with a simpler example
Pedestrian detection
This is a slightly simpler problem because the aspect ration remains pretty constant
Building our detection system
Now we have a new image - how do we find pedestrians in it?
- Start by taking a rectangular 82 x 36 patch in the image
- Run patch through classifier - hopefully in this example it will return y = 0
- Next slide the rectangle over to the right a little bit and re-run
- Then slide again
- The amount you slide each rectangle over is a parameter called the step-size or stride
- Could use 1 pixel
- Best, but computationally expensive
- More commonly 5-8 pixels used
- Could use 1 pixel
- So, keep stepping rectangle along all the way to the right
- Eventually get to the end
- Then move back to the left hand side but step down a bit too
- Repeat until you've covered the whole image
- Now, we initially started with quite a small rectangle
- So now we can take a larger image patch (of the same aspect ratio)
- Each time we process the image patch, we're resizing the larger patch to a smaller image, then running that smaller image through the classifier
- Hopefully, by changing the patch size and rastering repeatedly across the image, you eventually recognize all the pedestrians in the picture
- Start by taking a rectangular 82 x 36 patch in the image
Text detection example
Like pedestrian detection, we generate a labeled training set with
Having trained the classifier we apply it to an image
So, run a sliding window classifier at a fixed rectangle size
White region show where text detection system thinks text is
- Different shades of gray correspond to probability associated with how sure the classifier is the section contains text
- Black - no text
- White - text
- For text detection, we want to draw rectangles around all the regions where there is text in the image
- Different shades of gray correspond to probability associated with how sure the classifier is the section contains text
Take classifier output and apply an expansion algorithm
Takes each of white regions and expands it
How do we implement this
Say, for every pixel, is it within some distance of a white pixel?
Look at connected white regions in the image above
This example misses a piece of text on the door because the aspect ratio is wrong
Very hard to read
Stage two is character segmentation
- Use supervised learning algorithm
- Look in a defined image patch and decide, is there a split between two characters?
- We train a classifier to try and classify between positive and negative examples
- Run that classifier on the regions detected as containing text in the previous section
- Use a 1-dimensional sliding window to move along text regions
Character classification
Standard OCR, where you apply standard supervised learning which takes an input and identify which character we decide it is
Multi-class characterization problem
We've seen over and over that one of the most reliable ways to get a high performance machine learning system is to take a low bias algorithm and train on a massive data set
- Where do we get so much data from?
- In ML we can do artificial data synthesis
- This doesn't apply to every problem
- If it applies to your problem, it can be a great way to generate loads of data
Two main principles
- Creating data from scratch
- If we already have a small labeled training set can we amplify it into a larger training set
Character recognition as an example of data synthesis
- If we go and collect a large labeled data set will look like this
- How can we amplify this
- Modern computers often have a big font library
- If you go to websites, huge free font libraries
- For more training data, take characters from different fonts, paste these characters again random backgrounds
- After some work, can build a synthetic training set
- Random background
- Maybe some blurring/distortion filters
- Takes thought and work to make it look realistic
- If you do a sloppy job this won't help!
- So unlimited supply of training examples
- This is an example of creating new data from scratch
- Other way is to introduce distortion into existing data
Another example: speech recognition
- Learn from audio clip - what were the words
- Have a labeled training example
- Introduce audio distortions into the examples
- So only took one example
- Created lots of new ones!
- When introducing distortion, they should be reasonable relative to the issues your classifier may encounter
Getting more data
Before creating new data, make sure you have a low bias classifier
- Plot learning curve
If not a low bias classifier increase number of features
- Then create large artificial training set
Very important question: How much work would it be to get 10x data as we currently have?
- Often the answer is, "Not that hard"
- This is often a huge way to improve an algorithm
- Good question to ask yourself or ask the team
How many minutes/hours does it take to get a certain number of examples
- Say we have 1000 examples
- 10 seconds to label an example
- So we need another 9000 - 90000 seconds
- Comes to a few days (25 hours!)
Crowd sourcing is also a good way to get data
Risk or reliability issues
E.g. Amazon mechanical turks
Through the course repeatedly said one of the most valuable resources is developer time
Pick the right thing for you and your team to work on
Avoid spending a lot of time to realize the work was pointless in terms of enhancing performance
Photo OCR pipeline
- Three modules
- Each one could have a small team on it
- Where should you allocate resources?
- Good to have a single real number as an evaluation metric
- So, character accuracy for this example
- Find that our test set has 72% accuracy
Ceiling analysis on our pipeline
We go to the first module
Mess around with the test set - manually tell the algorithm where the text is
Simulate if your text detection system was 100% accurate
- So we're feeding the character segmentation module with 100% accurate data now
Accuracy goes up to 89%
Next do the same for the character segmentation
- Accuracy goes up to 90% now
Finally doe the same for character recognition
- Goes up to 100%
Having done this we can qualitatively show what the upside to improving each module would be
- Perfect text detection improves accuracy by 17%!
- Would bring the biggest gain if we could improve
- Perfect character segmentation would improve it by 1%
- Not worth working on
- Perfect character recognition would improve it by 10%
- Might be worth working on, depends if it looks easy or not
- Perfect text detection improves accuracy by 17%!
The "ceiling" is that each module has a ceiling by which making it perfect would improve the system overall
Other example - face recognition
- NB this is not how it's done in practice
- Probably more complicated than is used in practice
- How would you do ceiling analysis for this
- Overall system is 85%
- + Perfect background -> 85.1%
- Not a crucial step
- + Perfect face detection -> 91%
- Most important module to focus on
- + Perfect eyes ->95%
- + Perfect nose -> 96%
- + Perfect mouth -> 97%
- + Perfect logistic regression -> 100%
- Cautionary tale
- Two engineers spent 18 months improving background pre-processing
- Turns out had no impact on overall performance
- Could have saved three years of man power if they'd done ceiling analysis
- Two engineers spent 18 months improving background pre-processing