This project demonstrates the integration of OpenAI's GPT-4 Vision API with a HoloLens application. Users can capture images using the HoloLens camera and receive descriptive responses from the GPT-4V model.
gpt4-vision-hololens-demo.mp4
'A laptop displaying a webpage with the header "Let's build from here" is placed next to a spiral notebook and a pen on a dark surface.'
- Newtonsoft.JSON
- MRTK Foundation
- MRTK Standard Assets
- Open the
GPT4 Vision Example
-Scene - Specify your OpenAI key in the GameObject
GPT4Vision
>OpenAIWrapper
(or hardcode it into the OpenAIWrapper.cs class) - Specify your base prompt (which is concatenated to the image sent to OpenAI), e.g. Describe this image.
- Specify max tokens, sampling temperature, and image detail for the OpenAI API call
- Build the app as
.appx
(or deploy to HoloLens directly, e.g. via Visual Studio) and install it on your HoloLens - Run the app. Press on the camera button to capture a photo using HoloLens' PV camera which gets send to OpenAI's API.
- See the inference result (based on your prompt) displayed on the label.
- Make sure you have the dependencies from above installed.
- Import the package via
Assets > Import Package
. - Either open up the
GPT4 Vision Example
-Scene, or import theGPT4Vision
-Prefab into your own scene. - Edit the base prompt, tokens, temperature, image detail as described above.
- Optional: call
CapturePhoto()
within theGPT4Vision
-Prefab (in case you do not want to use the button and label within the Prefab).
For some reason, the built-in UnityEngine.Windows.WebCam
approach provided by Microsoft is really slow (~1.2s per captured photo on average, regardless of resolution). Also, inference speed on OpenAI's server can vary quite a bit. If you need this approach in real-time, skip PhotoCapture
altogether (Research Mode) and think about hosting your own LMM. Feel free to message me if you need some pointers.
This project is a barebones prototype for now and still WIP. Feel free to create a PR.