diff --git a/docs/docs/tutorials/local_rag.ipynb b/docs/docs/tutorials/local_rag.ipynb
index ddfd5642a198c..e4c3f536b7ca7 100644
--- a/docs/docs/tutorials/local_rag.ipynb
+++ b/docs/docs/tutorials/local_rag.ipynb
@@ -19,17 +19,28 @@
     "\n",
     ":::\n",
     "\n",
-    "The popularity of projects like [PrivateGPT](https://github.com/imartinez/privateGPT), [llama.cpp](https://github.com/ggerganov/llama.cpp), [GPT4All](https://github.com/nomic-ai/gpt4all), and [llamafile](https://github.com/Mozilla-Ocho/llamafile) underscore the importance of running LLMs locally.\n",
+    "The popularity of projects like [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://github.com/ollama/ollama), and [llamafile](https://github.com/Mozilla-Ocho/llamafile) underscore the importance of running LLMs locally.\n",
     "\n",
-    "LangChain has [integrations](https://integrations.langchain.com/) with many open-source LLMs that can be run locally.\n",
+    "LangChain has integrations with [many open-source LLM providers](/docs/how_to/local_llms) that can be run locally.\n",
     "\n",
-    "See [here](/docs/how_to/local_llms) for setup instructions for these LLMs. \n",
+    "This guide will show how to run `LLaMA 3.1` via one provider, [Ollama](/docs/integrations/providers/ollama/) locally (e.g., on your laptop) using local embeddings and a local LLM. However, you can set up and swap in other local providers, such as [LlamaCPP](/docs/integrations/chat/llamacpp/) if you prefer.\n",
     "\n",
-    "For example, here we show how to run `GPT4All` or `LLaMA2` locally (e.g., on your laptop) using local embeddings and a local LLM.\n",
+    "**Note:** This guide uses a [chat model](/docs/concepts/#chat-models) wrapper that takes care of formatting your input prompt for the specific local model you're using. However, if you are prompting local models directly with a [text-in/text-out LLM](/docs/concepts/#llms) wrapper, you may need to use a prompt tailed for your specific model. This will often [require the inclusion of special tokens](https://huggingface.co/blog/llama2#how-to-prompt-llama-2). [Here's an example for LLaMA 2](https://smith.langchain.com/hub/rlm/rag-prompt-llama).\n",
     "\n",
-    "## Document Loading \n",
+    "## Setup\n",
     "\n",
-    "First, install packages needed for local embeddings and vector storage."
+    "First we'll need to set up Ollama.\n",
+    "\n",
+    "The instructions [on their GitHub repo](https://github.com/ollama/ollama) provide details, which we summarize here:\n",
+    "\n",
+    "- [Download](https://ollama.com/download) and run their desktop app\n",
+    "- From command line, fetch models from [this list of options](https://ollama.com/library). For this guide, you'll need:\n",
+    "  - A general purpose model like `llama3.1:8b`, which you can pull with something like `ollama pull llama3.1:8b`\n",
+    "  - A [text embedding model](https://ollama.com/search?c=embedding) like `nomic-embed-text`, which you can pull with something like `ollama pull nomic-embed-text`\n",
+    "- When the app is running, all models are automatically served on `localhost:11434`\n",
+    "- Note that your model choice will depend on your hardware capabilities\n",
+    "\n",
+    "Next, install packages needed for local embeddings, vector storage, and inference."
    ]
   },
   {
@@ -39,7 +50,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install --upgrade --quiet  langchain langchain-community langchainhub gpt4all langchain-chroma "
+    "# Document loading, retrieval methods and text splitting\n",
+    "%pip install -qU langchain langchain_community\n",
+    "\n",
+    "# Local vector store via Chroma\n",
+    "%pip install -qU langchain_chroma\n",
+    "\n",
+    "# Local inference and embeddings via Ollama\n",
+    "%pip install -qU langchain_ollama"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02b7914e",
+   "metadata": {},
+   "source": [
+    "You can also [see this page](/docs/integrations/text_embedding/) for a full list of available embeddings models"
    ]
   },
   {
@@ -47,20 +73,22 @@
    "id": "5e7543fa",
    "metadata": {},
    "source": [
-    "Load and split an example document.\n",
+    "## Document Loading\n",
+    "\n",
+    "Now let's load and split an example document.\n",
     "\n",
-    "We'll use a blog post on agents as an example."
+    "We'll use a [blog post](https://lilianweng.github.io/posts/2023-06-23-agent/) by Lilian Weng on agents as an example."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "f8cf5765",
    "metadata": {},
    "outputs": [],
    "source": [
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
     "from langchain_community.document_loaders import WebBaseLoader\n",
-    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
     "\n",
     "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
     "data = loader.load()\n",
@@ -74,20 +102,22 @@
    "id": "131d5059",
    "metadata": {},
    "source": [
-    "Next, the below steps will download the `GPT4All` embeddings locally (if you don't already have them)."
+    "Next, the below steps will initialize your vector store. We use [`nomic-embed-text`](https://ollama.com/library/nomic-embed-text), but you can explore other providers or options as well:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "id": "fdce8923",
    "metadata": {},
    "outputs": [],
    "source": [
     "from langchain_chroma import Chroma\n",
-    "from langchain_community.embeddings import GPT4AllEmbeddings\n",
+    "from langchain_ollama import OllamaEmbeddings\n",
     "\n",
-    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())"
+    "local_embeddings = OllamaEmbeddings(model=\"nomic-embed-text\")\n",
+    "\n",
+    "vectorstore = Chroma.from_documents(documents=all_splits, embedding=local_embeddings)"
    ]
   },
   {
@@ -95,12 +125,12 @@
    "id": "29137915",
    "metadata": {},
    "source": [
-    "Test similarity search is working with our local embeddings."
+    "And now we have a working vector store! Test that similarity search is working:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 4,
    "id": "b0c55e98",
    "metadata": {},
    "outputs": [
@@ -110,7 +140,7 @@
        "4"
       ]
      },
-     "execution_count": 3,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -123,17 +153,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 5,
    "id": "32b43339",
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "Document(page_content='Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.', metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': \"LLM Powered Autonomous Agents | Lil'Log\"})"
+       "Document(metadata={'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\\nAgent System Overview In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:', 'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': \"LLM Powered Autonomous Agents | Lil'Log\"}, page_content='Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs.')"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -142,260 +172,102 @@
     "docs[0]"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "557cd9b8",
-   "metadata": {},
-   "source": [
-    "## Model \n",
-    "\n",
-    "### LLaMA2\n",
-    "\n",
-    "Note: new versions of `llama-cpp-python` use GGUF model files (see [here](https://github.com/abetlen/llama-cpp-python/pull/633)).\n",
-    "\n",
-    "If you have an existing GGML model, see [here](/docs/integrations/llms/llamacpp) for instructions for conversion for GGUF. \n",
-    "   \n",
-    "And / or, you can download a GGUF converted model (e.g., [here](https://huggingface.co/TheBloke)).\n",
-    "\n",
-    "Finally, as noted in detail [here](/docs/how_to/local_llms) install `llama-cpp-python`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9f218576",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%pip install --upgrade --quiet  llama-cpp-python"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0dd1804f",
-   "metadata": {},
-   "source": [
-    "To enable use of GPU on Apple Silicon, follow the steps [here](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md) to use the Python binding `with Metal support`.\n",
-    "\n",
-    "In particular, ensure that `conda` is using the correct virtual environment that you created (`miniforge3`).\n",
-    "\n",
-    "E.g., for me:\n",
-    "\n",
-    "```\n",
-    "conda activate /Users/rlm/miniforge3/envs/llama\n",
-    "```\n",
-    "\n",
-    "With this confirmed:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5884779a-957e-4c4c-b447-bc8385edc67e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 /Users/rlm/miniforge3/envs/llama/bin/pip install -U llama-cpp-python --no-cache-dir"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "cd7164e3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain_community.llms import LlamaCpp"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "fcf81052",
    "metadata": {},
    "source": [
-    "Setting model parameters as noted in the [llama.cpp docs](/docs/integrations/llms/llamacpp)."
+    "Next, set up a model. We use Ollama with `llama3.1:8b` here, but you can [explore other providers](/docs/how_to/local_llms/) or [model options depending on your hardware setup](https://ollama.com/library):"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "id": "af1176bb-d52a-4cf0-b983-8b7433d45b4f",
    "metadata": {},
    "outputs": [],
    "source": [
-    "n_gpu_layers = 1  # Metal set to 1 is enough.\n",
-    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
-    "\n",
-    "# Make sure the model path is correct for your system!\n",
-    "llm = LlamaCpp(\n",
-    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/models/llama-2-13b-chat.ggufv3.q4_0.bin\",\n",
-    "    n_gpu_layers=n_gpu_layers,\n",
-    "    n_batch=n_batch,\n",
-    "    n_ctx=2048,\n",
-    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
-    "    verbose=True,\n",
+    "from langchain_ollama import ChatOllama\n",
+    "\n",
+    "model = ChatOllama(\n",
+    "    model=\"llama3.1:8b\",\n",
     ")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "3831b16a",
+   "id": "8c4f7adf",
    "metadata": {},
    "source": [
-    "Note that these indicate that [Metal was enabled properly](/docs/integrations/llms/llamacpp):\n",
-    "\n",
-    "```\n",
-    "ggml_metal_init: allocating\n",
-    "ggml_metal_init: using MPS\n",
-    "```"
+    "Test it to make sure you've set everything up properly:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 7,
    "id": "bf0162e0-8c41-4344-88ae-ff2bbaeb12eb",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "by jonathan \n",
+      "**The scene is set: a packed arena, the crowd on their feet. In the blue corner, we have Stephen Colbert, aka \"The O'Reilly Factor\" himself. In the red corner, the challenger, John Oliver. The judges are announced as Tina Fey, Larry Wilmore, and Patton Oswalt. The crowd roars as the two opponents face off.**\n",
       "\n",
-      "Here's the hypothetical rap battle:\n",
+      "**Stephen Colbert (aka \"The Truth with a Twist\"):**\n",
+      "Yo, I'm the king of satire, the one they all fear\n",
+      "My show's on late, but my jokes are clear\n",
+      "I skewer the politicians, with precision and might\n",
+      "They tremble at my wit, day and night\n",
       "\n",
-      "[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\n",
+      "**John Oliver:**\n",
+      "Hold up, Stevie boy, you may have had your time\n",
+      "But I'm the new kid on the block, with a different prime\n",
+      "Time to wake up from that 90s coma, son\n",
+      "My show's got bite, and my facts are never done\n",
       "\n",
-      "[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\n",
+      "**Stephen Colbert:**\n",
+      "Oh, so you think you're the one, with the \"Last Week\" crown\n",
+      "But your jokes are stale, like the ones I wore down\n",
+      "I'm the master of absurdity, the lord of the spin\n",
+      "You're just a British import, trying to fit in\n",
       "\n",
-      "[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\n",
+      "**John Oliver:**\n",
+      "Stevie, my friend, you may have been the first\n",
+      "But I've got the skill and the wit, that's never blurred\n",
+      "My show's not afraid, to take on the fray\n",
+      "I'm the one who'll make you think, come what may\n",
       "\n",
-      "[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
+      "**Stephen Colbert:**\n",
+      "Well, it's time for a showdown, like two old friends\n",
+      "Let's see whose satire reigns supreme, till the very end\n",
+      "But I've got a secret, that might just seal your fate\n",
+      "My humor's contagious, and it's already too late!\n",
+      "\n",
+      "**John Oliver:**\n",
+      "Bring it on, Stevie! I'm ready for you\n",
+      "I'll take on your jokes, and show them what to do\n",
+      "My sarcasm's sharp, like a scalpel in the night\n",
+      "You're just a relic of the past, without a fight\n",
+      "\n",
+      "**The judges deliberate, weighing the rhymes and the flow. Finally, they announce their decision:**\n",
+      "\n",
+      "Tina Fey: I've got to go with John Oliver. His jokes were sharper, and his delivery was smoother.\n",
       "\n",
-      "llama_print_timings:        load time =  4481.74 ms\n",
-      "llama_print_timings:      sample time =   183.05 ms /   256 runs   (    0.72 ms per token,  1398.53 tokens per second)\n",
-      "llama_print_timings: prompt eval time =   456.05 ms /    13 tokens (   35.08 ms per token,    28.51 tokens per second)\n",
-      "llama_print_timings:        eval time =  7375.20 ms /   255 runs   (   28.92 ms per token,    34.58 tokens per second)\n",
-      "llama_print_timings:       total time =  8388.92 ms\n"
+      "Larry Wilmore: Agreed! But Stephen Colbert's still got that old-school charm.\n",
+      "\n",
+      "Patton Oswalt: You know what? It's a tie. Both of them brought the heat!\n",
+      "\n",
+      "**The crowd goes wild as both opponents take a bow. The rap battle may be over, but the satire war is just beginning...\n"
      ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "\"by jonathan \\n\\nHere's the hypothetical rap battle:\\n\\n[Stephen Colbert]: Yo, this is Stephen Colbert, known for my comedy show. I'm here to put some sense in your mind, like an enema do-go. Your opponent? A man of laughter and witty quips, John Oliver! Now let's see who gets the most laughs while taking shots at each other\\n\\n[John Oliver]: Yo, this is John Oliver, known for my own comedy show. I'm here to take your mind on an adventure through wit and humor. But first, allow me to you to our contestant: Stephen Colbert! His show has been around since the '90s, but it's time to see who can out-rap whom\\n\\n[Stephen Colbert]: You claim to be a witty man, John Oliver, with your British charm and clever remarks. But my knows that I'm America's funnyman! Who's the one taking you? Nobody!\\n\\n[John Oliver]: Hey Stephen Colbert, don't get too cocky. You may\""
-      ]
-     },
-     "execution_count": 11,
-     "metadata": {},
-     "output_type": "execute_result"
     }
    ],
    "source": [
-    "llm.invoke(\"Simulate a rap battle between Stephen Colbert and John Oliver\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0d9579a7",
-   "metadata": {},
-   "source": [
-    "### GPT4All\n",
-    "\n",
-    "Similarly, we can use `GPT4All`.\n",
-    "\n",
-    "[Download the GPT4All model binary](/docs/integrations/llms/gpt4all).\n",
-    "\n",
-    "The Model Explorer on the [GPT4All](https://gpt4all.io/index.html) is a great way to choose and download a model.\n",
-    "\n",
-    "Then, specify the path that you downloaded to to.\n",
-    "\n",
-    "E.g., for me, the model lives here:\n",
-    "\n",
-    "`/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "57c1aec0-04c7-479e-b9bf-af3c547ba0a3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain_community.llms import GPT4All\n",
-    "\n",
-    "gpt4all = GPT4All(\n",
-    "    model=\"/Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin\",\n",
-    "    max_tokens=2048,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e6d012e4-0eef-4734-a826-89ec74fe9f88",
-   "metadata": {},
-   "source": [
-    "### llamafile\n",
-    "\n",
-    "One of the simplest ways to run an LLM locally is using a [llamafile](https://github.com/Mozilla-Ocho/llamafile). All you need to do is:\n",
-    "\n",
-    "1) Download a llamafile from [HuggingFace](https://huggingface.co/models?other=llamafile)\n",
-    "2) Make the file executable\n",
-    "3) Run the file\n",
-    "\n",
-    "llamafiles bundle model weights and a [specially-compiled](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#technical-details) version of [`llama.cpp`](https://github.com/ggerganov/llama.cpp) into a single file that can run on most computers without any additional dependencies. They also come with an embedded inference server that provides an [API](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md#api-endpoints) for interacting with your model. \n",
-    "\n",
-    "Here's a simple bash script that shows all 3 setup steps:\n",
-    "\n",
-    "```bash\n",
-    "# Download a llamafile from HuggingFace\n",
-    "wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile\n",
-    "\n",
-    "# Make the file executable. On Windows, instead just rename the file to end in \".exe\".\n",
-    "chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile\n",
-    "\n",
-    "# Start the model server. Listens at http://localhost:8080 by default.\n",
-    "./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser\n",
-    "```\n",
-    "\n",
-    "After you run the above setup steps, you can interact with the model via LangChain:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "735e45b6-9aff-463e-aae4-bbf8ac2b21c5",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'\\n-1 1/2 (8 oz. Pounds) ground beef, browned and cooked until no longer pink\\n-3 cups whole wheat spaghetti\\n-4 (10 oz) cans diced tomatoes with garlic and basil\\n-2 eggs, beaten\\n-1 cup grated parmesan cheese\\n-1/2 teaspoon salt\\n-1/4 teaspoon black pepper\\n-1 cup breadcrumbs (16 oz)\\n-2 tablespoons olive oil\\n\\nInstructions:\\n1. Cook spaghetti according to package directions. Drain and set aside.\\n2. In a large skillet, brown ground beef over medium heat until no longer pink. Drain any excess grease.\\n3. Stir in diced tomatoes with garlic and basil, and season with salt and pepper. Cook for 5 to 7 minutes or until sauce is heated through. Set aside.\\n4. In a large bowl, beat eggs with a fork or whisk until fluffy. Add cheese, salt, and black pepper. Set aside.\\n5. In another bowl, combine breadcrumbs and olive oil. Dip each spaghetti into the egg mixture and then coat in the breadcrumb mixture. Place on baking sheet lined with parchment paper to prevent sticking. Repeat until all spaghetti are coated.\\n6. Heat oven to 375 degrees. Bake for 18 to 20 minutes, or until lightly golden brown.\\n7. Serve hot with meatballs and sauce on the side. Enjoy!'"
-      ]
-     },
-     "execution_count": 1,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from langchain_community.llms.llamafile import Llamafile\n",
-    "\n",
-    "llamafile = Llamafile()\n",
+    "response_message = model.invoke(\n",
+    "    \"Simulate a rap battle between Stephen Colbert and John Oliver\"\n",
+    ")\n",
     "\n",
-    "llamafile.invoke(\"Here is my grandmother's beloved recipe for spaghetti and meatballs:\")"
+    "print(response_message.content)"
    ]
   },
   {
@@ -405,79 +277,49 @@
    "source": [
     "## Using in a chain\n",
     "\n",
-    "We can create a summarization chain with either model by passing in the retrieved docs and a simple prompt.\n",
+    "We can create a summarization chain with either model by passing in retrieved docs and a simple prompt.\n",
     "\n",
-    "It formats the prompt template using the input key values provided and passes the formatted string to `GPT4All`, `LLama-V2`, or another specified LLM."
+    "It formats the prompt template using the input key values provided and passes the formatted string to the specified model:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 8,
    "id": "18a3716d",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "Based on the retrieved documents, the main themes are:\n",
-      "1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\n",
-      "2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\n",
-      "3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\n",
-      "4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems."
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time =  1191.88 ms\n",
-      "llama_print_timings:      sample time =   134.47 ms /   193 runs   (    0.70 ms per token,  1435.25 tokens per second)\n",
-      "llama_print_timings: prompt eval time = 39470.18 ms /  1055 tokens (   37.41 ms per token,    26.73 tokens per second)\n",
-      "llama_print_timings:        eval time =  8090.85 ms /   192 runs   (   42.14 ms per token,    23.73 tokens per second)\n",
-      "llama_print_timings:       total time = 47943.12 ms\n"
-     ]
-    },
     {
      "data": {
       "text/plain": [
-       "'\\nBased on the retrieved documents, the main themes are:\\n1. Task decomposition: The ability to break down complex tasks into smaller subtasks, which can be handled by an LLM or other components of the agent system.\\n2. LLM as the core controller: The use of a large language model (LLM) as the primary controller of an autonomous agent system, complemented by other key components such as a knowledge graph and a planner.\\n3. Potentiality of LLM: The idea that LLMs have the potential to be used as powerful general problem solvers, not just for generating well-written copies but also for solving complex tasks and achieving human-like intelligence.\\n4. Challenges in long-term planning: The challenges in planning over a lengthy history and effectively exploring the solution space, which are important limitations of current LLM-based autonomous agent systems.'"
+       "'The main themes in these documents are:\\n\\n1. **Task Decomposition**: The process of breaking down complex tasks into smaller, manageable subgoals is crucial for efficient task handling.\\n2. **Autonomous Agent System**: A system powered by Large Language Models (LLMs) that can perform planning, reflection, and refinement to improve the quality of final results.\\n3. **Challenges in Planning and Decomposition**:\\n\\t* Long-term planning and task decomposition are challenging for LLMs.\\n\\t* Adjusting plans when faced with unexpected errors is difficult for LLMs.\\n\\t* Humans learn from trial and error, making them more robust than LLMs in certain situations.\\n\\nOverall, the documents highlight the importance of task decomposition and planning in autonomous agent systems powered by LLMs, as well as the challenges that still need to be addressed.'"
       ]
      },
-     "execution_count": 27,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "from langchain_core.output_parsers import StrOutputParser\n",
-    "from langchain_core.prompts import PromptTemplate\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
     "\n",
-    "# Prompt\n",
-    "prompt = PromptTemplate.from_template(\n",
+    "prompt = ChatPromptTemplate.from_template(\n",
     "    \"Summarize the main themes in these retrieved docs: {docs}\"\n",
     ")\n",
     "\n",
     "\n",
-    "# Chain\n",
+    "# Convert loaded documents into strings by concatenating their content\n",
+    "# and ignoring metadata\n",
     "def format_docs(docs):\n",
     "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
     "\n",
     "\n",
-    "chain = {\"docs\": format_docs} | prompt | llm | StrOutputParser()\n",
+    "chain = {\"docs\": format_docs} | prompt | model | StrOutputParser()\n",
     "\n",
-    "# Run\n",
     "question = \"What are the approaches to Task Decomposition?\"\n",
+    "\n",
     "docs = vectorstore.similarity_search(question)\n",
+    "\n",
     "chain.invoke(docs)"
    ]
   },
@@ -486,184 +328,54 @@
    "id": "3cce6977-52e7-4944-89b4-c161d04f6698",
    "metadata": {},
    "source": [
-    "## Q&A \n",
-    "\n",
-    "We can also use the LangChain Prompt Hub to store and fetch prompts that are model-specific.\n",
+    "## Q&A\n",
     "\n",
-    "Let's try with a default RAG prompt, [here](https://smith.langchain.com/hub/rlm/rag-prompt)."
+    "You can also perform question-answering with your local model and vector store. Here's an example with a simple string prompt:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "59ed5f0d-7089-41cc-8486-af37b690dd33",
+   "execution_count": 9,
+   "id": "67cefb46-acd3-4c2a-a8f6-b62c7c3e30dc",
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template=\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: {question} \\nContext: {context} \\nAnswer:\"))]"
+       "'Task decomposition can be done through (1) simple prompting using LLM, (2) task-specific instructions, or (3) human inputs. This approach helps break down large tasks into smaller, manageable subgoals for efficient handling of complex tasks. It enables agents to plan ahead and improve the quality of final results through reflection and refinement.'"
       ]
      },
-     "execution_count": 3,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "from langchain import hub\n",
+    "from langchain_core.runnables import RunnablePassthrough\n",
     "\n",
-    "rag_prompt = hub.pull(\"rlm/rag-prompt\")\n",
-    "rag_prompt.messages"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 28,
-   "id": "c01c1725",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "Task can be done by down a task into smaller subtasks, using simple prompting like \"Steps for XYZ.\" or task-specific like \"Write a story outline\" for writing a novel."
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time = 11326.20 ms\n",
-      "llama_print_timings:      sample time =    33.03 ms /    47 runs   (    0.70 ms per token,  1422.86 tokens per second)\n",
-      "llama_print_timings: prompt eval time =  1387.31 ms /   242 tokens (    5.73 ms per token,   174.44 tokens per second)\n",
-      "llama_print_timings:        eval time =  1321.62 ms /    46 runs   (   28.73 ms per token,    34.81 tokens per second)\n",
-      "llama_print_timings:       total time =  2801.08 ms\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "{'output_text': '\\nTask can be done by down a task into smaller subtasks, using simple prompting like \"Steps for XYZ.\" or task-specific like \"Write a story outline\" for writing a novel.'}"
-      ]
-     },
-     "execution_count": 28,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from langchain_core.runnables import RunnablePassthrough, RunnablePick\n",
+    "RAG_TEMPLATE = \"\"\"\n",
+    "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n",
+    "\n",
+    "<context>\n",
+    "{context}\n",
+    "</context>\n",
+    "\n",
+    "Answer the following question:\n",
+    "\n",
+    "{question}\"\"\"\n",
+    "\n",
+    "rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)\n",
     "\n",
-    "# Chain\n",
     "chain = (\n",
-    "    RunnablePassthrough.assign(context=RunnablePick(\"context\") | format_docs)\n",
+    "    RunnablePassthrough.assign(context=lambda input: format_docs(input[\"context\"]))\n",
     "    | rag_prompt\n",
-    "    | llm\n",
+    "    | model\n",
     "    | StrOutputParser()\n",
     ")\n",
     "\n",
-    "# Run\n",
-    "chain.invoke({\"context\": docs, \"question\": question})"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2e5913f0-cf92-4e21-8794-0502ba11b202",
-   "metadata": {},
-   "source": [
-    "Now, let's try with [a prompt specifically for LLaMA](https://smith.langchain.com/hub/rlm/rag-prompt-llama), which [includes special tokens](https://huggingface.co/blog/llama2#how-to-prompt-llama-2)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "78f6862d-b7a6-4e03-84e4-45667185bf9b",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "ChatPromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question', 'context'], output_parser=None, partial_variables={}, template=\"[INST]<<SYS>> You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<</SYS>> \\nQuestion: {question} \\nContext: {context} \\nAnswer: [/INST]\", template_format='f-string', validate_template=True), additional_kwargs={})])"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# Prompt\n",
-    "rag_prompt_llama = hub.pull(\"rlm/rag-prompt-llama\")\n",
-    "rag_prompt_llama.messages"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "id": "67cefb46-acd3-4c2a-a8f6-b62c7c3e30dc",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "  Sure, I'd be happy to help! Based on the context, here are some to task:\n",
-      "\n",
-      "1. LLM with simple prompting: This using a large model (LLM) with simple prompts like \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\" to decompose tasks into smaller steps.\n",
-      "2. Task-specific: Another is to use task-specific, such as \"Write a story outline\" for writing a novel, to guide the of tasks.\n",
-      "3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\n",
-      "\n",
-      "As fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error."
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time = 11326.20 ms\n",
-      "llama_print_timings:      sample time =   144.81 ms /   207 runs   (    0.70 ms per token,  1429.47 tokens per second)\n",
-      "llama_print_timings: prompt eval time =  1506.13 ms /   258 tokens (    5.84 ms per token,   171.30 tokens per second)\n",
-      "llama_print_timings:        eval time =  6231.92 ms /   206 runs   (   30.25 ms per token,    33.06 tokens per second)\n",
-      "llama_print_timings:       total time =  8158.41 ms\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "{'output_text': '  Sure, I\\'d be happy to help! Based on the context, here are some to task:\\n\\n1. LLM with simple prompting: This using a large model (LLM) with simple prompts like \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\" to decompose tasks into smaller steps.\\n2. Task-specific: Another is to use task-specific, such as \"Write a story outline\" for writing a novel, to guide the of tasks.\\n3. Human inputs:, human inputs can be used to supplement the process, in cases where the task a high degree of creativity or expertise.\\n\\nAs fores in long-term and task, one major is that LLMs to adjust plans when faced with errors, making them less robust to humans who learn from trial and error.'}"
-      ]
-     },
-     "execution_count": 26,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# Chain\n",
-    "chain = (\n",
-    "    RunnablePassthrough.assign(context=RunnablePick(\"context\") | format_docs)\n",
-    "    | rag_prompt_llama\n",
-    "    | llm\n",
-    "    | StrOutputParser()\n",
-    ")\n",
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "\n",
+    "docs = vectorstore.similarity_search(question)\n",
     "\n",
     "# Run\n",
     "chain.invoke({\"context\": docs, \"question\": question})"
@@ -676,82 +388,64 @@
    "source": [
     "## Q&A with retrieval\n",
     "\n",
-    "Instead of manually passing in docs, we can automatically retrieve them from our vector store based on the user question.\n",
-    "\n",
-    "This will use a QA default prompt (shown [here](https://github.com/langchain-ai/langchain/blob/275b926cf745b5668d3ea30236635e20e7866442/langchain/chains/retrieval_qa/prompt.py#L4)) and will retrieve from the vectorDB."
+    "Finally, instead of manually passing in docs, you can automatically retrieve them from our vector store based on the user question:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 10,
    "id": "86c7a349",
    "metadata": {},
    "outputs": [],
    "source": [
     "retriever = vectorstore.as_retriever()\n",
+    "\n",
     "qa_chain = (\n",
     "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
     "    | rag_prompt\n",
-    "    | llm\n",
+    "    | model\n",
     "    | StrOutputParser()\n",
     ")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 11,
    "id": "112ca227",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Llama.generate: prefix-match hit\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "  Sure! Based on the context, here's my answer to your:\n",
-      "\n",
-      "There are several to task,:\n",
-      "\n",
-      "1. LLM-based with simple prompting, such as \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\"\n",
-      "2. Task-specific, like \"Write a story outline\" for writing a novel.\n",
-      "3. Human inputs to guide the process.\n",
-      "\n",
-      "These can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error."
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "llama_print_timings:        load time = 11326.20 ms\n",
-      "llama_print_timings:      sample time =   139.20 ms /   200 runs   (    0.70 ms per token,  1436.76 tokens per second)\n",
-      "llama_print_timings: prompt eval time =  1532.26 ms /   258 tokens (    5.94 ms per token,   168.38 tokens per second)\n",
-      "llama_print_timings:        eval time =  5977.62 ms /   199 runs   (   30.04 ms per token,    33.29 tokens per second)\n",
-      "llama_print_timings:       total time =  7916.21 ms\n"
-     ]
-    },
     {
      "data": {
       "text/plain": [
-       "{'query': 'What are the approaches to Task Decomposition?',\n",
-       " 'result': '  Sure! Based on the context, here\\'s my answer to your:\\n\\nThere are several to task,:\\n\\n1. LLM-based with simple prompting, such as \"Steps for XYZ\" or \"What are the subgoals for achieving XYZ?\"\\n2. Task-specific, like \"Write a story outline\" for writing a novel.\\n3. Human inputs to guide the process.\\n\\nThese can be used to decompose complex tasks into smaller, more manageable subtasks, which can help improve the and effectiveness of task. However, long-term and task can being due to the need to plan over a lengthy history and explore the space., LLMs may to adjust plans when faced with errors, making them less robust to human learners who can learn from trial and error.'}"
+       "'Task decomposition can be done through (1) simple prompting in Large Language Models (LLM), (2) using task-specific instructions, or (3) with human inputs. This process involves breaking down large tasks into smaller, manageable subgoals for efficient handling of complex tasks.'"
       ]
      },
-     "execution_count": 30,
+     "execution_count": 11,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
+    "question = \"What are the approaches to Task Decomposition?\"\n",
+    "\n",
     "qa_chain.invoke(question)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e75d3e9e",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "You've now seen how to build a RAG application using all local components. RAG is a very deep topic, and you might be interested in the following guides that discuss and demonstrate additional techniques:\n",
+    "\n",
+    "- [Video: Reliable, fully local RAG agents with LLaMA 3](https://www.youtube.com/watch?v=-ROS6gfYIts) for an agentic approach to RAG with local models\n",
+    "- [Video: Building Corrective RAG from scratch with open-source, local LLMs](https://www.youtube.com/watch?v=E2shqsYwxck)\n",
+    "- [Conceptual guide on retrieval](/docs/concepts/#retrieval) for an overview of various retrieval techniques you can apply to improve performance\n",
+    "- [How to guides on RAG](/docs/how_to/#qa-with-rag) for a deeper dive into different specifics around of RAG\n",
+    "- [How to run models locally](/docs/how_to/local_llms/) for different approaches to setting up different providers"
+   ]
   }
  ],
  "metadata": {
@@ -770,7 +464,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.1"
+   "version": "3.10.5"
   }
  },
  "nbformat": 4,
diff --git a/docs/docs/tutorials/rag.ipynb b/docs/docs/tutorials/rag.ipynb
index c1c973485197d..6ac5a6e55b639 100644
--- a/docs/docs/tutorials/rag.ipynb
+++ b/docs/docs/tutorials/rag.ipynb
@@ -936,7 +936,8 @@
     "- [Return sources](/docs/how_to/qa_sources): Learn how to return source documents\n",
     "- [Streaming](/docs/how_to/streaming): Learn how to stream outputs and intermediate steps\n",
     "- [Add chat history](/docs/how_to/message_history): Learn how to add chat history to your app\n",
-    "- [Retrieval conceptual guide](/docs/concepts/#retrieval): A high-level overview of specific retrieval techniques"
+    "- [Retrieval conceptual guide](/docs/concepts/#retrieval): A high-level overview of specific retrieval techniques\n",
+    "- [Build a local RAG application](/docs/tutorials/local_rag): Create an app similar to the one above using all local components"
    ]
   }
  ],
@@ -956,7 +957,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.4"
+   "version": "3.10.5"
   }
  },
  "nbformat": 4,