From 0fdda12a94663fb79c17307dd822563261e4ac68 Mon Sep 17 00:00:00 2001 From: Alex Cabrera Date: Tue, 9 Jan 2024 07:18:29 -0800 Subject: [PATCH] feat: add blog back (#59) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * feat: add blog back * add newsletter content * updates * minor * year and title * updates * fix footer newsletter name --------- Co-authored-by: Alex Bäuerle --- blog/2022-08-26-welcome.md | 25 ------------ blog/2024-01-09-newsletter.md | 72 +++++++++++++++++++++++++++++++++++ blog/authors.yml | 12 ++++-- docusaurus.config.js | 16 ++++---- 4 files changed, 88 insertions(+), 37 deletions(-) delete mode 100644 blog/2022-08-26-welcome.md create mode 100644 blog/2024-01-09-newsletter.md diff --git a/blog/2022-08-26-welcome.md b/blog/2022-08-26-welcome.md deleted file mode 100644 index 2775f9b6..00000000 --- a/blog/2022-08-26-welcome.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -slug: welcome-post -title: "Zeno: Interactive Model Evaluation" -authors: - name: Alex Cabrera - title: PhD Student @ CMU - url: https://cabreraalex.com - image_url: https://cabreraalex.com/images/profile.png -tags: ["introduction"] ---- - -Model evaluation is a key part of the machine learning development process. -Evaluation often stops at measuring metrics such as accuracy, but aggregate metrics are a limited view of model performance. -It's important to look at specific failures or subgroups of data to develop a nuanced view of model performance and discover issues such as [biases](http://gendershades.org/) and [safety concerns](https://en.wikipedia.org/wiki/Death_of_Elaine_Herzberg). - -![safety concerns](/img/overall.png) - -[**Zeno**](https://zenoml.com) is a framework for fine-grained evaluation of diverse machine learning models. -Zeno is primarily a UI for interactively exploring your data, model outputs, and metrics. -You can interactively filter your data using metadata and create slices to track and compare model performance. -Lastly, you can create visualizations and reports to summarize model performance and share insights with other stakeholders. - -### Get Started - -Interested in using Zeno? Get started with our [Quickstart](/docs/intro) guide! diff --git a/blog/2024-01-09-newsletter.md b/blog/2024-01-09-newsletter.md new file mode 100644 index 00000000..68a805bd --- /dev/null +++ b/blog/2024-01-09-newsletter.md @@ -0,0 +1,72 @@ +--- +slug: newsletter-24-01 +title: "Zeno's Notes on AI Evaluation | January 2024" +authors: [ac, ab] +tags: ["zeno's notes"] +--- + +Welcome to the first edition of the **Zeno's Notes** newsletter! +Each month, we'll discuss the community's work around Zeno, interesting research and projects on AI evaluation, and new Zeno features. + +Before we dive in, we wanted to look back at the last few months since we launched [Zeno Hub](https://hub.zenoml.com). +Our users have created over **800 projects** and **1,400 slices** to evaluate more than **10,000 AI systems**! +These insights have been used to author over **160 reports**, communicating interesting findings and insights. +It's exciting to see how Zeno is being used to make AI evaluation more accessible and transparent. + +## 🌎 Community + +_Highlighting work from the Zeno community._ + +### [An In-Depth Look at Gemini's Language Abilities](https://arxiv.org/abs/2312.11444) + +Researchers at CMU, including the Zeno team, conducted a [deep dive into Gemini's language abilities](https://x.com/gneubig/status/1737108966931673191?s=20). +They compared Gemini Pro, Google's newly released LLM, with GPT-3.5-Turbo, GPT-4-Turbo, and Mixtral. +Overall, they found that Gemini approaches but lags behind GPT-3.5-Turbo in all English tasks, yet performs better in translation into languages it supports. +For more detailed results, [read the paper](https://arxiv.org/abs/2312.11444) or explore the code on [GitHub](https://t.co/S7S9473xtP). +Each section of the paper is linked to a Zeno report for further exploration! + +### [HuggingFace is Dropping DROP](https://huggingface.co/blog/leaderboard-drop-dive) + +The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) is a popular repository for comparing new LLMs. HuggingFace recently added [three new benchmarks](https://twitter.com/clefourrier/status/1722555555338956840) to their leaderboard to better represent real-world performance. +After receiving feedback from the community, they noticed significant fluctuations in the scores for one benchmark, DROP. +[Using Zeno](https://hub.zenoml.com/report/1255/DROP%20Benchmark%20Exploration), they uncovered the reason behind the variance and decided to remove DROP from the leaderboard until a revised version of the benchmark is developed. + +## 📰 Evaluation News + +_Interesting news from the world of AI evaluation._ + +### [CRUXEval](https://crux-eval.github.io/) + +Researchers from MIT and Meta AI Research released new evaluation dataset for code reasoning, understanding, and execution. +Instead of having models generate code, CRUXEval asks models to either predict the output or input of a given function based on its signature. +This dataset, which includes 800 python functions, supplements classic code generation datasets such as HumanEval and MBPP. +They compared multiple open and closed-source models on their new benchmarks and found that there is quite a bit of room for improvement. + +### [CommonGen Leaderboard](https://inklab.usc.edu/CommonGen/leaderboard.html) + +CommonGen is a challenging benchmark task asking models to generate coherent sentences describing everyday scenarios. +The reseaerchers behind the benchmark, from USC, Allen AI, and UW, recently updated their [eval repository](https://github.com/allenai/CommonGen-Eval) with a [new leaderboard](https://inklab.usc.edu/CommonGen/leaderboard.html) for the task, showing how state-of-the-art models, including GPT-4, perform significantly worse than humans. +The authors argue that the task is so hard because it requires relational reasoning using background common sense knowledge and the models need to be able to generalize to unseen concept combinations. + +## ✨ New in Zeno + +_Updates to Zeno that you'll love._ + +### Integrations + +We've been focusing on making it even easier for you to analyze your evaluation results in Zeno by [integrating Zeno into other AI evaluation frameworks](https://zenoml.com/docs/integrations/). +You can now directly upload your model outputs if you're using the [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) or the [Ragas Framework](https://docs.ragas.io/en/latest/index.html). + +- **Ragas** is a library for model-graded evaluation of RAG applications. We've added [a detailed tutorial](https://docs.ragas.io/en/latest/howtos/integrations/zeno.html) on how to use Zeno to investigate your evaluation results. You can view an example of this in Zeno [here](https://hub.zenoml.com/project/b35c83b8-0b22-4b9c-aedb-80964011d7a7/ragas%20FICA%20eval). + +- **EleutherAI LM Evaluation Harness** is a popular library for running LLM benchmarks. We wrote a script that allows you to directly upload all your evaluation data as a Zeno project, enabling you to compare different models across various benchmarks provided by EleutherAI. To start visualizing your LM Evaluation Harness data in Zeno, follow [these instructions](https://github.com/EleutherAI/lm-evaluation-harness#visualizing-results) or take a look at our [example notebook](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/visualize-zeno.ipynb). We've already used this integration for some of our own projects, such as [this evaluation of the Mamba architecture](https://hub.zenoml.com/project/ba44d31c-9e02-4330-bdbe-0760dfe85dc4/Mamba%20Eval_hellaswag)! + +### Documentation + +We've also been working on improving our documentation to make it easier for you to get started with Zeno. +This includes [use cases](http://localhost:3000/docs/examples/), [tutorials](https://zenoml.com/docs/tutorials/), and [integration guides](https://zenoml.com/docs/integrations/). +If you have any suggestions for what you'd like to see in our documentation, please let us know! + +--- + +_If you have questions about Zeno or anything we've highlighted in this newsletter, have ideas for new Zeno features or content for a future issue of Zeno's Notes, or simply want to say hi, get in touch via [email](mailto:hello@zenoml.com) or join our [Discord](https://discord.gg/km62pDKAkE)._ diff --git a/blog/authors.yml b/blog/authors.yml index 87413168..5b2c1b4d 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -1,5 +1,11 @@ -endi: - name: Angel Alexander Cabrera - title: PhD Student @ CMU +ac: + name: Alex Cabrera + title: PhD Candidate @ CMU url: https://cabreraalex.com image_url: https://cabreraalex.com/images/profile.png + +ab: + name: Alex Bäuerle + title: Researcher @ CMU + url: https://a13x.io + image_url: https://a13x.io/images/alex.jpeg diff --git a/docusaurus.config.js b/docusaurus.config.js index ae3d6d94..ed585978 100644 --- a/docusaurus.config.js +++ b/docusaurus.config.js @@ -39,17 +39,9 @@ const config = { }, docs: { sidebarPath: require.resolve("./sidebars.js"), - // Please change this to your repo. - // Remove this to remove the "edit this page" links. - // editUrl: - // "https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/", }, blog: { showReadingTime: true, - // Please change this to your repo. - // Remove this to remove the "edit this page" links. - // editUrl: - // "https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/", }, theme: { customCss: require.resolve("./src/css/custom.css"), @@ -84,12 +76,14 @@ const config = { position: "left", label: "Docs", }, + { to: "blog", label: "Blog", position: "left" }, // or position: 'right' { to: "/about", label: "About", position: "left" }, { to: "/faq", label: "FAQ", position: "left" }, { type: "html", position: "right", - value: "Sign up", + value: + "Sign up", }, { type: "html", @@ -120,6 +114,10 @@ const config = { label: "Docs", to: "/docs/intro", }, + { + label: "Blog", + to: "/blog/", + }, { label: "About", to: "/about/",