Skip to content

Commit

Permalink
feat: add blog back (#59)
Browse files Browse the repository at this point in the history
* feat: add blog back

* add newsletter content

* updates

* minor

* year and title

* updates

* fix footer newsletter name

---------

Co-authored-by: Alex Bäuerle <alex@a13x.io>
  • Loading branch information
cabreraalex and Sparkier authored Jan 9, 2024
1 parent c117e26 commit 0fdda12
Show file tree
Hide file tree
Showing 4 changed files with 88 additions and 37 deletions.
25 changes: 0 additions & 25 deletions blog/2022-08-26-welcome.md

This file was deleted.

72 changes: 72 additions & 0 deletions blog/2024-01-09-newsletter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
slug: newsletter-24-01
title: "Zeno's Notes on AI Evaluation | January 2024"
authors: [ac, ab]
tags: ["zeno's notes"]
---

Welcome to the first edition of the **Zeno's Notes** newsletter!
Each month, we'll discuss the community's work around Zeno, interesting research and projects on AI evaluation, and new Zeno features.

Before we dive in, we wanted to look back at the last few months since we launched [Zeno Hub](https://hub.zenoml.com).
Our users have created over **800 projects** and **1,400 slices** to evaluate more than **10,000 AI systems**!
These insights have been used to author over **160 reports**, communicating interesting findings and insights.
It's exciting to see how Zeno is being used to make AI evaluation more accessible and transparent.

## 🌎 Community

_Highlighting work from the Zeno community._

### [An In-Depth Look at Gemini's Language Abilities](https://arxiv.org/abs/2312.11444)

Researchers at CMU, including the Zeno team, conducted a [deep dive into Gemini's language abilities](https://x.com/gneubig/status/1737108966931673191?s=20).
They compared Gemini Pro, Google's newly released LLM, with GPT-3.5-Turbo, GPT-4-Turbo, and Mixtral.
Overall, they found that Gemini approaches but lags behind GPT-3.5-Turbo in all English tasks, yet performs better in translation into languages it supports.
For more detailed results, [read the paper](https://arxiv.org/abs/2312.11444) or explore the code on [GitHub](https://t.co/S7S9473xtP).
Each section of the paper is linked to a Zeno report for further exploration!

### [HuggingFace is Dropping DROP](https://huggingface.co/blog/leaderboard-drop-dive)

The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) is a popular repository for comparing new LLMs. HuggingFace recently added [three new benchmarks](https://twitter.com/clefourrier/status/1722555555338956840) to their leaderboard to better represent real-world performance.
After receiving feedback from the community, they noticed significant fluctuations in the scores for one benchmark, DROP.
[Using Zeno](https://hub.zenoml.com/report/1255/DROP%20Benchmark%20Exploration), they uncovered the reason behind the variance and decided to remove DROP from the leaderboard until a revised version of the benchmark is developed.

## 📰 Evaluation News

_Interesting news from the world of AI evaluation._

### [CRUXEval](https://crux-eval.github.io/)

Researchers from MIT and Meta AI Research released new evaluation dataset for code reasoning, understanding, and execution.
Instead of having models generate code, CRUXEval asks models to either predict the output or input of a given function based on its signature.
This dataset, which includes 800 python functions, supplements classic code generation datasets such as HumanEval and MBPP.
They compared multiple open and closed-source models on their new benchmarks and found that there is quite a bit of room for improvement.

### [CommonGen Leaderboard](https://inklab.usc.edu/CommonGen/leaderboard.html)

CommonGen is a challenging benchmark task asking models to generate coherent sentences describing everyday scenarios.
The reseaerchers behind the benchmark, from USC, Allen AI, and UW, recently updated their [eval repository](https://github.com/allenai/CommonGen-Eval) with a [new leaderboard](https://inklab.usc.edu/CommonGen/leaderboard.html) for the task, showing how state-of-the-art models, including GPT-4, perform significantly worse than humans.
The authors argue that the task is so hard because it requires relational reasoning using background common sense knowledge and the models need to be able to generalize to unseen concept combinations.

## ✨ New in Zeno

_Updates to Zeno that you'll love._

### Integrations

We've been focusing on making it even easier for you to analyze your evaluation results in Zeno by [integrating Zeno into other AI evaluation frameworks](https://zenoml.com/docs/integrations/).
You can now directly upload your model outputs if you're using the [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) or the [Ragas Framework](https://docs.ragas.io/en/latest/index.html).

- **Ragas** is a library for model-graded evaluation of RAG applications. We've added [a detailed tutorial](https://docs.ragas.io/en/latest/howtos/integrations/zeno.html) on how to use Zeno to investigate your evaluation results. You can view an example of this in Zeno [here](https://hub.zenoml.com/project/b35c83b8-0b22-4b9c-aedb-80964011d7a7/ragas%20FICA%20eval).

- **EleutherAI LM Evaluation Harness** is a popular library for running LLM benchmarks. We wrote a script that allows you to directly upload all your evaluation data as a Zeno project, enabling you to compare different models across various benchmarks provided by EleutherAI. To start visualizing your LM Evaluation Harness data in Zeno, follow [these instructions](https://github.com/EleutherAI/lm-evaluation-harness#visualizing-results) or take a look at our [example notebook](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/visualize-zeno.ipynb). We've already used this integration for some of our own projects, such as [this evaluation of the Mamba architecture](https://hub.zenoml.com/project/ba44d31c-9e02-4330-bdbe-0760dfe85dc4/Mamba%20Eval_hellaswag)!

### Documentation

We've also been working on improving our documentation to make it easier for you to get started with Zeno.
This includes [use cases](http://localhost:3000/docs/examples/), [tutorials](https://zenoml.com/docs/tutorials/), and [integration guides](https://zenoml.com/docs/integrations/).
If you have any suggestions for what you'd like to see in our documentation, please let us know!

---

_If you have questions about Zeno or anything we've highlighted in this newsletter, have ideas for new Zeno features or content for a future issue of Zeno's Notes, or simply want to say hi, get in touch via [email](mailto:hello@zenoml.com) or join our [Discord](https://discord.gg/km62pDKAkE)._
12 changes: 9 additions & 3 deletions blog/authors.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
endi:
name: Angel Alexander Cabrera
title: PhD Student @ CMU
ac:
name: Alex Cabrera
title: PhD Candidate @ CMU
url: https://cabreraalex.com
image_url: https://cabreraalex.com/images/profile.png

ab:
name: Alex Bäuerle
title: Researcher @ CMU
url: https://a13x.io
image_url: https://a13x.io/images/alex.jpeg
16 changes: 7 additions & 9 deletions docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,17 +39,9 @@ const config = {
},
docs: {
sidebarPath: require.resolve("./sidebars.js"),
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
// editUrl:
// "https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/",
},
blog: {
showReadingTime: true,
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
// editUrl:
// "https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/",
},
theme: {
customCss: require.resolve("./src/css/custom.css"),
Expand Down Expand Up @@ -84,12 +76,14 @@ const config = {
position: "left",
label: "Docs",
},
{ to: "blog", label: "Blog", position: "left" }, // or position: 'right'
{ to: "/about", label: "About", position: "left" },
{ to: "/faq", label: "FAQ", position: "left" },
{
type: "html",
position: "right",
value: "<a style='margin-left: 10px' href='https://hub.zenoml.com/signup'>Sign up</a>",
value:
"<a style='margin-left: 10px' href='https://hub.zenoml.com/signup'>Sign up</a>",
},
{
type: "html",
Expand Down Expand Up @@ -120,6 +114,10 @@ const config = {
label: "Docs",
to: "/docs/intro",
},
{
label: "Blog",
to: "/blog/",
},
{
label: "About",
to: "/about/",
Expand Down

0 comments on commit 0fdda12

Please sign in to comment.