v0.10.0
⭐ Highlights
🚀 Making Pipelines more scalable
You can now easily scale and distribute Haystack Pipelines thanks to the new integration of the Ray framework (https://ray.io/).
Ray allows distributing a Pipeline's components across a cluster of machines. The individual components of a Pipeline can be independently scaled. For instance, an extractive QA Pipeline deployment can have three replicas of the Reader and a single replica for the Retriever. It enables efficient resource utilization by horizontally scaling Components. You can use Ray via the new RayPipeline
class (#1255)
To set the number of replicas, add replicas in the YAML config for the node in a pipeline:
components:
...
pipelines:
- name: ray_query_pipeline
type: RayPipeline
nodes:
- name: ESRetriever
replicas: 2 # number of replicas to create on the Ray cluster
inputs: [ Query ]
A RayPipeline currently can only be created with a YAML Pipeline config:
from haystack.pipeline import RayPipeline
pipeline = RayPipeline.load_from_yaml(path="my_pipelines.yaml", pipeline_name="my_query_pipeline")
pipeline.run(query="What is the capital of Germany?")
See docs for more details
😍 Making Pipelines more user-friendly
The old Pipeline
design came with a couple of flaws:
- Impossible to route certain parameters (e.g.
top_k
) to dedicated nodes - Incorrect parameters in pipeline.run() are silently swallowed
- Hard to understand what is in **kwargs when working with node.run() methods
- Hard to debug
We tackled those with a big refactoring of the Pipeline
class and changed how data is passed between nodes #1321.
This comes now with a few breaking changes:
Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict
pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})
Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})
See breaking changes section and the docs for details
📈 Better evaluation metric for QA: Semantic Answer Similarity (SAS)
The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In our recent EMNLP paper, we proposed "SAS", a cross-encoder-based metric for the estimation of semantic answer similarity. We compared it to seven existing metrics and found that it correlates better with human judgement. See our paper #1338
You can use it in Haystack like this:
...
# initialize the node with a SAS model
eval_reader = EvalAnswers(sas_model="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
# define a pipeline
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=eval_retriever, name="EvalDocuments", inputs=["ESRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["EvalDocuments"])
p.add_node(component=eval_reader, name="EvalAnswers", inputs=["QAReader"])
...
See our updated Tutorial 5 for a full example.
🤯 New nodes: Doc Classifier, Re-Ranker, QuestionGenerator & more
More nodes, more use cases:
FARMClassifier
node for Document Classification: tag a document at indexing time or add a class downstream in your inference pipeline #1265SentenceTransformersRanker
: Re-Rank your documents after retrieval to maximize the relevance of your results. This implementation uses the popular sentence-transformer models #1209QuestionGenerator
: Question Answering systems are trained to find an answer given a question and a document; but with the recent advances in generative NLP, there are now models that can read a document and suggest questions that can be answered by that document. All this power is available to you now via theQuestionGenerator
class.
QuestionGenerator
models can be trained using Question Answering datasets. Instead of predicting answers, theQuestionGenerator
takes the document as input and is trained to output the questions. This can be useful when you want to add "autosuggest" questions in your search bar or accelerate labeling processes See docs (#1267)
🔭 Better support for OpenSearch
We now support Approximate nearest neighbour (ANN) search in OpenSearch (#1225) and fixed some initialization issues.
📑 New Tutorials
- Tutorial 13 - Question Generation:Jupyter noteboook|Colab|Python
- Tutorial 14 - Query Classifier:Jupyter noteboook|Colab|Python
⚠️ Breaking Changes
probability
field removed from results #1340
Having two fields probability
and score
in answers / documents returned from nodes caused often confusion.
From now on we'll only have one field called score
that is in range [0,1]. In QA results, this field is populated with the old probability
value, so you can simply switch to this one. These fields have changed in Python and REST API.
Old:
{
"query": "Who is the father of Arya Stark?",
"answers": [
{
"answer": "Lord Eddard Stark",
"score": 14.684528350830078,
"probability": 0.9044522047042847,
"context": ...,
...
},
...
]
}
New:
{
"query": "Who is the father of Arya Stark?",
"answers": [
{
"answer": "Lord Eddard Stark",
"score": 0.9044522047042847,
"context": ...,
...
},
...
]
}
RemovedFinder
#1326
After being deprecated a few months ago, Finder
is now gone - R.I.P
Params in Pipeline.run()
#1321
Component params like top_k, no_ans_boost for Pipeline.run() must be passed in a params dict
Old:
pipeline.run(query="Why?", top_k_retriever=10, no_ans_boost=0.5)
New:
pipeline.run(query="Why?", params={"top_k":10, "no_ans_boost":0.5})
Component specific top_ks like top_k_reader, top_k_retriever are now replaced with top_k. To disambiguate, the params can be "targeted" to a specific node.
Old:
pipeline.run(query="Why?", top_k_retriever=10, top_k_reader=5)
New:
pipeline.run(query="Why?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5})
Also, custom nodes must not have **kwargs in their run methods anymore and should only return the data (e.g. answers) they produce themselves.
🤓 Detailed Changes
Crawler
Converter
- Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR #1349
Preprocessor
- Add PreProcessor optional language parameter. #1160
- Improve preprocessing logging #1263
- Make PreProcessor.process() work on lists of documents #1163
Pipeline
- Add Ray integration for Pipelines #1255
- MostSimilarDocumentsPipeline introduced #1413
- QoL function: access certain nodes in pipeline #1441
- Refactor
replicas
config for Ray Pipelines #1378 - Add simple docs2answer node to allow FAQ style QA / Doc search in API #1361
- Allow for batch indexing when using Pipelines fix #1168 #1231
Document Stores
- Implement OpenSearch ANN #1225
- Bump Weaviate version to 1.7.0 #1412
- Catch Elastic's search_phase_execution and raise with descriptive message. #1371
- Fix behavior of delete_documents() with filters for Milvus #1354
- delete_all_documents() replaced by delete_documents() #1377
- Support OpenDistro init #1334
- Integrate filters with knn queries in OpenDistroElasticsearchDocumentStore #1301
- feat: add support for elastic search to connect without any authentication #1294
- Raise warning when labels are overwritten #1257
- Fix SQLAlchemy relationship warnings #1289
- Added explicit refresh call during refresh_type is false in update em… #1259
- Add
id
inwrite_labels()
forSQLDocumentStore
#1253 - ElasticsearchDocumentStore get_label_count() bug fixed. #1252
- SQLDocumentStore get_label_count() index bug fixed. #1251
Retriever
- Adding multi gpu support for DPR inference #1414
- Ensure num_hard_negatives is 0 when embedding passages #1402
- global_loss_buffer_size to the DensePassageRetriever, fix exceeds max_size #1245
Summarizer
- Transformer summarizer truncation bug fixed #1309
Document Classifier
- Add FARMClassifier node for Document Classification #1265
Re-Ranker
- Add SentenceTransformersRanker with pre-trained Cross-Encoder #1209
Reader
- Use Reader's device by default #1208
Generator
- Add QuestionGenerator #1267
Evaluation
- Add new QA eval metric: Semantic Answer Similarity (SAS) #1338
REST API
- Fix handling of filters in Search REST API #1431
- Add support for Dense Retrievers in REST API Indexing Pipeline #1430
- Add Header in sample REST API Search Request #1293
- Fix convert integer CONCURRENT_REQUEST_PER_WORKER #1247
- Env var CONCURRENT_REQUEST_PER_WORKER #1235
- Small UI and REST API fixes #1223
- Add scaffold for defining custom components for Pipelines #1205
Docker
- Update DocumentStore env in docker-compose #1450
- Enable docker-compose for GPUs & Add public UI image #1406
- Fix tesseract installation in Dockerfile #1405
User Interface
- Allow multiple files to upload for Haystack UI #1323
- Add faq annotation #1333
- Upgrade streamlit #1279
Documentation and Tutorials
- new docs version for 0.9.0 #1217
- Added functionality for Google Colab usecase in Crawler Module #1436
- Update sentence transformer model in FAQ tutorial #1401
- crawler api docs updated. #1388
- Add support for no Docker envs in Tutorial 13 #1365
- Rag tutorial fixes #1375
- Editing docs read.me for new docs website workflow #1372
- Add query classifier usage docs #1348
- Adding tutorial 13 and 14 #1364
- Remove Finder from tutorials #1329
- Tutorial1 remove finder class from import #1328
- Update docstring for RAG #1149
- Update README.md for tutorial 13 Question Generation #1325
- add query classifier colab and jupyter notebook #1324
- Remove pipeline eval example script #1297
- Change variable names in tutorials #1286
- Add links to tutorial 12 to readme #1274
- Encapsulate tutorial code in method #1266
- Fix Links #1199
Misc
- Improve document stores unit test parametrization #1202
- Version tag added to Haystack #1216
- Add type ignore to resolve
mypy
errors #1427 - Bump pillow from 8.2.0 to 8.3.2 #1423
- Add sentence-transformers as mandatory dependency and remove from dev… #1387
- Adjust WeaviateDocumentStore import #1379
- Update test documentation in readme #1355
- Add tests for Crawler #1339
- Suppress FAISS logs & apex warnings #1315
- Pin Weaviate version #1306
- Relax typing for meta data #1224
🙏 Big thanks to all contributors! ❤️
A big thank you to all the contributors for this release: @prikmm @akkefa @MichelBartels @hammer @ramgarg102 @bishalgaire @MarkusSagen @dfhssilva @srevinsaju @demarant @mosheber @MichaelBitard @guillim @vblagoje @stefanondisponibile @cambiumproject @bobvanluijt @tanay1337 @Timoeller @annagruendler @PiffPaffM @oryx1729 @bogdankostic @brandenchan @shahrukhx01 @julian-risch @tholor
We would like to thank everyone who participated in the insightful discussions on GitHub and our community Slack!