Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance expectations? #22

Open
Gerbert-Kaandorp opened this issue Jun 22, 2024 · 1 comment
Open

Performance expectations? #22

Gerbert-Kaandorp opened this issue Jun 22, 2024 · 1 comment

Comments

@Gerbert-Kaandorp
Copy link

Hi Vincent!

Me again :), thanks a lot for adding the types in the last release!
It is working out great in my dev stack and I don't need a converter any more! 🎉🎉

So, I wanted to test the performance of this setup a bit.
I downloaded this Pokemon dataset

https://triplydb.com/academy/pokemon

It is about 4.893 MB / ~29000 triplets
And I am using the following function to insert them over http using the endpoint.

def upload_data_from_disk_multi_graph(url, filename='data/pokemon.trig', format="trig", batch_size=3000):
    ds = rdflib.Dataset()
    ds.parse(filename, format=format)

    # Iterate over each graph in the dataset
    for graph in ds.graphs():
        graph_uri = graph.identifier
        batches = []
        batch = []

        # Prepare batches of triples
        for s, p, o in graph:
            batch.append((s, p, o))
            if len(batch) >= batch_size:
                batches.append(batch)
                batch = []
        if batch:
            batches.append(batch)
        
        # Execute batch inserts for each batch
        for batch in batches:
            insert_query = f"INSERT DATA {{ GRAPH <{graph_uri}> {{ "
            for s, p, o in batch:
                # Ensure proper serialization of objects into N3 format
                s_n3 = s.n3() if isinstance(s, URIRef) else f"<{s}>"
                p_n3 = p.n3() if isinstance(p, URIRef) else f"<{p}>"
                o_n3 = o.n3()
                insert_query += f"{s_n3} {p_n3} {o_n3} . "
            insert_query += "}}"
            print(f"Executing batch insert for graph {graph_uri}: {len(insert_query)} characters.")
            
            response = requests.post(url, data={'update': insert_query}, headers={'Accept': 'application/ld+json'})
            print("Response Status:", response.status_code)

Turns out, this is extremely slow. :(

And I am not sure if I am even using the api the right way
Do you know what I am doing wrong?
Or is this performance normal for using rdflib?

Thanks for reading.
Gerbert

@vemonet
Copy link
Owner

vemonet commented Jul 25, 2024

Hi @Gerbert-Kaandorp, we are just executing the provided update query using:

parsed_update = prepareUpdate(update_query, initNs=graph_ns)
self.graph.update(parsed_update, "sparql")

So I guess this is just RDFLib not being really fast to insert data through update queries

And in general I don't think using INSERT DATA is a fast way to load a lot of data for any triplestore (usually they provide another call specifically to bulk load turtle/xml files, which we could also do relatively easily here by adding a call that takes a RDF file, and parse it into the graph used by the endpoint). Using INSERT DATA is more aimed at making small changes on the fly from an application (adding few dozen/hundred of triples)

If you have control over the server where you deploy the endpoint, then the recommended way is just to parse the file you want to load with RDFLib, then use this graph when instantiating the SparqlEndpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants