-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any description how to use the library? #9
Comments
I think I found something. Without group_by_subject, it works fine. However, I am looking into the Enron data set which does not have any in-reply-to and any references in the mail header, so I need to group by subject. There the message attribute in the container is accessed in the wrong way. The message attribute of the variable 'existing' seems to be accessible with get('message'). I will keep posting if things work. Edit: I think I will pull request. In the bottom part of the code (group_by_subject), some of the things are messed up or not updated. I checked in the code this is forked from to understand what it was meant to do and found the necessary changes to make it work. |
Hello @benelot , thanks for opening this issue!
I just added an example; the
Yes, a thread is a hierarchical structure composed of
We also wanted to run email threading on Enron dataset at some point. Unfortunately, the JWZ algorithm fundamentally relies on the When I looked into this last, I haven't really found any good alternative to JWZ that wouldn't need those header fields. There are certainly a few research papers.
But I haven't seen any implementation of those, and I'm not sure if the results would be reliable enough to use in practice. If you find some solution for Enron dataset threading I would be interested to know.
I'm not sure I have understood the issue, but in any case I am sure there are issues; a PR to fix that would be welcome ) |
Hi @rth, The problem I have when running the code is that in some places, the message attribute of the container is accessed by container.message. This does not work here, instead I have to use get("message") instead. You seem to mix both notations in several locations. Without that change, I run into the above mentioned exception. Furthermore, I had to revert to the former implementation of is_dummy. You seem to have changed that and I am not sure why. Your current implementation always finds 'message' as its key and thus none of the containers are considered a dummy. I will pull request this to you at some time next week so you can decide if the changes work for you. Maybe it has something to do with the fact that I am running the code on python3.6? Who knows, I don't. Regarding Enron: The dataset contains a lot of duplicates, thus it might seem like it clusters messages together which have the same subject. However, without the duplicates I get mostly threads of 2 and a smaller number of threads of 3 mails. The rest seems to stay unthreaded. |
Note that there are two type of objects, a generic container class container = JwzContainer()
container['message'] = Message()
container['message'].message = "something" I'm not saying that this is a good situation. But that's the result of evolution from the original implementation to the current code base where I needed a more generic container class (also used for to represent hierarchical clustering in this example).
So if you run the included
I don't remember, it's been a while. Mostly I validated the obtained threading in |
Do you depend in other locations outside of this repository on the
JwzContainer (I saw that this is part of a larger project)? Otherwise I
could give the internals a refactoring to make things clean again.
…On Sun, Feb 4, 2018 at 1:26 PM Roman Yurchak ***@***.***> wrote:
The problem I have when running the code is that in some places, the
message attribute of the container is accessed by container.message.
Note that there are two type of objects, a generic container class
Container, JwzContainer (that behaves like a dict and so you can store
anything there, including a "message" key) and the Message class that has
a message attribute. So you end up with something like,
container = JwzContainer()
container['message'] = Message()
container['message'].message = "something"
I'm not saying that this is a good situation. But that's the result of
evolution from the original implementation to the current code base where I
needed a more generic container class (also used for to represent
hierarchical clustering in this example
<http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html#sphx-glr-python-examples-birch-cluster-hierarchy-py>
).
This does not work here, instead, I have to use get("message") instead.
So if you run the included example/parse_mailbox.py (or adapt it to use
mailbox.mbox instead of parse_mailbox) you should not run into any issues.
Furthermore, I had to revert to the former implementation of is_dummy. You
seem to have changed that and I am not sure why.
I don't remember, it's been a while. Mostly I validated the obtained
threading in test_threading_fedora_June2010
<https://github.com/FreeDiscovery/jwzthreading/blob/d462a36fb823603ea3ce056f1006c06e166ae6b1/jwzthreading/tests/test_newsgroups.py#L54>
if you have another way of validating the results, a PR would definitely be
welcome..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC97q1YHMHQTvZlDTxzp0aUBeFrKr5PCks5tRaHzgaJpZM4R3KSg>
.
|
Sure, feel free to do so. The larger project currently bundles jwzthreading, so it won't be affected by a refactoring here. (and I could always update the bundled version later). Thanks. |
Hello!
I am looking for advice on how to use the library. I am having my mails in the unix mbox format and I would like to get my mails threaded. How do I get this to work?
I tried to use the 2010-January.txt file from your tests and wanted to use it as an mbox directly with the jwzthreading.py:
However, the library breaks with:
Thanks for any hints.
Also, is it then possible to get a list of emails belonging to a certain thread? I saw that the dictionary at the end of the threading method was dropped, so you no longer have a subject=emails dict.
The text was updated successfully, but these errors were encountered: