Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More info on configuration options #4

Open
RacheleSprugnoli opened this issue Jul 24, 2023 · 2 comments
Open

More info on configuration options #4

RacheleSprugnoli opened this issue Jul 24, 2023 · 2 comments

Comments

@RacheleSprugnoli
Copy link

Hi, thanks for providing this code! Could you please give more information (e.g. a brief explanation) of the following options?

  • max_align=5
  • top_k=3
  • win=5
  • skip=-0.1
  • margin=True
  • len_penalty=True
  • is_split=False

Thank you in advance!
Rachele

@bfsujason
Copy link
Owner

bfsujason commented Jul 24, 2023

max_align is the maximum alignment types such as 1:1, 1:2, etc. 5 means the alignments allowed are 1:0, 0:1, 1:1, 1:2, 2:1, 2:2, 2:3, and 3:2. You can set this parameter to a higher value if the corpus to be aligned contains many complex alignments.

top_k is for the search of k nearest target neighbors of each source sentence in the first-step alignment.

win is the search window of dynamic programming in the second-step alignment.

skip is the predefined simililarity score for 1:0 and 0:1 alignments. If your corpus consists of many omissions and insertions, you can set this value to a larger one, e.g. skip=0.

margin represents modified cosine similarity as proposed in https://doi.org/10.1093/llc/fqac089.

len_penalty considers the length difference between source and target sentences when calculating similarity between sentence pairs.

If is_split=True, it means the corpus has already been split into sentences. Otherwise, bertalign uses sentence-splitter to split the bitexts into sentences.

@jdough1982
Copy link

jdough1982 commented May 12, 2024

Hi. Is there a way to specify max_align with more granularity? For my use case, I would like to limit the allowable alignments to 1:1, 1:2, ..., 1:n, and the inverse thereof (1:1, 2:1, ..., n:1). In other words, I want to exclude 1:0, 0:1, and many-to-many alignments.

EDIT: nvm modifying get_alignment_types or hardcoding second_alignment_types seems to have done the trick.

matgille added a commit to matgille/mutilingual_collator that referenced this issue May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants