Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Cannot take a larger sample than population when replace is False #55

Open
SSMK-wq opened this issue May 12, 2022 · 3 comments

Comments

@SSMK-wq
Copy link

SSMK-wq commented May 12, 2022

I am trying to dedupe my dataframe which has a column Test_names. I have only around 40 rows

So, I tried the below code from this tutorial https://pypi.org/project/pandas-dedupe/

df = pd.read_excel('names.xlsx')
df_clean = pandas_dedupe.dedupe_dataframe(df,['Test_names'])

I got the below error

ValueError: Cannot take a larger sample than population when replace is False

I also tried the below

df_clean = pd.read_excel('clean_names.xlsx')
df_messy = pd.read_excel('test_names.xlsx')

#initiate deduplication
df_final = pandas_dedupe.gazetteer_dataframe(df_clean, df_messy, 'Test_names', canonicalize=True)

And got the same error

ValueError: Cannot take a larger sample than population when replace is False

@quancore
Copy link

quancore commented May 18, 2022

same problem @Lyonk71 @ieriii

@ieriii
Copy link
Collaborator

ieriii commented May 19, 2022

Thanks for reporting this.
I had a look and the easiest fix is to downgrade dedupe to version 2.0.13.

I'll have a closer look at the latest release of dedupe (version 2.0.14) and see how we can ensure compatiblity.
Let me know if it works or have any further questions.

@sarbaniAi
Copy link

Hi all, I am using the postgresql approach with own data ~40K records. I am getting the same error
"ValueError: Cannot take a larger sample than population when replace is False".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants