Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA OFI: hostname not working on debian loopback interface #535

Open
utkarshayachit opened this issue Dec 17, 2021 · 9 comments
Open

NA OFI: hostname not working on debian loopback interface #535

utkarshayachit opened this issue Dec 17, 2021 · 9 comments

Comments

@utkarshayachit
Copy link

utkarshayachit commented Dec 17, 2021

My objective is to make the server run on a specific port. Seems like explicitly specifying the address to use should work. However, it does not work is I use hostname instead of IP address or interface name when specifying the address for the server.

#include <unistd.h>
#include <thallium.hpp>
#include <string>

namespace tl = thallium;

/**
 * This is the ParaT server executable.
 */
int main(int argc, char* argv[])
{
  char buffer[256];
  gethostname(buffer, 256);

  std::ostringstream str;
  str << "tcp://" << buffer << ":11111";
  tl::engine myEngine(str.str(), THALLIUM_SERVER_MODE);
  std::cout << "requested address: " << str.str() << std::endl;
  std::cout << "server running at address: " << myEngine.self() << std::endl;
  return 0;
}

The output from this is as follows:

requested address: tcp://miron:11111
server running at address: ofi+tcp;ofi_rxm://192.168.1.73:33499

Platform (please complete the following information):

  • System description: Ubuntu 20.04.3 LTS
  • Compiler version: GCC 9.3.0
  • Plugin and protocol used [e.g. ofi, psm2]
  • Dependency version: libfabric-1.13.0
@soumagne
Copy link
Member

@utkarshayachit PR #537 should fix your issue, I just merged it to master. Would you be able to try it out?

@soumagne soumagne changed the title hostname not working with starting an engine NA OFI: hostname not working with starting an engine Dec 22, 2021
@soumagne soumagne added this to the mercury-2.1.1 milestone Dec 22, 2021
@utkarshayachit
Copy link
Author

thanks for the fix...it still fails for me; although now I get an error message rather that it just picking a random port.

Here's the output from the same test code from the issue.

# [31877.645652] mercury->fatal: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:2034
 # na_ofi_domain_open(): No provider found for "tcp;ofi_rxm" provider on domain "miron"
[error] Could not initialize hg_class
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/home/utkarsh/Kitware/Mochi/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-9.3.0/mochi-thallium-develop-yufqz3w5lpooqe2rmafnxkynt7mr4kyc/include/thallium/engine.hpp:180][margo_init_ext] Could not initialize Margo

@soumagne
Copy link
Member

soumagne commented Dec 24, 2021

Thanks. Alright I think we'll get to the bottom of it though. Can you please export HG_LOG_LEVEL=warning HG_LOG_SUBSYS=na and rerun your test, that will give us more details. Also the output of fi_info on your system would be helpful. I am not entirely sure yet until I see the logs but it looks like your hostname cannot be resolved for some reason.

@soumagne
Copy link
Member

Also thinking more about it, you might want to double check also that your /etc/hosts does not associate your hostname to an interface that is down or something like that. I can't really think of anything else that would be significantly different on your system compared to the ones we use.

@utkarshayachit
Copy link
Author

here are the results

# [481.582259] mercury->na: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ip.c:212
 # na_ip_check_interface(): No ifa_name match found for IP
# [481.582289] mercury->cls: [warning] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:3762
 # na_ofi_initialize(): Could not find matching interface for miron, attempting to use it as domain name
# [481.583238] mercury->fatal: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:2034
 # na_ofi_domain_open(): No provider found for "tcp;ofi_rxm" provider on domain "miron"
# [481.583258] mercury->cls: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na_ofi.c:3834
 # na_ofi_initialize(): Could not open domain for tcp;ofi_rxm, miron
# [481.583270] mercury->cls: [error] /tmp/utkarsh/spack-stage/spack-stage-mercury-master-tubj37zytqb2mcdmpbrhiqe4mvbexaro/spack-src/src/na/na.c:339
 # NA_Initialize_opt(): Could not initialize plugin
[error] Could not initialize hg_class
terminate called after throwing an instance of 'thallium::margo_exception'
  what():  [/home/utkarsh/Kitware/Mochi/spack/opt/spack/linux-ubuntu20.04-broadwell/gcc-9.3.0/mochi-thallium-develop-yufqz3w5lpooqe2rmafnxkynt7mr4kyc/include/thallium/engine.hpp:180][margo_init_ext] Could not initialize Margo
fish: “env HG_LOG_LEVEL=warning HG_LOG…” terminated by signal SIGABRT (Abort)
cat /etc/hosts
127.0.0.1 view-localhost
127.0.0.1       localhost
127.0.1.1       miron

@soumagne
Copy link
Member

soumagne commented Jan 3, 2022

Thanks, I believe the third line in your /etc/hosts file is causing the issues you're having. You should either remove it or have your permanent IP address assigned instead as documented there: https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution

@soumagne
Copy link
Member

soumagne commented Jan 3, 2022

Having said that, we should probably also be able to support this type of loopback IP, I'll have a look to see if we can also do that. In that case I'd expect you to have miron:11111 resolved as 127.0.1.1:11111, probably not the IP you'd want anyway but we should somehow support it.

@utkarshayachit
Copy link
Author

FWIW, removing it does indeed seem to solve the issue

@soumagne
Copy link
Member

soumagne commented Jan 3, 2022

Great, thanks for confirming.

@soumagne soumagne changed the title NA OFI: hostname not working with starting an engine NA OFI: hostname not working on debian loopback interface Aug 5, 2022
@soumagne soumagne modified the milestones: mercury-2.2.0, future Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants