diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..c7ce4e0 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,86 @@ +## Contributor Covenant Code of Conduct + + +### Our Pledge + +In the interest of fostering an open and welcoming environment, we as +contributors and maintainers pledge to making participation in our project +and our community a harassment-free experience for everyone, regardless of +age, body size, disability, ethnicity, sex characteristics, gender identity +and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, religion, or sexual identity and +orientation. + + +### Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +- Gracefully accepting constructive criticism +- Being respectful of differing viewpoints and experiences +- Focusing on what is best for the community +- Using welcoming and inclusive language +- Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +- Trolling, insulting/derogatory comments, and personal or political + attacks +- Public or private harassment +- Publishing others' private information, such as a physical or electronic + address, without explicit permission +- The use of sexualized language or imagery and unwelcome sexual attention + or advances + + +### Our Responsibilities + +Project maintainers are responsible for clarifying the standards of +acceptable behavior and are expected to take appropriate and fair +corrective action in response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem +inappropriate, threatening, offensive, or harmful. + + +### Scope + +This Code of Conduct applies both within project spaces and in public +spaces when an individual is expressly representing the project or its +community. Examples of representing a project or community include using +an official project e-mail address, posting via an official project social +media account, or acting as an appointed representative at an online or +offline event. Association with a project does not in itself imply +representation of the project. Representation may be further defined and +clarified by project maintainers. + + +### Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by contacting the project team at . All +complaints will be reviewed and investigated and will result in a response +that is deemed necessary and appropriate to the circumstances. The project +team is obligated to maintain confidentiality with regard to the reporter +of an incident. Further details of specific enforcement policies may be +posted separately. + +Project maintainers who do not follow or enforce the Code of Conduct in +good faith may face temporary or permanent repercussions as determined by +other members of the project's leadership. + + +### Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], +version 1.4, available at +. + +[homepage]: https://www.contributor-covenant.org + +For answers to common questions about this code of conduct, see +. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..12f4b5a --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,127 @@ +## Contributing to SG-t-SNE-Π + + +We welcome contributions that help improve the SG-t-SNE-Π project! Please read +through the guidelines in this document before reporting an issue or +submitting a request. We will do our best to respond to all issues and +requests, but please bear in mind that it may take us a while. + + + + + +### Contents + + +- [Types of contributions](#contrib-types) +- [Bug reports](#bug-reports) +- [Pull requests](#pull-requests) +- [Code and documentation style](#style) + + + + + +### Types of contributions + + +- Bug reports. +- Compatibility patches (support for different `C++` compilers). +- Minor patches (documentation clarification and typos; + code/documentation formatting; naming convention amendments). +- Functionality updates. +- Testing (testing or demo scripts for existing functionality). +- Re-implementations (performance improvements; system/language support + extensions). +- Anything else, as long as its utility and functionality is described. + + + + + +### Bug reports + + +Please [open a new issue][github-new-issue] for each unreported bug. +Specify "[BUG] *1-sentence-description-of-bug*" as the issue title, and +list the following information in the issue body: + +- Brief summary and background. +- Bug description: what should happen, and what happens instead. +- Version of compiler, operating system, and relevant libraries. +- Code for a concise script that reproduces and illustrates the + bug. +- Any other relevant notes (e.g., what you think causes the bug, any + steps you may have taken to identify or resolve it, etc). + + +[github-new-issue]: https://help.github.com/articles/creating-an-issue/ + + + + + +### Pull requests + + +Please submit a [pull request][github-pull-request] for each code or +documentation contribution to SG-t-SNE-Π. When submitting a pull request, please +adhere to the following. + +- Clearly identify the [type of your contribution](#contrib-types) in the + title and body of your pull request. + - If your contributions span multiple types, please separate them + into individual pull requests. Minor patches should be lumped into + a single pull request. +- Include a brief description of the rationale, functionality, and + implementation of your contribution. +- Include a testing or demo script (named `test_xxx` or + `demo_xxx`) that can be used to illustrate and validate your + contribution. Include a brief description of the script in the pull + request body. +- [Squash partial commits][github-squash-commit]. +- If applicable, draft some relevant text to be added to or amended in + the README. Please include the text in the pull request comments, + *not* as part of the commit. + +We encourage you to open a new issue to discuss any intended contributions +prior to developing or submitting a pull request. + + +[github-pull-request]: https://help.github.com/articles/about-pull-requests/ + +[github-squash-commit]: https://help.github.com/articles/about-pull-request-merges/ + + + + + +### Code and documentation style + + +Please try to follow the style conventions in the SG-t-SNE-Π repository when submitting pull requests. Use [Doxygen][doxygen-documentation] to document functions and scripts. We generally try to observe the following rules: + +- The code should be clear, stable, and efficient. Clarity and stability + take precedence over efficiency and performance. The code should be + self-documented if possible (avoid referring to descriptions in + existing issues or pull requests). +- Function interface documentation should be comprehensive and follow the + format of existing functions (e.g., `sgtsnepi`). +- Function and variable names are in `camelCase`; script names are in + `snake_case`. Typically, matrix/array names start with an uppercase + letter, while scalar/vector/function names start with a lowercase + letter. +- All code blocks should be briefly documented. +- We prefer 2-space indentation (no tabs), operator/operand alignment + across multiple lines, and 80-column line width. + +[doxygen-documentation]: http://www.doxygen.nl + + + + + + + + + diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..f288702 --- /dev/null +++ b/LICENSE @@ -0,0 +1,674 @@ + GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff --git a/Makefile.in b/Makefile.in new file mode 100644 index 0000000..10082ec --- /dev/null +++ b/Makefile.in @@ -0,0 +1,224 @@ +# #################################################################### +# +# C/C++ Makefile +# +# Author: Dimitris Floros +# +# #################################################################### + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ENVIROMENT + +# set SHELL to bash +SHELL := /bin/bash + +MACHINESPEC = -mtune=native + +MATLABROOT = @MATLABROOT@ +INTELROOT = /opt/intel/ + + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% COMMANDS - FLAGS + +MV = mv +CP = cp + +CXX = @CXX@ +CFLAGS += @CXXFLAGS@ -fopenmp -fcilkplus +LIBS = @LIBS@ + +MEX = @MEX@ +MEXFLAGS = @MEXFLAGS@ + +BUILDMEX = @ENABLE_MATLAB@ + +# # get OS name-type (OSX or linux setup) +OSNAME := $(shell uname) + + +ifeq ($(OSNAME),Darwin) # OS X + + # package manager + PKGMANAGER = port + DEPENDENCIES = flann tbb metis fftw-3 + + # architectures + ARCH = maci64 + + # MEX extension + MEXEXT = mexmaci64 + + # (nothing in OS X) + MEXRPATH = + + # MEX symbol map + MEXSYM = -bundle -Wl,-exported_symbols_list,$(MATLABROOT)/extern/lib/$(ARCH)/mexFunction.map + +else # linux + + # package manager + PKGMANAGER = apt-get + DEPENDENCIES = libtbb-dev libflann-dev libmetis-dev libfftw3-dev + + MEX = $(CXX) + MEXFLAGS= $(CFLAGS) + + # architectures + ARCH = glnxa64 + + # MEX extension + MEXEXT = mexa64 + + # relative paths for linux + MEXRPATH = -Wl,-rpath=$(MATLABROOT)/bin/$(ARCH) + MEXRPATH += -Wl,-rpath=$(INTELROOT)/linux/lib/intel64_lin + + # MEX symbol map + MEXSYM = -shared -Wl,--version-script,$(MATLABROOT)/extern/lib/$(ARCH)/mexFunction.map +endif + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% COMPILATION INCLUDES/LIBRARIES + +# MATLAB linking +MEXINC = -I$(MATLABROOT)/extern/include +MEXLIB = -L$(MATLABROOT)/bin/$(ARCH) +MEXLIB += -fno-common $(MEXSYM) +MEXLIB += -lmx -lmex -lmat +LIBS += -lcilkrts + +ifeq ($(CXX), icpc) + LIBS += -lirc -limf -lsvml + CFLAGS += -wd3947,3946,10006,3950 +endif + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% SOURCES / DIRECTORIES + +SRCS = sgtsne.cpp sparsematrix.cpp utils.cpp \ + gradient_descend.cpp csb_wrapper.cpp \ + qq.cpp nuconv.cpp graph_rescaling.cpp \ + dataReloc.cpp timers.cpp pq.cpp + +MEXS = sgtsnepi.$(MEXEXT) perplexityEqualize.$(MEXEXT) + +DEMOS = demo_perplexity_equalization demo_stochastic_matrix + +MEXS := $(addprefix matlab/, $(MEXS) ) +OBJS := $(addprefix build/, $(SRCS:.cpp=.o) ) +DEMOS := $(addprefix bin/, $(DEMOS) ) + +# update SRCS for dependencies +SRCS += $(MEXS:.$(MEXEXT)=.cpp) knn.cpp test_modules.cpp + +DEPDIR = build/.d + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% DEFINE TARGET RULES + +# default "make" target +.DEFAULT_GOAL = all + +# ==================== Target rules + +dependencies: ## Build dependencies + $(PKGMANAGER) install $(DEPENDENCIES) + +ifeq ($(BUILDMEX),yes) +all: sgtsnepi demos test matlab ## Build library, demos, and MEX wrappers +else +all: sgtsnepi demos test +endif + +sgtsnepi: lib/libsgtsnepi.a ## Build static library + +tsnepi: bin/tsnepi + +demos: $(DEMOS) ## Build demo scripts + +test: bin/test_modules ## Build and run testing scripts + +matlab: $(MEXS) ## Build MEX wrappers (MATLAB required) + +lib/libsgtsnepi.a: $(OBJS) + ar rcs $@ $(OBJS) + +# ==================== Documentation + +documentation: ## Build doxygen documentation + doxygen docs/doxygen.config + +# ==================== Demo scripts + +bin/demo_stochastic_matrix: build/demo_stochastic_matrix.o lib/libsgtsnepi.a ## Stochastic matrix | λ rescale + $(LINK.o) $+ $(OUTPUT_OPTION) $(LIBS) + +bin/tsnepi: build/tsnepi.o lib/libsgtsnepi.a ## Stochastic matrix | λ rescale + $(LINK.o) $+ $(OUTPUT_OPTION) $(LIBS) + +bin/demo_perplexity_equalization: ## Conventional t-SNE | Perplexity equalize +bin/demo_perplexity_equalization: build/demo_perplexity_equalization.o lib/libsgtsnepi.a + $(LINK.o) $+ $(OUTPUT_OPTION) $(LIBS) + +bin/test_modules: build/test_modules.o lib/libsgtsnepi.a + $(LINK.o) $+ $(OUTPUT_OPTION) $(LIBS) + +matlab/%.$(MEXEXT): build/%_mex.o lib/libsgtsnepi.a + $(LINKMEX.o) $+ $(OUTPUT_OPTION) $(LIBS) $(MEXLIB) + + +# ==================== Miscellaneous + +clean: ## Clean-up intermediate outputs + $(RM) build/*.o src/*~ + $(RM) build/.d/* + +cleandocs: ## Remove documentation outputs + $(RM) -r docs/html + +purge: clean ## Remove library and executables + $(RM) $(MEXS) + $(RM) $(DEMOS) + $(RM) bin/test_modules + $(RM) bin/tsnepi + $(RM) lib/libsgtsnepi.a + +help: ## Echo Makefile documentation + @echo + @grep -E '(^[a-zA-Z_-]+:.*?##.*$$)|(^# ====================)' $(firstword $(MAKEFILE_LIST)) | awk 'BEGIN {FS = ":.*?## "}{printf "\033[32m %-35s\033[0m %s\n", $$1, $$2}' | sed -e 's/\[32m # ====================/[33m===============/' + @echo "" + @echo -e "\033[033m*** DEFAULT:\033[0m \033[032m$(.DEFAULT_GOAL)\033[0m" + @echo "" + + +# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% COMPILATION RULES (DO NOT CHANGE) + +$(shell mkdir -p build >/dev/null) +$(shell mkdir -p $(DEPDIR) >/dev/null) +$(shell mkdir -p bin >/dev/null) +$(shell mkdir -p lib >/dev/null) + +DEPFLAGS = -MT $@ -MMD -MP -MF $(DEPDIR)/$(notdir $*.Td) + +LINKMEX.o = $(MEX) $(MEXFLAGS) $(LIBPATH) $(TARGET_ARCH) +COMPILEMEX.cc = $(CXX) $(DEPFLAGS) $(CFLAGS) $(MEXINC) $(MEXRPATH) $(TARGET_ARCH) $(INCPATH) -c +LINK.o = $(CXX) $(CFLAGS) $(LIBPATH) $(TARGET_ARCH) +COMPILE.cc = $(CXX) $(DEPFLAGS) $(CFLAGS) $(TARGET_ARCH) $(INCPATH) -c +POSTCOMPILE = @mv -f $(DEPDIR)/$(notdir $*.Td) $(DEPDIR)/$(notdir $*.d) && touch $@ + +build/%_mex.o : src/%_mex.cpp +build/%_mex.o : src/%_mex.cpp $(DEPDIR)/%.d + $(COMPILEMEX.cc) $(OUTPUT_OPTION) $< + $(POSTCOMPILE) + +build/%.o : src/%.cpp +build/%.o : src/%.cpp $(DEPDIR)/%.d + $(COMPILE.cc) $(OUTPUT_OPTION) $< + $(POSTCOMPILE) + +build/%.o : csb/%.cpp +build/%.o : csb/%.cpp $(DEPDIR)/%.d + $(COMPILE.cc) $(OUTPUT_OPTION) $< -DALIGN=64 + $(POSTCOMPILE) + +$(DEPDIR)/%.d: ; +.PRECIOUS: $(DEPDIR)/%.d + +include $(wildcard $(patsubst %,$(DEPDIR)/%.d,$(basename $(SRCS)))) + diff --git a/README.md b/README.md new file mode 100644 index 0000000..dff09ee --- /dev/null +++ b/README.md @@ -0,0 +1,408 @@ +# SG-t-SNE-Π
Swift Neighbor Embedding of Sparse Stochastic Graphs + +[![GitHub license](https://img.shields.io/github/license/fcdimitr/sgtsnepi.svg)](https://github.com/fcdimitr/sgtsnepi/blob/master/LICENSE) +[![GitHub issues](https://img.shields.io/github/issues/fcdimitr/sgtsnepi.svg)](https://github.com/fcdimitr/sgtsnepi/issues/) + +- [Overview](#overview) + - [Precursor algorithms](#precursor-algorithms) + - [Approximation of the gradient](#approximation-of-the-gradient) + - [SG-t-SNE-Π](#sg-t-sne-π) + - [Accelerated accumulation of attractive + interactions](#accelerated-accumulation-of-attractive-interactions) + - [Accelerated accumulation of repulsive + interactions](#accelerated-accumulation-of-repulsive-interactions) + - [Rapid intra-term and inter-term data + relocations](#rapid-intra-term-and-inter-term-data-relocations) + - [Supplementary material](#supplementary-material) +- [References](#references) +- [Getting started](#getting-started) + - [System environment](#system-environment) + - [Prerequisites](#prerequisites) + - [Installation](#installation) + - [Basic instructions](#basic-instructions) + - [Support of the conventional t-SNE](#support-of-the-conventional-t-sne) + - [MATLAB interface](#matlab-interface) + - [Usage demo](#usage-demo) +- [License and community guidelines](#license-and-community-guidelines) +- [Contributors](#contributors) +- [Acknowledgements](#acknowledgements) + +## Overview + +We introduce SG-t-SNE-, a high-performance software for swift +embedding of a large, sparse, stochastic graph + into a -dimensional space +() on a shared-memory computer. The algorithm SG-t-SNE and the +software t-SNE- were first described in Reference [[1](#Pitsianis2019)]. +The algorithm is built upon precursors for embedding a -nearest +neighbor (NN) graph, which is distance-based and regular with +constant degree . In practice, the precursor algorithms are also +limited up to 2D embedding or suffer from overly long latency in 3D +embedding. SG-t-SNE removes the algorithmic restrictions and enables +-dimensional embedding of arbitrary stochastic graphs, including, but +not restricted to, NN graphs. SG-t-SNE- expedites the +computation with high-performance functions and materializes 3D +embedding in shorter time than 2D embedding with any precursor algorithm +on modern laptop/desktop computers. + +### Precursor algorithms + +The original t-SNE [[2](#Maaten2008)] has given rise to several variants. Two +of the variants, [t-SNE-BH](https://lvdmaaten.github.io/tsne/) [[3](#VanDerMaaten2014)] and +[FIt-SNE](https://github.com/KlugerLab/FIt-SNE) [[4](#Linderman2019)], are distinctive and representative in their +approximation approaches to reducing algorithmic complexity. They are, +however, limited to NN graph embedding. Specifically, at the user +interface, a set of data points, +, is provided in terms of their +feature vectors in an -dimensional vector space +equipped with a metric/distance function. The input parameters include + for the embedding dimension, for the number of near-neighbors, +and for the perplexity. A t-SNE algorithm maps the data points + to data points in a +-dimensional space. + +There are two basic algorithmic stages in a conventional t-SNE +algorithm. In the preprocessing stage, the NN graph is generated from +the feature vectors according to the metric function +and input parameter . Each data point is associated with a graph +vertex. Next, the NN graph is cast into a stochastic one, +, and symmetrized to +, + +

+ + where + is the binary-valued adjacency matrix of the +NN graph, with zero diagonal elements (i.e., the graph has no +self-loops), and is the distance between and +. The Gaussian parameters are determined by the +point-wise equations related to the same perplexity value , + + +

+ + + +The next stage is to determine and locate the embedding coordinates + by minimizing the +Kullback-Leibler divergence + +

+ + +where matrix is made of the +ensemble regulated by the Student t-distribution, + + +

+ + +In other words, the objective of +(3) is to find the optimal +stochastic matching between and defined, +respectively, over the feature vector set and the embedding +coordinate set . The optimal matching is obtained numerically +by applying the gradient descent method. A main difference among the +precursor algorithms lies in how the gradient of the objective function +is computed. + +### Approximation of the gradient + +The computation per iteration step is dominated by the calculation of +the gradient. Van der Maaten reformulated the gradient into two +terms [[3](#VanDerMaaten2014)]: + + +

+ + +The attractive interaction term can be cast as the sum of +matrix-vector products with the sparse matrix +. The vectors are composed of the +embedding coordinates, one in each dimension. The repulsive interaction +term can be cast as the sum of matrix-vector products with the +dense matrix . For clarity, we simply +refer to the two terms as the term and the +term, respectively. + +The (repulsion) term is in fact a broad-support, dense +convolution with the Student t-distribution kernel on non-equispaced, +scattered data points. As the matrix is dense, a naive method for +calculating the term takes arithmetic operations. The quadratic +complexity limits the practical use of t-SNE to small graphs. Two types +of existing approaches reduce the quadratic complexity to , +they are typified by t-SNE-BH and FIt-SNE. The algorithm t-SNE-BH, +introduced by van der Maaten [[3](#VanDerMaaten2014)], is based on the +Barnes-Hut algorithm. The broad-support convolution is factored into + convolutions of narrow support, at multiple spatial levels, +each narrowly supported algorithm takes operations. FIt-SNE, +presented by Linderman et al. [[4](#Linderman2019)], may be viewed as based +on non-uniform fast Fourier transforms. The execution time of each +approximate algorithm becomes dominated by the +(attraction) term computation. The execution time also faces a steep +rise from 2D to 3D embedding. + +### SG-t-SNE-Π + +With the algorithm SG-t-SNE we extend the use of t-SNE to any sparse +stochastic graph . The key input +is the stochastic matrix , +, associated with the graph, where is not +restricted to the form of +(1). +We introduce a parametrized, non-linear rescaling mechanism to explore +the graph sparsity. We determine rescaling parameters by + + +

+ + +where is an input parameter and is a monotonically +increasing function. We set in the present version of +SG-t-SNE-. Unlike +(2), the rescaling mechanism +(6) imposes no constraint on the graph, +its solution exists unconditionally. For the conventional t-SNE as a +special case, we set by default. One may still make use of +and exploit the benefit of rescaling (). + +With the implementation SG-t-SNE-, we accelerate the entire +gradient calculation of SG-t-SNE and enable practical 3D embedding of +large sparse graphs on modern desktop and laptop computers. We +accelerate the computation of both and terms +by utilizing the matrix structures and the memory architecture in +tandem. + +#### Accelerated accumulation of attractive interactions + +The matrix in the attractive interaction term of +(5) has the same sparsity pattern as +matrix , regardless of iterative changes in . +Sparsity patterns are generally irregular. Matrix-vector products with +irregular sparse matrix invoke irregular memory accesses and incur +non-equal, prolonged access latencies on hierarchical memories. We +moderate memory accesses by permuting the rows and columns of matrix + such that rows and columns with similar nonzero patterns +are placed closer together. The permuted matrix becomes block-sparse +with denser blocks, resulting in better data locality in memory reads +and writes. + +The permuted matrix is stored in the Compressed Sparse +Blocks (CSB) storage format [[5](#Buluc2009)]. We utilize the CSB routines +for accessing the matrix and calculating the matrix-vector products with +the sparse matrix . The elements of the +matrix are formed on the fly during the calculation of the attractive +interaction term. + +#### Accelerated accumulation of repulsive interactions + +We factor the convolution in the repulsive interaction term of +(5) into three consecutive convolutional +operations. We introduce an internal equispaced grid within the spatial +domain of the embedding points at each iteration, similar to the +approach used in FIt-SNE [[4](#Linderman2019)]. The three convolutional +operations are: + +`S2G`: Local translation of the scattered (embedding) points to their +neighboring grid points. + +`G2G`: Convolution across the grid with the same t-distribution kernel +function, which is symmetric, of broad support, and aperiodic. + +`G2S`: Local translation of the gridded data to the scattered points. + +The `G2S` operation is a gridded interpolation and `S2G` is its +transpose; the arithmetic complexity is , where is the +interpolation window size per side. Convolution on the grid takes + arithmetic operations, where is the +number of grid points, i.e., the grid size. The grid size is determined +by the range of the embedding points at each iteration, with respect to +the error tolerance set by default or specified by the user. In the +current implementation, the local interpolation method employed by +SG-t-SNE- is accurate up to cubic polynomials in separable +variables (). + +Although the arithmetic complexity is substantially reduced in +comparison to the quadratic complexity of the direct way, the factored +operations suffer either from memory access latency or memory capacity +issues, which were not recognized or resolved in existing t-SNE +software. The scattered translation incurs high memory access latency. +The aperiodic convolution on the grid suffers from excessive use of +memory when the grid is periodically extended in all sides at once by +zero padding. The exponential memory growth with limits the +embedding dimension or the graph size. + +We resolve these memory latency and capacity issues in SG-t-SNE-. +Prior to `S2G`, we relocate the scattered data points to the grid bins. +This binning process has two immediate benefits. It improves data +locality in the subsequent interpolation. It also establishes a data +partition for parallel, multi-threaded execution of the scattered +interpolation. We omit the parallelization details. For `G2G`, we +implement aperiodic convolution by operator splitting, without using +extra memory. + +#### Rapid intra-term and inter-term data relocations + +In sparse or structured matrix computation of arithmetic +complexity, the execution time is dominated by memory accesses. We have +described +in the previous sections how we use +intra-term permutations to improve data locality and reduce memory +access latency in computing the [attraction](#accelerated-accumulation-of-attractive-interactions) and the [repulsion](#accelerated-accumulation-of-repulsive-interactions) terms of +(5). In addition, we permute and relocate +in memory the embedding data points between the two terms, at every +iteration step. The inter-term data relocation is carried out at +multiple layers, exploiting block-wise memory hierarchy. The data +permutation overhead is well paid-off by the much shortened time for +arithmetic calculation with the permuted data. We use in the +software name SG-t-SNE- to signify the importance and the role of +the permutations in accelerating t-SNE algorithms, including the +conventional one, and enabling 3D embeddings. + +### Supplementary material + +Supplementary material and performance plots are found at +. + + +## References + +[1] N. Pitsianis, A.-S. Iliopoulos, D. Floros, +and X. Sun. [Spaceland embedding of sparse stochastic graphs](https://arxiv.org/abs/1906.05582). In +*Proceedings of IEEE High Performance Extreme Computing +Conference*, 2019. (To appear.) + +[2] L. van der Maaten and G. Hinton. [Visualizing data using +t-SNE](http://www.jmlr.org/papers/v9/vandermaaten08a.html). *Journal of Machine Learning Research* 9(Nov):2579–2605, 2008. + +[3] L. van der Maaten. [Accelerating t-SNE using tree-based +algorithms](http://jmlr.org/papers/v15/vandermaaten14a.html). *Journal of Machine Learning Research* 15(Oct):3221–3245, +2014. + +[4] G. C. Linderman, M. Rachh, J. G. Hoskins, +S. Steinerberger, and Y. Kluger. [Fast interpolation-based t-SNE +for improved visualization of single-cell RNA-seq data](https://doi.org/10.1038/s41592-018-0308-4). *Nature +Methods* 16(3):243–245, 2019. + +[5] A. Buluç, J. T. Fineman, M. Frigo, +J. R. Gilbert, and C. E. Leiserson. [Parallel sparse matrix-vector and +matrix-transpose-vector multiplication using compressed sparse +blocks](https://doi.org/10.1145/1583991.1584053). In *Proceedings of Annual Symposium on Parallelism in +Algorithms and Architectures*, pp. 233–244, 2009. + + +## Getting started + +### System environment + +SG-t-SNE- is developed for shared-memory computers with +multi-threading, running Linux or macOS operating system. The source +code is (to be) compiled by a `C++` compiler supporting Cilk. The +current release is tested with the `GNU g++` compiler 7.4.0 and the +`Intel` `icpc` compiler 19.0.4.233. + +### Prerequisites + +SG-t-SNE- uses the following open-source software: + +- [FFTW3](http://www.fftw.org/) 3.3.8 + +- [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) 5.1.0 + +- [FLANN](https://www.cs.ubc.ca/research/flann/) 1.9.1 + +- [Intel TBB](https://01.org/tbb) 2019 + +- [Doxygen](http://www.doxygen.nl/) 1.8.14 + +On Ubuntu: + + sudo apt-get install libtbb-dev libflann-dev libmetis-dev libfftw3-dev doxygen + +On macOS: + + sudo port install flann tbb metis fftw-3 + +### Installation + +#### Basic instructions + +To generate the SG-t-SNE- library, test and demo programs: + + ./configure + make all + +To specify the `C++` compiler: + + ./configure CXX= + +To test whether the installation is successful: + + bin/test_modules + +To generate the documentation: + + make documentation + +#### Support of the conventional t-SNE + +SG-t-SNE- supports the conventional t-SNE algorithm, through a set +of preprocessing functions. Issue + + make tsnepi + +to generate the `bin/tsnepi` binary, which is fully compatible with the +[existing wrappers](https://github.com/lvdmaaten/bhtsne/) provided by van der Maaten [[3](#VanDerMaaten2014)]. + +#### MATLAB interface + +To compile the SG-t-SNE- `MATLAB` wrappers, use the +`--enable-matlab` option in the `configure` command. The default +`MATLAB` installation path is `/opt/local/matlab`; otherwise, set +`MATLABROOT`: + + ./configure --enable-matlab MATLABROOT= + +### Usage demo + +We provide two data sets of modest size for demonstrating stochastic +graph embedding with SG-t-SNE-: + + tar -xvzf data/mobius-graph.tar.gz + bin/demo_stochastic_matrix mobius-graph.mtx + + tar -xvzf data/pbmc-graph.tar.gz + bin/demo_stochastic_matrix pbmc-graph.mtx + +The [MNIST data set](http://yann.lecun.com/exdb/mnist/) can be tested using [existing wrappers](https://github.com/lvdmaaten/bhtsne/) provided +by van der Maaten [[3](#VanDerMaaten2014)]. + +## License and community guidelines + +The SG-t-SNE- library is licensed under the [GNU general public +license v3.0](https://github.com/fcdimitr/sgtsnepi/blob/master/LICENSE). +To contribute to SG-t-SNE- or report any problem, follow our +[contribution +guidelines](https://github.com/fcdimitr/sgtsnepi/blob/master/CONTRIBUTING.md) +and [code of +conduct](https://github.com/fcdimitr/sgtsnepi/blob/master/CODE_OF_CONDUCT.md). + +## Contributors + +*Design and development*:\ +Nikos Pitsianis1,2, Dimitris Floros1, +Alexandros-Stavros Iliopoulos2, Xiaobai +Sun2\ +1 Department of Electrical and Computer Engineering, +Aristotle University of Thessaloniki, Thessaloniki 54124, Greece\ +2 Department of Computer Science, Duke University, Durham, NC +27708, USA + +## Acknowledgements + +*Alpha test participants*:\ +Xenofon Theodoridis + + + + diff --git a/README.pdf b/README.pdf new file mode 100644 index 0000000..68af44a Binary files /dev/null and b/README.pdf differ diff --git a/configure b/configure new file mode 100755 index 0000000..f2cb6a0 --- /dev/null +++ b/configure @@ -0,0 +1,5663 @@ +#! /bin/sh +# Guess values for system-dependent variables and create Makefiles. +# Generated by GNU Autoconf 2.69 for sgtsnepi version-1.0. +# +# +# Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc. +# +# +# This configure script is free software; the Free Software Foundation +# gives unlimited permission to copy, distribute and modify it. +## -------------------- ## +## M4sh Initialization. ## +## -------------------- ## + +# Be more Bourne compatible +DUALCASE=1; export DUALCASE # for MKS sh +if test -n "${ZSH_VERSION+set}" && (emulate sh) >/dev/null 2>&1; then : + emulate sh + NULLCMD=: + # Pre-4.2 versions of Zsh do word splitting on ${1+"$@"}, which + # is contrary to our usage. Disable this feature. + alias -g '${1+"$@"}'='"$@"' + setopt NO_GLOB_SUBST +else + case `(set -o) 2>/dev/null` in #( + *posix*) : + set -o posix ;; #( + *) : + ;; +esac +fi + + +as_nl=' +' +export as_nl +# Printing a long string crashes Solaris 7 /usr/bin/printf. +as_echo='\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo$as_echo +# Prefer a ksh shell builtin over an external printf program on Solaris, +# but without wasting forks for bash or zsh. +if test -z "$BASH_VERSION$ZSH_VERSION" \ + && (test "X`print -r -- $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='print -r --' + as_echo_n='print -rn --' +elif (test "X`printf %s $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='printf %s\n' + as_echo_n='printf %s' +else + if test "X`(/usr/ucb/echo -n -n $as_echo) 2>/dev/null`" = "X-n $as_echo"; then + as_echo_body='eval /usr/ucb/echo -n "$1$as_nl"' + as_echo_n='/usr/ucb/echo -n' + else + as_echo_body='eval expr "X$1" : "X\\(.*\\)"' + as_echo_n_body='eval + arg=$1; + case $arg in #( + *"$as_nl"*) + expr "X$arg" : "X\\(.*\\)$as_nl"; + arg=`expr "X$arg" : ".*$as_nl\\(.*\\)"`;; + esac; + expr "X$arg" : "X\\(.*\\)" | tr -d "$as_nl" + ' + export as_echo_n_body + as_echo_n='sh -c $as_echo_n_body as_echo' + fi + export as_echo_body + as_echo='sh -c $as_echo_body as_echo' +fi + +# The user is always right. +if test "${PATH_SEPARATOR+set}" != set; then + PATH_SEPARATOR=: + (PATH='/bin;/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 && { + (PATH='/bin:/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 || + PATH_SEPARATOR=';' + } +fi + + +# IFS +# We need space, tab and new line, in precisely that order. Quoting is +# there to prevent editors from complaining about space-tab. +# (If _AS_PATH_WALK were called with IFS unset, it would disable word +# splitting by setting IFS to empty value.) +IFS=" "" $as_nl" + +# Find who we are. Look in the path if we contain no directory separator. +as_myself= +case $0 in #(( + *[\\/]* ) as_myself=$0 ;; + *) as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + test -r "$as_dir/$0" && as_myself=$as_dir/$0 && break + done +IFS=$as_save_IFS + + ;; +esac +# We did not find ourselves, most probably we were run as `sh COMMAND' +# in which case we are not to be found in the path. +if test "x$as_myself" = x; then + as_myself=$0 +fi +if test ! -f "$as_myself"; then + $as_echo "$as_myself: error: cannot find myself; rerun with an absolute file name" >&2 + exit 1 +fi + +# Unset variables that we do not need and which cause bugs (e.g. in +# pre-3.0 UWIN ksh). But do not cause bugs in bash 2.01; the "|| exit 1" +# suppresses any "Segmentation fault" message there. '((' could +# trigger a bug in pdksh 5.2.14. +for as_var in BASH_ENV ENV MAIL MAILPATH +do eval test x\${$as_var+set} = xset \ + && ( (unset $as_var) || exit 1) >/dev/null 2>&1 && unset $as_var || : +done +PS1='$ ' +PS2='> ' +PS4='+ ' + +# NLS nuisances. +LC_ALL=C +export LC_ALL +LANGUAGE=C +export LANGUAGE + +# CDPATH. +(unset CDPATH) >/dev/null 2>&1 && unset CDPATH + +# Use a proper internal environment variable to ensure we don't fall + # into an infinite loop, continuously re-executing ourselves. + if test x"${_as_can_reexec}" != xno && test "x$CONFIG_SHELL" != x; then + _as_can_reexec=no; export _as_can_reexec; + # We cannot yet assume a decent shell, so we have to provide a +# neutralization value for shells without unset; and this also +# works around shells that cannot unset nonexistent variables. +# Preserve -v and -x to the replacement shell. +BASH_ENV=/dev/null +ENV=/dev/null +(unset BASH_ENV) >/dev/null 2>&1 && unset BASH_ENV ENV +case $- in # (((( + *v*x* | *x*v* ) as_opts=-vx ;; + *v* ) as_opts=-v ;; + *x* ) as_opts=-x ;; + * ) as_opts= ;; +esac +exec $CONFIG_SHELL $as_opts "$as_myself" ${1+"$@"} +# Admittedly, this is quite paranoid, since all the known shells bail +# out after a failed `exec'. +$as_echo "$0: could not re-execute with $CONFIG_SHELL" >&2 +as_fn_exit 255 + fi + # We don't want this to propagate to other subprocesses. + { _as_can_reexec=; unset _as_can_reexec;} +if test "x$CONFIG_SHELL" = x; then + as_bourne_compatible="if test -n \"\${ZSH_VERSION+set}\" && (emulate sh) >/dev/null 2>&1; then : + emulate sh + NULLCMD=: + # Pre-4.2 versions of Zsh do word splitting on \${1+\"\$@\"}, which + # is contrary to our usage. Disable this feature. + alias -g '\${1+\"\$@\"}'='\"\$@\"' + setopt NO_GLOB_SUBST +else + case \`(set -o) 2>/dev/null\` in #( + *posix*) : + set -o posix ;; #( + *) : + ;; +esac +fi +" + as_required="as_fn_return () { (exit \$1); } +as_fn_success () { as_fn_return 0; } +as_fn_failure () { as_fn_return 1; } +as_fn_ret_success () { return 0; } +as_fn_ret_failure () { return 1; } + +exitcode=0 +as_fn_success || { exitcode=1; echo as_fn_success failed.; } +as_fn_failure && { exitcode=1; echo as_fn_failure succeeded.; } +as_fn_ret_success || { exitcode=1; echo as_fn_ret_success failed.; } +as_fn_ret_failure && { exitcode=1; echo as_fn_ret_failure succeeded.; } +if ( set x; as_fn_ret_success y && test x = \"\$1\" ); then : + +else + exitcode=1; echo positional parameters were not saved. +fi +test x\$exitcode = x0 || exit 1 +test -x / || exit 1" + as_suggested=" as_lineno_1=";as_suggested=$as_suggested$LINENO;as_suggested=$as_suggested" as_lineno_1a=\$LINENO + as_lineno_2=";as_suggested=$as_suggested$LINENO;as_suggested=$as_suggested" as_lineno_2a=\$LINENO + eval 'test \"x\$as_lineno_1'\$as_run'\" != \"x\$as_lineno_2'\$as_run'\" && + test \"x\`expr \$as_lineno_1'\$as_run' + 1\`\" = \"x\$as_lineno_2'\$as_run'\"' || exit 1 +test \$(( 1 + 1 )) = 2 || exit 1" + if (eval "$as_required") 2>/dev/null; then : + as_have_required=yes +else + as_have_required=no +fi + if test x$as_have_required = xyes && (eval "$as_suggested") 2>/dev/null; then : + +else + as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +as_found=false +for as_dir in /bin$PATH_SEPARATOR/usr/bin$PATH_SEPARATOR$PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + as_found=: + case $as_dir in #( + /*) + for as_base in sh bash ksh sh5; do + # Try only shells that exist, to save several forks. + as_shell=$as_dir/$as_base + if { test -f "$as_shell" || test -f "$as_shell.exe"; } && + { $as_echo "$as_bourne_compatible""$as_required" | as_run=a "$as_shell"; } 2>/dev/null; then : + CONFIG_SHELL=$as_shell as_have_required=yes + if { $as_echo "$as_bourne_compatible""$as_suggested" | as_run=a "$as_shell"; } 2>/dev/null; then : + break 2 +fi +fi + done;; + esac + as_found=false +done +$as_found || { if { test -f "$SHELL" || test -f "$SHELL.exe"; } && + { $as_echo "$as_bourne_compatible""$as_required" | as_run=a "$SHELL"; } 2>/dev/null; then : + CONFIG_SHELL=$SHELL as_have_required=yes +fi; } +IFS=$as_save_IFS + + + if test "x$CONFIG_SHELL" != x; then : + export CONFIG_SHELL + # We cannot yet assume a decent shell, so we have to provide a +# neutralization value for shells without unset; and this also +# works around shells that cannot unset nonexistent variables. +# Preserve -v and -x to the replacement shell. +BASH_ENV=/dev/null +ENV=/dev/null +(unset BASH_ENV) >/dev/null 2>&1 && unset BASH_ENV ENV +case $- in # (((( + *v*x* | *x*v* ) as_opts=-vx ;; + *v* ) as_opts=-v ;; + *x* ) as_opts=-x ;; + * ) as_opts= ;; +esac +exec $CONFIG_SHELL $as_opts "$as_myself" ${1+"$@"} +# Admittedly, this is quite paranoid, since all the known shells bail +# out after a failed `exec'. +$as_echo "$0: could not re-execute with $CONFIG_SHELL" >&2 +exit 255 +fi + + if test x$as_have_required = xno; then : + $as_echo "$0: This script requires a shell more modern than all" + $as_echo "$0: the shells that I found on your system." + if test x${ZSH_VERSION+set} = xset ; then + $as_echo "$0: In particular, zsh $ZSH_VERSION has bugs and should" + $as_echo "$0: be upgraded to zsh 4.3.4 or later." + else + $as_echo "$0: Please tell bug-autoconf@gnu.org about your system, +$0: including any error possibly output before this +$0: message. Then install a modern shell, or manually run +$0: the script under such a shell if you do have one." + fi + exit 1 +fi +fi +fi +SHELL=${CONFIG_SHELL-/bin/sh} +export SHELL +# Unset more variables known to interfere with behavior of common tools. +CLICOLOR_FORCE= GREP_OPTIONS= +unset CLICOLOR_FORCE GREP_OPTIONS + +## --------------------- ## +## M4sh Shell Functions. ## +## --------------------- ## +# as_fn_unset VAR +# --------------- +# Portably unset VAR. +as_fn_unset () +{ + { eval $1=; unset $1;} +} +as_unset=as_fn_unset + +# as_fn_set_status STATUS +# ----------------------- +# Set $? to STATUS, without forking. +as_fn_set_status () +{ + return $1 +} # as_fn_set_status + +# as_fn_exit STATUS +# ----------------- +# Exit the shell with STATUS, even in a "trap 0" or "set -e" context. +as_fn_exit () +{ + set +e + as_fn_set_status $1 + exit $1 +} # as_fn_exit + +# as_fn_mkdir_p +# ------------- +# Create "$as_dir" as a directory, including parents if necessary. +as_fn_mkdir_p () +{ + + case $as_dir in #( + -*) as_dir=./$as_dir;; + esac + test -d "$as_dir" || eval $as_mkdir_p || { + as_dirs= + while :; do + case $as_dir in #( + *\'*) as_qdir=`$as_echo "$as_dir" | sed "s/'/'\\\\\\\\''/g"`;; #'( + *) as_qdir=$as_dir;; + esac + as_dirs="'$as_qdir' $as_dirs" + as_dir=`$as_dirname -- "$as_dir" || +$as_expr X"$as_dir" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$as_dir" : 'X\(//\)[^/]' \| \ + X"$as_dir" : 'X\(//\)$' \| \ + X"$as_dir" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$as_dir" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + test -d "$as_dir" && break + done + test -z "$as_dirs" || eval "mkdir $as_dirs" + } || test -d "$as_dir" || as_fn_error $? "cannot create directory $as_dir" + + +} # as_fn_mkdir_p + +# as_fn_executable_p FILE +# ----------------------- +# Test if FILE is an executable regular file. +as_fn_executable_p () +{ + test -f "$1" && test -x "$1" +} # as_fn_executable_p +# as_fn_append VAR VALUE +# ---------------------- +# Append the text in VALUE to the end of the definition contained in VAR. Take +# advantage of any shell optimizations that allow amortized linear growth over +# repeated appends, instead of the typical quadratic growth present in naive +# implementations. +if (eval "as_var=1; as_var+=2; test x\$as_var = x12") 2>/dev/null; then : + eval 'as_fn_append () + { + eval $1+=\$2 + }' +else + as_fn_append () + { + eval $1=\$$1\$2 + } +fi # as_fn_append + +# as_fn_arith ARG... +# ------------------ +# Perform arithmetic evaluation on the ARGs, and store the result in the +# global $as_val. Take advantage of shells that can avoid forks. The arguments +# must be portable across $(()) and expr. +if (eval "test \$(( 1 + 1 )) = 2") 2>/dev/null; then : + eval 'as_fn_arith () + { + as_val=$(( $* )) + }' +else + as_fn_arith () + { + as_val=`expr "$@" || test $? -eq 1` + } +fi # as_fn_arith + + +# as_fn_error STATUS ERROR [LINENO LOG_FD] +# ---------------------------------------- +# Output "`basename $0`: error: ERROR" to stderr. If LINENO and LOG_FD are +# provided, also output the error to LOG_FD, referencing LINENO. Then exit the +# script with STATUS, using 1 if that was 0. +as_fn_error () +{ + as_status=$1; test $as_status -eq 0 && as_status=1 + if test "$4"; then + as_lineno=${as_lineno-"$3"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + $as_echo "$as_me:${as_lineno-$LINENO}: error: $2" >&$4 + fi + $as_echo "$as_me: error: $2" >&2 + as_fn_exit $as_status +} # as_fn_error + +if expr a : '\(a\)' >/dev/null 2>&1 && + test "X`expr 00001 : '.*\(...\)'`" = X001; then + as_expr=expr +else + as_expr=false +fi + +if (basename -- /) >/dev/null 2>&1 && test "X`basename -- / 2>&1`" = "X/"; then + as_basename=basename +else + as_basename=false +fi + +if (as_dir=`dirname -- /` && test "X$as_dir" = X/) >/dev/null 2>&1; then + as_dirname=dirname +else + as_dirname=false +fi + +as_me=`$as_basename -- "$0" || +$as_expr X/"$0" : '.*/\([^/][^/]*\)/*$' \| \ + X"$0" : 'X\(//\)$' \| \ + X"$0" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X/"$0" | + sed '/^.*\/\([^/][^/]*\)\/*$/{ + s//\1/ + q + } + /^X\/\(\/\/\)$/{ + s//\1/ + q + } + /^X\/\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + +# Avoid depending upon Character Ranges. +as_cr_letters='abcdefghijklmnopqrstuvwxyz' +as_cr_LETTERS='ABCDEFGHIJKLMNOPQRSTUVWXYZ' +as_cr_Letters=$as_cr_letters$as_cr_LETTERS +as_cr_digits='0123456789' +as_cr_alnum=$as_cr_Letters$as_cr_digits + + + as_lineno_1=$LINENO as_lineno_1a=$LINENO + as_lineno_2=$LINENO as_lineno_2a=$LINENO + eval 'test "x$as_lineno_1'$as_run'" != "x$as_lineno_2'$as_run'" && + test "x`expr $as_lineno_1'$as_run' + 1`" = "x$as_lineno_2'$as_run'"' || { + # Blame Lee E. McMahon (1931-1989) for sed's syntax. :-) + sed -n ' + p + /[$]LINENO/= + ' <$as_myself | + sed ' + s/[$]LINENO.*/&-/ + t lineno + b + :lineno + N + :loop + s/[$]LINENO\([^'$as_cr_alnum'_].*\n\)\(.*\)/\2\1\2/ + t loop + s/-\n.*// + ' >$as_me.lineno && + chmod +x "$as_me.lineno" || + { $as_echo "$as_me: error: cannot create $as_me.lineno; rerun with a POSIX shell" >&2; as_fn_exit 1; } + + # If we had to re-execute with $CONFIG_SHELL, we're ensured to have + # already done that, so ensure we don't try to do so again and fall + # in an infinite loop. This has already happened in practice. + _as_can_reexec=no; export _as_can_reexec + # Don't try to exec as it changes $[0], causing all sort of problems + # (the dirname of $[0] is not the place where we might find the + # original and so on. Autoconf is especially sensitive to this). + . "./$as_me.lineno" + # Exit status is that of the last command. + exit +} + +ECHO_C= ECHO_N= ECHO_T= +case `echo -n x` in #((((( +-n*) + case `echo 'xy\c'` in + *c*) ECHO_T=' ';; # ECHO_T is single tab character. + xy) ECHO_C='\c';; + *) echo `echo ksh88 bug on AIX 6.1` > /dev/null + ECHO_T=' ';; + esac;; +*) + ECHO_N='-n';; +esac + +rm -f conf$$ conf$$.exe conf$$.file +if test -d conf$$.dir; then + rm -f conf$$.dir/conf$$.file +else + rm -f conf$$.dir + mkdir conf$$.dir 2>/dev/null +fi +if (echo >conf$$.file) 2>/dev/null; then + if ln -s conf$$.file conf$$ 2>/dev/null; then + as_ln_s='ln -s' + # ... but there are two gotchas: + # 1) On MSYS, both `ln -s file dir' and `ln file dir' fail. + # 2) DJGPP < 2.04 has no symlinks; `ln -s' creates a wrapper executable. + # In both cases, we have to default to `cp -pR'. + ln -s conf$$.file conf$$.dir 2>/dev/null && test ! -f conf$$.exe || + as_ln_s='cp -pR' + elif ln conf$$.file conf$$ 2>/dev/null; then + as_ln_s=ln + else + as_ln_s='cp -pR' + fi +else + as_ln_s='cp -pR' +fi +rm -f conf$$ conf$$.exe conf$$.dir/conf$$.file conf$$.file +rmdir conf$$.dir 2>/dev/null + +if mkdir -p . 2>/dev/null; then + as_mkdir_p='mkdir -p "$as_dir"' +else + test -d ./-p && rmdir ./-p + as_mkdir_p=false +fi + +as_test_x='test -x' +as_executable_p=as_fn_executable_p + +# Sed expression to map a string onto a valid CPP name. +as_tr_cpp="eval sed 'y%*$as_cr_letters%P$as_cr_LETTERS%;s%[^_$as_cr_alnum]%_%g'" + +# Sed expression to map a string onto a valid variable name. +as_tr_sh="eval sed 'y%*+%pp%;s%[^_$as_cr_alnum]%_%g'" + + +test -n "$DJDIR" || exec 7<&0 &1 + +# Name of the host. +# hostname on some systems (SVR3.2, old GNU/Linux) returns a bogus exit status, +# so uname gets run too. +ac_hostname=`(hostname || uname -n) 2>/dev/null | sed 1q` + +# +# Initializations. +# +ac_default_prefix=/usr/local +ac_clean_files= +ac_config_libobj_dir=. +LIBOBJS= +cross_compiling=no +subdirs= +MFLAGS= +MAKEFLAGS= + +# Identity of this package. +PACKAGE_NAME='sgtsnepi' +PACKAGE_TARNAME='sgtsnepi' +PACKAGE_VERSION='version-1.0' +PACKAGE_STRING='sgtsnepi version-1.0' +PACKAGE_BUGREPORT='' +PACKAGE_URL='' + +# Factoring default headers for most tests. +ac_includes_default="\ +#include +#ifdef HAVE_SYS_TYPES_H +# include +#endif +#ifdef HAVE_SYS_STAT_H +# include +#endif +#ifdef STDC_HEADERS +# include +# include +#else +# ifdef HAVE_STDLIB_H +# include +# endif +#endif +#ifdef HAVE_STRING_H +# if !defined STDC_HEADERS && defined HAVE_MEMORY_H +# include +# endif +# include +#endif +#ifdef HAVE_STRINGS_H +# include +#endif +#ifdef HAVE_INTTYPES_H +# include +#endif +#ifdef HAVE_STDINT_H +# include +#endif +#ifdef HAVE_UNISTD_H +# include +#endif" + +ac_subst_vars='ENABLE_MATLAB +EGREP +GREP +CXXCPP +LTLIBOBJS +LIBOBJS +INCS +MATLABROOT +MEX +MEXFLAGS +OBJEXT +EXEEXT +ac_ct_CXX +CPPFLAGS +LDFLAGS +CXXFLAGS +CXX +target_alias +host_alias +build_alias +LIBS +ECHO_T +ECHO_N +ECHO_C +DEFS +mandir +localedir +libdir +psdir +pdfdir +dvidir +htmldir +infodir +docdir +oldincludedir +includedir +localstatedir +sharedstatedir +sysconfdir +datadir +datarootdir +libexecdir +sbindir +bindir +program_transform_name +prefix +exec_prefix +PACKAGE_URL +PACKAGE_BUGREPORT +PACKAGE_STRING +PACKAGE_VERSION +PACKAGE_TARNAME +PACKAGE_NAME +PATH_SEPARATOR +SHELL' +ac_subst_files='' +ac_user_opts=' +enable_option_checking +enable_matlab +' + ac_precious_vars='build_alias +host_alias +target_alias +CXX +CXXFLAGS +LDFLAGS +LIBS +CPPFLAGS +CCC +CXXCPP' + + +# Initialize some variables set by options. +ac_init_help= +ac_init_version=false +ac_unrecognized_opts= +ac_unrecognized_sep= +# The variables have the same names as the options, with +# dashes changed to underlines. +cache_file=/dev/null +exec_prefix=NONE +no_create= +no_recursion= +prefix=NONE +program_prefix=NONE +program_suffix=NONE +program_transform_name=s,x,x, +silent= +site= +srcdir= +verbose= +x_includes=NONE +x_libraries=NONE + +# Installation directory options. +# These are left unexpanded so users can "make install exec_prefix=/foo" +# and all the variables that are supposed to be based on exec_prefix +# by default will actually change. +# Use braces instead of parens because sh, perl, etc. also accept them. +# (The list follows the same order as the GNU Coding Standards.) +bindir='${exec_prefix}/bin' +sbindir='${exec_prefix}/sbin' +libexecdir='${exec_prefix}/libexec' +datarootdir='${prefix}/share' +datadir='${datarootdir}' +sysconfdir='${prefix}/etc' +sharedstatedir='${prefix}/com' +localstatedir='${prefix}/var' +includedir='${prefix}/include' +oldincludedir='/usr/include' +docdir='${datarootdir}/doc/${PACKAGE_TARNAME}' +infodir='${datarootdir}/info' +htmldir='${docdir}' +dvidir='${docdir}' +pdfdir='${docdir}' +psdir='${docdir}' +libdir='${exec_prefix}/lib' +localedir='${datarootdir}/locale' +mandir='${datarootdir}/man' + +ac_prev= +ac_dashdash= +for ac_option +do + # If the previous option needs an argument, assign it. + if test -n "$ac_prev"; then + eval $ac_prev=\$ac_option + ac_prev= + continue + fi + + case $ac_option in + *=?*) ac_optarg=`expr "X$ac_option" : '[^=]*=\(.*\)'` ;; + *=) ac_optarg= ;; + *) ac_optarg=yes ;; + esac + + # Accept the important Cygnus configure options, so we can diagnose typos. + + case $ac_dashdash$ac_option in + --) + ac_dashdash=yes ;; + + -bindir | --bindir | --bindi | --bind | --bin | --bi) + ac_prev=bindir ;; + -bindir=* | --bindir=* | --bindi=* | --bind=* | --bin=* | --bi=*) + bindir=$ac_optarg ;; + + -build | --build | --buil | --bui | --bu) + ac_prev=build_alias ;; + -build=* | --build=* | --buil=* | --bui=* | --bu=*) + build_alias=$ac_optarg ;; + + -cache-file | --cache-file | --cache-fil | --cache-fi \ + | --cache-f | --cache- | --cache | --cach | --cac | --ca | --c) + ac_prev=cache_file ;; + -cache-file=* | --cache-file=* | --cache-fil=* | --cache-fi=* \ + | --cache-f=* | --cache-=* | --cache=* | --cach=* | --cac=* | --ca=* | --c=*) + cache_file=$ac_optarg ;; + + --config-cache | -C) + cache_file=config.cache ;; + + -datadir | --datadir | --datadi | --datad) + ac_prev=datadir ;; + -datadir=* | --datadir=* | --datadi=* | --datad=*) + datadir=$ac_optarg ;; + + -datarootdir | --datarootdir | --datarootdi | --datarootd | --dataroot \ + | --dataroo | --dataro | --datar) + ac_prev=datarootdir ;; + -datarootdir=* | --datarootdir=* | --datarootdi=* | --datarootd=* \ + | --dataroot=* | --dataroo=* | --dataro=* | --datar=*) + datarootdir=$ac_optarg ;; + + -disable-* | --disable-*) + ac_useropt=`expr "x$ac_option" : 'x-*disable-\(.*\)'` + # Reject names that are not valid shell variable names. + expr "x$ac_useropt" : ".*[^-+._$as_cr_alnum]" >/dev/null && + as_fn_error $? "invalid feature name: $ac_useropt" + ac_useropt_orig=$ac_useropt + ac_useropt=`$as_echo "$ac_useropt" | sed 's/[-+.]/_/g'` + case $ac_user_opts in + *" +"enable_$ac_useropt" +"*) ;; + *) ac_unrecognized_opts="$ac_unrecognized_opts$ac_unrecognized_sep--disable-$ac_useropt_orig" + ac_unrecognized_sep=', ';; + esac + eval enable_$ac_useropt=no ;; + + -docdir | --docdir | --docdi | --doc | --do) + ac_prev=docdir ;; + -docdir=* | --docdir=* | --docdi=* | --doc=* | --do=*) + docdir=$ac_optarg ;; + + -dvidir | --dvidir | --dvidi | --dvid | --dvi | --dv) + ac_prev=dvidir ;; + -dvidir=* | --dvidir=* | --dvidi=* | --dvid=* | --dvi=* | --dv=*) + dvidir=$ac_optarg ;; + + -enable-* | --enable-*) + ac_useropt=`expr "x$ac_option" : 'x-*enable-\([^=]*\)'` + # Reject names that are not valid shell variable names. + expr "x$ac_useropt" : ".*[^-+._$as_cr_alnum]" >/dev/null && + as_fn_error $? "invalid feature name: $ac_useropt" + ac_useropt_orig=$ac_useropt + ac_useropt=`$as_echo "$ac_useropt" | sed 's/[-+.]/_/g'` + case $ac_user_opts in + *" +"enable_$ac_useropt" +"*) ;; + *) ac_unrecognized_opts="$ac_unrecognized_opts$ac_unrecognized_sep--enable-$ac_useropt_orig" + ac_unrecognized_sep=', ';; + esac + eval enable_$ac_useropt=\$ac_optarg ;; + + -exec-prefix | --exec_prefix | --exec-prefix | --exec-prefi \ + | --exec-pref | --exec-pre | --exec-pr | --exec-p | --exec- \ + | --exec | --exe | --ex) + ac_prev=exec_prefix ;; + -exec-prefix=* | --exec_prefix=* | --exec-prefix=* | --exec-prefi=* \ + | --exec-pref=* | --exec-pre=* | --exec-pr=* | --exec-p=* | --exec-=* \ + | --exec=* | --exe=* | --ex=*) + exec_prefix=$ac_optarg ;; + + -gas | --gas | --ga | --g) + # Obsolete; use --with-gas. + with_gas=yes ;; + + -help | --help | --hel | --he | -h) + ac_init_help=long ;; + -help=r* | --help=r* | --hel=r* | --he=r* | -hr*) + ac_init_help=recursive ;; + -help=s* | --help=s* | --hel=s* | --he=s* | -hs*) + ac_init_help=short ;; + + -host | --host | --hos | --ho) + ac_prev=host_alias ;; + -host=* | --host=* | --hos=* | --ho=*) + host_alias=$ac_optarg ;; + + -htmldir | --htmldir | --htmldi | --htmld | --html | --htm | --ht) + ac_prev=htmldir ;; + -htmldir=* | --htmldir=* | --htmldi=* | --htmld=* | --html=* | --htm=* \ + | --ht=*) + htmldir=$ac_optarg ;; + + -includedir | --includedir | --includedi | --included | --include \ + | --includ | --inclu | --incl | --inc) + ac_prev=includedir ;; + -includedir=* | --includedir=* | --includedi=* | --included=* | --include=* \ + | --includ=* | --inclu=* | --incl=* | --inc=*) + includedir=$ac_optarg ;; + + -infodir | --infodir | --infodi | --infod | --info | --inf) + ac_prev=infodir ;; + -infodir=* | --infodir=* | --infodi=* | --infod=* | --info=* | --inf=*) + infodir=$ac_optarg ;; + + -libdir | --libdir | --libdi | --libd) + ac_prev=libdir ;; + -libdir=* | --libdir=* | --libdi=* | --libd=*) + libdir=$ac_optarg ;; + + -libexecdir | --libexecdir | --libexecdi | --libexecd | --libexec \ + | --libexe | --libex | --libe) + ac_prev=libexecdir ;; + -libexecdir=* | --libexecdir=* | --libexecdi=* | --libexecd=* | --libexec=* \ + | --libexe=* | --libex=* | --libe=*) + libexecdir=$ac_optarg ;; + + -localedir | --localedir | --localedi | --localed | --locale) + ac_prev=localedir ;; + -localedir=* | --localedir=* | --localedi=* | --localed=* | --locale=*) + localedir=$ac_optarg ;; + + -localstatedir | --localstatedir | --localstatedi | --localstated \ + | --localstate | --localstat | --localsta | --localst | --locals) + ac_prev=localstatedir ;; + -localstatedir=* | --localstatedir=* | --localstatedi=* | --localstated=* \ + | --localstate=* | --localstat=* | --localsta=* | --localst=* | --locals=*) + localstatedir=$ac_optarg ;; + + -mandir | --mandir | --mandi | --mand | --man | --ma | --m) + ac_prev=mandir ;; + -mandir=* | --mandir=* | --mandi=* | --mand=* | --man=* | --ma=* | --m=*) + mandir=$ac_optarg ;; + + -nfp | --nfp | --nf) + # Obsolete; use --without-fp. + with_fp=no ;; + + -no-create | --no-create | --no-creat | --no-crea | --no-cre \ + | --no-cr | --no-c | -n) + no_create=yes ;; + + -no-recursion | --no-recursion | --no-recursio | --no-recursi \ + | --no-recurs | --no-recur | --no-recu | --no-rec | --no-re | --no-r) + no_recursion=yes ;; + + -oldincludedir | --oldincludedir | --oldincludedi | --oldincluded \ + | --oldinclude | --oldinclud | --oldinclu | --oldincl | --oldinc \ + | --oldin | --oldi | --old | --ol | --o) + ac_prev=oldincludedir ;; + -oldincludedir=* | --oldincludedir=* | --oldincludedi=* | --oldincluded=* \ + | --oldinclude=* | --oldinclud=* | --oldinclu=* | --oldincl=* | --oldinc=* \ + | --oldin=* | --oldi=* | --old=* | --ol=* | --o=*) + oldincludedir=$ac_optarg ;; + + -prefix | --prefix | --prefi | --pref | --pre | --pr | --p) + ac_prev=prefix ;; + -prefix=* | --prefix=* | --prefi=* | --pref=* | --pre=* | --pr=* | --p=*) + prefix=$ac_optarg ;; + + -program-prefix | --program-prefix | --program-prefi | --program-pref \ + | --program-pre | --program-pr | --program-p) + ac_prev=program_prefix ;; + -program-prefix=* | --program-prefix=* | --program-prefi=* \ + | --program-pref=* | --program-pre=* | --program-pr=* | --program-p=*) + program_prefix=$ac_optarg ;; + + -program-suffix | --program-suffix | --program-suffi | --program-suff \ + | --program-suf | --program-su | --program-s) + ac_prev=program_suffix ;; + -program-suffix=* | --program-suffix=* | --program-suffi=* \ + | --program-suff=* | --program-suf=* | --program-su=* | --program-s=*) + program_suffix=$ac_optarg ;; + + -program-transform-name | --program-transform-name \ + | --program-transform-nam | --program-transform-na \ + | --program-transform-n | --program-transform- \ + | --program-transform | --program-transfor \ + | --program-transfo | --program-transf \ + | --program-trans | --program-tran \ + | --progr-tra | --program-tr | --program-t) + ac_prev=program_transform_name ;; + -program-transform-name=* | --program-transform-name=* \ + | --program-transform-nam=* | --program-transform-na=* \ + | --program-transform-n=* | --program-transform-=* \ + | --program-transform=* | --program-transfor=* \ + | --program-transfo=* | --program-transf=* \ + | --program-trans=* | --program-tran=* \ + | --progr-tra=* | --program-tr=* | --program-t=*) + program_transform_name=$ac_optarg ;; + + -pdfdir | --pdfdir | --pdfdi | --pdfd | --pdf | --pd) + ac_prev=pdfdir ;; + -pdfdir=* | --pdfdir=* | --pdfdi=* | --pdfd=* | --pdf=* | --pd=*) + pdfdir=$ac_optarg ;; + + -psdir | --psdir | --psdi | --psd | --ps) + ac_prev=psdir ;; + -psdir=* | --psdir=* | --psdi=* | --psd=* | --ps=*) + psdir=$ac_optarg ;; + + -q | -quiet | --quiet | --quie | --qui | --qu | --q \ + | -silent | --silent | --silen | --sile | --sil) + silent=yes ;; + + -sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb) + ac_prev=sbindir ;; + -sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \ + | --sbi=* | --sb=*) + sbindir=$ac_optarg ;; + + -sharedstatedir | --sharedstatedir | --sharedstatedi \ + | --sharedstated | --sharedstate | --sharedstat | --sharedsta \ + | --sharedst | --shareds | --shared | --share | --shar \ + | --sha | --sh) + ac_prev=sharedstatedir ;; + -sharedstatedir=* | --sharedstatedir=* | --sharedstatedi=* \ + | --sharedstated=* | --sharedstate=* | --sharedstat=* | --sharedsta=* \ + | --sharedst=* | --shareds=* | --shared=* | --share=* | --shar=* \ + | --sha=* | --sh=*) + sharedstatedir=$ac_optarg ;; + + -site | --site | --sit) + ac_prev=site ;; + -site=* | --site=* | --sit=*) + site=$ac_optarg ;; + + -srcdir | --srcdir | --srcdi | --srcd | --src | --sr) + ac_prev=srcdir ;; + -srcdir=* | --srcdir=* | --srcdi=* | --srcd=* | --src=* | --sr=*) + srcdir=$ac_optarg ;; + + -sysconfdir | --sysconfdir | --sysconfdi | --sysconfd | --sysconf \ + | --syscon | --sysco | --sysc | --sys | --sy) + ac_prev=sysconfdir ;; + -sysconfdir=* | --sysconfdir=* | --sysconfdi=* | --sysconfd=* | --sysconf=* \ + | --syscon=* | --sysco=* | --sysc=* | --sys=* | --sy=*) + sysconfdir=$ac_optarg ;; + + -target | --target | --targe | --targ | --tar | --ta | --t) + ac_prev=target_alias ;; + -target=* | --target=* | --targe=* | --targ=* | --tar=* | --ta=* | --t=*) + target_alias=$ac_optarg ;; + + -v | -verbose | --verbose | --verbos | --verbo | --verb) + verbose=yes ;; + + -version | --version | --versio | --versi | --vers | -V) + ac_init_version=: ;; + + -with-* | --with-*) + ac_useropt=`expr "x$ac_option" : 'x-*with-\([^=]*\)'` + # Reject names that are not valid shell variable names. + expr "x$ac_useropt" : ".*[^-+._$as_cr_alnum]" >/dev/null && + as_fn_error $? "invalid package name: $ac_useropt" + ac_useropt_orig=$ac_useropt + ac_useropt=`$as_echo "$ac_useropt" | sed 's/[-+.]/_/g'` + case $ac_user_opts in + *" +"with_$ac_useropt" +"*) ;; + *) ac_unrecognized_opts="$ac_unrecognized_opts$ac_unrecognized_sep--with-$ac_useropt_orig" + ac_unrecognized_sep=', ';; + esac + eval with_$ac_useropt=\$ac_optarg ;; + + -without-* | --without-*) + ac_useropt=`expr "x$ac_option" : 'x-*without-\(.*\)'` + # Reject names that are not valid shell variable names. + expr "x$ac_useropt" : ".*[^-+._$as_cr_alnum]" >/dev/null && + as_fn_error $? "invalid package name: $ac_useropt" + ac_useropt_orig=$ac_useropt + ac_useropt=`$as_echo "$ac_useropt" | sed 's/[-+.]/_/g'` + case $ac_user_opts in + *" +"with_$ac_useropt" +"*) ;; + *) ac_unrecognized_opts="$ac_unrecognized_opts$ac_unrecognized_sep--without-$ac_useropt_orig" + ac_unrecognized_sep=', ';; + esac + eval with_$ac_useropt=no ;; + + --x) + # Obsolete; use --with-x. + with_x=yes ;; + + -x-includes | --x-includes | --x-include | --x-includ | --x-inclu \ + | --x-incl | --x-inc | --x-in | --x-i) + ac_prev=x_includes ;; + -x-includes=* | --x-includes=* | --x-include=* | --x-includ=* | --x-inclu=* \ + | --x-incl=* | --x-inc=* | --x-in=* | --x-i=*) + x_includes=$ac_optarg ;; + + -x-libraries | --x-libraries | --x-librarie | --x-librari \ + | --x-librar | --x-libra | --x-libr | --x-lib | --x-li | --x-l) + ac_prev=x_libraries ;; + -x-libraries=* | --x-libraries=* | --x-librarie=* | --x-librari=* \ + | --x-librar=* | --x-libra=* | --x-libr=* | --x-lib=* | --x-li=* | --x-l=*) + x_libraries=$ac_optarg ;; + + -*) as_fn_error $? "unrecognized option: \`$ac_option' +Try \`$0 --help' for more information" + ;; + + *=*) + ac_envvar=`expr "x$ac_option" : 'x\([^=]*\)='` + # Reject names that are not valid shell variable names. + case $ac_envvar in #( + '' | [0-9]* | *[!_$as_cr_alnum]* ) + as_fn_error $? "invalid variable name: \`$ac_envvar'" ;; + esac + eval $ac_envvar=\$ac_optarg + export $ac_envvar ;; + + *) + # FIXME: should be removed in autoconf 3.0. + $as_echo "$as_me: WARNING: you should use --build, --host, --target" >&2 + expr "x$ac_option" : ".*[^-._$as_cr_alnum]" >/dev/null && + $as_echo "$as_me: WARNING: invalid host type: $ac_option" >&2 + : "${build_alias=$ac_option} ${host_alias=$ac_option} ${target_alias=$ac_option}" + ;; + + esac +done + +if test -n "$ac_prev"; then + ac_option=--`echo $ac_prev | sed 's/_/-/g'` + as_fn_error $? "missing argument to $ac_option" +fi + +if test -n "$ac_unrecognized_opts"; then + case $enable_option_checking in + no) ;; + fatal) as_fn_error $? "unrecognized options: $ac_unrecognized_opts" ;; + *) $as_echo "$as_me: WARNING: unrecognized options: $ac_unrecognized_opts" >&2 ;; + esac +fi + +# Check all directory arguments for consistency. +for ac_var in exec_prefix prefix bindir sbindir libexecdir datarootdir \ + datadir sysconfdir sharedstatedir localstatedir includedir \ + oldincludedir docdir infodir htmldir dvidir pdfdir psdir \ + libdir localedir mandir +do + eval ac_val=\$$ac_var + # Remove trailing slashes. + case $ac_val in + */ ) + ac_val=`expr "X$ac_val" : 'X\(.*[^/]\)' \| "X$ac_val" : 'X\(.*\)'` + eval $ac_var=\$ac_val;; + esac + # Be sure to have absolute directory names. + case $ac_val in + [\\/$]* | ?:[\\/]* ) continue;; + NONE | '' ) case $ac_var in *prefix ) continue;; esac;; + esac + as_fn_error $? "expected an absolute directory name for --$ac_var: $ac_val" +done + +# There might be people who depend on the old broken behavior: `$host' +# used to hold the argument of --host etc. +# FIXME: To remove some day. +build=$build_alias +host=$host_alias +target=$target_alias + +# FIXME: To remove some day. +if test "x$host_alias" != x; then + if test "x$build_alias" = x; then + cross_compiling=maybe + elif test "x$build_alias" != "x$host_alias"; then + cross_compiling=yes + fi +fi + +ac_tool_prefix= +test -n "$host_alias" && ac_tool_prefix=$host_alias- + +test "$silent" = yes && exec 6>/dev/null + + +ac_pwd=`pwd` && test -n "$ac_pwd" && +ac_ls_di=`ls -di .` && +ac_pwd_ls_di=`cd "$ac_pwd" && ls -di .` || + as_fn_error $? "working directory cannot be determined" +test "X$ac_ls_di" = "X$ac_pwd_ls_di" || + as_fn_error $? "pwd does not report name of working directory" + + +# Find the source files, if location was not specified. +if test -z "$srcdir"; then + ac_srcdir_defaulted=yes + # Try the directory containing this script, then the parent directory. + ac_confdir=`$as_dirname -- "$as_myself" || +$as_expr X"$as_myself" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$as_myself" : 'X\(//\)[^/]' \| \ + X"$as_myself" : 'X\(//\)$' \| \ + X"$as_myself" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$as_myself" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + srcdir=$ac_confdir + if test ! -r "$srcdir/$ac_unique_file"; then + srcdir=.. + fi +else + ac_srcdir_defaulted=no +fi +if test ! -r "$srcdir/$ac_unique_file"; then + test "$ac_srcdir_defaulted" = yes && srcdir="$ac_confdir or .." + as_fn_error $? "cannot find sources ($ac_unique_file) in $srcdir" +fi +ac_msg="sources are in $srcdir, but \`cd $srcdir' does not work" +ac_abs_confdir=`( + cd "$srcdir" && test -r "./$ac_unique_file" || as_fn_error $? "$ac_msg" + pwd)` +# When building in place, set srcdir=. +if test "$ac_abs_confdir" = "$ac_pwd"; then + srcdir=. +fi +# Remove unnecessary trailing slashes from srcdir. +# Double slashes in file names in object file debugging info +# mess up M-x gdb in Emacs. +case $srcdir in +*/) srcdir=`expr "X$srcdir" : 'X\(.*[^/]\)' \| "X$srcdir" : 'X\(.*\)'`;; +esac +for ac_var in $ac_precious_vars; do + eval ac_env_${ac_var}_set=\${${ac_var}+set} + eval ac_env_${ac_var}_value=\$${ac_var} + eval ac_cv_env_${ac_var}_set=\${${ac_var}+set} + eval ac_cv_env_${ac_var}_value=\$${ac_var} +done + +# +# Report the --help message. +# +if test "$ac_init_help" = "long"; then + # Omit some internal or obsolete options to make the list less imposing. + # This message is too long to be a string in the A/UX 3.1 sh. + cat <<_ACEOF +\`configure' configures sgtsnepi version-1.0 to adapt to many kinds of systems. + +Usage: $0 [OPTION]... [VAR=VALUE]... + +To assign environment variables (e.g., CC, CFLAGS...), specify them as +VAR=VALUE. See below for descriptions of some of the useful variables. + +Defaults for the options are specified in brackets. + +Configuration: + -h, --help display this help and exit + --help=short display options specific to this package + --help=recursive display the short help of all the included packages + -V, --version display version information and exit + -q, --quiet, --silent do not print \`checking ...' messages + --cache-file=FILE cache test results in FILE [disabled] + -C, --config-cache alias for \`--cache-file=config.cache' + -n, --no-create do not create output files + --srcdir=DIR find the sources in DIR [configure dir or \`..'] + +Installation directories: + --prefix=PREFIX install architecture-independent files in PREFIX + [$ac_default_prefix] + --exec-prefix=EPREFIX install architecture-dependent files in EPREFIX + [PREFIX] + +By default, \`make install' will install all the files in +\`$ac_default_prefix/bin', \`$ac_default_prefix/lib' etc. You can specify +an installation prefix other than \`$ac_default_prefix' using \`--prefix', +for instance \`--prefix=\$HOME'. + +For better control, use the options below. + +Fine tuning of the installation directories: + --bindir=DIR user executables [EPREFIX/bin] + --sbindir=DIR system admin executables [EPREFIX/sbin] + --libexecdir=DIR program executables [EPREFIX/libexec] + --sysconfdir=DIR read-only single-machine data [PREFIX/etc] + --sharedstatedir=DIR modifiable architecture-independent data [PREFIX/com] + --localstatedir=DIR modifiable single-machine data [PREFIX/var] + --libdir=DIR object code libraries [EPREFIX/lib] + --includedir=DIR C header files [PREFIX/include] + --oldincludedir=DIR C header files for non-gcc [/usr/include] + --datarootdir=DIR read-only arch.-independent data root [PREFIX/share] + --datadir=DIR read-only architecture-independent data [DATAROOTDIR] + --infodir=DIR info documentation [DATAROOTDIR/info] + --localedir=DIR locale-dependent data [DATAROOTDIR/locale] + --mandir=DIR man documentation [DATAROOTDIR/man] + --docdir=DIR documentation root [DATAROOTDIR/doc/sgtsnepi] + --htmldir=DIR html documentation [DOCDIR] + --dvidir=DIR dvi documentation [DOCDIR] + --pdfdir=DIR pdf documentation [DOCDIR] + --psdir=DIR ps documentation [DOCDIR] +_ACEOF + + cat <<\_ACEOF +_ACEOF +fi + +if test -n "$ac_init_help"; then + case $ac_init_help in + short | recursive ) echo "Configuration of sgtsnepi version-1.0:";; + esac + cat <<\_ACEOF + +Optional Features: + --disable-option-checking ignore unrecognized --enable/--with options + --disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no) + --enable-FEATURE[=ARG] include FEATURE [ARG=yes] + --enable-matlab Build MATLAB interface. + +Some influential environment variables: + CXX C++ compiler command + CXXFLAGS C++ compiler flags + LDFLAGS linker flags, e.g. -L if you have libraries in a + nonstandard directory + LIBS libraries to pass to the linker, e.g. -l + CPPFLAGS (Objective) C/C++ preprocessor flags, e.g. -I if + you have headers in a nonstandard directory + CXXCPP C++ preprocessor + +Use these variables to override the choices made by `configure' or to help +it to find libraries and programs with nonstandard names/locations. + +Report bugs to the package provider. +_ACEOF +ac_status=$? +fi + +if test "$ac_init_help" = "recursive"; then + # If there are subdirs, report their specific --help. + for ac_dir in : $ac_subdirs_all; do test "x$ac_dir" = x: && continue + test -d "$ac_dir" || + { cd "$srcdir" && ac_pwd=`pwd` && srcdir=. && test -d "$ac_dir"; } || + continue + ac_builddir=. + +case "$ac_dir" in +.) ac_dir_suffix= ac_top_builddir_sub=. ac_top_build_prefix= ;; +*) + ac_dir_suffix=/`$as_echo "$ac_dir" | sed 's|^\.[\\/]||'` + # A ".." for each directory in $ac_dir_suffix. + ac_top_builddir_sub=`$as_echo "$ac_dir_suffix" | sed 's|/[^\\/]*|/..|g;s|/||'` + case $ac_top_builddir_sub in + "") ac_top_builddir_sub=. ac_top_build_prefix= ;; + *) ac_top_build_prefix=$ac_top_builddir_sub/ ;; + esac ;; +esac +ac_abs_top_builddir=$ac_pwd +ac_abs_builddir=$ac_pwd$ac_dir_suffix +# for backward compatibility: +ac_top_builddir=$ac_top_build_prefix + +case $srcdir in + .) # We are building in place. + ac_srcdir=. + ac_top_srcdir=$ac_top_builddir_sub + ac_abs_top_srcdir=$ac_pwd ;; + [\\/]* | ?:[\\/]* ) # Absolute name. + ac_srcdir=$srcdir$ac_dir_suffix; + ac_top_srcdir=$srcdir + ac_abs_top_srcdir=$srcdir ;; + *) # Relative name. + ac_srcdir=$ac_top_build_prefix$srcdir$ac_dir_suffix + ac_top_srcdir=$ac_top_build_prefix$srcdir + ac_abs_top_srcdir=$ac_pwd/$srcdir ;; +esac +ac_abs_srcdir=$ac_abs_top_srcdir$ac_dir_suffix + + cd "$ac_dir" || { ac_status=$?; continue; } + # Check for guested configure. + if test -f "$ac_srcdir/configure.gnu"; then + echo && + $SHELL "$ac_srcdir/configure.gnu" --help=recursive + elif test -f "$ac_srcdir/configure"; then + echo && + $SHELL "$ac_srcdir/configure" --help=recursive + else + $as_echo "$as_me: WARNING: no configuration information is in $ac_dir" >&2 + fi || ac_status=$? + cd "$ac_pwd" || { ac_status=$?; break; } + done +fi + +test -n "$ac_init_help" && exit $ac_status +if $ac_init_version; then + cat <<\_ACEOF +sgtsnepi configure version-1.0 +generated by GNU Autoconf 2.69 + +Copyright (C) 2012 Free Software Foundation, Inc. +This configure script is free software; the Free Software Foundation +gives unlimited permission to copy, distribute and modify it. +_ACEOF + exit +fi + +## ------------------------ ## +## Autoconf initialization. ## +## ------------------------ ## + +# ac_fn_cxx_try_compile LINENO +# ---------------------------- +# Try to compile conftest.$ac_ext, and return whether this succeeded. +ac_fn_cxx_try_compile () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + rm -f conftest.$ac_objext + if { { ac_try="$ac_compile" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_compile") 2>conftest.err + ac_status=$? + if test -s conftest.err; then + grep -v '^ *+' conftest.err >conftest.er1 + cat conftest.er1 >&5 + mv -f conftest.er1 conftest.err + fi + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } && { + test -z "$ac_cxx_werror_flag" || + test ! -s conftest.err + } && test -s conftest.$ac_objext; then : + ac_retval=0 +else + $as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + + ac_retval=1 +fi + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + as_fn_set_status $ac_retval + +} # ac_fn_cxx_try_compile + +# ac_fn_cxx_try_cpp LINENO +# ------------------------ +# Try to preprocess conftest.$ac_ext, and return whether this succeeded. +ac_fn_cxx_try_cpp () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + if { { ac_try="$ac_cpp conftest.$ac_ext" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_cpp conftest.$ac_ext") 2>conftest.err + ac_status=$? + if test -s conftest.err; then + grep -v '^ *+' conftest.err >conftest.er1 + cat conftest.er1 >&5 + mv -f conftest.er1 conftest.err + fi + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } > conftest.i && { + test -z "$ac_cxx_preproc_warn_flag$ac_cxx_werror_flag" || + test ! -s conftest.err + }; then : + ac_retval=0 +else + $as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + + ac_retval=1 +fi + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + as_fn_set_status $ac_retval + +} # ac_fn_cxx_try_cpp + +# ac_fn_cxx_check_header_mongrel LINENO HEADER VAR INCLUDES +# --------------------------------------------------------- +# Tests whether HEADER exists, giving a warning if it cannot be compiled using +# the include files in INCLUDES and setting the cache variable VAR +# accordingly. +ac_fn_cxx_check_header_mongrel () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + if eval \${$3+:} false; then : + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5 +$as_echo_n "checking for $2... " >&6; } +if eval \${$3+:} false; then : + $as_echo_n "(cached) " >&6 +fi +eval ac_res=\$$3 + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5 +$as_echo "$ac_res" >&6; } +else + # Is the header compilable? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking $2 usability" >&5 +$as_echo_n "checking $2 usability... " >&6; } +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +$4 +#include <$2> +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + ac_header_compiler=yes +else + ac_header_compiler=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_header_compiler" >&5 +$as_echo "$ac_header_compiler" >&6; } + +# Is the header present? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking $2 presence" >&5 +$as_echo_n "checking $2 presence... " >&6; } +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <$2> +_ACEOF +if ac_fn_cxx_try_cpp "$LINENO"; then : + ac_header_preproc=yes +else + ac_header_preproc=no +fi +rm -f conftest.err conftest.i conftest.$ac_ext +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_header_preproc" >&5 +$as_echo "$ac_header_preproc" >&6; } + +# So? What about this header? +case $ac_header_compiler:$ac_header_preproc:$ac_cxx_preproc_warn_flag in #(( + yes:no: ) + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: accepted by the compiler, rejected by the preprocessor!" >&5 +$as_echo "$as_me: WARNING: $2: accepted by the compiler, rejected by the preprocessor!" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: proceeding with the compiler's result" >&5 +$as_echo "$as_me: WARNING: $2: proceeding with the compiler's result" >&2;} + ;; + no:yes:* ) + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: present but cannot be compiled" >&5 +$as_echo "$as_me: WARNING: $2: present but cannot be compiled" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: check for missing prerequisite headers?" >&5 +$as_echo "$as_me: WARNING: $2: check for missing prerequisite headers?" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: see the Autoconf documentation" >&5 +$as_echo "$as_me: WARNING: $2: see the Autoconf documentation" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: section \"Present But Cannot Be Compiled\"" >&5 +$as_echo "$as_me: WARNING: $2: section \"Present But Cannot Be Compiled\"" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $2: proceeding with the compiler's result" >&5 +$as_echo "$as_me: WARNING: $2: proceeding with the compiler's result" >&2;} + ;; +esac + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5 +$as_echo_n "checking for $2... " >&6; } +if eval \${$3+:} false; then : + $as_echo_n "(cached) " >&6 +else + eval "$3=\$ac_header_compiler" +fi +eval ac_res=\$$3 + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5 +$as_echo "$ac_res" >&6; } +fi + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + +} # ac_fn_cxx_check_header_mongrel + +# ac_fn_cxx_try_run LINENO +# ------------------------ +# Try to link conftest.$ac_ext, and return whether this succeeded. Assumes +# that executables *can* be run. +ac_fn_cxx_try_run () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + if { { ac_try="$ac_link" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_link") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } && { ac_try='./conftest$ac_exeext' + { { case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_try") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; }; }; then : + ac_retval=0 +else + $as_echo "$as_me: program exited with status $ac_status" >&5 + $as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + + ac_retval=$ac_status +fi + rm -rf conftest.dSYM conftest_ipa8_conftest.oo + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + as_fn_set_status $ac_retval + +} # ac_fn_cxx_try_run + +# ac_fn_cxx_check_header_compile LINENO HEADER VAR INCLUDES +# --------------------------------------------------------- +# Tests whether HEADER exists and can be compiled using the include files in +# INCLUDES, setting the cache variable VAR accordingly. +ac_fn_cxx_check_header_compile () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for $2" >&5 +$as_echo_n "checking for $2... " >&6; } +if eval \${$3+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +$4 +#include <$2> +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + eval "$3=yes" +else + eval "$3=no" +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +eval ac_res=\$$3 + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_res" >&5 +$as_echo "$ac_res" >&6; } + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + +} # ac_fn_cxx_check_header_compile + +# ac_fn_cxx_try_link LINENO +# ------------------------- +# Try to link conftest.$ac_ext, and return whether this succeeded. +ac_fn_cxx_try_link () +{ + as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + rm -f conftest.$ac_objext conftest$ac_exeext + if { { ac_try="$ac_link" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_link") 2>conftest.err + ac_status=$? + if test -s conftest.err; then + grep -v '^ *+' conftest.err >conftest.er1 + cat conftest.er1 >&5 + mv -f conftest.er1 conftest.err + fi + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } && { + test -z "$ac_cxx_werror_flag" || + test ! -s conftest.err + } && test -s conftest$ac_exeext && { + test "$cross_compiling" = yes || + test -x conftest$ac_exeext + }; then : + ac_retval=0 +else + $as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + + ac_retval=1 +fi + # Delete the IPA/IPO (Inter Procedural Analysis/Optimization) information + # created by the PGI compiler (conftest_ipa8_conftest.oo), as it would + # interfere with the next link command; also delete a directory that is + # left behind by Apple's compiler. We do this before executing the actions. + rm -rf conftest.dSYM conftest_ipa8_conftest.oo + eval $as_lineno_stack; ${as_lineno_stack:+:} unset as_lineno + as_fn_set_status $ac_retval + +} # ac_fn_cxx_try_link +cat >config.log <<_ACEOF +This file contains any messages produced by compilers while +running configure, to aid debugging if configure makes a mistake. + +It was created by sgtsnepi $as_me version-1.0, which was +generated by GNU Autoconf 2.69. Invocation command line was + + $ $0 $@ + +_ACEOF +exec 5>>config.log +{ +cat <<_ASUNAME +## --------- ## +## Platform. ## +## --------- ## + +hostname = `(hostname || uname -n) 2>/dev/null | sed 1q` +uname -m = `(uname -m) 2>/dev/null || echo unknown` +uname -r = `(uname -r) 2>/dev/null || echo unknown` +uname -s = `(uname -s) 2>/dev/null || echo unknown` +uname -v = `(uname -v) 2>/dev/null || echo unknown` + +/usr/bin/uname -p = `(/usr/bin/uname -p) 2>/dev/null || echo unknown` +/bin/uname -X = `(/bin/uname -X) 2>/dev/null || echo unknown` + +/bin/arch = `(/bin/arch) 2>/dev/null || echo unknown` +/usr/bin/arch -k = `(/usr/bin/arch -k) 2>/dev/null || echo unknown` +/usr/convex/getsysinfo = `(/usr/convex/getsysinfo) 2>/dev/null || echo unknown` +/usr/bin/hostinfo = `(/usr/bin/hostinfo) 2>/dev/null || echo unknown` +/bin/machine = `(/bin/machine) 2>/dev/null || echo unknown` +/usr/bin/oslevel = `(/usr/bin/oslevel) 2>/dev/null || echo unknown` +/bin/universe = `(/bin/universe) 2>/dev/null || echo unknown` + +_ASUNAME + +as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + $as_echo "PATH: $as_dir" + done +IFS=$as_save_IFS + +} >&5 + +cat >&5 <<_ACEOF + + +## ----------- ## +## Core tests. ## +## ----------- ## + +_ACEOF + + +# Keep a trace of the command line. +# Strip out --no-create and --no-recursion so they do not pile up. +# Strip out --silent because we don't want to record it for future runs. +# Also quote any args containing shell meta-characters. +# Make two passes to allow for proper duplicate-argument suppression. +ac_configure_args= +ac_configure_args0= +ac_configure_args1= +ac_must_keep_next=false +for ac_pass in 1 2 +do + for ac_arg + do + case $ac_arg in + -no-create | --no-c* | -n | -no-recursion | --no-r*) continue ;; + -q | -quiet | --quiet | --quie | --qui | --qu | --q \ + | -silent | --silent | --silen | --sile | --sil) + continue ;; + *\'*) + ac_arg=`$as_echo "$ac_arg" | sed "s/'/'\\\\\\\\''/g"` ;; + esac + case $ac_pass in + 1) as_fn_append ac_configure_args0 " '$ac_arg'" ;; + 2) + as_fn_append ac_configure_args1 " '$ac_arg'" + if test $ac_must_keep_next = true; then + ac_must_keep_next=false # Got value, back to normal. + else + case $ac_arg in + *=* | --config-cache | -C | -disable-* | --disable-* \ + | -enable-* | --enable-* | -gas | --g* | -nfp | --nf* \ + | -q | -quiet | --q* | -silent | --sil* | -v | -verb* \ + | -with-* | --with-* | -without-* | --without-* | --x) + case "$ac_configure_args0 " in + "$ac_configure_args1"*" '$ac_arg' "* ) continue ;; + esac + ;; + -* ) ac_must_keep_next=true ;; + esac + fi + as_fn_append ac_configure_args " '$ac_arg'" + ;; + esac + done +done +{ ac_configure_args0=; unset ac_configure_args0;} +{ ac_configure_args1=; unset ac_configure_args1;} + +# When interrupted or exit'd, cleanup temporary files, and complete +# config.log. We remove comments because anyway the quotes in there +# would cause problems or look ugly. +# WARNING: Use '\'' to represent an apostrophe within the trap. +# WARNING: Do not start the trap code with a newline, due to a FreeBSD 4.0 bug. +trap 'exit_status=$? + # Save into config.log some information that might help in debugging. + { + echo + + $as_echo "## ---------------- ## +## Cache variables. ## +## ---------------- ##" + echo + # The following way of writing the cache mishandles newlines in values, +( + for ac_var in `(set) 2>&1 | sed -n '\''s/^\([a-zA-Z_][a-zA-Z0-9_]*\)=.*/\1/p'\''`; do + eval ac_val=\$$ac_var + case $ac_val in #( + *${as_nl}*) + case $ac_var in #( + *_cv_*) { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: cache variable $ac_var contains a newline" >&5 +$as_echo "$as_me: WARNING: cache variable $ac_var contains a newline" >&2;} ;; + esac + case $ac_var in #( + _ | IFS | as_nl) ;; #( + BASH_ARGV | BASH_SOURCE) eval $ac_var= ;; #( + *) { eval $ac_var=; unset $ac_var;} ;; + esac ;; + esac + done + (set) 2>&1 | + case $as_nl`(ac_space='\'' '\''; set) 2>&1` in #( + *${as_nl}ac_space=\ *) + sed -n \ + "s/'\''/'\''\\\\'\'''\''/g; + s/^\\([_$as_cr_alnum]*_cv_[_$as_cr_alnum]*\\)=\\(.*\\)/\\1='\''\\2'\''/p" + ;; #( + *) + sed -n "/^[_$as_cr_alnum]*_cv_[_$as_cr_alnum]*=/p" + ;; + esac | + sort +) + echo + + $as_echo "## ----------------- ## +## Output variables. ## +## ----------------- ##" + echo + for ac_var in $ac_subst_vars + do + eval ac_val=\$$ac_var + case $ac_val in + *\'\''*) ac_val=`$as_echo "$ac_val" | sed "s/'\''/'\''\\\\\\\\'\'''\''/g"`;; + esac + $as_echo "$ac_var='\''$ac_val'\''" + done | sort + echo + + if test -n "$ac_subst_files"; then + $as_echo "## ------------------- ## +## File substitutions. ## +## ------------------- ##" + echo + for ac_var in $ac_subst_files + do + eval ac_val=\$$ac_var + case $ac_val in + *\'\''*) ac_val=`$as_echo "$ac_val" | sed "s/'\''/'\''\\\\\\\\'\'''\''/g"`;; + esac + $as_echo "$ac_var='\''$ac_val'\''" + done | sort + echo + fi + + if test -s confdefs.h; then + $as_echo "## ----------- ## +## confdefs.h. ## +## ----------- ##" + echo + cat confdefs.h + echo + fi + test "$ac_signal" != 0 && + $as_echo "$as_me: caught signal $ac_signal" + $as_echo "$as_me: exit $exit_status" + } >&5 + rm -f core *.core core.conftest.* && + rm -f -r conftest* confdefs* conf$$* $ac_clean_files && + exit $exit_status +' 0 +for ac_signal in 1 2 13 15; do + trap 'ac_signal='$ac_signal'; as_fn_exit 1' $ac_signal +done +ac_signal=0 + +# confdefs.h avoids OS command line length limits that DEFS can exceed. +rm -f -r conftest* confdefs.h + +$as_echo "/* confdefs.h */" > confdefs.h + +# Predefined preprocessor variables. + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_NAME "$PACKAGE_NAME" +_ACEOF + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_TARNAME "$PACKAGE_TARNAME" +_ACEOF + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_VERSION "$PACKAGE_VERSION" +_ACEOF + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_STRING "$PACKAGE_STRING" +_ACEOF + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_BUGREPORT "$PACKAGE_BUGREPORT" +_ACEOF + +cat >>confdefs.h <<_ACEOF +#define PACKAGE_URL "$PACKAGE_URL" +_ACEOF + + +# Let the site file select an alternate cache file if it wants to. +# Prefer an explicitly selected file to automatically selected ones. +ac_site_file1=NONE +ac_site_file2=NONE +if test -n "$CONFIG_SITE"; then + # We do not want a PATH search for config.site. + case $CONFIG_SITE in #(( + -*) ac_site_file1=./$CONFIG_SITE;; + */*) ac_site_file1=$CONFIG_SITE;; + *) ac_site_file1=./$CONFIG_SITE;; + esac +elif test "x$prefix" != xNONE; then + ac_site_file1=$prefix/share/config.site + ac_site_file2=$prefix/etc/config.site +else + ac_site_file1=$ac_default_prefix/share/config.site + ac_site_file2=$ac_default_prefix/etc/config.site +fi +for ac_site_file in "$ac_site_file1" "$ac_site_file2" +do + test "x$ac_site_file" = xNONE && continue + if test /dev/null != "$ac_site_file" && test -r "$ac_site_file"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: loading site script $ac_site_file" >&5 +$as_echo "$as_me: loading site script $ac_site_file" >&6;} + sed 's/^/| /' "$ac_site_file" >&5 + . "$ac_site_file" \ + || { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error $? "failed to load site script $ac_site_file +See \`config.log' for more details" "$LINENO" 5; } + fi +done + +if test -r "$cache_file"; then + # Some versions of bash will fail to source /dev/null (special files + # actually), so we avoid doing that. DJGPP emulates it as a regular file. + if test /dev/null != "$cache_file" && test -f "$cache_file"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: loading cache $cache_file" >&5 +$as_echo "$as_me: loading cache $cache_file" >&6;} + case $cache_file in + [\\/]* | ?:[\\/]* ) . "$cache_file";; + *) . "./$cache_file";; + esac + fi +else + { $as_echo "$as_me:${as_lineno-$LINENO}: creating cache $cache_file" >&5 +$as_echo "$as_me: creating cache $cache_file" >&6;} + >$cache_file +fi + +# Check that the precious variables saved in the cache have kept the same +# value. +ac_cache_corrupted=false +for ac_var in $ac_precious_vars; do + eval ac_old_set=\$ac_cv_env_${ac_var}_set + eval ac_new_set=\$ac_env_${ac_var}_set + eval ac_old_val=\$ac_cv_env_${ac_var}_value + eval ac_new_val=\$ac_env_${ac_var}_value + case $ac_old_set,$ac_new_set in + set,) + { $as_echo "$as_me:${as_lineno-$LINENO}: error: \`$ac_var' was set to \`$ac_old_val' in the previous run" >&5 +$as_echo "$as_me: error: \`$ac_var' was set to \`$ac_old_val' in the previous run" >&2;} + ac_cache_corrupted=: ;; + ,set) + { $as_echo "$as_me:${as_lineno-$LINENO}: error: \`$ac_var' was not set in the previous run" >&5 +$as_echo "$as_me: error: \`$ac_var' was not set in the previous run" >&2;} + ac_cache_corrupted=: ;; + ,);; + *) + if test "x$ac_old_val" != "x$ac_new_val"; then + # differences in whitespace do not lead to failure. + ac_old_val_w=`echo x $ac_old_val` + ac_new_val_w=`echo x $ac_new_val` + if test "$ac_old_val_w" != "$ac_new_val_w"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: error: \`$ac_var' has changed since the previous run:" >&5 +$as_echo "$as_me: error: \`$ac_var' has changed since the previous run:" >&2;} + ac_cache_corrupted=: + else + { $as_echo "$as_me:${as_lineno-$LINENO}: warning: ignoring whitespace changes in \`$ac_var' since the previous run:" >&5 +$as_echo "$as_me: warning: ignoring whitespace changes in \`$ac_var' since the previous run:" >&2;} + eval $ac_var=\$ac_old_val + fi + { $as_echo "$as_me:${as_lineno-$LINENO}: former value: \`$ac_old_val'" >&5 +$as_echo "$as_me: former value: \`$ac_old_val'" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: current value: \`$ac_new_val'" >&5 +$as_echo "$as_me: current value: \`$ac_new_val'" >&2;} + fi;; + esac + # Pass precious variables to config.status. + if test "$ac_new_set" = set; then + case $ac_new_val in + *\'*) ac_arg=$ac_var=`$as_echo "$ac_new_val" | sed "s/'/'\\\\\\\\''/g"` ;; + *) ac_arg=$ac_var=$ac_new_val ;; + esac + case " $ac_configure_args " in + *" '$ac_arg' "*) ;; # Avoid dups. Use of quotes ensures accuracy. + *) as_fn_append ac_configure_args " '$ac_arg'" ;; + esac + fi +done +if $ac_cache_corrupted; then + { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} + { $as_echo "$as_me:${as_lineno-$LINENO}: error: changes in the environment can compromise the build" >&5 +$as_echo "$as_me: error: changes in the environment can compromise the build" >&2;} + as_fn_error $? "run \`make distclean' and/or \`rm $cache_file' and start over" "$LINENO" 5 +fi +## -------------------- ## +## Main body of script. ## +## -------------------- ## + +ac_ext=c +ac_cpp='$CPP $CPPFLAGS' +ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_c_compiler_gnu + + +{ $as_echo "$as_me:${as_lineno-$LINENO}: SG-t-SNE-Pi." >&5 +$as_echo "$as_me: SG-t-SNE-Pi." >&6;} + +printf "\nSetup language to C++\n\n" + +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + + +: ${CXXFLAGS="-O2 -fPIC -m64 -std=c++11 -mtune=native"} + +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu +if test -z "$CXX"; then + if test -n "$CCC"; then + CXX=$CCC + else + if test -n "$ac_tool_prefix"; then + for ac_prog in g++ c++ gpp aCC CC cxx cc++ cl.exe FCC KCC RCC xlC_r xlC + do + # Extract the first word of "$ac_tool_prefix$ac_prog", so it can be a program name with args. +set dummy $ac_tool_prefix$ac_prog; ac_word=$2 +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5 +$as_echo_n "checking for $ac_word... " >&6; } +if ${ac_cv_prog_CXX+:} false; then : + $as_echo_n "(cached) " >&6 +else + if test -n "$CXX"; then + ac_cv_prog_CXX="$CXX" # Let the user override the test. +else +as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + for ac_exec_ext in '' $ac_executable_extensions; do + if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then + ac_cv_prog_CXX="$ac_tool_prefix$ac_prog" + $as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5 + break 2 + fi +done + done +IFS=$as_save_IFS + +fi +fi +CXX=$ac_cv_prog_CXX +if test -n "$CXX"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $CXX" >&5 +$as_echo "$CXX" >&6; } +else + { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5 +$as_echo "no" >&6; } +fi + + + test -n "$CXX" && break + done +fi +if test -z "$CXX"; then + ac_ct_CXX=$CXX + for ac_prog in g++ c++ gpp aCC CC cxx cc++ cl.exe FCC KCC RCC xlC_r xlC +do + # Extract the first word of "$ac_prog", so it can be a program name with args. +set dummy $ac_prog; ac_word=$2 +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5 +$as_echo_n "checking for $ac_word... " >&6; } +if ${ac_cv_prog_ac_ct_CXX+:} false; then : + $as_echo_n "(cached) " >&6 +else + if test -n "$ac_ct_CXX"; then + ac_cv_prog_ac_ct_CXX="$ac_ct_CXX" # Let the user override the test. +else +as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + for ac_exec_ext in '' $ac_executable_extensions; do + if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then + ac_cv_prog_ac_ct_CXX="$ac_prog" + $as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5 + break 2 + fi +done + done +IFS=$as_save_IFS + +fi +fi +ac_ct_CXX=$ac_cv_prog_ac_ct_CXX +if test -n "$ac_ct_CXX"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_ct_CXX" >&5 +$as_echo "$ac_ct_CXX" >&6; } +else + { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5 +$as_echo "no" >&6; } +fi + + + test -n "$ac_ct_CXX" && break +done + + if test "x$ac_ct_CXX" = x; then + CXX="g++" + else + case $cross_compiling:$ac_tool_warned in +yes:) +{ $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: using cross tools not prefixed with host triplet" >&5 +$as_echo "$as_me: WARNING: using cross tools not prefixed with host triplet" >&2;} +ac_tool_warned=yes ;; +esac + CXX=$ac_ct_CXX + fi +fi + + fi +fi +# Provide some information about the compiler. +$as_echo "$as_me:${as_lineno-$LINENO}: checking for C++ compiler version" >&5 +set X $ac_compile +ac_compiler=$2 +for ac_option in --version -v -V -qversion; do + { { ac_try="$ac_compiler $ac_option >&5" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_compiler $ac_option >&5") 2>conftest.err + ac_status=$? + if test -s conftest.err; then + sed '10a\ +... rest of stderr output deleted ... + 10q' conftest.err >conftest.er1 + cat conftest.er1 >&5 + fi + rm -f conftest.er1 conftest.err + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } +done + +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +ac_clean_files_save=$ac_clean_files +ac_clean_files="$ac_clean_files a.out a.out.dSYM a.exe b.out" +# Try to create an executable without -o first, disregard a.out. +# It will help us diagnose broken compilers, and finding out an intuition +# of exeext. +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether the C++ compiler works" >&5 +$as_echo_n "checking whether the C++ compiler works... " >&6; } +ac_link_default=`$as_echo "$ac_link" | sed 's/ -o *conftest[^ ]*//'` + +# The possible output files: +ac_files="a.out conftest.exe conftest a.exe a_out.exe b.out conftest.*" + +ac_rmfiles= +for ac_file in $ac_files +do + case $ac_file in + *.$ac_ext | *.xcoff | *.tds | *.d | *.pdb | *.xSYM | *.bb | *.bbg | *.map | *.inf | *.dSYM | *.o | *.obj ) ;; + * ) ac_rmfiles="$ac_rmfiles $ac_file";; + esac +done +rm -f $ac_rmfiles + +if { { ac_try="$ac_link_default" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_link_default") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; }; then : + # Autoconf-2.13 could set the ac_cv_exeext variable to `no'. +# So ignore a value of `no', otherwise this would lead to `EXEEXT = no' +# in a Makefile. We should not override ac_cv_exeext if it was cached, +# so that the user can short-circuit this test for compilers unknown to +# Autoconf. +for ac_file in $ac_files '' +do + test -f "$ac_file" || continue + case $ac_file in + *.$ac_ext | *.xcoff | *.tds | *.d | *.pdb | *.xSYM | *.bb | *.bbg | *.map | *.inf | *.dSYM | *.o | *.obj ) + ;; + [ab].out ) + # We found the default executable, but exeext='' is most + # certainly right. + break;; + *.* ) + if test "${ac_cv_exeext+set}" = set && test "$ac_cv_exeext" != no; + then :; else + ac_cv_exeext=`expr "$ac_file" : '[^.]*\(\..*\)'` + fi + # We set ac_cv_exeext here because the later test for it is not + # safe: cross compilers may not add the suffix if given an `-o' + # argument, so we may need to know it at that point already. + # Even if this section looks crufty: it has the advantage of + # actually working. + break;; + * ) + break;; + esac +done +test "$ac_cv_exeext" = no && ac_cv_exeext= + +else + ac_file='' +fi +if test -z "$ac_file"; then : + { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5 +$as_echo "no" >&6; } +$as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + +{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error 77 "C++ compiler cannot create executables +See \`config.log' for more details" "$LINENO" 5; } +else + { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5 +$as_echo "yes" >&6; } +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for C++ compiler default output file name" >&5 +$as_echo_n "checking for C++ compiler default output file name... " >&6; } +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_file" >&5 +$as_echo "$ac_file" >&6; } +ac_exeext=$ac_cv_exeext + +rm -f -r a.out a.out.dSYM a.exe conftest$ac_cv_exeext b.out +ac_clean_files=$ac_clean_files_save +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for suffix of executables" >&5 +$as_echo_n "checking for suffix of executables... " >&6; } +if { { ac_try="$ac_link" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_link") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; }; then : + # If both `conftest.exe' and `conftest' are `present' (well, observable) +# catch `conftest.exe'. For instance with Cygwin, `ls conftest' will +# work properly (i.e., refer to `conftest.exe'), while it won't with +# `rm'. +for ac_file in conftest.exe conftest conftest.*; do + test -f "$ac_file" || continue + case $ac_file in + *.$ac_ext | *.xcoff | *.tds | *.d | *.pdb | *.xSYM | *.bb | *.bbg | *.map | *.inf | *.dSYM | *.o | *.obj ) ;; + *.* ) ac_cv_exeext=`expr "$ac_file" : '[^.]*\(\..*\)'` + break;; + * ) break;; + esac +done +else + { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error $? "cannot compute suffix of executables: cannot compile and link +See \`config.log' for more details" "$LINENO" 5; } +fi +rm -f conftest conftest$ac_cv_exeext +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_exeext" >&5 +$as_echo "$ac_cv_exeext" >&6; } + +rm -f conftest.$ac_ext +EXEEXT=$ac_cv_exeext +ac_exeext=$EXEEXT +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +int +main () +{ +FILE *f = fopen ("conftest.out", "w"); + return ferror (f) || fclose (f) != 0; + + ; + return 0; +} +_ACEOF +ac_clean_files="$ac_clean_files conftest.out" +# Check that the compiler produces executables we can run. If not, either +# the compiler is broken, or we cross compile. +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether we are cross compiling" >&5 +$as_echo_n "checking whether we are cross compiling... " >&6; } +if test "$cross_compiling" != yes; then + { { ac_try="$ac_link" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_link") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; } + if { ac_try='./conftest$ac_cv_exeext' + { { case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_try") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; }; }; then + cross_compiling=no + else + if test "$cross_compiling" = maybe; then + cross_compiling=yes + else + { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error $? "cannot run C++ compiled programs. +If you meant to cross compile, use \`--host'. +See \`config.log' for more details" "$LINENO" 5; } + fi + fi +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $cross_compiling" >&5 +$as_echo "$cross_compiling" >&6; } + +rm -f conftest.$ac_ext conftest$ac_cv_exeext conftest.out +ac_clean_files=$ac_clean_files_save +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for suffix of object files" >&5 +$as_echo_n "checking for suffix of object files... " >&6; } +if ${ac_cv_objext+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +rm -f conftest.o conftest.obj +if { { ac_try="$ac_compile" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\"" +$as_echo "$ac_try_echo"; } >&5 + (eval "$ac_compile") 2>&5 + ac_status=$? + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5 + test $ac_status = 0; }; then : + for ac_file in conftest.o conftest.obj conftest.*; do + test -f "$ac_file" || continue; + case $ac_file in + *.$ac_ext | *.xcoff | *.tds | *.d | *.pdb | *.xSYM | *.bb | *.bbg | *.map | *.inf | *.dSYM ) ;; + *) ac_cv_objext=`expr "$ac_file" : '.*\.\(.*\)'` + break;; + esac +done +else + $as_echo "$as_me: failed program was:" >&5 +sed 's/^/| /' conftest.$ac_ext >&5 + +{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error $? "cannot compute suffix of object files: cannot compile +See \`config.log' for more details" "$LINENO" 5; } +fi +rm -f conftest.$ac_cv_objext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_objext" >&5 +$as_echo "$ac_cv_objext" >&6; } +OBJEXT=$ac_cv_objext +ac_objext=$OBJEXT +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether we are using the GNU C++ compiler" >&5 +$as_echo_n "checking whether we are using the GNU C++ compiler... " >&6; } +if ${ac_cv_cxx_compiler_gnu+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +#ifndef __GNUC__ + choke me +#endif + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + ac_compiler_gnu=yes +else + ac_compiler_gnu=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +ac_cv_cxx_compiler_gnu=$ac_compiler_gnu + +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_cxx_compiler_gnu" >&5 +$as_echo "$ac_cv_cxx_compiler_gnu" >&6; } +if test $ac_compiler_gnu = yes; then + GXX=yes +else + GXX= +fi +ac_test_CXXFLAGS=${CXXFLAGS+set} +ac_save_CXXFLAGS=$CXXFLAGS +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether $CXX accepts -g" >&5 +$as_echo_n "checking whether $CXX accepts -g... " >&6; } +if ${ac_cv_prog_cxx_g+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_save_cxx_werror_flag=$ac_cxx_werror_flag + ac_cxx_werror_flag=yes + ac_cv_prog_cxx_g=no + CXXFLAGS="-g" + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + ac_cv_prog_cxx_g=yes +else + CXXFLAGS="" + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + +else + ac_cxx_werror_flag=$ac_save_cxx_werror_flag + CXXFLAGS="-g" + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + ac_cv_prog_cxx_g=yes +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext + ac_cxx_werror_flag=$ac_save_cxx_werror_flag +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_prog_cxx_g" >&5 +$as_echo "$ac_cv_prog_cxx_g" >&6; } +if test "$ac_test_CXXFLAGS" = set; then + CXXFLAGS=$ac_save_CXXFLAGS +elif test $ac_cv_prog_cxx_g = yes; then + if test "$GXX" = yes; then + CXXFLAGS="-g -O2" + else + CXXFLAGS="-g" + fi +else + if test "$GXX" = yes; then + CXXFLAGS="-O2" + else + CXXFLAGS= + fi +fi +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + + +printf "\nSetup MATLAB linker (for optional MEX wrappers)\n\n" + +: ${MEX="${CXX}"} +: ${MEXFLAGS="${CXXFLAGS}"} +: ${MATLABROOT="/opt/local/matlab"} + +printf "using C++ flags: $CXXFLAGS\n" +printf "variable \$MATLABROOT is set to: $MATLABROOT\n" +printf "variable \$MEX is set to: $MEX\n" +printf "variable \$MEXFLAGS is set to: $MEXFLAGS\n" + + + + + + +printf "\nExporting temporary Makefile (in case dependencies are required)" + +ac_config_files="$ac_config_files Makefile" + +cat >confcache <<\_ACEOF +# This file is a shell script that caches the results of configure +# tests run on this system so they can be shared between configure +# scripts and configure runs, see configure's option --config-cache. +# It is not useful on other systems. If it contains results you don't +# want to keep, you may remove or edit it. +# +# config.status only pays attention to the cache file if you give it +# the --recheck option to rerun configure. +# +# `ac_cv_env_foo' variables (set or unset) will be overridden when +# loading this file, other *unset* `ac_cv_foo' will be assigned the +# following values. + +_ACEOF + +# The following way of writing the cache mishandles newlines in values, +# but we know of no workaround that is simple, portable, and efficient. +# So, we kill variables containing newlines. +# Ultrix sh set writes to stderr and can't be redirected directly, +# and sets the high bit in the cache file unless we assign to the vars. +( + for ac_var in `(set) 2>&1 | sed -n 's/^\([a-zA-Z_][a-zA-Z0-9_]*\)=.*/\1/p'`; do + eval ac_val=\$$ac_var + case $ac_val in #( + *${as_nl}*) + case $ac_var in #( + *_cv_*) { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: cache variable $ac_var contains a newline" >&5 +$as_echo "$as_me: WARNING: cache variable $ac_var contains a newline" >&2;} ;; + esac + case $ac_var in #( + _ | IFS | as_nl) ;; #( + BASH_ARGV | BASH_SOURCE) eval $ac_var= ;; #( + *) { eval $ac_var=; unset $ac_var;} ;; + esac ;; + esac + done + + (set) 2>&1 | + case $as_nl`(ac_space=' '; set) 2>&1` in #( + *${as_nl}ac_space=\ *) + # `set' does not quote correctly, so add quotes: double-quote + # substitution turns \\\\ into \\, and sed turns \\ into \. + sed -n \ + "s/'/'\\\\''/g; + s/^\\([_$as_cr_alnum]*_cv_[_$as_cr_alnum]*\\)=\\(.*\\)/\\1='\\2'/p" + ;; #( + *) + # `set' quotes correctly as required by POSIX, so do not add quotes. + sed -n "/^[_$as_cr_alnum]*_cv_[_$as_cr_alnum]*=/p" + ;; + esac | + sort +) | + sed ' + /^ac_cv_env_/b end + t clear + :clear + s/^\([^=]*\)=\(.*[{}].*\)$/test "${\1+set}" = set || &/ + t end + s/^\([^=]*\)=\(.*\)$/\1=${\1=\2}/ + :end' >>confcache +if diff "$cache_file" confcache >/dev/null 2>&1; then :; else + if test -w "$cache_file"; then + if test "x$cache_file" != "x/dev/null"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: updating cache $cache_file" >&5 +$as_echo "$as_me: updating cache $cache_file" >&6;} + if test ! -f "$cache_file" || test -h "$cache_file"; then + cat confcache >"$cache_file" + else + case $cache_file in #( + */* | ?:*) + mv -f confcache "$cache_file"$$ && + mv -f "$cache_file"$$ "$cache_file" ;; #( + *) + mv -f confcache "$cache_file" ;; + esac + fi + fi + else + { $as_echo "$as_me:${as_lineno-$LINENO}: not updating unwritable cache $cache_file" >&5 +$as_echo "$as_me: not updating unwritable cache $cache_file" >&6;} + fi +fi +rm -f confcache + +test "x$prefix" = xNONE && prefix=$ac_default_prefix +# Let make expand exec_prefix. +test "x$exec_prefix" = xNONE && exec_prefix='${prefix}' + +# Transform confdefs.h into DEFS. +# Protect against shell expansion while executing Makefile rules. +# Protect against Makefile macro expansion. +# +# If the first sed substitution is executed (which looks for macros that +# take arguments), then branch to the quote section. Otherwise, +# look for a macro that doesn't take arguments. +ac_script=' +:mline +/\\$/{ + N + s,\\\n,, + b mline +} +t clear +:clear +s/^[ ]*#[ ]*define[ ][ ]*\([^ (][^ (]*([^)]*)\)[ ]*\(.*\)/-D\1=\2/g +t quote +s/^[ ]*#[ ]*define[ ][ ]*\([^ ][^ ]*\)[ ]*\(.*\)/-D\1=\2/g +t quote +b any +:quote +s/[ `~#$^&*(){}\\|;'\''"<>?]/\\&/g +s/\[/\\&/g +s/\]/\\&/g +s/\$/$$/g +H +:any +${ + g + s/^\n// + s/\n/ /g + p +} +' +DEFS=`sed -n "$ac_script" confdefs.h` + + +ac_libobjs= +ac_ltlibobjs= +U= +for ac_i in : $LIBOBJS; do test "x$ac_i" = x: && continue + # 1. Remove the extension, and $U if already installed. + ac_script='s/\$U\././;s/\.o$//;s/\.obj$//' + ac_i=`$as_echo "$ac_i" | sed "$ac_script"` + # 2. Prepend LIBOBJDIR. When used with automake>=1.10 LIBOBJDIR + # will be set to the directory where LIBOBJS objects are built. + as_fn_append ac_libobjs " \${LIBOBJDIR}$ac_i\$U.$ac_objext" + as_fn_append ac_ltlibobjs " \${LIBOBJDIR}$ac_i"'$U.lo' +done +LIBOBJS=$ac_libobjs + +LTLIBOBJS=$ac_ltlibobjs + + + +: "${CONFIG_STATUS=./config.status}" +ac_write_fail=0 +ac_clean_files_save=$ac_clean_files +ac_clean_files="$ac_clean_files $CONFIG_STATUS" +{ $as_echo "$as_me:${as_lineno-$LINENO}: creating $CONFIG_STATUS" >&5 +$as_echo "$as_me: creating $CONFIG_STATUS" >&6;} +as_write_fail=0 +cat >$CONFIG_STATUS <<_ASEOF || as_write_fail=1 +#! $SHELL +# Generated by $as_me. +# Run this file to recreate the current configuration. +# Compiler output produced by configure, useful for debugging +# configure, is in config.log if it exists. + +debug=false +ac_cs_recheck=false +ac_cs_silent=false + +SHELL=\${CONFIG_SHELL-$SHELL} +export SHELL +_ASEOF +cat >>$CONFIG_STATUS <<\_ASEOF || as_write_fail=1 +## -------------------- ## +## M4sh Initialization. ## +## -------------------- ## + +# Be more Bourne compatible +DUALCASE=1; export DUALCASE # for MKS sh +if test -n "${ZSH_VERSION+set}" && (emulate sh) >/dev/null 2>&1; then : + emulate sh + NULLCMD=: + # Pre-4.2 versions of Zsh do word splitting on ${1+"$@"}, which + # is contrary to our usage. Disable this feature. + alias -g '${1+"$@"}'='"$@"' + setopt NO_GLOB_SUBST +else + case `(set -o) 2>/dev/null` in #( + *posix*) : + set -o posix ;; #( + *) : + ;; +esac +fi + + +as_nl=' +' +export as_nl +# Printing a long string crashes Solaris 7 /usr/bin/printf. +as_echo='\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo$as_echo +# Prefer a ksh shell builtin over an external printf program on Solaris, +# but without wasting forks for bash or zsh. +if test -z "$BASH_VERSION$ZSH_VERSION" \ + && (test "X`print -r -- $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='print -r --' + as_echo_n='print -rn --' +elif (test "X`printf %s $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='printf %s\n' + as_echo_n='printf %s' +else + if test "X`(/usr/ucb/echo -n -n $as_echo) 2>/dev/null`" = "X-n $as_echo"; then + as_echo_body='eval /usr/ucb/echo -n "$1$as_nl"' + as_echo_n='/usr/ucb/echo -n' + else + as_echo_body='eval expr "X$1" : "X\\(.*\\)"' + as_echo_n_body='eval + arg=$1; + case $arg in #( + *"$as_nl"*) + expr "X$arg" : "X\\(.*\\)$as_nl"; + arg=`expr "X$arg" : ".*$as_nl\\(.*\\)"`;; + esac; + expr "X$arg" : "X\\(.*\\)" | tr -d "$as_nl" + ' + export as_echo_n_body + as_echo_n='sh -c $as_echo_n_body as_echo' + fi + export as_echo_body + as_echo='sh -c $as_echo_body as_echo' +fi + +# The user is always right. +if test "${PATH_SEPARATOR+set}" != set; then + PATH_SEPARATOR=: + (PATH='/bin;/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 && { + (PATH='/bin:/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 || + PATH_SEPARATOR=';' + } +fi + + +# IFS +# We need space, tab and new line, in precisely that order. Quoting is +# there to prevent editors from complaining about space-tab. +# (If _AS_PATH_WALK were called with IFS unset, it would disable word +# splitting by setting IFS to empty value.) +IFS=" "" $as_nl" + +# Find who we are. Look in the path if we contain no directory separator. +as_myself= +case $0 in #(( + *[\\/]* ) as_myself=$0 ;; + *) as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + test -r "$as_dir/$0" && as_myself=$as_dir/$0 && break + done +IFS=$as_save_IFS + + ;; +esac +# We did not find ourselves, most probably we were run as `sh COMMAND' +# in which case we are not to be found in the path. +if test "x$as_myself" = x; then + as_myself=$0 +fi +if test ! -f "$as_myself"; then + $as_echo "$as_myself: error: cannot find myself; rerun with an absolute file name" >&2 + exit 1 +fi + +# Unset variables that we do not need and which cause bugs (e.g. in +# pre-3.0 UWIN ksh). But do not cause bugs in bash 2.01; the "|| exit 1" +# suppresses any "Segmentation fault" message there. '((' could +# trigger a bug in pdksh 5.2.14. +for as_var in BASH_ENV ENV MAIL MAILPATH +do eval test x\${$as_var+set} = xset \ + && ( (unset $as_var) || exit 1) >/dev/null 2>&1 && unset $as_var || : +done +PS1='$ ' +PS2='> ' +PS4='+ ' + +# NLS nuisances. +LC_ALL=C +export LC_ALL +LANGUAGE=C +export LANGUAGE + +# CDPATH. +(unset CDPATH) >/dev/null 2>&1 && unset CDPATH + + +# as_fn_error STATUS ERROR [LINENO LOG_FD] +# ---------------------------------------- +# Output "`basename $0`: error: ERROR" to stderr. If LINENO and LOG_FD are +# provided, also output the error to LOG_FD, referencing LINENO. Then exit the +# script with STATUS, using 1 if that was 0. +as_fn_error () +{ + as_status=$1; test $as_status -eq 0 && as_status=1 + if test "$4"; then + as_lineno=${as_lineno-"$3"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + $as_echo "$as_me:${as_lineno-$LINENO}: error: $2" >&$4 + fi + $as_echo "$as_me: error: $2" >&2 + as_fn_exit $as_status +} # as_fn_error + + +# as_fn_set_status STATUS +# ----------------------- +# Set $? to STATUS, without forking. +as_fn_set_status () +{ + return $1 +} # as_fn_set_status + +# as_fn_exit STATUS +# ----------------- +# Exit the shell with STATUS, even in a "trap 0" or "set -e" context. +as_fn_exit () +{ + set +e + as_fn_set_status $1 + exit $1 +} # as_fn_exit + +# as_fn_unset VAR +# --------------- +# Portably unset VAR. +as_fn_unset () +{ + { eval $1=; unset $1;} +} +as_unset=as_fn_unset +# as_fn_append VAR VALUE +# ---------------------- +# Append the text in VALUE to the end of the definition contained in VAR. Take +# advantage of any shell optimizations that allow amortized linear growth over +# repeated appends, instead of the typical quadratic growth present in naive +# implementations. +if (eval "as_var=1; as_var+=2; test x\$as_var = x12") 2>/dev/null; then : + eval 'as_fn_append () + { + eval $1+=\$2 + }' +else + as_fn_append () + { + eval $1=\$$1\$2 + } +fi # as_fn_append + +# as_fn_arith ARG... +# ------------------ +# Perform arithmetic evaluation on the ARGs, and store the result in the +# global $as_val. Take advantage of shells that can avoid forks. The arguments +# must be portable across $(()) and expr. +if (eval "test \$(( 1 + 1 )) = 2") 2>/dev/null; then : + eval 'as_fn_arith () + { + as_val=$(( $* )) + }' +else + as_fn_arith () + { + as_val=`expr "$@" || test $? -eq 1` + } +fi # as_fn_arith + + +if expr a : '\(a\)' >/dev/null 2>&1 && + test "X`expr 00001 : '.*\(...\)'`" = X001; then + as_expr=expr +else + as_expr=false +fi + +if (basename -- /) >/dev/null 2>&1 && test "X`basename -- / 2>&1`" = "X/"; then + as_basename=basename +else + as_basename=false +fi + +if (as_dir=`dirname -- /` && test "X$as_dir" = X/) >/dev/null 2>&1; then + as_dirname=dirname +else + as_dirname=false +fi + +as_me=`$as_basename -- "$0" || +$as_expr X/"$0" : '.*/\([^/][^/]*\)/*$' \| \ + X"$0" : 'X\(//\)$' \| \ + X"$0" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X/"$0" | + sed '/^.*\/\([^/][^/]*\)\/*$/{ + s//\1/ + q + } + /^X\/\(\/\/\)$/{ + s//\1/ + q + } + /^X\/\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + +# Avoid depending upon Character Ranges. +as_cr_letters='abcdefghijklmnopqrstuvwxyz' +as_cr_LETTERS='ABCDEFGHIJKLMNOPQRSTUVWXYZ' +as_cr_Letters=$as_cr_letters$as_cr_LETTERS +as_cr_digits='0123456789' +as_cr_alnum=$as_cr_Letters$as_cr_digits + +ECHO_C= ECHO_N= ECHO_T= +case `echo -n x` in #((((( +-n*) + case `echo 'xy\c'` in + *c*) ECHO_T=' ';; # ECHO_T is single tab character. + xy) ECHO_C='\c';; + *) echo `echo ksh88 bug on AIX 6.1` > /dev/null + ECHO_T=' ';; + esac;; +*) + ECHO_N='-n';; +esac + +rm -f conf$$ conf$$.exe conf$$.file +if test -d conf$$.dir; then + rm -f conf$$.dir/conf$$.file +else + rm -f conf$$.dir + mkdir conf$$.dir 2>/dev/null +fi +if (echo >conf$$.file) 2>/dev/null; then + if ln -s conf$$.file conf$$ 2>/dev/null; then + as_ln_s='ln -s' + # ... but there are two gotchas: + # 1) On MSYS, both `ln -s file dir' and `ln file dir' fail. + # 2) DJGPP < 2.04 has no symlinks; `ln -s' creates a wrapper executable. + # In both cases, we have to default to `cp -pR'. + ln -s conf$$.file conf$$.dir 2>/dev/null && test ! -f conf$$.exe || + as_ln_s='cp -pR' + elif ln conf$$.file conf$$ 2>/dev/null; then + as_ln_s=ln + else + as_ln_s='cp -pR' + fi +else + as_ln_s='cp -pR' +fi +rm -f conf$$ conf$$.exe conf$$.dir/conf$$.file conf$$.file +rmdir conf$$.dir 2>/dev/null + + +# as_fn_mkdir_p +# ------------- +# Create "$as_dir" as a directory, including parents if necessary. +as_fn_mkdir_p () +{ + + case $as_dir in #( + -*) as_dir=./$as_dir;; + esac + test -d "$as_dir" || eval $as_mkdir_p || { + as_dirs= + while :; do + case $as_dir in #( + *\'*) as_qdir=`$as_echo "$as_dir" | sed "s/'/'\\\\\\\\''/g"`;; #'( + *) as_qdir=$as_dir;; + esac + as_dirs="'$as_qdir' $as_dirs" + as_dir=`$as_dirname -- "$as_dir" || +$as_expr X"$as_dir" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$as_dir" : 'X\(//\)[^/]' \| \ + X"$as_dir" : 'X\(//\)$' \| \ + X"$as_dir" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$as_dir" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + test -d "$as_dir" && break + done + test -z "$as_dirs" || eval "mkdir $as_dirs" + } || test -d "$as_dir" || as_fn_error $? "cannot create directory $as_dir" + + +} # as_fn_mkdir_p +if mkdir -p . 2>/dev/null; then + as_mkdir_p='mkdir -p "$as_dir"' +else + test -d ./-p && rmdir ./-p + as_mkdir_p=false +fi + + +# as_fn_executable_p FILE +# ----------------------- +# Test if FILE is an executable regular file. +as_fn_executable_p () +{ + test -f "$1" && test -x "$1" +} # as_fn_executable_p +as_test_x='test -x' +as_executable_p=as_fn_executable_p + +# Sed expression to map a string onto a valid CPP name. +as_tr_cpp="eval sed 'y%*$as_cr_letters%P$as_cr_LETTERS%;s%[^_$as_cr_alnum]%_%g'" + +# Sed expression to map a string onto a valid variable name. +as_tr_sh="eval sed 'y%*+%pp%;s%[^_$as_cr_alnum]%_%g'" + + +exec 6>&1 +## ----------------------------------- ## +## Main body of $CONFIG_STATUS script. ## +## ----------------------------------- ## +_ASEOF +test $as_write_fail = 0 && chmod +x $CONFIG_STATUS || ac_write_fail=1 + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# Save the log message, to keep $0 and so on meaningful, and to +# report actual input values of CONFIG_FILES etc. instead of their +# values after options handling. +ac_log=" +This file was extended by sgtsnepi $as_me version-1.0, which was +generated by GNU Autoconf 2.69. Invocation command line was + + CONFIG_FILES = $CONFIG_FILES + CONFIG_HEADERS = $CONFIG_HEADERS + CONFIG_LINKS = $CONFIG_LINKS + CONFIG_COMMANDS = $CONFIG_COMMANDS + $ $0 $@ + +on `(hostname || uname -n) 2>/dev/null | sed 1q` +" + +_ACEOF + +case $ac_config_files in *" +"*) set x $ac_config_files; shift; ac_config_files=$*;; +esac + + + +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +# Files that config.status was made for. +config_files="$ac_config_files" + +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +ac_cs_usage="\ +\`$as_me' instantiates files and other configuration actions +from templates according to the current configuration. Unless the files +and actions are specified as TAGs, all are instantiated by default. + +Usage: $0 [OPTION]... [TAG]... + + -h, --help print this help, then exit + -V, --version print version number and configuration settings, then exit + --config print configuration, then exit + -q, --quiet, --silent + do not print progress messages + -d, --debug don't remove temporary files + --recheck update $as_me by reconfiguring in the same conditions + --file=FILE[:TEMPLATE] + instantiate the configuration file FILE + +Configuration files: +$config_files + +Report bugs to the package provider." + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`" +ac_cs_version="\\ +sgtsnepi config.status version-1.0 +configured by $0, generated by GNU Autoconf 2.69, + with options \\"\$ac_cs_config\\" + +Copyright (C) 2012 Free Software Foundation, Inc. +This config.status script is free software; the Free Software Foundation +gives unlimited permission to copy, distribute and modify it." + +ac_pwd='$ac_pwd' +srcdir='$srcdir' +test -n "\$AWK" || AWK=awk +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# The default lists apply if the user does not specify any file. +ac_need_defaults=: +while test $# != 0 +do + case $1 in + --*=?*) + ac_option=`expr "X$1" : 'X\([^=]*\)='` + ac_optarg=`expr "X$1" : 'X[^=]*=\(.*\)'` + ac_shift=: + ;; + --*=) + ac_option=`expr "X$1" : 'X\([^=]*\)='` + ac_optarg= + ac_shift=: + ;; + *) + ac_option=$1 + ac_optarg=$2 + ac_shift=shift + ;; + esac + + case $ac_option in + # Handling of the options. + -recheck | --recheck | --rechec | --reche | --rech | --rec | --re | --r) + ac_cs_recheck=: ;; + --version | --versio | --versi | --vers | --ver | --ve | --v | -V ) + $as_echo "$ac_cs_version"; exit ;; + --config | --confi | --conf | --con | --co | --c ) + $as_echo "$ac_cs_config"; exit ;; + --debug | --debu | --deb | --de | --d | -d ) + debug=: ;; + --file | --fil | --fi | --f ) + $ac_shift + case $ac_optarg in + *\'*) ac_optarg=`$as_echo "$ac_optarg" | sed "s/'/'\\\\\\\\''/g"` ;; + '') as_fn_error $? "missing file argument" ;; + esac + as_fn_append CONFIG_FILES " '$ac_optarg'" + ac_need_defaults=false;; + --he | --h | --help | --hel | -h ) + $as_echo "$ac_cs_usage"; exit ;; + -q | -quiet | --quiet | --quie | --qui | --qu | --q \ + | -silent | --silent | --silen | --sile | --sil | --si | --s) + ac_cs_silent=: ;; + + # This is an error. + -*) as_fn_error $? "unrecognized option: \`$1' +Try \`$0 --help' for more information." ;; + + *) as_fn_append ac_config_targets " $1" + ac_need_defaults=false ;; + + esac + shift +done + +ac_configure_extra_args= + +if $ac_cs_silent; then + exec 6>/dev/null + ac_configure_extra_args="$ac_configure_extra_args --silent" +fi + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +if \$ac_cs_recheck; then + set X $SHELL '$0' $ac_configure_args \$ac_configure_extra_args --no-create --no-recursion + shift + \$as_echo "running CONFIG_SHELL=$SHELL \$*" >&6 + CONFIG_SHELL='$SHELL' + export CONFIG_SHELL + exec "\$@" +fi + +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +exec 5>>config.log +{ + echo + sed 'h;s/./-/g;s/^.../## /;s/...$/ ##/;p;x;p;x' <<_ASBOX +## Running $as_me. ## +_ASBOX + $as_echo "$ac_log" +} >&5 + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 + +# Handling of arguments. +for ac_config_target in $ac_config_targets +do + case $ac_config_target in + "Makefile") CONFIG_FILES="$CONFIG_FILES Makefile" ;; + + *) as_fn_error $? "invalid argument: \`$ac_config_target'" "$LINENO" 5;; + esac +done + + +# If the user did not use the arguments to specify the items to instantiate, +# then the envvar interface is used. Set only those that are not. +# We use the long form for the default assignment because of an extremely +# bizarre bug on SunOS 4.1.3. +if $ac_need_defaults; then + test "${CONFIG_FILES+set}" = set || CONFIG_FILES=$config_files +fi + +# Have a temporary directory for convenience. Make it in the build tree +# simply because there is no reason against having it here, and in addition, +# creating and moving files from /tmp can sometimes cause problems. +# Hook for its removal unless debugging. +# Note that there is a small window in which the directory will not be cleaned: +# after its creation but before its name has been assigned to `$tmp'. +$debug || +{ + tmp= ac_tmp= + trap 'exit_status=$? + : "${ac_tmp:=$tmp}" + { test ! -d "$ac_tmp" || rm -fr "$ac_tmp"; } && exit $exit_status +' 0 + trap 'as_fn_exit 1' 1 2 13 15 +} +# Create a (secure) tmp directory for tmp files. + +{ + tmp=`(umask 077 && mktemp -d "./confXXXXXX") 2>/dev/null` && + test -d "$tmp" +} || +{ + tmp=./conf$$-$RANDOM + (umask 077 && mkdir "$tmp") +} || as_fn_error $? "cannot create a temporary directory in ." "$LINENO" 5 +ac_tmp=$tmp + +# Set up the scripts for CONFIG_FILES section. +# No need to generate them if there are no CONFIG_FILES. +# This happens for instance with `./config.status config.h'. +if test -n "$CONFIG_FILES"; then + + +ac_cr=`echo X | tr X '\015'` +# On cygwin, bash can eat \r inside `` if the user requested igncr. +# But we know of no other shell where ac_cr would be empty at this +# point, so we can use a bashism as a fallback. +if test "x$ac_cr" = x; then + eval ac_cr=\$\'\\r\' +fi +ac_cs_awk_cr=`$AWK 'BEGIN { print "a\rb" }' /dev/null` +if test "$ac_cs_awk_cr" = "a${ac_cr}b"; then + ac_cs_awk_cr='\\r' +else + ac_cs_awk_cr=$ac_cr +fi + +echo 'BEGIN {' >"$ac_tmp/subs1.awk" && +_ACEOF + + +{ + echo "cat >conf$$subs.awk <<_ACEOF" && + echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/' && + echo "_ACEOF" +} >conf$$subs.sh || + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 +ac_delim_num=`echo "$ac_subst_vars" | grep -c '^'` +ac_delim='%!_!# ' +for ac_last_try in false false false false false :; do + . ./conf$$subs.sh || + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 + + ac_delim_n=`sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` + if test $ac_delim_n = $ac_delim_num; then + break + elif $ac_last_try; then + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 + else + ac_delim="$ac_delim!$ac_delim _$ac_delim!! " + fi +done +rm -f conf$$subs.sh + +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +cat >>"\$ac_tmp/subs1.awk" <<\\_ACAWK && +_ACEOF +sed -n ' +h +s/^/S["/; s/!.*/"]=/ +p +g +s/^[^!]*!// +:repl +t repl +s/'"$ac_delim"'$// +t delim +:nl +h +s/\(.\{148\}\)..*/\1/ +t more1 +s/["\\]/\\&/g; s/^/"/; s/$/\\n"\\/ +p +n +b repl +:more1 +s/["\\]/\\&/g; s/^/"/; s/$/"\\/ +p +g +s/.\{148\}// +t nl +:delim +h +s/\(.\{148\}\)..*/\1/ +t more2 +s/["\\]/\\&/g; s/^/"/; s/$/"/ +p +b +:more2 +s/["\\]/\\&/g; s/^/"/; s/$/"\\/ +p +g +s/.\{148\}// +t delim +' >$CONFIG_STATUS || ac_write_fail=1 +rm -f conf$$subs.awk +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +_ACAWK +cat >>"\$ac_tmp/subs1.awk" <<_ACAWK && + for (key in S) S_is_set[key] = 1 + FS = "" + +} +{ + line = $ 0 + nfields = split(line, field, "@") + substed = 0 + len = length(field[1]) + for (i = 2; i < nfields; i++) { + key = field[i] + keylen = length(key) + if (S_is_set[key]) { + value = S[key] + line = substr(line, 1, len) "" value "" substr(line, len + keylen + 3) + len += length(value) + length(field[++i]) + substed = 1 + } else + len += 1 + keylen + } + + print line +} + +_ACAWK +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +if sed "s/$ac_cr//" < /dev/null > /dev/null 2>&1; then + sed "s/$ac_cr\$//; s/$ac_cr/$ac_cs_awk_cr/g" +else + cat +fi < "$ac_tmp/subs1.awk" > "$ac_tmp/subs.awk" \ + || as_fn_error $? "could not setup config files machinery" "$LINENO" 5 +_ACEOF + +# VPATH may cause trouble with some makes, so we remove sole $(srcdir), +# ${srcdir} and @srcdir@ entries from VPATH if srcdir is ".", strip leading and +# trailing colons and then remove the whole line if VPATH becomes empty +# (actually we leave an empty line to preserve line numbers). +if test "x$srcdir" = x.; then + ac_vpsub='/^[ ]*VPATH[ ]*=[ ]*/{ +h +s/// +s/^/:/ +s/[ ]*$/:/ +s/:\$(srcdir):/:/g +s/:\${srcdir}:/:/g +s/:@srcdir@:/:/g +s/^:*// +s/:*$// +x +s/\(=[ ]*\).*/\1/ +G +s/\n// +s/^[^=]*=[ ]*$// +}' +fi + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +fi # test -n "$CONFIG_FILES" + + +eval set X " :F $CONFIG_FILES " +shift +for ac_tag +do + case $ac_tag in + :[FHLC]) ac_mode=$ac_tag; continue;; + esac + case $ac_mode$ac_tag in + :[FHL]*:*);; + :L* | :C*:*) as_fn_error $? "invalid tag \`$ac_tag'" "$LINENO" 5;; + :[FH]-) ac_tag=-:-;; + :[FH]*) ac_tag=$ac_tag:$ac_tag.in;; + esac + ac_save_IFS=$IFS + IFS=: + set x $ac_tag + IFS=$ac_save_IFS + shift + ac_file=$1 + shift + + case $ac_mode in + :L) ac_source=$1;; + :[FH]) + ac_file_inputs= + for ac_f + do + case $ac_f in + -) ac_f="$ac_tmp/stdin";; + *) # Look for the file first in the build tree, then in the source tree + # (if the path is not absolute). The absolute path cannot be DOS-style, + # because $ac_f cannot contain `:'. + test -f "$ac_f" || + case $ac_f in + [\\/$]*) false;; + *) test -f "$srcdir/$ac_f" && ac_f="$srcdir/$ac_f";; + esac || + as_fn_error 1 "cannot find input file: \`$ac_f'" "$LINENO" 5;; + esac + case $ac_f in *\'*) ac_f=`$as_echo "$ac_f" | sed "s/'/'\\\\\\\\''/g"`;; esac + as_fn_append ac_file_inputs " '$ac_f'" + done + + # Let's still pretend it is `configure' which instantiates (i.e., don't + # use $as_me), people would be surprised to read: + # /* config.h. Generated by config.status. */ + configure_input='Generated from '` + $as_echo "$*" | sed 's|^[^:]*/||;s|:[^:]*/|, |g' + `' by configure.' + if test x"$ac_file" != x-; then + configure_input="$ac_file. $configure_input" + { $as_echo "$as_me:${as_lineno-$LINENO}: creating $ac_file" >&5 +$as_echo "$as_me: creating $ac_file" >&6;} + fi + # Neutralize special characters interpreted by sed in replacement strings. + case $configure_input in #( + *\&* | *\|* | *\\* ) + ac_sed_conf_input=`$as_echo "$configure_input" | + sed 's/[\\\\&|]/\\\\&/g'`;; #( + *) ac_sed_conf_input=$configure_input;; + esac + + case $ac_tag in + *:-:* | *:-) cat >"$ac_tmp/stdin" \ + || as_fn_error $? "could not create $ac_file" "$LINENO" 5 ;; + esac + ;; + esac + + ac_dir=`$as_dirname -- "$ac_file" || +$as_expr X"$ac_file" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$ac_file" : 'X\(//\)[^/]' \| \ + X"$ac_file" : 'X\(//\)$' \| \ + X"$ac_file" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$ac_file" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + as_dir="$ac_dir"; as_fn_mkdir_p + ac_builddir=. + +case "$ac_dir" in +.) ac_dir_suffix= ac_top_builddir_sub=. ac_top_build_prefix= ;; +*) + ac_dir_suffix=/`$as_echo "$ac_dir" | sed 's|^\.[\\/]||'` + # A ".." for each directory in $ac_dir_suffix. + ac_top_builddir_sub=`$as_echo "$ac_dir_suffix" | sed 's|/[^\\/]*|/..|g;s|/||'` + case $ac_top_builddir_sub in + "") ac_top_builddir_sub=. ac_top_build_prefix= ;; + *) ac_top_build_prefix=$ac_top_builddir_sub/ ;; + esac ;; +esac +ac_abs_top_builddir=$ac_pwd +ac_abs_builddir=$ac_pwd$ac_dir_suffix +# for backward compatibility: +ac_top_builddir=$ac_top_build_prefix + +case $srcdir in + .) # We are building in place. + ac_srcdir=. + ac_top_srcdir=$ac_top_builddir_sub + ac_abs_top_srcdir=$ac_pwd ;; + [\\/]* | ?:[\\/]* ) # Absolute name. + ac_srcdir=$srcdir$ac_dir_suffix; + ac_top_srcdir=$srcdir + ac_abs_top_srcdir=$srcdir ;; + *) # Relative name. + ac_srcdir=$ac_top_build_prefix$srcdir$ac_dir_suffix + ac_top_srcdir=$ac_top_build_prefix$srcdir + ac_abs_top_srcdir=$ac_pwd/$srcdir ;; +esac +ac_abs_srcdir=$ac_abs_top_srcdir$ac_dir_suffix + + + case $ac_mode in + :F) + # + # CONFIG_FILE + # + +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# If the template does not know about datarootdir, expand it. +# FIXME: This hack should be removed a few years after 2.60. +ac_datarootdir_hack=; ac_datarootdir_seen= +ac_sed_dataroot=' +/datarootdir/ { + p + q +} +/@datadir@/p +/@docdir@/p +/@infodir@/p +/@localedir@/p +/@mandir@/p' +case `eval "sed -n \"\$ac_sed_dataroot\" $ac_file_inputs"` in +*datarootdir*) ac_datarootdir_seen=yes;; +*@datadir@*|*@docdir@*|*@infodir@*|*@localedir@*|*@mandir@*) + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $ac_file_inputs seems to ignore the --datarootdir setting" >&5 +$as_echo "$as_me: WARNING: $ac_file_inputs seems to ignore the --datarootdir setting" >&2;} +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 + ac_datarootdir_hack=' + s&@datadir@&$datadir&g + s&@docdir@&$docdir&g + s&@infodir@&$infodir&g + s&@localedir@&$localedir&g + s&@mandir@&$mandir&g + s&\\\${datarootdir}&$datarootdir&g' ;; +esac +_ACEOF + +# Neutralize VPATH when `$srcdir' = `.'. +# Shell code in configure.ac might set extrasub. +# FIXME: do we really want to maintain this feature? +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +ac_sed_extra="$ac_vpsub +$extrasub +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +:t +/@[a-zA-Z_][a-zA-Z_0-9]*@/!b +s|@configure_input@|$ac_sed_conf_input|;t t +s&@top_builddir@&$ac_top_builddir_sub&;t t +s&@top_build_prefix@&$ac_top_build_prefix&;t t +s&@srcdir@&$ac_srcdir&;t t +s&@abs_srcdir@&$ac_abs_srcdir&;t t +s&@top_srcdir@&$ac_top_srcdir&;t t +s&@abs_top_srcdir@&$ac_abs_top_srcdir&;t t +s&@builddir@&$ac_builddir&;t t +s&@abs_builddir@&$ac_abs_builddir&;t t +s&@abs_top_builddir@&$ac_abs_top_builddir&;t t +$ac_datarootdir_hack +" +eval sed \"\$ac_sed_extra\" "$ac_file_inputs" | $AWK -f "$ac_tmp/subs.awk" \ + >$ac_tmp/out || as_fn_error $? "could not create $ac_file" "$LINENO" 5 + +test -z "$ac_datarootdir_hack$ac_datarootdir_seen" && + { ac_out=`sed -n '/\${datarootdir}/p' "$ac_tmp/out"`; test -n "$ac_out"; } && + { ac_out=`sed -n '/^[ ]*datarootdir[ ]*:*=/p' \ + "$ac_tmp/out"`; test -z "$ac_out"; } && + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $ac_file contains a reference to the variable \`datarootdir' +which seems to be undefined. Please make sure it is defined" >&5 +$as_echo "$as_me: WARNING: $ac_file contains a reference to the variable \`datarootdir' +which seems to be undefined. Please make sure it is defined" >&2;} + + rm -f "$ac_tmp/stdin" + case $ac_file in + -) cat "$ac_tmp/out" && rm -f "$ac_tmp/out";; + *) rm -f "$ac_file" && mv "$ac_tmp/out" "$ac_file";; + esac \ + || as_fn_error $? "could not create $ac_file" "$LINENO" 5 + ;; + + + + esac + +done # for ac_tag + + +as_fn_exit 0 +_ACEOF +ac_clean_files=$ac_clean_files_save + +test $ac_write_fail = 0 || + as_fn_error $? "write failure creating $CONFIG_STATUS" "$LINENO" 5 + + +# configure is writing to config.log, and then calls config.status. +# config.status does its own redirection, appending to config.log. +# Unfortunately, on DOS this fails, as config.log is still kept open +# by configure, so config.status won't be able to write to it; its +# output is simply discarded. So we exec the FD to /dev/null, +# effectively closing config.log, so it can be properly (re)opened and +# appended to by config.status. When coming back to configure, we +# need to make the FD available again. +if test "$no_create" != yes; then + ac_cs_success=: + ac_config_status_args= + test "$silent" = yes && + ac_config_status_args="$ac_config_status_args --quiet" + exec 5>/dev/null + $SHELL $CONFIG_STATUS $ac_config_status_args || ac_cs_success=false + exec 5>>config.log + # Use ||, not &&, to avoid exiting from the if with $? = 1, which + # would make configure fail if this is the last instruction. + $ac_cs_success || as_fn_exit 1 +fi +if test -n "$ac_unrecognized_opts" && test "$enable_option_checking" != no; then + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: unrecognized options: $ac_unrecognized_opts" >&5 +$as_echo "$as_me: WARNING: unrecognized options: $ac_unrecognized_opts" >&2;} +fi + + +printf "\nCheck header dependencies\n\n" + + +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking how to run the C++ preprocessor" >&5 +$as_echo_n "checking how to run the C++ preprocessor... " >&6; } +if test -z "$CXXCPP"; then + if ${ac_cv_prog_CXXCPP+:} false; then : + $as_echo_n "(cached) " >&6 +else + # Double quotes because CXXCPP needs to be expanded + for CXXCPP in "$CXX -E" "/lib/cpp" + do + ac_preproc_ok=false +for ac_cxx_preproc_warn_flag in '' yes +do + # Use a header file that comes with gcc, so configuring glibc + # with a fresh cross-compiler works. + # Prefer to if __STDC__ is defined, since + # exists even on freestanding compilers. + # On the NeXT, cc -E runs the code through the compiler's parser, + # not just through cpp. "Syntax error" is here to catch this case. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#ifdef __STDC__ +# include +#else +# include +#endif + Syntax error +_ACEOF +if ac_fn_cxx_try_cpp "$LINENO"; then : + +else + # Broken: fails on valid input. +continue +fi +rm -f conftest.err conftest.i conftest.$ac_ext + + # OK, works on sane cases. Now check whether nonexistent headers + # can be detected and how. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +_ACEOF +if ac_fn_cxx_try_cpp "$LINENO"; then : + # Broken: success on invalid input. +continue +else + # Passes both tests. +ac_preproc_ok=: +break +fi +rm -f conftest.err conftest.i conftest.$ac_ext + +done +# Because of `break', _AC_PREPROC_IFELSE's cleaning code was skipped. +rm -f conftest.i conftest.err conftest.$ac_ext +if $ac_preproc_ok; then : + break +fi + + done + ac_cv_prog_CXXCPP=$CXXCPP + +fi + CXXCPP=$ac_cv_prog_CXXCPP +else + ac_cv_prog_CXXCPP=$CXXCPP +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $CXXCPP" >&5 +$as_echo "$CXXCPP" >&6; } +ac_preproc_ok=false +for ac_cxx_preproc_warn_flag in '' yes +do + # Use a header file that comes with gcc, so configuring glibc + # with a fresh cross-compiler works. + # Prefer to if __STDC__ is defined, since + # exists even on freestanding compilers. + # On the NeXT, cc -E runs the code through the compiler's parser, + # not just through cpp. "Syntax error" is here to catch this case. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#ifdef __STDC__ +# include +#else +# include +#endif + Syntax error +_ACEOF +if ac_fn_cxx_try_cpp "$LINENO"; then : + +else + # Broken: fails on valid input. +continue +fi +rm -f conftest.err conftest.i conftest.$ac_ext + + # OK, works on sane cases. Now check whether nonexistent headers + # can be detected and how. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +_ACEOF +if ac_fn_cxx_try_cpp "$LINENO"; then : + # Broken: success on invalid input. +continue +else + # Passes both tests. +ac_preproc_ok=: +break +fi +rm -f conftest.err conftest.i conftest.$ac_ext + +done +# Because of `break', _AC_PREPROC_IFELSE's cleaning code was skipped. +rm -f conftest.i conftest.err conftest.$ac_ext +if $ac_preproc_ok; then : + +else + { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5 +$as_echo "$as_me: error: in \`$ac_pwd':" >&2;} +as_fn_error $? "C++ preprocessor \"$CXXCPP\" fails sanity check +See \`config.log' for more details" "$LINENO" 5; } +fi + +ac_ext=cpp +ac_cpp='$CXXCPP $CPPFLAGS' +ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5' +ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $LIBS >&5' +ac_compiler_gnu=$ac_cv_cxx_compiler_gnu + + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5 +$as_echo_n "checking for grep that handles long lines and -e... " >&6; } +if ${ac_cv_path_GREP+:} false; then : + $as_echo_n "(cached) " >&6 +else + if test -z "$GREP"; then + ac_path_GREP_found=false + # Loop through the user's path and test for each of PROGNAME-LIST + as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + for ac_prog in grep ggrep; do + for ac_exec_ext in '' $ac_executable_extensions; do + ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext" + as_fn_executable_p "$ac_path_GREP" || continue +# Check for GNU ac_path_GREP and select it if it is found. + # Check for GNU $ac_path_GREP +case `"$ac_path_GREP" --version 2>&1` in +*GNU*) + ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;; +*) + ac_count=0 + $as_echo_n 0123456789 >"conftest.in" + while : + do + cat "conftest.in" "conftest.in" >"conftest.tmp" + mv "conftest.tmp" "conftest.in" + cp "conftest.in" "conftest.nl" + $as_echo 'GREP' >> "conftest.nl" + "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break + diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break + as_fn_arith $ac_count + 1 && ac_count=$as_val + if test $ac_count -gt ${ac_path_GREP_max-0}; then + # Best one so far, save it but keep looking for a better one + ac_cv_path_GREP="$ac_path_GREP" + ac_path_GREP_max=$ac_count + fi + # 10*(2^10) chars as input seems more than enough + test $ac_count -gt 10 && break + done + rm -f conftest.in conftest.tmp conftest.nl conftest.out;; +esac + + $ac_path_GREP_found && break 3 + done + done + done +IFS=$as_save_IFS + if test -z "$ac_cv_path_GREP"; then + as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5 + fi +else + ac_cv_path_GREP=$GREP +fi + +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5 +$as_echo "$ac_cv_path_GREP" >&6; } + GREP="$ac_cv_path_GREP" + + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5 +$as_echo_n "checking for egrep... " >&6; } +if ${ac_cv_path_EGREP+:} false; then : + $as_echo_n "(cached) " >&6 +else + if echo a | $GREP -E '(a|b)' >/dev/null 2>&1 + then ac_cv_path_EGREP="$GREP -E" + else + if test -z "$EGREP"; then + ac_path_EGREP_found=false + # Loop through the user's path and test for each of PROGNAME-LIST + as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + for ac_prog in egrep; do + for ac_exec_ext in '' $ac_executable_extensions; do + ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext" + as_fn_executable_p "$ac_path_EGREP" || continue +# Check for GNU ac_path_EGREP and select it if it is found. + # Check for GNU $ac_path_EGREP +case `"$ac_path_EGREP" --version 2>&1` in +*GNU*) + ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;; +*) + ac_count=0 + $as_echo_n 0123456789 >"conftest.in" + while : + do + cat "conftest.in" "conftest.in" >"conftest.tmp" + mv "conftest.tmp" "conftest.in" + cp "conftest.in" "conftest.nl" + $as_echo 'EGREP' >> "conftest.nl" + "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break + diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break + as_fn_arith $ac_count + 1 && ac_count=$as_val + if test $ac_count -gt ${ac_path_EGREP_max-0}; then + # Best one so far, save it but keep looking for a better one + ac_cv_path_EGREP="$ac_path_EGREP" + ac_path_EGREP_max=$ac_count + fi + # 10*(2^10) chars as input seems more than enough + test $ac_count -gt 10 && break + done + rm -f conftest.in conftest.tmp conftest.nl conftest.out;; +esac + + $ac_path_EGREP_found && break 3 + done + done + done +IFS=$as_save_IFS + if test -z "$ac_cv_path_EGREP"; then + as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5 + fi +else + ac_cv_path_EGREP=$EGREP +fi + + fi +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5 +$as_echo "$ac_cv_path_EGREP" >&6; } + EGREP="$ac_cv_path_EGREP" + + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for ANSI C header files" >&5 +$as_echo_n "checking for ANSI C header files... " >&6; } +if ${ac_cv_header_stdc+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +#include +#include +#include + +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_cxx_try_compile "$LINENO"; then : + ac_cv_header_stdc=yes +else + ac_cv_header_stdc=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext + +if test $ac_cv_header_stdc = yes; then + # SunOS 4.x string.h does not declare mem*, contrary to ANSI. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include + +_ACEOF +if (eval "$ac_cpp conftest.$ac_ext") 2>&5 | + $EGREP "memchr" >/dev/null 2>&1; then : + +else + ac_cv_header_stdc=no +fi +rm -f conftest* + +fi + +if test $ac_cv_header_stdc = yes; then + # ISC 2.0.2 stdlib.h does not declare free, contrary to ANSI. + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include + +_ACEOF +if (eval "$ac_cpp conftest.$ac_ext") 2>&5 | + $EGREP "free" >/dev/null 2>&1; then : + +else + ac_cv_header_stdc=no +fi +rm -f conftest* + +fi + +if test $ac_cv_header_stdc = yes; then + # /bin/cc in Irix-4.0.5 gets non-ANSI ctype macros unless using -ansi. + if test "$cross_compiling" = yes; then : + : +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include +#include +#if ((' ' & 0x0FF) == 0x020) +# define ISLOWER(c) ('a' <= (c) && (c) <= 'z') +# define TOUPPER(c) (ISLOWER(c) ? 'A' + ((c) - 'a') : (c)) +#else +# define ISLOWER(c) \ + (('a' <= (c) && (c) <= 'i') \ + || ('j' <= (c) && (c) <= 'r') \ + || ('s' <= (c) && (c) <= 'z')) +# define TOUPPER(c) (ISLOWER(c) ? ((c) | 0x40) : (c)) +#endif + +#define XOR(e, f) (((e) && !(f)) || (!(e) && (f))) +int +main () +{ + int i; + for (i = 0; i < 256; i++) + if (XOR (islower (i), ISLOWER (i)) + || toupper (i) != TOUPPER (i)) + return 2; + return 0; +} +_ACEOF +if ac_fn_cxx_try_run "$LINENO"; then : + +else + ac_cv_header_stdc=no +fi +rm -f core *.core core.conftest.* gmon.out bb.out conftest$ac_exeext \ + conftest.$ac_objext conftest.beam conftest.$ac_ext +fi + +fi +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_header_stdc" >&5 +$as_echo "$ac_cv_header_stdc" >&6; } +if test $ac_cv_header_stdc = yes; then + +$as_echo "#define STDC_HEADERS 1" >>confdefs.h + +fi + +# On IRIX 5.3, sys/types and inttypes.h are conflicting. +for ac_header in sys/types.h sys/stat.h stdlib.h string.h memory.h strings.h \ + inttypes.h stdint.h unistd.h +do : + as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh` +ac_fn_cxx_check_header_compile "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default +" +if eval test \"x\$"$as_ac_Header"\" = x"yes"; then : + cat >>confdefs.h <<_ACEOF +#define `$as_echo "HAVE_$ac_header" | $as_tr_cpp` 1 +_ACEOF + +fi + +done + + +for ac_header in fftw3.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "fftw3.h" "ac_cv_header_fftw3_h" "$ac_includes_default" +if test "x$ac_cv_header_fftw3_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_FFTW3_H 1 +_ACEOF + +else + as_fn_error $? "A working FFTW3 installation is required. Issue make dependencies and re-run configure. Check that the path executable are in your path." "$LINENO" 5 +fi + +done + +for ac_header in flann/flann.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "flann/flann.h" "ac_cv_header_flann_flann_h" "$ac_includes_default" +if test "x$ac_cv_header_flann_flann_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_FLANN_FLANN_H 1 +_ACEOF + +else + as_fn_error $? "A working FLANN installation is required. Issue make dependencies and re-run configure. Check that the path executable are in your path." "$LINENO" 5 +fi + +done + +for ac_header in metis.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "metis.h" "ac_cv_header_metis_h" "$ac_includes_default" +if test "x$ac_cv_header_metis_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_METIS_H 1 +_ACEOF + +else + as_fn_error $? "A working METIS installation is required. Issue make dependencies and re-run configure. Check that the path executable are in your path." "$LINENO" 5 +fi + +done + +for ac_header in tbb/scalable_allocator.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "tbb/scalable_allocator.h" "ac_cv_header_tbb_scalable_allocator_h" "$ac_includes_default" +if test "x$ac_cv_header_tbb_scalable_allocator_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_TBB_SCALABLE_ALLOCATOR_H 1 +_ACEOF + +else + as_fn_error $? "A working TBB installation is required. Issue make dependencies and re-run configure. Check that the path executable are in your path." "$LINENO" 5 +fi + +done + +for ac_header in cilk/cilk.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "cilk/cilk.h" "ac_cv_header_cilk_cilk_h" "$ac_includes_default" +if test "x$ac_cv_header_cilk_cilk_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_CILK_CILK_H 1 +_ACEOF + +else + as_fn_error $? "A working Cilk installation is required" "$LINENO" 5 +fi + +done + +for ac_header in omp.h +do : + ac_fn_cxx_check_header_mongrel "$LINENO" "omp.h" "ac_cv_header_omp_h" "$ac_includes_default" +if test "x$ac_cv_header_omp_h" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_OMP_H 1 +_ACEOF + +else + as_fn_error $? "An OMP installation is required" "$LINENO" 5 +fi + +done + + +printf "\nCheck library dependencies\n\n" + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fftw_plan_many_dft" >&5 +$as_echo_n "checking for library containing fftw_plan_many_dft... " >&6; } +if ${ac_cv_search_fftw_plan_many_dft+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char fftw_plan_many_dft (); +int +main () +{ +return fftw_plan_many_dft (); + ; + return 0; +} +_ACEOF +for ac_lib in '' fftw3; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_cxx_try_link "$LINENO"; then : + ac_cv_search_fftw_plan_many_dft=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_fftw_plan_many_dft+:} false; then : + break +fi +done +if ${ac_cv_search_fftw_plan_many_dft+:} false; then : + +else + ac_cv_search_fftw_plan_many_dft=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_fftw_plan_many_dft" >&5 +$as_echo "$ac_cv_search_fftw_plan_many_dft" >&6; } +ac_res=$ac_cv_search_fftw_plan_many_dft +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +else + as_fn_error $? "A working FFTW3 installation is required. Issue make dependencies and re-run configure." "$LINENO" 5 +fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing fftw_plan_with_nthreads" >&5 +$as_echo_n "checking for library containing fftw_plan_with_nthreads... " >&6; } +if ${ac_cv_search_fftw_plan_with_nthreads+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char fftw_plan_with_nthreads (); +int +main () +{ +return fftw_plan_with_nthreads (); + ; + return 0; +} +_ACEOF +for ac_lib in '' fftw3_threads; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_cxx_try_link "$LINENO"; then : + ac_cv_search_fftw_plan_with_nthreads=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_fftw_plan_with_nthreads+:} false; then : + break +fi +done +if ${ac_cv_search_fftw_plan_with_nthreads+:} false; then : + +else + ac_cv_search_fftw_plan_with_nthreads=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_fftw_plan_with_nthreads" >&5 +$as_echo "$ac_cv_search_fftw_plan_with_nthreads" >&6; } +ac_res=$ac_cv_search_fftw_plan_with_nthreads +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +else + as_fn_error $? "A parallel FFTW3 installation is required. Issue make dependencies and re-run configure." "$LINENO" 5 +fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing scalable_malloc" >&5 +$as_echo_n "checking for library containing scalable_malloc... " >&6; } +if ${ac_cv_search_scalable_malloc+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char scalable_malloc (); +int +main () +{ +return scalable_malloc (); + ; + return 0; +} +_ACEOF +for ac_lib in '' tbbmalloc; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_cxx_try_link "$LINENO"; then : + ac_cv_search_scalable_malloc=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_scalable_malloc+:} false; then : + break +fi +done +if ${ac_cv_search_scalable_malloc+:} false; then : + +else + ac_cv_search_scalable_malloc=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_scalable_malloc" >&5 +$as_echo "$ac_cv_search_scalable_malloc" >&6; } +ac_res=$ac_cv_search_scalable_malloc +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +else + as_fn_error $? "A TBB installation is required. Issue make dependencies and re-run configure." "$LINENO" 5 +fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing METIS_NodeND" >&5 +$as_echo_n "checking for library containing METIS_NodeND... " >&6; } +if ${ac_cv_search_METIS_NodeND+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char METIS_NodeND (); +int +main () +{ +return METIS_NodeND (); + ; + return 0; +} +_ACEOF +for ac_lib in '' metis; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_cxx_try_link "$LINENO"; then : + ac_cv_search_METIS_NodeND=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_METIS_NodeND+:} false; then : + break +fi +done +if ${ac_cv_search_METIS_NodeND+:} false; then : + +else + ac_cv_search_METIS_NodeND=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_METIS_NodeND" >&5 +$as_echo "$ac_cv_search_METIS_NodeND" >&6; } +ac_res=$ac_cv_search_METIS_NodeND +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +else + as_fn_error $? "A METIS installation is required. Issue make dependencies and re-run configure." "$LINENO" 5 +fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing flann_find_nearest_neighbors_double" >&5 +$as_echo_n "checking for library containing flann_find_nearest_neighbors_double... " >&6; } +if ${ac_cv_search_flann_find_nearest_neighbors_double+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_func_search_save_LIBS=$LIBS +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char flann_find_nearest_neighbors_double (); +int +main () +{ +return flann_find_nearest_neighbors_double (); + ; + return 0; +} +_ACEOF +for ac_lib in '' flann; do + if test -z "$ac_lib"; then + ac_res="none required" + else + ac_res=-l$ac_lib + LIBS="-l$ac_lib $ac_func_search_save_LIBS" + fi + if ac_fn_cxx_try_link "$LINENO"; then : + ac_cv_search_flann_find_nearest_neighbors_double=$ac_res +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext + if ${ac_cv_search_flann_find_nearest_neighbors_double+:} false; then : + break +fi +done +if ${ac_cv_search_flann_find_nearest_neighbors_double+:} false; then : + +else + ac_cv_search_flann_find_nearest_neighbors_double=no +fi +rm conftest.$ac_ext +LIBS=$ac_func_search_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_search_flann_find_nearest_neighbors_double" >&5 +$as_echo "$ac_cv_search_flann_find_nearest_neighbors_double" >&6; } +ac_res=$ac_cv_search_flann_find_nearest_neighbors_double +if test "$ac_res" != no; then : + test "$ac_res" = "none required" || LIBS="$ac_res $LIBS" + +else + as_fn_error $? "A FLANN installation is required. Issue make dependencies and re-run configure." "$LINENO" 5 +fi + + + +# Check whether --enable-matlab was given. +if test "${enable_matlab+set}" = set; then : + enableval=$enable_matlab; ENABLE_MATLAB=yes +else + ENABLE_MATLAB=no +fi + + + + +printf "\nBuild Makefile\n\n" + +cat >confcache <<\_ACEOF +# This file is a shell script that caches the results of configure +# tests run on this system so they can be shared between configure +# scripts and configure runs, see configure's option --config-cache. +# It is not useful on other systems. If it contains results you don't +# want to keep, you may remove or edit it. +# +# config.status only pays attention to the cache file if you give it +# the --recheck option to rerun configure. +# +# `ac_cv_env_foo' variables (set or unset) will be overridden when +# loading this file, other *unset* `ac_cv_foo' will be assigned the +# following values. + +_ACEOF + +# The following way of writing the cache mishandles newlines in values, +# but we know of no workaround that is simple, portable, and efficient. +# So, we kill variables containing newlines. +# Ultrix sh set writes to stderr and can't be redirected directly, +# and sets the high bit in the cache file unless we assign to the vars. +( + for ac_var in `(set) 2>&1 | sed -n 's/^\([a-zA-Z_][a-zA-Z0-9_]*\)=.*/\1/p'`; do + eval ac_val=\$$ac_var + case $ac_val in #( + *${as_nl}*) + case $ac_var in #( + *_cv_*) { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: cache variable $ac_var contains a newline" >&5 +$as_echo "$as_me: WARNING: cache variable $ac_var contains a newline" >&2;} ;; + esac + case $ac_var in #( + _ | IFS | as_nl) ;; #( + BASH_ARGV | BASH_SOURCE) eval $ac_var= ;; #( + *) { eval $ac_var=; unset $ac_var;} ;; + esac ;; + esac + done + + (set) 2>&1 | + case $as_nl`(ac_space=' '; set) 2>&1` in #( + *${as_nl}ac_space=\ *) + # `set' does not quote correctly, so add quotes: double-quote + # substitution turns \\\\ into \\, and sed turns \\ into \. + sed -n \ + "s/'/'\\\\''/g; + s/^\\([_$as_cr_alnum]*_cv_[_$as_cr_alnum]*\\)=\\(.*\\)/\\1='\\2'/p" + ;; #( + *) + # `set' quotes correctly as required by POSIX, so do not add quotes. + sed -n "/^[_$as_cr_alnum]*_cv_[_$as_cr_alnum]*=/p" + ;; + esac | + sort +) | + sed ' + /^ac_cv_env_/b end + t clear + :clear + s/^\([^=]*\)=\(.*[{}].*\)$/test "${\1+set}" = set || &/ + t end + s/^\([^=]*\)=\(.*\)$/\1=${\1=\2}/ + :end' >>confcache +if diff "$cache_file" confcache >/dev/null 2>&1; then :; else + if test -w "$cache_file"; then + if test "x$cache_file" != "x/dev/null"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: updating cache $cache_file" >&5 +$as_echo "$as_me: updating cache $cache_file" >&6;} + if test ! -f "$cache_file" || test -h "$cache_file"; then + cat confcache >"$cache_file" + else + case $cache_file in #( + */* | ?:*) + mv -f confcache "$cache_file"$$ && + mv -f "$cache_file"$$ "$cache_file" ;; #( + *) + mv -f confcache "$cache_file" ;; + esac + fi + fi + else + { $as_echo "$as_me:${as_lineno-$LINENO}: not updating unwritable cache $cache_file" >&5 +$as_echo "$as_me: not updating unwritable cache $cache_file" >&6;} + fi +fi +rm -f confcache + +test "x$prefix" = xNONE && prefix=$ac_default_prefix +# Let make expand exec_prefix. +test "x$exec_prefix" = xNONE && exec_prefix='${prefix}' + +# Transform confdefs.h into DEFS. +# Protect against shell expansion while executing Makefile rules. +# Protect against Makefile macro expansion. +# +# If the first sed substitution is executed (which looks for macros that +# take arguments), then branch to the quote section. Otherwise, +# look for a macro that doesn't take arguments. +ac_script=' +:mline +/\\$/{ + N + s,\\\n,, + b mline +} +t clear +:clear +s/^[ ]*#[ ]*define[ ][ ]*\([^ (][^ (]*([^)]*)\)[ ]*\(.*\)/-D\1=\2/g +t quote +s/^[ ]*#[ ]*define[ ][ ]*\([^ ][^ ]*\)[ ]*\(.*\)/-D\1=\2/g +t quote +b any +:quote +s/[ `~#$^&*(){}\\|;'\''"<>?]/\\&/g +s/\[/\\&/g +s/\]/\\&/g +s/\$/$$/g +H +:any +${ + g + s/^\n// + s/\n/ /g + p +} +' +DEFS=`sed -n "$ac_script" confdefs.h` + + +ac_libobjs= +ac_ltlibobjs= +U= +for ac_i in : $LIBOBJS; do test "x$ac_i" = x: && continue + # 1. Remove the extension, and $U if already installed. + ac_script='s/\$U\././;s/\.o$//;s/\.obj$//' + ac_i=`$as_echo "$ac_i" | sed "$ac_script"` + # 2. Prepend LIBOBJDIR. When used with automake>=1.10 LIBOBJDIR + # will be set to the directory where LIBOBJS objects are built. + as_fn_append ac_libobjs " \${LIBOBJDIR}$ac_i\$U.$ac_objext" + as_fn_append ac_ltlibobjs " \${LIBOBJDIR}$ac_i"'$U.lo' +done +LIBOBJS=$ac_libobjs + +LTLIBOBJS=$ac_ltlibobjs + + + +: "${CONFIG_STATUS=./config.status}" +ac_write_fail=0 +ac_clean_files_save=$ac_clean_files +ac_clean_files="$ac_clean_files $CONFIG_STATUS" +{ $as_echo "$as_me:${as_lineno-$LINENO}: creating $CONFIG_STATUS" >&5 +$as_echo "$as_me: creating $CONFIG_STATUS" >&6;} +as_write_fail=0 +cat >$CONFIG_STATUS <<_ASEOF || as_write_fail=1 +#! $SHELL +# Generated by $as_me. +# Run this file to recreate the current configuration. +# Compiler output produced by configure, useful for debugging +# configure, is in config.log if it exists. + +debug=false +ac_cs_recheck=false +ac_cs_silent=false + +SHELL=\${CONFIG_SHELL-$SHELL} +export SHELL +_ASEOF +cat >>$CONFIG_STATUS <<\_ASEOF || as_write_fail=1 +## -------------------- ## +## M4sh Initialization. ## +## -------------------- ## + +# Be more Bourne compatible +DUALCASE=1; export DUALCASE # for MKS sh +if test -n "${ZSH_VERSION+set}" && (emulate sh) >/dev/null 2>&1; then : + emulate sh + NULLCMD=: + # Pre-4.2 versions of Zsh do word splitting on ${1+"$@"}, which + # is contrary to our usage. Disable this feature. + alias -g '${1+"$@"}'='"$@"' + setopt NO_GLOB_SUBST +else + case `(set -o) 2>/dev/null` in #( + *posix*) : + set -o posix ;; #( + *) : + ;; +esac +fi + + +as_nl=' +' +export as_nl +# Printing a long string crashes Solaris 7 /usr/bin/printf. +as_echo='\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo +as_echo=$as_echo$as_echo$as_echo$as_echo$as_echo$as_echo +# Prefer a ksh shell builtin over an external printf program on Solaris, +# but without wasting forks for bash or zsh. +if test -z "$BASH_VERSION$ZSH_VERSION" \ + && (test "X`print -r -- $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='print -r --' + as_echo_n='print -rn --' +elif (test "X`printf %s $as_echo`" = "X$as_echo") 2>/dev/null; then + as_echo='printf %s\n' + as_echo_n='printf %s' +else + if test "X`(/usr/ucb/echo -n -n $as_echo) 2>/dev/null`" = "X-n $as_echo"; then + as_echo_body='eval /usr/ucb/echo -n "$1$as_nl"' + as_echo_n='/usr/ucb/echo -n' + else + as_echo_body='eval expr "X$1" : "X\\(.*\\)"' + as_echo_n_body='eval + arg=$1; + case $arg in #( + *"$as_nl"*) + expr "X$arg" : "X\\(.*\\)$as_nl"; + arg=`expr "X$arg" : ".*$as_nl\\(.*\\)"`;; + esac; + expr "X$arg" : "X\\(.*\\)" | tr -d "$as_nl" + ' + export as_echo_n_body + as_echo_n='sh -c $as_echo_n_body as_echo' + fi + export as_echo_body + as_echo='sh -c $as_echo_body as_echo' +fi + +# The user is always right. +if test "${PATH_SEPARATOR+set}" != set; then + PATH_SEPARATOR=: + (PATH='/bin;/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 && { + (PATH='/bin:/bin'; FPATH=$PATH; sh -c :) >/dev/null 2>&1 || + PATH_SEPARATOR=';' + } +fi + + +# IFS +# We need space, tab and new line, in precisely that order. Quoting is +# there to prevent editors from complaining about space-tab. +# (If _AS_PATH_WALK were called with IFS unset, it would disable word +# splitting by setting IFS to empty value.) +IFS=" "" $as_nl" + +# Find who we are. Look in the path if we contain no directory separator. +as_myself= +case $0 in #(( + *[\\/]* ) as_myself=$0 ;; + *) as_save_IFS=$IFS; IFS=$PATH_SEPARATOR +for as_dir in $PATH +do + IFS=$as_save_IFS + test -z "$as_dir" && as_dir=. + test -r "$as_dir/$0" && as_myself=$as_dir/$0 && break + done +IFS=$as_save_IFS + + ;; +esac +# We did not find ourselves, most probably we were run as `sh COMMAND' +# in which case we are not to be found in the path. +if test "x$as_myself" = x; then + as_myself=$0 +fi +if test ! -f "$as_myself"; then + $as_echo "$as_myself: error: cannot find myself; rerun with an absolute file name" >&2 + exit 1 +fi + +# Unset variables that we do not need and which cause bugs (e.g. in +# pre-3.0 UWIN ksh). But do not cause bugs in bash 2.01; the "|| exit 1" +# suppresses any "Segmentation fault" message there. '((' could +# trigger a bug in pdksh 5.2.14. +for as_var in BASH_ENV ENV MAIL MAILPATH +do eval test x\${$as_var+set} = xset \ + && ( (unset $as_var) || exit 1) >/dev/null 2>&1 && unset $as_var || : +done +PS1='$ ' +PS2='> ' +PS4='+ ' + +# NLS nuisances. +LC_ALL=C +export LC_ALL +LANGUAGE=C +export LANGUAGE + +# CDPATH. +(unset CDPATH) >/dev/null 2>&1 && unset CDPATH + + +# as_fn_error STATUS ERROR [LINENO LOG_FD] +# ---------------------------------------- +# Output "`basename $0`: error: ERROR" to stderr. If LINENO and LOG_FD are +# provided, also output the error to LOG_FD, referencing LINENO. Then exit the +# script with STATUS, using 1 if that was 0. +as_fn_error () +{ + as_status=$1; test $as_status -eq 0 && as_status=1 + if test "$4"; then + as_lineno=${as_lineno-"$3"} as_lineno_stack=as_lineno_stack=$as_lineno_stack + $as_echo "$as_me:${as_lineno-$LINENO}: error: $2" >&$4 + fi + $as_echo "$as_me: error: $2" >&2 + as_fn_exit $as_status +} # as_fn_error + + +# as_fn_set_status STATUS +# ----------------------- +# Set $? to STATUS, without forking. +as_fn_set_status () +{ + return $1 +} # as_fn_set_status + +# as_fn_exit STATUS +# ----------------- +# Exit the shell with STATUS, even in a "trap 0" or "set -e" context. +as_fn_exit () +{ + set +e + as_fn_set_status $1 + exit $1 +} # as_fn_exit + +# as_fn_unset VAR +# --------------- +# Portably unset VAR. +as_fn_unset () +{ + { eval $1=; unset $1;} +} +as_unset=as_fn_unset +# as_fn_append VAR VALUE +# ---------------------- +# Append the text in VALUE to the end of the definition contained in VAR. Take +# advantage of any shell optimizations that allow amortized linear growth over +# repeated appends, instead of the typical quadratic growth present in naive +# implementations. +if (eval "as_var=1; as_var+=2; test x\$as_var = x12") 2>/dev/null; then : + eval 'as_fn_append () + { + eval $1+=\$2 + }' +else + as_fn_append () + { + eval $1=\$$1\$2 + } +fi # as_fn_append + +# as_fn_arith ARG... +# ------------------ +# Perform arithmetic evaluation on the ARGs, and store the result in the +# global $as_val. Take advantage of shells that can avoid forks. The arguments +# must be portable across $(()) and expr. +if (eval "test \$(( 1 + 1 )) = 2") 2>/dev/null; then : + eval 'as_fn_arith () + { + as_val=$(( $* )) + }' +else + as_fn_arith () + { + as_val=`expr "$@" || test $? -eq 1` + } +fi # as_fn_arith + + +if expr a : '\(a\)' >/dev/null 2>&1 && + test "X`expr 00001 : '.*\(...\)'`" = X001; then + as_expr=expr +else + as_expr=false +fi + +if (basename -- /) >/dev/null 2>&1 && test "X`basename -- / 2>&1`" = "X/"; then + as_basename=basename +else + as_basename=false +fi + +if (as_dir=`dirname -- /` && test "X$as_dir" = X/) >/dev/null 2>&1; then + as_dirname=dirname +else + as_dirname=false +fi + +as_me=`$as_basename -- "$0" || +$as_expr X/"$0" : '.*/\([^/][^/]*\)/*$' \| \ + X"$0" : 'X\(//\)$' \| \ + X"$0" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X/"$0" | + sed '/^.*\/\([^/][^/]*\)\/*$/{ + s//\1/ + q + } + /^X\/\(\/\/\)$/{ + s//\1/ + q + } + /^X\/\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + +# Avoid depending upon Character Ranges. +as_cr_letters='abcdefghijklmnopqrstuvwxyz' +as_cr_LETTERS='ABCDEFGHIJKLMNOPQRSTUVWXYZ' +as_cr_Letters=$as_cr_letters$as_cr_LETTERS +as_cr_digits='0123456789' +as_cr_alnum=$as_cr_Letters$as_cr_digits + +ECHO_C= ECHO_N= ECHO_T= +case `echo -n x` in #((((( +-n*) + case `echo 'xy\c'` in + *c*) ECHO_T=' ';; # ECHO_T is single tab character. + xy) ECHO_C='\c';; + *) echo `echo ksh88 bug on AIX 6.1` > /dev/null + ECHO_T=' ';; + esac;; +*) + ECHO_N='-n';; +esac + +rm -f conf$$ conf$$.exe conf$$.file +if test -d conf$$.dir; then + rm -f conf$$.dir/conf$$.file +else + rm -f conf$$.dir + mkdir conf$$.dir 2>/dev/null +fi +if (echo >conf$$.file) 2>/dev/null; then + if ln -s conf$$.file conf$$ 2>/dev/null; then + as_ln_s='ln -s' + # ... but there are two gotchas: + # 1) On MSYS, both `ln -s file dir' and `ln file dir' fail. + # 2) DJGPP < 2.04 has no symlinks; `ln -s' creates a wrapper executable. + # In both cases, we have to default to `cp -pR'. + ln -s conf$$.file conf$$.dir 2>/dev/null && test ! -f conf$$.exe || + as_ln_s='cp -pR' + elif ln conf$$.file conf$$ 2>/dev/null; then + as_ln_s=ln + else + as_ln_s='cp -pR' + fi +else + as_ln_s='cp -pR' +fi +rm -f conf$$ conf$$.exe conf$$.dir/conf$$.file conf$$.file +rmdir conf$$.dir 2>/dev/null + + +# as_fn_mkdir_p +# ------------- +# Create "$as_dir" as a directory, including parents if necessary. +as_fn_mkdir_p () +{ + + case $as_dir in #( + -*) as_dir=./$as_dir;; + esac + test -d "$as_dir" || eval $as_mkdir_p || { + as_dirs= + while :; do + case $as_dir in #( + *\'*) as_qdir=`$as_echo "$as_dir" | sed "s/'/'\\\\\\\\''/g"`;; #'( + *) as_qdir=$as_dir;; + esac + as_dirs="'$as_qdir' $as_dirs" + as_dir=`$as_dirname -- "$as_dir" || +$as_expr X"$as_dir" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$as_dir" : 'X\(//\)[^/]' \| \ + X"$as_dir" : 'X\(//\)$' \| \ + X"$as_dir" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$as_dir" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + test -d "$as_dir" && break + done + test -z "$as_dirs" || eval "mkdir $as_dirs" + } || test -d "$as_dir" || as_fn_error $? "cannot create directory $as_dir" + + +} # as_fn_mkdir_p +if mkdir -p . 2>/dev/null; then + as_mkdir_p='mkdir -p "$as_dir"' +else + test -d ./-p && rmdir ./-p + as_mkdir_p=false +fi + + +# as_fn_executable_p FILE +# ----------------------- +# Test if FILE is an executable regular file. +as_fn_executable_p () +{ + test -f "$1" && test -x "$1" +} # as_fn_executable_p +as_test_x='test -x' +as_executable_p=as_fn_executable_p + +# Sed expression to map a string onto a valid CPP name. +as_tr_cpp="eval sed 'y%*$as_cr_letters%P$as_cr_LETTERS%;s%[^_$as_cr_alnum]%_%g'" + +# Sed expression to map a string onto a valid variable name. +as_tr_sh="eval sed 'y%*+%pp%;s%[^_$as_cr_alnum]%_%g'" + + +exec 6>&1 +## ----------------------------------- ## +## Main body of $CONFIG_STATUS script. ## +## ----------------------------------- ## +_ASEOF +test $as_write_fail = 0 && chmod +x $CONFIG_STATUS || ac_write_fail=1 + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# Save the log message, to keep $0 and so on meaningful, and to +# report actual input values of CONFIG_FILES etc. instead of their +# values after options handling. +ac_log=" +This file was extended by sgtsnepi $as_me version-1.0, which was +generated by GNU Autoconf 2.69. Invocation command line was + + CONFIG_FILES = $CONFIG_FILES + CONFIG_HEADERS = $CONFIG_HEADERS + CONFIG_LINKS = $CONFIG_LINKS + CONFIG_COMMANDS = $CONFIG_COMMANDS + $ $0 $@ + +on `(hostname || uname -n) 2>/dev/null | sed 1q` +" + +_ACEOF + +case $ac_config_files in *" +"*) set x $ac_config_files; shift; ac_config_files=$*;; +esac + + + +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +# Files that config.status was made for. +config_files="$ac_config_files" + +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +ac_cs_usage="\ +\`$as_me' instantiates files and other configuration actions +from templates according to the current configuration. Unless the files +and actions are specified as TAGs, all are instantiated by default. + +Usage: $0 [OPTION]... [TAG]... + + -h, --help print this help, then exit + -V, --version print version number and configuration settings, then exit + --config print configuration, then exit + -q, --quiet, --silent + do not print progress messages + -d, --debug don't remove temporary files + --recheck update $as_me by reconfiguring in the same conditions + --file=FILE[:TEMPLATE] + instantiate the configuration file FILE + +Configuration files: +$config_files + +Report bugs to the package provider." + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`" +ac_cs_version="\\ +sgtsnepi config.status version-1.0 +configured by $0, generated by GNU Autoconf 2.69, + with options \\"\$ac_cs_config\\" + +Copyright (C) 2012 Free Software Foundation, Inc. +This config.status script is free software; the Free Software Foundation +gives unlimited permission to copy, distribute and modify it." + +ac_pwd='$ac_pwd' +srcdir='$srcdir' +test -n "\$AWK" || AWK=awk +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# The default lists apply if the user does not specify any file. +ac_need_defaults=: +while test $# != 0 +do + case $1 in + --*=?*) + ac_option=`expr "X$1" : 'X\([^=]*\)='` + ac_optarg=`expr "X$1" : 'X[^=]*=\(.*\)'` + ac_shift=: + ;; + --*=) + ac_option=`expr "X$1" : 'X\([^=]*\)='` + ac_optarg= + ac_shift=: + ;; + *) + ac_option=$1 + ac_optarg=$2 + ac_shift=shift + ;; + esac + + case $ac_option in + # Handling of the options. + -recheck | --recheck | --rechec | --reche | --rech | --rec | --re | --r) + ac_cs_recheck=: ;; + --version | --versio | --versi | --vers | --ver | --ve | --v | -V ) + $as_echo "$ac_cs_version"; exit ;; + --config | --confi | --conf | --con | --co | --c ) + $as_echo "$ac_cs_config"; exit ;; + --debug | --debu | --deb | --de | --d | -d ) + debug=: ;; + --file | --fil | --fi | --f ) + $ac_shift + case $ac_optarg in + *\'*) ac_optarg=`$as_echo "$ac_optarg" | sed "s/'/'\\\\\\\\''/g"` ;; + '') as_fn_error $? "missing file argument" ;; + esac + as_fn_append CONFIG_FILES " '$ac_optarg'" + ac_need_defaults=false;; + --he | --h | --help | --hel | -h ) + $as_echo "$ac_cs_usage"; exit ;; + -q | -quiet | --quiet | --quie | --qui | --qu | --q \ + | -silent | --silent | --silen | --sile | --sil | --si | --s) + ac_cs_silent=: ;; + + # This is an error. + -*) as_fn_error $? "unrecognized option: \`$1' +Try \`$0 --help' for more information." ;; + + *) as_fn_append ac_config_targets " $1" + ac_need_defaults=false ;; + + esac + shift +done + +ac_configure_extra_args= + +if $ac_cs_silent; then + exec 6>/dev/null + ac_configure_extra_args="$ac_configure_extra_args --silent" +fi + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +if \$ac_cs_recheck; then + set X $SHELL '$0' $ac_configure_args \$ac_configure_extra_args --no-create --no-recursion + shift + \$as_echo "running CONFIG_SHELL=$SHELL \$*" >&6 + CONFIG_SHELL='$SHELL' + export CONFIG_SHELL + exec "\$@" +fi + +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +exec 5>>config.log +{ + echo + sed 'h;s/./-/g;s/^.../## /;s/...$/ ##/;p;x;p;x' <<_ASBOX +## Running $as_me. ## +_ASBOX + $as_echo "$ac_log" +} >&5 + +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 + +# Handling of arguments. +for ac_config_target in $ac_config_targets +do + case $ac_config_target in + "Makefile") CONFIG_FILES="$CONFIG_FILES Makefile" ;; + + *) as_fn_error $? "invalid argument: \`$ac_config_target'" "$LINENO" 5;; + esac +done + + +# If the user did not use the arguments to specify the items to instantiate, +# then the envvar interface is used. Set only those that are not. +# We use the long form for the default assignment because of an extremely +# bizarre bug on SunOS 4.1.3. +if $ac_need_defaults; then + test "${CONFIG_FILES+set}" = set || CONFIG_FILES=$config_files +fi + +# Have a temporary directory for convenience. Make it in the build tree +# simply because there is no reason against having it here, and in addition, +# creating and moving files from /tmp can sometimes cause problems. +# Hook for its removal unless debugging. +# Note that there is a small window in which the directory will not be cleaned: +# after its creation but before its name has been assigned to `$tmp'. +$debug || +{ + tmp= ac_tmp= + trap 'exit_status=$? + : "${ac_tmp:=$tmp}" + { test ! -d "$ac_tmp" || rm -fr "$ac_tmp"; } && exit $exit_status +' 0 + trap 'as_fn_exit 1' 1 2 13 15 +} +# Create a (secure) tmp directory for tmp files. + +{ + tmp=`(umask 077 && mktemp -d "./confXXXXXX") 2>/dev/null` && + test -d "$tmp" +} || +{ + tmp=./conf$$-$RANDOM + (umask 077 && mkdir "$tmp") +} || as_fn_error $? "cannot create a temporary directory in ." "$LINENO" 5 +ac_tmp=$tmp + +# Set up the scripts for CONFIG_FILES section. +# No need to generate them if there are no CONFIG_FILES. +# This happens for instance with `./config.status config.h'. +if test -n "$CONFIG_FILES"; then + + +ac_cr=`echo X | tr X '\015'` +# On cygwin, bash can eat \r inside `` if the user requested igncr. +# But we know of no other shell where ac_cr would be empty at this +# point, so we can use a bashism as a fallback. +if test "x$ac_cr" = x; then + eval ac_cr=\$\'\\r\' +fi +ac_cs_awk_cr=`$AWK 'BEGIN { print "a\rb" }' /dev/null` +if test "$ac_cs_awk_cr" = "a${ac_cr}b"; then + ac_cs_awk_cr='\\r' +else + ac_cs_awk_cr=$ac_cr +fi + +echo 'BEGIN {' >"$ac_tmp/subs1.awk" && +_ACEOF + + +{ + echo "cat >conf$$subs.awk <<_ACEOF" && + echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/' && + echo "_ACEOF" +} >conf$$subs.sh || + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 +ac_delim_num=`echo "$ac_subst_vars" | grep -c '^'` +ac_delim='%!_!# ' +for ac_last_try in false false false false false :; do + . ./conf$$subs.sh || + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 + + ac_delim_n=`sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` + if test $ac_delim_n = $ac_delim_num; then + break + elif $ac_last_try; then + as_fn_error $? "could not make $CONFIG_STATUS" "$LINENO" 5 + else + ac_delim="$ac_delim!$ac_delim _$ac_delim!! " + fi +done +rm -f conf$$subs.sh + +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +cat >>"\$ac_tmp/subs1.awk" <<\\_ACAWK && +_ACEOF +sed -n ' +h +s/^/S["/; s/!.*/"]=/ +p +g +s/^[^!]*!// +:repl +t repl +s/'"$ac_delim"'$// +t delim +:nl +h +s/\(.\{148\}\)..*/\1/ +t more1 +s/["\\]/\\&/g; s/^/"/; s/$/\\n"\\/ +p +n +b repl +:more1 +s/["\\]/\\&/g; s/^/"/; s/$/"\\/ +p +g +s/.\{148\}// +t nl +:delim +h +s/\(.\{148\}\)..*/\1/ +t more2 +s/["\\]/\\&/g; s/^/"/; s/$/"/ +p +b +:more2 +s/["\\]/\\&/g; s/^/"/; s/$/"\\/ +p +g +s/.\{148\}// +t delim +' >$CONFIG_STATUS || ac_write_fail=1 +rm -f conf$$subs.awk +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +_ACAWK +cat >>"\$ac_tmp/subs1.awk" <<_ACAWK && + for (key in S) S_is_set[key] = 1 + FS = "" + +} +{ + line = $ 0 + nfields = split(line, field, "@") + substed = 0 + len = length(field[1]) + for (i = 2; i < nfields; i++) { + key = field[i] + keylen = length(key) + if (S_is_set[key]) { + value = S[key] + line = substr(line, 1, len) "" value "" substr(line, len + keylen + 3) + len += length(value) + length(field[++i]) + substed = 1 + } else + len += 1 + keylen + } + + print line +} + +_ACAWK +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +if sed "s/$ac_cr//" < /dev/null > /dev/null 2>&1; then + sed "s/$ac_cr\$//; s/$ac_cr/$ac_cs_awk_cr/g" +else + cat +fi < "$ac_tmp/subs1.awk" > "$ac_tmp/subs.awk" \ + || as_fn_error $? "could not setup config files machinery" "$LINENO" 5 +_ACEOF + +# VPATH may cause trouble with some makes, so we remove sole $(srcdir), +# ${srcdir} and @srcdir@ entries from VPATH if srcdir is ".", strip leading and +# trailing colons and then remove the whole line if VPATH becomes empty +# (actually we leave an empty line to preserve line numbers). +if test "x$srcdir" = x.; then + ac_vpsub='/^[ ]*VPATH[ ]*=[ ]*/{ +h +s/// +s/^/:/ +s/[ ]*$/:/ +s/:\$(srcdir):/:/g +s/:\${srcdir}:/:/g +s/:@srcdir@:/:/g +s/^:*// +s/:*$// +x +s/\(=[ ]*\).*/\1/ +G +s/\n// +s/^[^=]*=[ ]*$// +}' +fi + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +fi # test -n "$CONFIG_FILES" + + +eval set X " :F $CONFIG_FILES " +shift +for ac_tag +do + case $ac_tag in + :[FHLC]) ac_mode=$ac_tag; continue;; + esac + case $ac_mode$ac_tag in + :[FHL]*:*);; + :L* | :C*:*) as_fn_error $? "invalid tag \`$ac_tag'" "$LINENO" 5;; + :[FH]-) ac_tag=-:-;; + :[FH]*) ac_tag=$ac_tag:$ac_tag.in;; + esac + ac_save_IFS=$IFS + IFS=: + set x $ac_tag + IFS=$ac_save_IFS + shift + ac_file=$1 + shift + + case $ac_mode in + :L) ac_source=$1;; + :[FH]) + ac_file_inputs= + for ac_f + do + case $ac_f in + -) ac_f="$ac_tmp/stdin";; + *) # Look for the file first in the build tree, then in the source tree + # (if the path is not absolute). The absolute path cannot be DOS-style, + # because $ac_f cannot contain `:'. + test -f "$ac_f" || + case $ac_f in + [\\/$]*) false;; + *) test -f "$srcdir/$ac_f" && ac_f="$srcdir/$ac_f";; + esac || + as_fn_error 1 "cannot find input file: \`$ac_f'" "$LINENO" 5;; + esac + case $ac_f in *\'*) ac_f=`$as_echo "$ac_f" | sed "s/'/'\\\\\\\\''/g"`;; esac + as_fn_append ac_file_inputs " '$ac_f'" + done + + # Let's still pretend it is `configure' which instantiates (i.e., don't + # use $as_me), people would be surprised to read: + # /* config.h. Generated by config.status. */ + configure_input='Generated from '` + $as_echo "$*" | sed 's|^[^:]*/||;s|:[^:]*/|, |g' + `' by configure.' + if test x"$ac_file" != x-; then + configure_input="$ac_file. $configure_input" + { $as_echo "$as_me:${as_lineno-$LINENO}: creating $ac_file" >&5 +$as_echo "$as_me: creating $ac_file" >&6;} + fi + # Neutralize special characters interpreted by sed in replacement strings. + case $configure_input in #( + *\&* | *\|* | *\\* ) + ac_sed_conf_input=`$as_echo "$configure_input" | + sed 's/[\\\\&|]/\\\\&/g'`;; #( + *) ac_sed_conf_input=$configure_input;; + esac + + case $ac_tag in + *:-:* | *:-) cat >"$ac_tmp/stdin" \ + || as_fn_error $? "could not create $ac_file" "$LINENO" 5 ;; + esac + ;; + esac + + ac_dir=`$as_dirname -- "$ac_file" || +$as_expr X"$ac_file" : 'X\(.*[^/]\)//*[^/][^/]*/*$' \| \ + X"$ac_file" : 'X\(//\)[^/]' \| \ + X"$ac_file" : 'X\(//\)$' \| \ + X"$ac_file" : 'X\(/\)' \| . 2>/dev/null || +$as_echo X"$ac_file" | + sed '/^X\(.*[^/]\)\/\/*[^/][^/]*\/*$/{ + s//\1/ + q + } + /^X\(\/\/\)[^/].*/{ + s//\1/ + q + } + /^X\(\/\/\)$/{ + s//\1/ + q + } + /^X\(\/\).*/{ + s//\1/ + q + } + s/.*/./; q'` + as_dir="$ac_dir"; as_fn_mkdir_p + ac_builddir=. + +case "$ac_dir" in +.) ac_dir_suffix= ac_top_builddir_sub=. ac_top_build_prefix= ;; +*) + ac_dir_suffix=/`$as_echo "$ac_dir" | sed 's|^\.[\\/]||'` + # A ".." for each directory in $ac_dir_suffix. + ac_top_builddir_sub=`$as_echo "$ac_dir_suffix" | sed 's|/[^\\/]*|/..|g;s|/||'` + case $ac_top_builddir_sub in + "") ac_top_builddir_sub=. ac_top_build_prefix= ;; + *) ac_top_build_prefix=$ac_top_builddir_sub/ ;; + esac ;; +esac +ac_abs_top_builddir=$ac_pwd +ac_abs_builddir=$ac_pwd$ac_dir_suffix +# for backward compatibility: +ac_top_builddir=$ac_top_build_prefix + +case $srcdir in + .) # We are building in place. + ac_srcdir=. + ac_top_srcdir=$ac_top_builddir_sub + ac_abs_top_srcdir=$ac_pwd ;; + [\\/]* | ?:[\\/]* ) # Absolute name. + ac_srcdir=$srcdir$ac_dir_suffix; + ac_top_srcdir=$srcdir + ac_abs_top_srcdir=$srcdir ;; + *) # Relative name. + ac_srcdir=$ac_top_build_prefix$srcdir$ac_dir_suffix + ac_top_srcdir=$ac_top_build_prefix$srcdir + ac_abs_top_srcdir=$ac_pwd/$srcdir ;; +esac +ac_abs_srcdir=$ac_abs_top_srcdir$ac_dir_suffix + + + case $ac_mode in + :F) + # + # CONFIG_FILE + # + +_ACEOF + +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +# If the template does not know about datarootdir, expand it. +# FIXME: This hack should be removed a few years after 2.60. +ac_datarootdir_hack=; ac_datarootdir_seen= +ac_sed_dataroot=' +/datarootdir/ { + p + q +} +/@datadir@/p +/@docdir@/p +/@infodir@/p +/@localedir@/p +/@mandir@/p' +case `eval "sed -n \"\$ac_sed_dataroot\" $ac_file_inputs"` in +*datarootdir*) ac_datarootdir_seen=yes;; +*@datadir@*|*@docdir@*|*@infodir@*|*@localedir@*|*@mandir@*) + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $ac_file_inputs seems to ignore the --datarootdir setting" >&5 +$as_echo "$as_me: WARNING: $ac_file_inputs seems to ignore the --datarootdir setting" >&2;} +_ACEOF +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 + ac_datarootdir_hack=' + s&@datadir@&$datadir&g + s&@docdir@&$docdir&g + s&@infodir@&$infodir&g + s&@localedir@&$localedir&g + s&@mandir@&$mandir&g + s&\\\${datarootdir}&$datarootdir&g' ;; +esac +_ACEOF + +# Neutralize VPATH when `$srcdir' = `.'. +# Shell code in configure.ac might set extrasub. +# FIXME: do we really want to maintain this feature? +cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 +ac_sed_extra="$ac_vpsub +$extrasub +_ACEOF +cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 +:t +/@[a-zA-Z_][a-zA-Z_0-9]*@/!b +s|@configure_input@|$ac_sed_conf_input|;t t +s&@top_builddir@&$ac_top_builddir_sub&;t t +s&@top_build_prefix@&$ac_top_build_prefix&;t t +s&@srcdir@&$ac_srcdir&;t t +s&@abs_srcdir@&$ac_abs_srcdir&;t t +s&@top_srcdir@&$ac_top_srcdir&;t t +s&@abs_top_srcdir@&$ac_abs_top_srcdir&;t t +s&@builddir@&$ac_builddir&;t t +s&@abs_builddir@&$ac_abs_builddir&;t t +s&@abs_top_builddir@&$ac_abs_top_builddir&;t t +$ac_datarootdir_hack +" +eval sed \"\$ac_sed_extra\" "$ac_file_inputs" | $AWK -f "$ac_tmp/subs.awk" \ + >$ac_tmp/out || as_fn_error $? "could not create $ac_file" "$LINENO" 5 + +test -z "$ac_datarootdir_hack$ac_datarootdir_seen" && + { ac_out=`sed -n '/\${datarootdir}/p' "$ac_tmp/out"`; test -n "$ac_out"; } && + { ac_out=`sed -n '/^[ ]*datarootdir[ ]*:*=/p' \ + "$ac_tmp/out"`; test -z "$ac_out"; } && + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: $ac_file contains a reference to the variable \`datarootdir' +which seems to be undefined. Please make sure it is defined" >&5 +$as_echo "$as_me: WARNING: $ac_file contains a reference to the variable \`datarootdir' +which seems to be undefined. Please make sure it is defined" >&2;} + + rm -f "$ac_tmp/stdin" + case $ac_file in + -) cat "$ac_tmp/out" && rm -f "$ac_tmp/out";; + *) rm -f "$ac_file" && mv "$ac_tmp/out" "$ac_file";; + esac \ + || as_fn_error $? "could not create $ac_file" "$LINENO" 5 + ;; + + + + esac + +done # for ac_tag + + +as_fn_exit 0 +_ACEOF +ac_clean_files=$ac_clean_files_save + +test $ac_write_fail = 0 || + as_fn_error $? "write failure creating $CONFIG_STATUS" "$LINENO" 5 + + +# configure is writing to config.log, and then calls config.status. +# config.status does its own redirection, appending to config.log. +# Unfortunately, on DOS this fails, as config.log is still kept open +# by configure, so config.status won't be able to write to it; its +# output is simply discarded. So we exec the FD to /dev/null, +# effectively closing config.log, so it can be properly (re)opened and +# appended to by config.status. When coming back to configure, we +# need to make the FD available again. +if test "$no_create" != yes; then + ac_cs_success=: + ac_config_status_args= + test "$silent" = yes && + ac_config_status_args="$ac_config_status_args --quiet" + exec 5>/dev/null + $SHELL $CONFIG_STATUS $ac_config_status_args || ac_cs_success=false + exec 5>>config.log + # Use ||, not &&, to avoid exiting from the if with $? = 1, which + # would make configure fail if this is the last instruction. + $ac_cs_success || as_fn_exit 1 +fi +if test -n "$ac_unrecognized_opts" && test "$enable_option_checking" != no; then + { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: unrecognized options: $ac_unrecognized_opts" >&5 +$as_echo "$as_me: WARNING: unrecognized options: $ac_unrecognized_opts" >&2;} +fi + diff --git a/csb/SSEspmv.cpp b/csb/SSEspmv.cpp new file mode 100644 index 0000000..54ffef4 --- /dev/null +++ b/csb/SSEspmv.cpp @@ -0,0 +1,1294 @@ +#include +#include +#include +#include +#include +#include +#include // MMX +#include // SSE +#include // SSE 2 +#include // SSE 3 + +#ifndef AMD + #include // SSSE 3 + #include // SSE 4.1 + #include // SSE 4.2 + #include // SSE ?? (AES) +#endif + +#ifdef ICC + #include +#else + #include // SSE4A (amd's popcount) +#endif + +using namespace std; + + +const uint64_t masktable64[64] = {0x8000000000000000, 0x4000000000000000, 0x2000000000000000, 0x1000000000000000, + 0x0800000000000000, 0x0400000000000000, 0x0200000000000000, 0x0100000000000000, + 0x0080000000000000, 0x0040000000000000, 0x0020000000000000, 0x0010000000000000, + 0x0008000000000000, 0x0004000000000000, 0x0002000000000000, 0x0001000000000000, + 0x0000800000000000, 0x0000400000000000, 0x0000200000000000, 0x0000100000000000, + 0x0000080000000000, 0x0000040000000000, 0x0000020000000000, 0x0000010000000000, + 0x0000008000000000, 0x0000004000000000, 0x0000002000000000, 0x0000001000000000, + 0x0000000800000000, 0x0000000400000000, 0x0000000200000000, 0x0000000100000000, + 0x0000000080000000, 0x0000000040000000, 0x0000000020000000, 0x0000000010000000, + 0x0000000008000000, 0x0000000004000000, 0x0000000002000000, 0x0000000001000000, + 0x0000000000800000, 0x0000000000400000, 0x0000000000200000, 0x0000000000100000, + 0x0000000000080000, 0x0000000000040000, 0x0000000000020000, 0x0000000000010000, + 0x0000000000008000, 0x0000000000004000, 0x0000000000002000, 0x0000000000001000, + 0x0000000000000800, 0x0000000000000400, 0x0000000000000200, 0x0000000000000100, + 0x0000000000000080, 0x0000000000000040, 0x0000000000000020, 0x0000000000000010, + 0x0000000000000008, 0x0000000000000004, 0x0000000000000002, 0x0000000000000001 }; + + +const unsigned short masktable16[16] = {0x8000, 0x4000, 0x2000, 0x1000, 0x0800, 0x0400, 0x0200, 0x0100, + 0x0080, 0x0040, 0x0020, 0x0010, 0x0008, 0x0004, 0x0002, 0x0001 }; + +//--------------------------------------- +// Type Definitions +//--------------------------------------- + +typedef signed char ssp_s8; +typedef unsigned char ssp_u8; + +typedef signed short ssp_s16; +typedef unsigned short ssp_u16; + +typedef signed int ssp_s32; +typedef unsigned int ssp_u32; + +typedef float ssp_f32; +typedef double ssp_f64; + +typedef signed long long ssp_s64; +typedef unsigned long long ssp_u64; + +typedef union +{ +__m128 f; +__m128d d; +__m128i i; +__m64 m64[ 2]; +ssp_u64 u64[ 2]; +ssp_s64 s64[ 2]; +ssp_f64 f64[ 2]; +ssp_u32 u32[ 4]; +ssp_s32 s32[ 4]; +ssp_f32 f32[ 4]; +ssp_u16 u16[ 8]; +ssp_s16 s16[ 8]; +ssp_u8 u8 [16]; +ssp_s8 s8 [16]; +} ssp_m128; + + +/** + * \SSE4_1{SSE2,_mm_blendv_pd} + * ISSUE: Do not short-circuit, i.e. loads 'a' regardless of the mask value + * Question: Does the original blendv_pd (in SSE4.1) short-circuit? + */ +inline __m128d ssp_blendv_pd_SSE2( __m128d a, __m128d b, __m128d mask ) +{ + ssp_m128 A, B, Mask; + A.d = a; + B.d = b; + Mask.d = mask; + +// _MM_SHUFFLE(z,y,x,w) does not select anything, this macro just creates a mask +// expands to the following value: (z<<6) | (y<<4) | (x<<2) | w + + Mask.i = _mm_shuffle_epi32( Mask.i, _MM_SHUFFLE(3, 3, 1, 1) ); + Mask.i = _mm_srai_epi32 ( Mask.i, 31 ); + + B.i = _mm_and_si128( B.i, Mask.i ); + A.i = _mm_andnot_si128( Mask.i, A.i ); + A.i = _mm_or_si128( A.i, B.i ); + return A.d; +} + + +#ifdef AMD + #define _mm_blendv_pd ssp_blendv_pd_SSE2 +#endif + +#ifdef ICC + #define __builtin_popcountll _mm_popcnt_u64 + #define __builtin_popcount _mm_popcnt_u32 +#endif + +// 16-bit reversal table +const unsigned char BitReverseTable64[] = +{ + 0x0, 0x20, 0x10, 0x30, 0x8, 0x28, 0x18, 0x38, + 0x4, 0x24, 0x14, 0x34, 0xc, 0x2c, 0x1c, 0x3c, + 0x2, 0x22, 0x12, 0x32, 0xa, 0x2a, 0x1a, 0x3a, + 0x6, 0x26, 0x16, 0x36, 0xe, 0x2e, 0x1e, 0x3e, + 0x1, 0x21, 0x11, 0x31, 0x9, 0x29, 0x19, 0x39, + 0x5, 0x25, 0x15, 0x35, 0xd, 0x2d, 0x1d, 0x3d, + 0x3, 0x23, 0x13, 0x33, 0xb, 0x2b, 0x1b, 0x3b, + 0x7, 0x27, 0x17, 0x37, 0xf, 0x2f, 0x1f, 0x3f +}; + + +// reverse 16-bit value, 6 bits at time +unsigned short BitReverse(unsigned short v) +{ + unsigned short c = (BitReverseTable64[v & 0x3f] << 10) | + (BitReverseTable64[(v >> 6) & 0x3f] << 4) | + (BitReverseTable64[(v >> 12) & 0x0f] >> 2); + + return c; +} + + +inline void atomicallyIncrementDouble(volatile double *target, const double by){ + asm volatile( + "movq %0, %%rax \n\t" // rax = *(%0) + "xorpd %%xmm0, %%xmm0 \n\t" // xmm0 = [0.0,0.0] + "movsd %1, %%xmm0\n\t" // xmm0[lo] = *(%1) + "1:\n\t" + // rax (containing *target) was last set at startup or by a failed cmpxchg + "movq %%rax, %%xmm1\n\t" // xmm1[lo] = rax + "addsd %%xmm0, %%xmm1\n\t" // xmm1 = xmm0 + xmm1 = by + xmm1 + "movq %%xmm1, %%r8 \n\t" // r8 = xmm1[lo] + "lock cmpxchgq %%r8, %0\n\t" // if(*(%0)==rax){ZF=1;*(%0)=r8}else{ZF=0;rax=*(%0);} + "jnz 1b\n\t" // jump back if failed (ZF=0) + : "=m"(*target) // outputs + : "m"(by) // inputs + : "cc", "memory", "%rax", "%r8", "%xmm0", "%xmm1" // clobbered + ); + return; +} + +void symcsr(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlowbits) +{ + static const size_t NMortonRows64[] = + { + 0, 1, 0, 1, 2, 3, 2, 3, 0, 1, 0, 1, 2, 3, 2, 3, + 4, 5, 4, 5, 6, 7, 6, 7, 4, 5, 4, 5, 6, 7, 6, 7, + 0, 1, 0, 1, 2, 3, 2, 3, 0, 1, 0, 1, 2, 3, 2, 3, + 4, 5, 4, 5, 6, 7, 6, 7, 4, 5, 4, 5, 6, 7, 6, 7 + }; + static const size_t NMortonCols64[] = + { + 0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, + 0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, + 4, 4, 5, 5, 4, 4, 5, 5, 6, 6, 7, 7, 6, 6, 7, 7, + 4, 4, 5, 5, 4, 4, 5, 5, 6, 6, 7, 7, 6, 6, 7, 7 + }; + + for(unsigned i=0; i> nlowbits) & lowmask; + uint64_t mask = M[i]; + for(size_t j=0; j<64; ++j) + { + if(mask & masktable64[j]) + { atomicallyIncrementDouble(&Y[Ri+NMortonRows64[j]], (*V) * X[Ci+NMortonCols64[j]]); + atomicallyIncrementDouble(&YT[Ci+NMortonCols64[j]], (*V) * XT[Ri+NMortonRows64[j]]); + ++V; + } + } + } +} + + +void symcsr(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlowbits) +{ + static const size_t NMortonRows16[] = { 0, 1, 0, 1, 2, 3, 2, 3, 0, 1, 0, 1, 2, 3, 2, 3 }; + static const size_t NMortonCols16[] = { 0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3 }; + + for(unsigned i=0; i> nlowbits) & lowmask; + unsigned short mask = M[i]; + for(size_t j=0; j<16; ++j) + { + if(mask & masktable16[j]) + { atomicallyIncrementDouble(&Y[Ri+NMortonRows16[j]], (*V) * X[Ci+NMortonCols16[j]]); + atomicallyIncrementDouble(&YT[Ci+NMortonCols16[j]], (*V) * XT[Ri+NMortonRows16[j]]); + ++V; + } + } + } +} + +void symcsr(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlowbits) +{ + for(unsigned i=0; i> nlowbits) & lowmask; + unsigned char mask = M[i]; + if(mask & 0x8) + { + atomicallyIncrementDouble(&Y[Ri+0], (*V) * X[Ci+0]); + atomicallyIncrementDouble(&YT[Ci+0], (*V) * XT[Ri+0]); + V++; + } + + if(mask & 0x4) + { + atomicallyIncrementDouble(&Y[Ri+1], (*V) * X[Ci+0]); + atomicallyIncrementDouble(&YT[Ci+0], (*V) * XT[Ri+1]); + V++; + } + + if(mask & 0x2) + { + atomicallyIncrementDouble(&Y[Ri+0], (*V) * X[Ci+1]); + atomicallyIncrementDouble(&YT[Ci+1], (*V) * XT[Ri+0]); + V++; + } + + if(mask & 0x1) + { + atomicallyIncrementDouble(&Y[Ri+1], (*V) * X[Ci+1]); + atomicallyIncrementDouble(&YT[Ci+1], (*V) * XT[Ri+1]); + V++; + } + } +} + + +/** + * Symmetric SpMV inner kernel using bitmasked register blocks + * 2-by-2 potentially diagonal case (X == XT and Y == YT) + * We can still use the __restrict keyword because we only use one alias for both X and XT + **/ +void SSEsym(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * __restrict Y, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> nlbits) & lowmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 60); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X01QW = _mm_loadu_pd(&X[Ri]); // the transpose of X aliases X itself + + // {0,2} 02 {0,1} + // {1,3} <- 13 {2,3} + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8)])), _mm_setzero_pd(),(__m128d)Z01QW); // ERROR here ! [invalid read of _V, debug with sym matrix] + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE)])), _mm_setzero_pd(),(__m128d)Z23QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,A01QW),Y01QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,A23QW),Y01QW); + __m128d Y00QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y11QW = _mm_mul_pd(X01QW, A23QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind] & 0x0F); + //-------------------------------------------------------------------------- + + ssp_m128 yt0, yt1; + yt0.d = Y00QW; + yt1.d = Y11QW; + + _mm_store_pd(&Y[Ri],Y01QW); + + // The additional Y_T updates should come after we stored Y[Ri] back, otherwise they will be lost + Y[Ci+0] += yt0.f64[0] + yt0.f64[1]; + Y[Ci+1] += yt1.f64[0] + yt1.f64[1]; + } +} + + +/** + * Symmetric SpMV inner kernel using bitmasked register blocks + * 2-by-2 general case (X != XT and Y != YT) + * assumes strict-aliasing on X and Y + **/ +void SSEsym(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> nlbits) & lowmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 60); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X01QW = _mm_loadu_pd(&XT[Ri]); // use the transpose of X + + // {0,2} 02 {0,1} + // {1,3} <- 13 {2,3} + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8)])), _mm_setzero_pd(),(__m128d)Z01QW); + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE)])), _mm_setzero_pd(),(__m128d)Z23QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,A01QW),Y01QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,A23QW),Y01QW); + __m128d YT0QW = _mm_mul_pd(X01QW, A01QW); + __m128d YT1QW = _mm_mul_pd(X01QW, A23QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind] & 0x0F); + //-------------------------------------------------------------------------- + + ssp_m128 yt0, yt1; + yt0.d = YT0QW; + yt1.d = YT1QW; + + YT[Ci+0] += yt0.f64[0] + yt0.f64[1]; + YT[Ci+1] += yt1.f64[0] + yt1.f64[1]; + _mm_store_pd(&Y[Ri],Y01QW); + } +} + + +/** + * SpMV (usually used as a subroutine) using bitmasked register blocks + * This version works only with double values, unsigned indices, and 2x2 register blocks + * @param[in] nbr number of register blocks for this compressed sparse block only + * @param[in] bot the local part of the bottom array, i.e. {lower row bits}.{higher row bits} + * \attention SSEspmv should only be called within a single compressed sparse block and + * X and Y should already be partially indexed by the higher order bits + * We don't need any template specialization based on the register block size + * because for different block sizes, M's type differs, hence creating overloaded definitions + **/ +void SSEspmv(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> clbits) & lrmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 60); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + + // {0,2} 02 {0,1} + // {1,3} <- 13 {2,3} + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind] & 0x0F); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + } +} + +// 8x8 version, using uint64_t for M +// Possibly aliasing (Y=YT or X=XT) version for the blocks right on the diagonal +void SSEsym(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * Y, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + for(unsigned ind=0;ind> nlbits) & lowmask; + const uint64_t Zi = ~M[ind]; // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + __m128d Y45QW = _mm_loadu_pd(&Y[Ri+4]); + __m128d Y67QW = _mm_loadu_pd(&Y[Ri+6]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + __m128d X01QW = _mm_loadu_pd(&X[Ri]); // the transpose of X aliases X itself + __m128d X23QW = _mm_loadu_pd(&X[Ri+2]); + __m128d X45QW = _mm_loadu_pd(&X[Ri+4]); + __m128d X67QW = _mm_loadu_pd(&X[Ri+6]); + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0x8000000000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xE000000000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + __m128d A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xF800000000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + __m128d A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFE00000000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d Y00QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y11QW = _mm_mul_pd(X01QW, A23QW); + Y00QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y00QW); + Y11QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y11QW); + + // reuse variables for the second half of the first quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFF80000000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFE0000000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFF8000000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFE000000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d Y22QW = _mm_mul_pd(X01QW, A01QW); // the transpose (lower-triangular) updates + __m128d Y33QW = _mm_mul_pd(X01QW, A23QW); + Y22QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y22QW); + Y33QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y33QW); + + // Second quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFF800000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFE00000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFF80000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFE0000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + Y00QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), Y00QW); // the transpose updates + Y11QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), Y11QW); + Y00QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), Y00QW); + Y11QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), Y11QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFF8000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFE000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFF800000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFE00000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + Y22QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), Y22QW); // the transpose updates + Y33QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), Y33QW); + Y22QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), Y22QW); + Y33QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), Y33QW); + + + // Reuse registers (e.g., X00QW <- X44QW) + X00QW = _mm_loaddup_pd(&X[4+Ci]); + X11QW = _mm_loaddup_pd(&X[5+Ci]); + X22QW = _mm_loaddup_pd(&X[6+Ci]); + X33QW = _mm_loaddup_pd(&X[7+Ci]); + + // Third quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFF80000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFE0000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFF8000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFE000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d Y44QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y55QW = _mm_mul_pd(X01QW, A23QW); + Y44QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y44QW); + Y55QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y55QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFF800000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFE00000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFF80000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFE0000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d Y66QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y77QW = _mm_mul_pd(X01QW, A23QW); + Y66QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y66QW); + Y77QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y77QW); + + // Fourth quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFF8000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFE000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFF800)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFE00)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + Y44QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), Y44QW); // the transpose updates + Y55QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), Y55QW); + Y44QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), Y44QW); + Y55QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), Y55QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFF80)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFE0)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFF8)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFFE)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y45QW); // no need to shift ZxxQW + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y67QW); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y45QW); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y67QW); + + Y66QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), Y66QW); // the transpose updates + Y77QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), Y77QW); + Y66QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), Y66QW); + Y77QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), Y77QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcountll(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + _mm_store_pd(&Y[Ri+4],Y45QW); + _mm_store_pd(&Y[Ri+6],Y67QW); + + // These mirror updates come after the stores, otherwise we lose the updates + + ssp_m128 yt0, yt1, yt2, yt3,yt4,yt5,yt6,yt7; + yt0.d = Y00QW; + yt1.d = Y11QW; + yt2.d = Y22QW; + yt3.d = Y33QW; + yt4.d = Y44QW; + yt5.d = Y55QW; + yt6.d = Y66QW; + yt7.d = Y77QW; + + Y[Ci+0] += yt0.f64[0] + yt0.f64[1]; + Y[Ci+1] += yt1.f64[0] + yt1.f64[1]; + Y[Ci+2] += yt2.f64[0] + yt2.f64[1]; + Y[Ci+3] += yt3.f64[0] + yt3.f64[1]; + Y[Ci+4] += yt4.f64[0] + yt4.f64[1]; + Y[Ci+5] += yt5.f64[0] + yt5.f64[1]; + Y[Ci+6] += yt6.f64[0] + yt6.f64[1]; + Y[Ci+7] += yt7.f64[0] + yt7.f64[1]; + } +} + + +// 8x8 version, using uint64_t for M +// No aliasing between Y and YT +void SSEsym(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * __restrict Y, double * __restrict YT, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + for(unsigned ind=0;ind> nlbits) & lowmask; + const uint64_t Zi = ~M[ind]; // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + __m128d Y45QW = _mm_loadu_pd(&Y[Ri+4]); + __m128d Y67QW = _mm_loadu_pd(&Y[Ri+6]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + __m128d X01QW = _mm_loadu_pd(&XT[Ri]); + __m128d X23QW = _mm_loadu_pd(&XT[Ri+2]); + __m128d X45QW = _mm_loadu_pd(&XT[Ri+4]); + __m128d X67QW = _mm_loadu_pd(&XT[Ri+6]); + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0x8000000000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xE000000000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + __m128d A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xF800000000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + __m128d A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFE00000000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d YT0QW = _mm_mul_pd(X01QW, A01QW); + __m128d YT1QW = _mm_mul_pd(X01QW, A23QW); + YT0QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT0QW); + YT1QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT1QW); + + // reuse variables for the second half of the first quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFF80000000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFE0000000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFF8000000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFE000000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d YT2QW = _mm_mul_pd(X01QW, A01QW); // the transpose (lower-triangular) updates + __m128d YT3QW = _mm_mul_pd(X01QW, A23QW); + YT2QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT2QW); + YT3QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT3QW); + + // Second quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFF800000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFE00000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFF80000000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFE0000000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + YT0QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), YT0QW); // the transpose updates + YT1QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), YT1QW); + YT0QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), YT0QW); + YT1QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), YT1QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFF8000000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFE000000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFF800000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFE00000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + YT2QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), YT2QW); // the transpose updates + YT3QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), YT3QW); + YT2QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), YT2QW); + YT3QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), YT3QW); + + ssp_m128 yt0, yt1, yt2, yt3; + yt0.d = YT0QW; + yt1.d = YT1QW; + yt2.d = YT2QW; + yt3.d = YT3QW; + + YT[Ci+0] += yt0.f64[0] + yt0.f64[1]; + YT[Ci+1] += yt1.f64[0] + yt1.f64[1]; + YT[Ci+2] += yt2.f64[0] + yt2.f64[1]; + YT[Ci+3] += yt3.f64[0] + yt3.f64[1]; + + // Reuse registers (e.g., X00QW <- X44QW) + X00QW = _mm_loaddup_pd(&X[4+Ci]); + X11QW = _mm_loaddup_pd(&X[5+Ci]); + X22QW = _mm_loaddup_pd(&X[6+Ci]); + X33QW = _mm_loaddup_pd(&X[7+Ci]); + + // Third quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFF80000000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFE0000000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFF8000000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFE000000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + YT0QW = _mm_mul_pd(X01QW, A01QW); // reuse Y(1:4) registers for Y(5:8) + YT1QW = _mm_mul_pd(X01QW, A23QW); + YT0QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT0QW); + YT1QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT1QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFF800000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFE00000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFF80000)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFE0000)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + YT2QW = _mm_mul_pd(X01QW, A01QW); + YT3QW = _mm_mul_pd(X01QW, A23QW); + YT2QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT2QW); + YT3QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT3QW); + + // Fourth quadrand + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFF8000)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFE000)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFF800)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFE00)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y45QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y67QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y45QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y67QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + YT0QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), YT0QW); // the transpose updates + YT1QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), YT1QW); + YT0QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), YT0QW); + YT1QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), YT1QW); + + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFF80)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFE0)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFF8)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFFE)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y45QW); // no need to shift ZxxQW + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y67QW); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y45QW); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y67QW); + + YT2QW = _mm_add_pd(_mm_mul_pd(X45QW, A01QW), YT2QW); // the transpose updates + YT3QW = _mm_add_pd(_mm_mul_pd(X45QW, A23QW), YT3QW); + YT2QW = _mm_add_pd(_mm_mul_pd(X67QW, A45QW), YT2QW); + YT3QW = _mm_add_pd(_mm_mul_pd(X67QW, A67QW), YT3QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcountll(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + _mm_store_pd(&Y[Ri+4],Y45QW); + _mm_store_pd(&Y[Ri+6],Y67QW); + + yt0.d = YT0QW; + yt1.d = YT1QW; + yt2.d = YT2QW; + yt3.d = YT3QW; + + YT[Ci+4] += yt0.f64[0] + yt0.f64[1]; + YT[Ci+5] += yt1.f64[0] + yt1.f64[1]; + YT[Ci+6] += yt2.f64[0] + yt2.f64[1]; + YT[Ci+7] += yt3.f64[0] + yt3.f64[1]; + } +} + + + +// Possibly aliasing (Y=YT or X=XT) version for the blocks right on the diagonal +void SSEsym(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * Y, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + for(unsigned ind=0;ind> nlbits) & lowmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 48); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + __m128d X01QW = _mm_loadu_pd(&X[Ri]); // the transpose of X aliases X itself + __m128d X23QW = _mm_loadu_pd(&X[Ri+2]); + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8000)])), _mm_setzero_pd(),(__m128d)Z01QW); + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE000)])), _mm_setzero_pd(),(__m128d)Z23QW); + __m128d A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xF800)])), _mm_setzero_pd(),(__m128d)Z45QW); + __m128d A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFE00)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d Y00QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y11QW = _mm_mul_pd(X01QW, A23QW); + Y00QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y00QW); + Y11QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y11QW); + + // reuse variables for the second half of A + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFF80)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFE0)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFF8)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFFE)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); // the shifts on ZxxQW are unnecessary after this point + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); + + __m128d Y22QW = _mm_mul_pd(X01QW, A01QW); + __m128d Y33QW = _mm_mul_pd(X01QW, A23QW); + Y22QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), Y22QW); + Y33QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), Y33QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + + // These mirror updates come after the stores, otherwise we lose the updates + ssp_m128 yt0, yt1, yt2, yt3; + yt0.d = Y00QW; + yt1.d = Y11QW; + yt2.d = Y22QW; + yt3.d = Y33QW; + + Y[Ci+0] += yt0.f64[0] + yt0.f64[1]; + Y[Ci+1] += yt1.f64[0] + yt1.f64[1]; + Y[Ci+2] += yt2.f64[0] + yt2.f64[1]; + Y[Ci+3] += yt3.f64[0] + yt3.f64[1]; + } +} + +void SSEsym(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> nlbits) & lowmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 48); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + __m128d X01QW = _mm_loadu_pd(&XT[Ri]); + __m128d X23QW = _mm_loadu_pd(&XT[Ri+2]); + + __m128d A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8000)])), _mm_setzero_pd(),(__m128d)Z01QW); + __m128d A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE000)])), _mm_setzero_pd(),(__m128d)Z23QW); + __m128d A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xF800)])), _mm_setzero_pd(),(__m128d)Z45QW); + __m128d A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFE00)])), _mm_setzero_pd(),(__m128d)Z67QW); + + // Operations rescheduled for maximum parallelism (they follow a 1-3-2-4 order) + // {0,2} 02** {0,1} + // {1,3} <- 13** {2,3} + // * **** * + // * **** * + + // * **** {4,5} + // * <- **** {6,7} + // {4,6} 46** * + // {5,7} 57** * + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW, A01QW), Y01QW); Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW, A45QW), Y23QW); Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW, A23QW), Y01QW); Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW, A67QW), Y23QW); Z67QW=_mm_slli_epi64(Z67QW,8); + + __m128d YT0QW = _mm_mul_pd(X01QW, A01QW); + __m128d YT1QW = _mm_mul_pd(X01QW, A23QW); + YT0QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT0QW); + YT1QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT1QW); + + // write YT back (Safe since we know that Y is not an alias to YT) + ssp_m128 yt0, yt1; + yt0.d = YT0QW; + yt1.d = YT1QW; + + YT[Ci+0] += yt0.f64[0] + yt0.f64[1]; + YT[Ci+1] += yt1.f64[0] + yt1.f64[1]; + + + // reuse variables for the second half of A + A01QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFF80)])), _mm_setzero_pd(),(__m128d)Z01QW); + A23QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFE0)])), _mm_setzero_pd(),(__m128d)Z23QW); + A45QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFF8)])), _mm_setzero_pd(),(__m128d)Z45QW); + A67QW = _mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFFE)])), _mm_setzero_pd(),(__m128d)Z67QW); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW, A01QW), Y01QW); // the shifts on ZxxQW are unnecessary after this point + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW, A45QW), Y23QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW, A23QW), Y01QW); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW, A67QW), Y23QW); + + __m128d YT2QW = _mm_mul_pd(X01QW, A01QW); + __m128d YT3QW = _mm_mul_pd(X01QW, A23QW); + YT2QW = _mm_add_pd(_mm_mul_pd(X23QW, A45QW), YT2QW); + YT3QW = _mm_add_pd(_mm_mul_pd(X23QW, A67QW), YT3QW); + + // write YT back (Safe since we know that Y is not an alias to YT) + ssp_m128 yt2, yt3; + yt2.d = YT2QW; + yt3.d = YT3QW; + + YT[Ci+2] += yt2.f64[0] + yt2.f64[1]; + YT[Ci+3] += yt3.f64[0] + yt3.f64[1]; + + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + } +} + + +/** + * SpMV (usually used as a subroutine) using bitmasked register blocks + * This version works only with double values, unsigned indices, and 4x4 register blocks + * @param[in] nbr number of register blocks for this compressed sparse block only + * @param[in] bot the local part of the bottom array, i.e. {lower row bits}.{higher row bits} + * \attention SSEspmv should only be called within a single compressed sparse block and + * X and Y should already be partially indexed by the higher order bits + * We don't need any template specialization based on the register block size + * because for different block sizes, M's type differs, hence creating overloaded definitions + **/ +void SSEspmv(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> clbits) & lrmask; + + const uint64_t m64 = (uint64_t) M[ind]; // upcast to 64 bit, fill-in zeros from left + const uint64_t Zi = ((~m64) << 48); // a 1 denotes a zero + const uint64_t Zil = Zi << 1; + +#ifdef AMD + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zil,1); +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + // Operations rescheduled for maximum parallelism (they follow a 1-3-2-4 order) + // {0,2} 02** {0,1} + // {1,3} <- 13** {2,3} + // * **** * + // * **** * + + // * **** {4,5} + // * <- **** {6,7} + // {4,6} 46** * + // {5,7} 57** * + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0x8000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xF800)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xE000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFE00)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + // {8,0} **80 * + // {9,1} <- **91 * + // * **** {8,9} + // * **** {0,1} + + // * **** * + // * <- **** * + // {2,4} **24 {2,3} + // {3,5} **35 {4,5} + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFF80)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFF8)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFE0)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcount(M[ind]&0xFFFE)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcount(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + } +} + + //-------------------------------------------------------------------------- +// M is of type uint64_t --> 8x8 register blocks +void SSEspmv(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits) +{ + const double * __restrict _V = V-1; + + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(unsigned ind=0;ind> clbits) & lrmask; + const uint64_t Zi = ~M[ind]; // a 1 denotes a zero + + __m128d Y01QW = _mm_loadu_pd(&Y[Ri]); + __m128d Y23QW = _mm_loadu_pd(&Y[Ri+2]); + __m128d Y45QW = _mm_loadu_pd(&Y[Ri+4]); + __m128d Y67QW = _mm_loadu_pd(&Y[Ri+6]); + +#ifdef AMD + const uint64_t Zil = Zi << 1; + __m128i Z01QW = _mm_unpacklo_epi64 (_mm_loadl_epi64((__m128i*)&Zi), _mm_loadl_epi64((__m128i*)&Zil)); +#else + __m128i Z01QW = _mm_insert_epi64(_mm_loadl_epi64((__m128i*)&Zi),Zi<<1,1); // Z01[0][63] = Z[63] +#endif + __m128i Z23QW = _mm_slli_epi64(Z01QW, 2); + __m128i Z45QW = _mm_slli_epi64(Z01QW, 4); + __m128i Z67QW = _mm_slli_epi64(Z01QW, 6); + + //-------------------------------------------------------------------------- + __m128d X00QW = _mm_loaddup_pd(&X[0+Ci]); // load and duplicate a double into 128-bit registers. + __m128d X11QW = _mm_loaddup_pd(&X[1+Ci]); + __m128d X22QW = _mm_loaddup_pd(&X[2+Ci]); + __m128d X33QW = _mm_loaddup_pd(&X[3+Ci]); + + // Operations rescheduled for maximum parallelism (they follow a 1-3-2-4 order) + // {0,2} 02** {0,1} + // {1,3} <- 13** {2,3} + // * **** * + // * **** * + + // * **** {4,5} + // * <- **** {6,7} + // {4,6} 46** * + // {5,7} 57** * + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0x8000000000000000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xF800000000000000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xE000000000000000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFE00000000000000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + // {8,0} **80 * + // {9,1} <- **91 * + // * **** {8,9} + // * **** {0,1} + + // * **** * + // * <- **** * + // {2,4} **24 {2,3} + // {3,5} **35 {4,5} + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFF80000000000000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFF8000000000000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFE0000000000000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFE000000000000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + //-------------------------------------------------------------------------- + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFF800000000000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y45QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFF80000000000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y67QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFE00000000000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y45QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFE0000000000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y67QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFF8000000000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y45QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFE000000000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y45QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFF800000000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y67QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFE00000000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y67QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + //-------------------------------------------------------------------------- + //-------------------------------------------------------------------------- + // Reuse registers (e.g., X00QW <- X44QW) + X00QW = _mm_loaddup_pd(&X[4+Ci]); + X11QW = _mm_loaddup_pd(&X[5+Ci]); + X22QW = _mm_loaddup_pd(&X[6+Ci]); + X33QW = _mm_loaddup_pd(&X[7+Ci]); + + Y01QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFF80000000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFF8000000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFE0000000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFE000000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + Y01QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFF800000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y01QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFF80000)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y23QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y01QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFE00000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y01QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y23QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFE0000)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y23QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + Y45QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFF8000)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y45QW);Z01QW=_mm_slli_epi64(Z01QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X00QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFF800)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y67QW);Z45QW=_mm_slli_epi64(Z45QW,8); + Y45QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFE000)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y45QW);Z23QW=_mm_slli_epi64(Z23QW,8); + Y67QW = _mm_add_pd(_mm_mul_pd(X11QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFE00)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y67QW);Z67QW=_mm_slli_epi64(Z67QW,8); + + Y45QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFF80)])),_mm_setzero_pd(),(__m128d)Z01QW)),Y45QW); + Y67QW = _mm_add_pd(_mm_mul_pd(X22QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFF8)])),_mm_setzero_pd(),(__m128d)Z45QW)),Y67QW); + Y45QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFE0)])),_mm_setzero_pd(),(__m128d)Z23QW)),Y45QW); + Y67QW = _mm_add_pd(_mm_mul_pd(X33QW,_mm_blendv_pd((__m128d)_mm_loadu_ps((float*)&(_V[__builtin_popcountll(M[ind]&0xFFFFFFFFFFFFFFFE)])),_mm_setzero_pd(),(__m128d)Z67QW)),Y67QW); + + //-------------------------------------------------------------------------- + _V += __builtin_popcountll(M[ind]); + //-------------------------------------------------------------------------- + + _mm_store_pd(&Y[Ri],Y01QW); + _mm_store_pd(&Y[Ri+2],Y23QW); + _mm_store_pd(&Y[Ri+4],Y45QW); + _mm_store_pd(&Y[Ri+6],Y67QW); + } +} + +void popcountall(const unsigned char * __restrict M, unsigned * __restrict counts, size_t n) +{ + // only the last for bits counts in every location M[i] + size_t nn = n/8; + for(size_t i=0; i +#include +#include +#include // MMX +#include // SSE +#include // SSE 2 +#include // SSE 3 + +#ifndef AMD + #include // SSSE 3 + #include // SSE 4.1 + #include // SSE 4.2 + #include // SSE ?? (AES) +#else + #include // SSE4A (amd's popcount) + #include +#endif + +#include "timer.clock_gettime.c" + + +//--------------------------------------- +// Type Definitions +//--------------------------------------- + +typedef signed char ssp_s8; +typedef unsigned char ssp_u8; + +typedef signed short ssp_s16; +typedef unsigned short ssp_u16; + +typedef signed int ssp_s32; +typedef unsigned int ssp_u32; + +typedef float ssp_f32; +typedef double ssp_f64; + +typedef signed long long ssp_s64; +typedef unsigned long long ssp_u64; + +typedef union +{ +__m128 f; +__m128d d; +__m128i i; +__m64 m64[ 2]; +ssp_u64 u64[ 2]; +ssp_s64 s64[ 2]; +ssp_f64 f64[ 2]; +ssp_u32 u32[ 4]; +ssp_s32 s32[ 4]; +ssp_f32 f32[ 4]; +ssp_u16 u16[ 8]; +ssp_s16 s16[ 8]; +ssp_u8 u8 [16]; +ssp_s8 s8 [16]; +} ssp_m128; + + +/** \SSE4_1{SSE2,_mm_blendv_pd} */ +inline __m128d ssp_blendv_pd_SSE2( __m128d a, __m128d b, __m128d mask ) +{ + ssp_m128 A, B, Mask; + A.d = a; + B.d = b; + Mask.d = mask; + +// _MM_SHUFFLE(z,y,x,w) does not select anything, this macro just creates a mask +// expands to the following value: (z<<6) | (y<<4) | (x<<2) | w + + Mask.i = _mm_shuffle_epi32( Mask.i, _MM_SHUFFLE(3, 3, 1, 1) ); + Mask.i = _mm_srai_epi32 ( Mask.i, 31 ); + + B.i = _mm_and_si128( B.i, Mask.i ); + A.i = _mm_andnot_si128( Mask.i, A.i ); + A.i = _mm_or_si128( A.i, B.i ); + return A.d; +} + + +#ifdef AMD + #define _mm_blendv_pd ssp_blendv_pd_SSE2 +#endif + +/** + * SpMV (usually used as a subroutine) using bitmasked register blocks + * This version works only with doubles and 8x8 register blocks + * @param[in] nbr number of register blocks + * @param[in] bot the local part of the bottom array, i.e. {lower row bits}.{higher row bits} + **/ +template +void SSEspmv(const double *V, const uint64_t *M, const IT *bot, const IT nrb, const double *X, double *Y){ + const double *_V = V-1; + __m128d Y01QW = _mm_setzero_pd(); + __m128d Y23QW = _mm_setzero_pd(); + __m128d Y45QW = _mm_setzero_pd(); + __m128d Y67QW = _mm_setzero_pd(); + // use popcnt to index into nonzero stream + // use blendv where 1 = zero + for(IT i=0;i +#include +#include + +#ifdef __APPLE__ +#include +#else +#include +#endif + +#include "promote.h" + +template +struct inf_plus{ + T operator()(const T& a, const T& b) const { + T inf = std::numeric_limits::max(); + if (a == inf || b == inf){ + return inf; + } + return a + b; + } +}; + +// (+,*) on scalars +template +struct PTSR +{ + typedef typename promote_trait::T_promote T_promote; + + static T_promote add(const T1 & arg1, const T2 & arg2) + { + return (static_cast(arg1) + + static_cast(arg2) ); + } + static T_promote multiply(const T1 & arg1, const T2 & arg2) + { + return (static_cast(arg1) * + static_cast(arg2) ); + } + // y += ax overload with a=1 + static void axpy(const T2 & x, T_promote & y) + { + y += x; + } + + static void axpy(T1 a, const T2 & x, T_promote & y) + { + y += a*x; + } +}; + + +template +struct UnrollerL { + template + static void step(Lambda& func) { + func(Begin); + UnrollerL::step(func); + } +}; + +template +struct UnrollerL { + template + static void step(Lambda& func) { + // base case is when Begin=End; do nothing + } +}; + + +// (+,*) on std:array's +template +struct PTSRArray +{ + typedef typename promote_trait::T_promote T_promote; + + // y <- a*x + y overload with a=1 + static void axpy(const array & b, array & c) + { + const T2 * __restrict barr = b.data(); + T_promote * __restrict carr = c.data(); + __assume_aligned(barr, ALIGN); + __assume_aligned(carr, ALIGN); + + #pragma simd + for(int i=0; i::step ( multadd ); + } + + // Todo: Do partial unrolling; this code will bloat for D > 32 + static void axpy(T1 a, const array & b, array & c) + { + const T2 * __restrict barr = b.data(); + T_promote * __restrict carr = c.data(); + __assume_aligned(barr, ALIGN); + __assume_aligned(carr, ALIGN); + + #pragma simd + for(int i=0; i::step ( multadd ); + } +}; + +// (min,+) on scalars +template +struct MPSR +{ + typedef typename promote_trait::T_promote T_promote; + + static T_promote add(const T1 & arg1, const T2 & arg2) + { + return std::min + (static_cast(arg1), static_cast(arg2)); + } + static T_promote multiply(const T1 & arg1, const T2 & arg2) + { + return inf_plus< T_promote > + (static_cast(arg1), static_cast(arg2)); + } +}; + + +#endif diff --git a/csb/SuiteSparse_config.h b/csb/SuiteSparse_config.h new file mode 100644 index 0000000..e14a171 --- /dev/null +++ b/csb/SuiteSparse_config.h @@ -0,0 +1,247 @@ +/* ========================================================================== */ +/* === SuiteSparse_config =================================================== */ +/* ========================================================================== */ + +/* Configuration file for SuiteSparse: a Suite of Sparse matrix packages + * (AMD, COLAMD, CCOLAMD, CAMD, CHOLMOD, UMFPACK, CXSparse, and others). + * + * SuiteSparse_config.h provides the definition of the long integer. On most + * systems, a C program can be compiled in LP64 mode, in which long's and + * pointers are both 64-bits, and int's are 32-bits. Windows 64, however, uses + * the LLP64 model, in which int's and long's are 32-bits, and long long's and + * pointers are 64-bits. + * + * SuiteSparse packages that include long integer versions are + * intended for the LP64 mode. However, as a workaround for Windows 64 + * (and perhaps other systems), the long integer can be redefined. + * + * If _WIN64 is defined, then the __int64 type is used instead of long. + * + * The long integer can also be defined at compile time. For example, this + * could be added to SuiteSparse_config.mk: + * + * CFLAGS = -O -D'SuiteSparse_long=long long' \ + * -D'SuiteSparse_long_max=9223372036854775801' -D'SuiteSparse_long_idd="lld"' + * + * This file defines SuiteSparse_long as either long (on all but _WIN64) or + * __int64 on Windows 64. The intent is that a SuiteSparse_long is always a + * 64-bit integer in a 64-bit code. ptrdiff_t might be a better choice than + * long; it is always the same size as a pointer. + * + * This file also defines the SUITESPARSE_VERSION and related definitions. + * + * Copyright (c) 2012, Timothy A. Davis. No licensing restrictions apply + * to this file or to the SuiteSparse_config directory. + * Author: Timothy A. Davis. + */ + +#ifndef SUITESPARSE_CONFIG_H +#define SUITESPARSE_CONFIG_H + +#ifdef __cplusplus +extern "C" { +#endif + +#include +#include + +/* ========================================================================== */ +/* === SuiteSparse_long ===================================================== */ +/* ========================================================================== */ + +#ifndef SuiteSparse_long + +#ifdef _WIN64 + +#define SuiteSparse_long __int64 +#define SuiteSparse_long_max _I64_MAX +#define SuiteSparse_long_idd "I64d" + +#else + +#define SuiteSparse_long long +#define SuiteSparse_long_max LONG_MAX +#define SuiteSparse_long_idd "ld" + +#endif +#define SuiteSparse_long_id "%" SuiteSparse_long_idd +#endif + +/* ========================================================================== */ +/* === SuiteSparse_config parameters and functions ========================== */ +/* ========================================================================== */ + +/* SuiteSparse-wide parameters are placed in this struct. It is meant to be + an extern, globally-accessible struct. It is not meant to be updated + frequently by multiple threads. Rather, if an application needs to modify + SuiteSparse_config, it should do it once at the beginning of the application, + before multiple threads are launched. + + The intent of these function pointers is that they not be used in your + application directly, except to assign them to the desired user-provided + functions. Rather, you should use the + */ + +struct SuiteSparse_config_struct +{ + void *(*malloc_func) (size_t) ; /* pointer to malloc */ + void *(*calloc_func) (size_t, size_t) ; /* pointer to calloc */ + void *(*realloc_func) (void *, size_t) ; /* pointer to realloc */ + void (*free_func) (void *) ; /* pointer to free */ + int (*printf_func) (const char *, ...) ; /* pointer to printf */ + double (*hypot_func) (double, double) ; /* pointer to hypot */ + int (*divcomplex_func) (double, double, double, double, double *, double *); +} ; + +extern struct SuiteSparse_config_struct SuiteSparse_config ; + +void SuiteSparse_start ( void ) ; /* called to start SuiteSparse */ + +void SuiteSparse_finish ( void ) ; /* called to finish SuiteSparse */ + +void *SuiteSparse_malloc /* pointer to allocated block of memory */ +( + size_t nitems, /* number of items to malloc (>=1 is enforced) */ + size_t size_of_item /* sizeof each item */ +) ; + +void *SuiteSparse_calloc /* pointer to allocated block of memory */ +( + size_t nitems, /* number of items to calloc (>=1 is enforced) */ + size_t size_of_item /* sizeof each item */ +) ; + +void *SuiteSparse_realloc /* pointer to reallocated block of memory, or + to original block if the realloc failed. */ +( + size_t nitems_new, /* new number of items in the object */ + size_t nitems_old, /* old number of items in the object */ + size_t size_of_item, /* sizeof each item */ + void *p, /* old object to reallocate */ + int *ok /* 1 if successful, 0 otherwise */ +) ; + +void *SuiteSparse_free /* always returns NULL */ +( + void *p /* block to free */ +) ; + +void SuiteSparse_tic /* start the timer */ +( + double tic [2] /* output, contents undefined on input */ +) ; + +double SuiteSparse_toc /* return time in seconds since last tic */ +( + double tic [2] /* input: from last call to SuiteSparse_tic */ +) ; + +double SuiteSparse_time /* returns current wall clock time in seconds */ +( + void +) ; + +/* returns sqrt (x^2 + y^2), computed reliably */ +double SuiteSparse_hypot (double x, double y) ; + +/* complex division of c = a/b */ +int SuiteSparse_divcomplex +( + double ar, double ai, /* real and imaginary parts of a */ + double br, double bi, /* real and imaginary parts of b */ + double *cr, double *ci /* real and imaginary parts of c */ +) ; + +/* determine which timer to use, if any */ +#ifndef NTIMER +#ifdef _POSIX_C_SOURCE +#if _POSIX_C_SOURCE >= 199309L +#define SUITESPARSE_TIMER_ENABLED +#endif +#endif +#endif + +/* SuiteSparse printf macro */ +#define SUITESPARSE_PRINTF(params) \ +{ \ + if (SuiteSparse_config.printf_func != NULL) \ + { \ + (void) (SuiteSparse_config.printf_func) params ; \ + } \ +} + +/* ========================================================================== */ +/* === SuiteSparse version ================================================== */ +/* ========================================================================== */ + +/* SuiteSparse is not a package itself, but a collection of packages, some of + * which must be used together (UMFPACK requires AMD, CHOLMOD requires AMD, + * COLAMD, CAMD, and CCOLAMD, etc). A version number is provided here for the + * collection itself. The versions of packages within each version of + * SuiteSparse are meant to work together. Combining one package from one + * version of SuiteSparse, with another package from another version of + * SuiteSparse, may or may not work. + * + * SuiteSparse contains the following packages: + * + * SuiteSparse_config version 4.4.1 (version always the same as SuiteSparse) + * AMD version 2.4.1 + * BTF version 1.2.1 + * CAMD version 2.4.1 + * CCOLAMD version 2.9.1 + * CHOLMOD version 3.0.3 + * COLAMD version 2.9.1 + * CSparse version 3.1.4 + * CXSparse version 3.1.4 + * KLU version 1.3.2 + * LDL version 2.2.1 + * RBio version 2.2.1 + * SPQR version 2.0.0 + * GPUQREngine version 1.0.0 + * SuiteSparse_GPURuntime version 1.0.0 + * UMFPACK version 5.7.1 + * MATLAB_Tools various packages & M-files + * + * Other package dependencies: + * BLAS required by CHOLMOD and UMFPACK + * LAPACK required by CHOLMOD + * METIS 4.0.1 required by CHOLMOD (optional) and KLU (optional) + * CUBLAS, CUDART NVIDIA libraries required by CHOLMOD and SPQR when + * they are compiled with GPU acceleration. + */ + + +int SuiteSparse_version /* returns SUITESPARSE_VERSION */ +( + /* output, not defined on input. Not used if NULL. Returns + the three version codes in version [0..2]: + version [0] is SUITESPARSE_MAIN_VERSION + version [1] is SUITESPARSE_SUB_VERSION + version [2] is SUITESPARSE_SUBSUB_VERSION + */ + int version [3] +) ; + +/* Versions prior to 4.2.0 do not have the above function. The following + code fragment will work with any version of SuiteSparse: + + #ifdef SUITESPARSE_HAS_VERSION_FUNCTION + v = SuiteSparse_version (NULL) ; + #else + v = SUITESPARSE_VERSION ; + #endif +*/ +#define SUITESPARSE_HAS_VERSION_FUNCTION + +#define SUITESPARSE_DATE "Jan 7, 2015" +#define SUITESPARSE_VER_CODE(main,sub) ((main) * 1000 + (sub)) +#define SUITESPARSE_MAIN_VERSION 4 +#define SUITESPARSE_SUB_VERSION 4 +#define SUITESPARSE_SUBSUB_VERSION 2 +#define SUITESPARSE_VERSION \ + SUITESPARSE_VER_CODE(SUITESPARSE_MAIN_VERSION,SUITESPARSE_SUB_VERSION) + +#ifdef __cplusplus +} +#endif +#endif diff --git a/csb/bandMat.cpp b/csb/bandMat.cpp new file mode 100644 index 0000000..55f8d4d --- /dev/null +++ b/csb/bandMat.cpp @@ -0,0 +1,356 @@ +#include "bandMat.hpp" + + +// ================================================== +// === UTILITIES + +void transformCSCtoCSR(double * const acsr, + int * const ja, + int * const ia, + double * const acsc, + int * const ja1, + int * const ia1, + unsigned int const n) { + + MKL_INT job[8] = {1, 0, 0, 0, 0, 1, 0, 0}; + + MKL_INT m = (MKL_INT) n; + + MKL_INT * info = new MKL_INT[1]; + + mkl_dcsrcsc (job , &m , + acsr , ja , ia, + acsc , ja1 , ia1 , info ); + +} + +void transformCSCtoCSR(float * const acsr, + int * const ja, + int * const ia, + float * const acsc, + int * const ja1, + int * const ia1, + unsigned int const n) { + + MKL_INT job[8] = {1, 0, 0, 0, 0, 1, 0, 0}; + + MKL_INT m = (MKL_INT) n; + + MKL_INT * info = new MKL_INT[1]; + + mkl_scsrcsc (job , &m , + acsr , ja , ia, + acsc , ja1 , ia1 , info ); + +} + + +// ================================================== +// === MATRIX-VECTOR PRODUCT ROUTINES + + +// -------------------------------------------------- +// ---------- ?GBMV: BANDED MATRIX VECTOR + +void bandMatComputation( double * const y, + double const * const A, + double const * const x, + unsigned int const n, + unsigned int const b, + unsigned int const nOfVec){ + + unsigned int kl = (b - 1) / 2; + + for ( int jj = 0; jj < nOfVec; jj++ ) { + + cblas_dgbmv(CblasColMajor, CblasNoTrans, n, n, kl, kl, + 1, A, b, &(x[jj*n]), 1, 0, &(y[jj*n]), 1); + } + +} + +void bandMatComputation( float * const y, + float const * const A, + float const * const x, + unsigned int const n, + unsigned int const b, + unsigned int const nOfVec){ + + unsigned int kl = (b - 1) / 2; + + for ( int jj = 0; jj < nOfVec; jj++ ) { + + cblas_sgbmv(CblasColMajor, CblasNoTrans, n, n, kl, kl, + 1, A, b, &(x[jj*n]), 1, 0, &(y[jj*n]), 1); + + } + +} + + +// -------------------------------------------------- +// ---------- ?CSCMV: MATRIX VECTOR PRODUCT USING SPARSE CSC + +void sparseComputation( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + char transa = 'n'; + double alpha = 1; + double beta = 0; + + MKL_INT m = n; + MKL_INT k = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_dcscmv(&transa, &m, &k, &alpha, matdescra, values, rows, + columns, &(columns[1]), &(x[jj*n]), &beta, &(y[jj*n])); + +} + +void sparseComputation( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + char transa = 'n'; + float alpha = 1; + float beta = 0; + + MKL_INT m = n; + MKL_INT k = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_scscmv(&transa, &m, &k, &alpha, matdescra, values, rows, + columns, &(columns[1]), &(x[jj*n]), &beta, &(y[jj*n])); + +} + + +// -------------------------------------------------- +// ---------- ?CSRMV: MATRIX VECTOR PRODUCT USING SPARSE CSR + +void sparseComputationCSR( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa = 'n'; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_dcsrgemv (&transa , &m , values, rows, columns , &(x[jj*n]), &(y[jj*n]) ); + +} + +void sparseComputationCSR( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa = 'n'; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_scsrgemv (&transa , &m , values , rows , columns , &(x[jj*n]), &(y[jj*n]) ); + +} + + +// -------------------------------------------------- +// ---------- ?SBMV: SYMMETRIC BANDED MATRIX VECTOR + +void symBandMatComputation( double * const y, + double const * const A, + double const * const x, + unsigned int const n, + unsigned int const b, + unsigned int const nOfVec){ + + for ( int jj = 0; jj < nOfVec; jj++) + + cblas_dsbmv(CblasColMajor, CblasUpper, n, b-1, + 1, A, b, &(x[jj*n]), 1, 0, &(y[jj*n]), 1); + +} + +void symBandMatComputation( float * const y, + float const * const A, + float const * const x, + unsigned int const n, + unsigned int const b, + unsigned int const nOfVec){ + + for ( int jj = 0; jj < nOfVec; jj++) + + cblas_ssbmv(CblasColMajor, CblasUpper, n, b-1, + 1, A, b, &(x[jj*n]), 1, 0, &(y[jj*n]), 1); + +} + + +// -------------------------------------------------- +// ---------- ?CSRSYMV: SYMMETRICAL MATRIX VECTOR PRODUCT + +void symSparseComputation( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char uplo[1] = {'u'}; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_dcsrsymv(uplo, &m, values, rows, columns, &(x[jj*n]), &(y[jj*n])); + +} + +void symSparseComputation( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char uplo[1] = {'u'}; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_scsrsymv(uplo, &m, values, rows, columns, &(x[jj*n]), &(y[jj*n])); + +} + + +// ================================================== +// === MATRIX-MATRIX PRODUCT ROUTINES + +// -------------------------------------------------- +// --- CSR MATRIX-MATRIX PRODUCT + +void sparseMatrixMatrixComputationCSR( double * const y, + double const * const values, + int const * const columns, + int const * const rows, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + double alpha = 1; + double beta = 0; + + mkl_dcsrmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, columns, + rows, &(rows[1]), x, &nVecs, &beta, y, &nVecs ); + +} + +void sparseMatrixMatrixComputationCSR( float * const y, + float const * const values, + int const * const columns, + int const * const rows, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + float alpha = 1; + float beta = 0; + + mkl_scsrmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, columns, + rows, &(rows[1]), x, &nVecs, &beta, y, &nVecs ); + +} + +// -------------------------------------------------- +// --- CSC MATRIX-MATRIX PRODUCT + +void sparseMatrixMatrixComputation( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + double alpha = 1; + double beta = 0; + + mkl_dcscmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, rows, + columns, &(columns[1]), x, &nVecs, &beta, y, &nVecs ); + +} + +void sparseMatrixMatrixComputation( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + float alpha = 1; + float beta = 0; + + mkl_scscmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, rows, + columns, &(columns[1]), x, &nVecs, &beta, y, &nVecs ); + +} diff --git a/csb/bicsb.cpp b/csb/bicsb.cpp new file mode 100644 index 0000000..cd75df9 --- /dev/null +++ b/csb/bicsb.cpp @@ -0,0 +1,1901 @@ +#include +#include "bicsb.h" +#include "utility.h" +#include + +// Choose block size as big as possible given the following constraints +// 1) The bot array is addressible by IT +// 2) The parts of x & y vectors that a block touches fits into L2 cache [assuming a saxpy() operation] +// 3) There's enough parallel slackness for block rows (at least SLACKNESS * CILK_NPROC) +template +void BiCsb::Init(int workers, IT forcelogbeta) +{ + ispar = (workers > 1); + IT roundrowup = nextpoweroftwo(m); + IT roundcolup = nextpoweroftwo(n); + + // if indices are negative, highestbitset returns -1, + // but that will be caught by the sizereq below + IT rowbits = highestbitset(roundrowup); + IT colbits = highestbitset(roundcolup); + bool sizereq; + if (ispar) + { + sizereq = ((IntPower<2>(rowbits) > SLACKNESS * workers) + && (IntPower<2>(colbits) > SLACKNESS * workers)); + } + else + { + sizereq = ((rowbits > 1) && (colbits > 1)); + } + + if(!sizereq) + { + cerr << "Matrix too small for this library" << endl; + return; + } + + rowlowbits = rowbits-1; + collowbits = colbits-1; + IT inf = numeric_limits::max(); + IT maxbits = highestbitset(inf); + + rowhighbits = rowbits-rowlowbits; // # higher order bits for rows (has at least one bit) + colhighbits = colbits-collowbits; // # higher order bits for cols (has at least one bit) + if(ispar) + { + while(IntPower<2>(rowhighbits) < SLACKNESS * workers) + { + rowhighbits++; + rowlowbits--; + } + } + + // calculate the space that suby occupies in L2 cache + IT yL2 = IntPower<2>(rowlowbits) * sizeof(NT); + while(yL2 > L2SIZE) + { + yL2 /= 2; + rowhighbits++; + rowlowbits--; + } + + // calculate the space that subx occupies in L2 cache + IT xL2 = IntPower<2>(collowbits) * sizeof(NT); + while(xL2 > L2SIZE) + { + xL2 /= 2; + colhighbits++; + collowbits--; + } + + // blocks need to be square for correctness (maybe generalize this later?) + while(rowlowbits+collowbits > maxbits) + { + if(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + else + { + colhighbits++; + collowbits--; + } + } + while(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + while(rowlowbits < collowbits) + { + colhighbits++; + collowbits--; + } + assert (collowbits == rowlowbits); + + lowrowmask = IntPower<2>(rowlowbits) - 1; + lowcolmask = IntPower<2>(collowbits) - 1; + if(forcelogbeta != 0) + { + IT candlowmask = IntPower<2>(forcelogbeta) -1; + cout << "Forcing beta to "<< (candlowmask+1) << " instead of the chosen " << (lowrowmask+1) << endl; + cout << "Warning : No checks are performed on the beta you have forced, anything can happen !" << endl; + lowrowmask = lowcolmask = candlowmask; + rowlowbits = collowbits = forcelogbeta; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + else + { + //cout << "Choussing Beta m: " << m << "n: " << n << endl; + double sqrtn = sqrt(sqrt(static_cast(m) * static_cast(n))); + IT logbeta = static_cast(ceil(log2(sqrtn))) + 2; + if(rowlowbits > logbeta) + { + //cout << "Row Low bits" << endl; + rowlowbits = collowbits = logbeta; + lowrowmask = lowcolmask = IntPower<2>(logbeta) -1; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + //cout << "Low row mask:" << lowriwmask << endl; + // cout << "Beta chosen to be "<< (lowrowmask+1) << endl; + } + highrowmask = ((roundrowup - 1) ^ lowrowmask); + highcolmask = ((roundcolup - 1) ^ lowcolmask); + + // nbc = #{block columns} = #{blocks in any block row}, nbr = #{block rows) + IT blcdimrow = lowrowmask + 1; + IT blcdimcol = lowcolmask + 1; + nbr = static_cast(ceil(static_cast(m) / static_cast(blcdimrow))); + nbc = static_cast(ceil(static_cast(n) / static_cast(blcdimcol))); + + blcrange = (lowrowmask+1) * (lowcolmask+1); // range indexed by one block + mortoncmp = MortonCompare(rowlowbits, collowbits, lowrowmask, lowcolmask); +} + +// Partial template specialization for booleans +// Does not check cache considerations as this is mostly likely +// to be used for gaxpy() with multiple rhs vectors (we don't know how many and what type at this point) +template +void BiCsb::Init(int workers, IT forcelogbeta) +{ + ispar = (workers > 1); + IT roundrowup = nextpoweroftwo(m); + IT roundcolup = nextpoweroftwo(n); + + // if indices are negative, highestbitset returns -1, + // but that will be caught by the sizereq below + IT rowbits = highestbitset(roundrowup); + IT colbits = highestbitset(roundcolup); + bool sizereq; + if (ispar) + { + sizereq = ((IntPower<2>(rowbits) > SLACKNESS * workers) + && (IntPower<2>(colbits) > SLACKNESS * workers)); + } + else + { + sizereq = ((rowbits > 1) && (colbits > 1)); + } + + if(!sizereq) + { + cerr << "Matrix too small for this library" << endl; + return; + } + + rowlowbits = rowbits-1; + collowbits = colbits-1; + IT inf = numeric_limits::max(); + IT maxbits = highestbitset(inf); + + rowhighbits = rowbits-rowlowbits; // # higher order bits for rows (has at least one bit) + colhighbits = colbits-collowbits; // # higher order bits for cols (has at least one bit) + if(ispar) + { + while(IntPower<2>(rowhighbits) < SLACKNESS * workers) + { + rowhighbits++; + rowlowbits--; + } + } + + // blocks need to be square for correctness (maybe generalize this later?) + while(rowlowbits+collowbits > maxbits) + { + if(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + else + { + colhighbits++; + collowbits--; + } + } + while(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + while(rowlowbits < collowbits) + { + colhighbits++; + collowbits--; + } + assert (collowbits == rowlowbits); + + lowrowmask = IntPower<2>(rowlowbits) - 1; + lowcolmask = IntPower<2>(collowbits) - 1; + if(forcelogbeta != 0) + { + IT candlowmask = IntPower<2>(forcelogbeta) -1; + cout << "Forcing beta to "<< (candlowmask+1) << " instead of the chosen " << (lowrowmask+1) << endl; + cout << "Warning : No checks are performed on the beta you have forced, anything can happen !" << endl; + lowrowmask = lowcolmask = candlowmask; + rowlowbits = collowbits = forcelogbeta; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + else + { + double sqrtn = sqrt(sqrt(static_cast(m) * static_cast(n))); + IT logbeta = static_cast(ceil(log2(sqrtn))) + 2; + if(rowlowbits > logbeta) + { + rowlowbits = collowbits = logbeta; + lowrowmask = lowcolmask = IntPower<2>(logbeta) -1; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + // cout << "Beta chosen to be "<< (lowrowmask+1) << endl; + } + highrowmask = ((roundrowup - 1) ^ lowrowmask); + highcolmask = ((roundcolup - 1) ^ lowcolmask); + + // nbc = #{block columns} = #{blocks in any block row}, nbr = #{block rows) + IT blcdimrow = lowrowmask + 1; + IT blcdimcol = lowcolmask + 1; + nbr = static_cast(ceil(static_cast(m) / static_cast(blcdimrow))); + nbc = static_cast(ceil(static_cast(n) / static_cast(blcdimcol))); + + blcrange = (lowrowmask+1) * (lowcolmask+1); // range indexed by one block + mortoncmp = MortonCompare(rowlowbits, collowbits, lowrowmask, lowcolmask); +} + + +// Constructing empty BiCsb objects (size = 0) are not allowed. +template +BiCsb::BiCsb (IT size, IT rows, IT cols, int workers): nz(size),m(rows),n(cols) +{ + assert(nz != 0 && n != 0 && m != 0); + Init(workers); + + num = (NT*) aligned_malloc( nz * sizeof(NT)); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); +} + +// Partial template specialization for booleans +template +BiCsb::BiCsb (IT size, IT rows, IT cols, int workers): nz(size),m(rows),n(cols) +{ + assert(nz != 0 && n != 0 && m != 0); + Init(workers); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); +} + +// copy constructor +template +BiCsb::BiCsb (const BiCsb & rhs) + : nz(rhs.nz), m(rhs.m), n(rhs.n), blcrange(rhs.blcrange), nbr(rhs.nbr), nbc(rhs.nbc), + rowhighbits(rhs.rowhighbits), rowlowbits(rhs.rowlowbits), highrowmask(rhs.highrowmask), lowrowmask(rhs.lowrowmask), + colhighbits(rhs.colhighbits), collowbits(rhs.collowbits), highcolmask(rhs.highcolmask), lowcolmask(rhs.lowcolmask), + mortoncmp(rhs.mortoncmp), ispar(rhs.ispar) +{ + if(nz > 0) + { + num = (NT*) aligned_malloc( nz * sizeof(NT)); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + + copy (rhs.num, rhs.num + nz, num); + copy (rhs.bot, rhs.bot + nz, bot); + } + if ( nbr > 0) + { + top = allocate2D(nbr, nbc+1); + for(IT i=0; i +BiCsb::BiCsb (const BiCsb & rhs) + : nz(rhs.nz), m(rhs.m), n(rhs.n), blcrange(rhs.blcrange), nbr(rhs.nbr), nbc(rhs.nbc), + rowhighbits(rhs.rowhighbits), rowlowbits(rhs.rowlowbits), highrowmask(rhs.highrowmask), lowrowmask(rhs.lowrowmask), + colhighbits(rhs.colhighbits), collowbits(rhs.collowbits), highcolmask(rhs.highcolmask), lowcolmask(rhs.lowcolmask), + mortoncmp(rhs.mortoncmp), ispar(rhs.ispar) +{ + if(nz > 0) + { + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + copy (rhs.bot, rhs.bot + nz, bot); + } + if ( nbr > 0) + { + top = allocate2D(nbr, nbc+1); + for(IT i=0; i +BiCsb & BiCsb::operator= (const BiCsb & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty, make it empty + { + aligned_free(bot); + aligned_free(num); + } + if(nbr > 0) + { + deallocate2D(top, nbr); + } + ispar = rhs.ispar; + nz = rhs.nz; + n = rhs.n; + m = rhs.m; + nbr = rhs.nbr; + nbc = rhs.nbc; + blcrange = rhs.blcrange; + rowhighbits = rhs.rowhighbits; + rowlowbits = rhs.rowlowbits; + highrowmask = rhs.highrowmask; + lowrowmask = rhs.lowrowmask; + colhighbits = rhs.colhighbits; + collowbits = rhs.collowbits; + highcolmask = rhs.highcolmask; + lowcolmask= rhs.lowcolmask; + mortoncmp = rhs.mortoncmp; + if(nz > 0) // if the copied object is not empty + { + num = (NT*) aligned_malloc( nz * sizeof(NT)); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + copy (rhs.num, rhs.num + nz, num); + copy (rhs.bot, rhs.bot + nz, bot); + } + if ( nbr > 0) + { + top = allocate2D(nbr, nbc+1); + for(IT i=0; i +BiCsb & BiCsb::operator= (const BiCsb & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty, make it empty + { + aligned_free(bot); + } + if(nbr > 0) + { + deallocate2D(top, nbr); + } + ispar = rhs.ispar; + nz = rhs.nz; + n = rhs.n; + m = rhs.m; + nbr = rhs.nbr; + nbc = rhs.nbc; + blcrange = rhs.blcrange; + rowhighbits = rhs.rowhighbits; + rowlowbits = rhs.rowlowbits; + highrowmask = rhs.highrowmask; + lowrowmask = rhs.lowrowmask; + colhighbits = rhs.colhighbits; + collowbits = rhs.collowbits; + highcolmask = rhs.highcolmask; + lowcolmask= rhs.lowcolmask; + mortoncmp = rhs.mortoncmp; + if(nz > 0) // if the copied object is not empty + { + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + copy (rhs.bot, rhs.bot + nz, bot); + } + if ( nbr > 0) + { + top = allocate2D(nbr, nbc+1); + for(IT i=0; i +BiCsb::~BiCsb() +{ + if( nz > 0) + { + aligned_free((unsigned char*) num); + aligned_free((unsigned char*) bot); + } + if ( nbr > 0) + { + deallocate2D(top, nbr); + } +} + +template +BiCsb::~BiCsb() +{ + if( nz > 0) + { + aligned_free((unsigned char*) bot); + } + if ( nbr > 0) + { + deallocate2D(top, nbr); + } +} + +template +BiCsb::BiCsb (Csc & csc, int workers, IT forcelogbeta):nz(csc.nz),m(csc.m),n(csc.n) +{ + typedef std::pair ipair; + typedef std::pair mypair; + assert(nz != 0 && n != 0 && m != 0); + if(forcelogbeta == 0) + Init(workers); + else + Init(workers, forcelogbeta); + + num = (NT*) aligned_malloc( nz * sizeof(NT)); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); + mypair * pairarray = new mypair[nz]; + IT k = 0; + for(IT j = 0; j < n; ++j) + { + for (IT i = csc.jc [j] ; i < csc.jc[j+1] ; ++i) // scan the jth column + { + // concatenate the higher/lower order half of both row (first) index and col (second) index bits + IT hindex = (((highrowmask & csc.ir[i] ) >> rowlowbits) << colhighbits) + | ((highcolmask & j) >> collowbits); + IT lindex = ((lowrowmask & csc.ir[i]) << collowbits) | (lowcolmask & j) ; + + // i => location of that nonzero in csc.ir and csc.num arrays + pairarray[k++] = mypair(hindex, ipair(lindex,i)); + } + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray, csc.num); + delete [] pairarray; +} + +template +template // to provide conversion from arbitrary Csc<> to specialized BiCsb +BiCsb::BiCsb (Csc & csc, int workers):nz(csc.nz),m(csc.m),n(csc.n) +{ + typedef std::pair ipair; + typedef std::pair mypair; + assert(nz != 0 && n != 0 && m != 0); + Init(workers); + + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); + mypair * pairarray = new mypair[nz]; + IT k = 0; + for(IT j = 0; j < n; ++j) + { + for (IT i = csc.jc [j] ; i < csc.jc[j+1] ; ++i) // scan the jth column + { + // concatenate the higher/lower order half of both row (first) index and col (second) index bits + IT hindex = (((highrowmask & csc.ir[i] ) >> rowlowbits) << colhighbits) + | ((highcolmask & j) >> collowbits); + IT lindex = ((lowrowmask & csc.ir[i]) << collowbits) | (lowcolmask & j) ; + + // i => location of that nonzero in csc.ir and csc.num arrays + pairarray[k++] = mypair(hindex, ipair(lindex,i)); + } + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray); + delete [] pairarray; +} + +// Assumption: rowindices (ri) and colindices(ci) are "parallel arrays" sorted w.r.t. column index values +template +BiCsb::BiCsb (IT size, IT rows, IT cols, IT * ri, IT * ci, NT * val, int workers, IT forcelogbeta) + :nz(size),m(rows),n(cols) +{ + typedef std::pair ipair; + typedef std::pair mypair; + assert(nz != 0 && n != 0 && m != 0); + Init(workers, forcelogbeta); + + num = (NT*) aligned_malloc( nz * sizeof(NT)); + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); + mypair * pairarray = new mypair[nz]; + for(IT k = 0; k < nz; ++k) + { + // concatenate the higher/lower order half of both row (first) index and col (second) index bits + IT hindex = (((highrowmask & ri[k] ) >> rowlowbits) << colhighbits) | ((highcolmask & ci[k]) >> collowbits); + IT lindex = ((lowrowmask & ri[k]) << collowbits) | (lowcolmask & ci[k]) ; + + // k is stored in order to retrieve the location of this nonzero in val array + pairarray[k] = mypair(hindex, ipair(lindex, k)); + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray, val); + delete [] pairarray; +} + +template +BiCsb::BiCsb (IT size, IT rows, IT cols, IT * ri, IT * ci, int workers, IT forcelogbeta) + :nz(size),m(rows),n(cols) +{ + typedef std::pair ipair; + typedef std::pair mypair; + assert(nz != 0 && n != 0 && m != 0); + Init(workers, forcelogbeta); + + bot = (IT*) aligned_malloc( nz * sizeof(IT)); + top = allocate2D(nbr, nbc+1); + mypair * pairarray = new mypair[nz]; + for(IT k = 0; k < nz; ++k) + { + // concatenate the higher/lower order half of both row (first) index and col (second) index bits + IT hindex = (((highrowmask & ri[k] ) >> rowlowbits) << colhighbits) | ((highcolmask & ci[k]) >> collowbits); + IT lindex = ((lowrowmask & ri[k]) << collowbits) | (lowcolmask & ci[k]) ; + + // k is stored in order to retrieve the location of this nonzero in val array + pairarray[k] = mypair(hindex, ipair(lindex, k)); + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray); + delete [] pairarray; +} + +template +void BiCsb::SortBlocks(pair > * pairarray, NT * val) +{ + typedef typename std::pair > mypair; + IT cnz = 0; + IT ldim = IntPower<2>(colhighbits); // leading dimension (not always equal to nbc) + for(IT i = 0; i < nbr; ++i) + { + for(IT j = 0; j < nbc; ++j) + { + top[i][j] = cnz; + IT prevcnz = cnz; + vector< mypair > blocknz; + while(cnz < nz && pairarray[cnz].first == ((i*ldim)+j) ) // as long as we're in this block + { + IT lowbits = pairarray[cnz].second.first; + IT rlowbits = ((lowbits >> collowbits) & lowrowmask); + IT clowbits = (lowbits & lowcolmask); + IT bikey = BitInterleaveLow(rlowbits, clowbits); + + blocknz.push_back(mypair(bikey, pairarray[cnz++].second)); + } + // sort the block into bitinterleaved order + sort(blocknz.begin(), blocknz.end()); + + for(IT k=prevcnz; k +void BiCsb::SortBlocks(pair > * pairarray) +{ + typedef pair > mypair; + IT cnz = 0; + IT ldim = IntPower<2>(colhighbits); // leading dimension (not always equal to nbc) + for(IT i = 0; i < nbr; ++i) + { + for(IT j = 0; j < nbc; ++j) + { + top[i][j] = cnz; + IT prevcnz = cnz; + std::vector blocknz; + while(cnz < nz && pairarray[cnz].first == ((i*ldim)+j) ) // as long as we're in this block + { + IT lowbits = pairarray[cnz].second.first; + IT rlowbits = ((lowbits >> collowbits) & lowrowmask); + IT clowbits = (lowbits & lowcolmask); + IT bikey = BitInterleaveLow(rlowbits, clowbits); + + blocknz.push_back(mypair(bikey, pairarray[cnz++].second)); + } + // sort the block into bitinterleaved order + sort(blocknz.begin(), blocknz.end()); + + for(IT k=prevcnz; k +template +void BiCsb::BMult(IT** chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const +{ + assert(end-start > 0); // there should be at least one chunk + if (end-start == 1) // single chunk + { + if((chunks[end] - chunks[start]) == 1) // chunk consists of a single (normally dense) block + { + IT chi = ( (chunks[start] - chunks[0]) << collowbits); + + // m-chi > lowcolmask for all blocks except the last skinny tall one. + // if the last one is regular too, then it has m-chi = lowcolmask+1 + if(ysize == (lowrowmask+1) && (m-chi) > lowcolmask ) // parallelize if it is a regular/complete block + { + const RHS * __restrict subx = &x[chi]; + BlockPar( *(chunks[start]) , *(chunks[end]), subx, y, 0, blcrange, BREAKEVEN * ysize); + } + else // otherwise block parallelization will fail + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y); + } + } + else // a number of sparse blocks with a total of at most O(\beta) nonzeros + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y); + } + } + else + { + // divide chunks into half + IT mid = (start+end)/2; + + cilk_spawn BMult(chunks, start, mid, x, y, ysize); + if(SYNCHED) + { + BMult(chunks, mid, end, x, y, ysize); + } + else + { + LHS * temp = new LHS[ysize](); + // not the empty set of parantheses as the initializer, therefore + // even if LHS is a built-in type (such as double,int) it will be default-constructed + // The C++ standard says that: A default constructed POD type is zero-initialized, + // for non-POD types (such as std::array), the caller should make sure default constructs to zero + + BMult(chunks, mid, end, x, temp, ysize); + cilk_sync; + +#pragma simd + for(IT i=0; i +template +void BiCsb::BMult(IT** chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const +{ + assert(end-start > 0); // there should be at least one chunk + if (end-start == 1) // single chunk + { + if((chunks[end] - chunks[start]) == 1) // chunk consists of a single (normally dense) block + { + IT chi = ( (chunks[start] - chunks[0]) << collowbits); + + // m-chi > lowcolmask for all blocks except the last skinny tall one. + // if the last one is regular too, then it has m-chi = lowcolmask+1 + if(ysize == (lowrowmask+1) && (m-chi) > lowcolmask ) // parallelize if it is a regular/complete block + { + const RHS * __restrict subx = &x[chi]; + BlockPar( *(chunks[start]) , *(chunks[end]), subx, y, 0, blcrange, BREAKEVEN * ysize); + } + else // otherwise block parallelization will fail + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y); + } + } + else // a number of sparse blocks with a total of at most O(\beta) nonzeros + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y); + } + } + else + { + // divide chunks into half + IT mid = (start+end)/2; + + cilk_spawn BMult(chunks, start, mid, x, y, ysize); + if(SYNCHED) + { + BMult(chunks, mid, end, x, y, ysize); + } + else + { + LHS * temp = new LHS[ysize](); + // not the empty set of parantheses as the initializer, therefore + // even if LHS is a built-in type (such as double,int) it will be default-constructed + // The C++ standard says that: A default constructed POD type is zero-initialized, + // for non-POD types (such as std::array), the caller should make sure default constructs to zero + + BMult(chunks, mid, end, x, temp, ysize); + cilk_sync; + +#pragma simd + for(IT i=0; i > * >] chunks {a vector of pointers to vectors of pairs} + * Each vector of pairs is a chunk and each pair is a block within that chunk + * chunks[i] is valid for i = {start,start+1,...,end-1} + **/ +template +template +void BiCsb::BTransMult(vector< vector< tuple > * > & chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const +{ +#ifdef STATS + blockparcalls += 1; +#endif + assert(end-start > 0); // there should be at least one chunk + if (end-start == 1) // single chunk (note that single chunk does not mean single block) + { + if(chunks[start]->size() == 1) // chunk consists of a single (normally dense) block + { + // get the block row id higher order bits to index x (because this is A'x) + auto block = chunks[start]->front(); // get the tuple representing this compressed sparse block + IT chi = ( get<2>(block) << rowlowbits); + + // m-chi > lowrowmask for all blocks except the last skinny tall one. + // if the last one is regular too, then it has m-chi = lowcolmask+1 + // parallelize if it is a regular/complete block (and it it is worth it) + + if(ysize == (lowrowmask+1) && (m-chi) > lowrowmask && (get<1>(block)-get<0>(block)) > BREAKEVEN * ysize) + { + const RHS * __restrict subx = &x[chi]; + BlockParT( get<0>(block) , get<1>(block), subx, y, 0, blcrange, BREAKEVEN * ysize); + } + else // otherwise block parallelization will fail + { + SubSpMVTrans(*(chunks[start]), x, y); + } + } + else // a number of sparse blocks with a total of at most O(\beta) nonzeros + { + SubSpMVTrans(*(chunks[start]), x, y); + } + } + else // multiple chunks + { + IT mid = (start+end)/2; + cilk_spawn BTransMult(chunks, start, mid, x, y, ysize); + if(SYNCHED) + { + BTransMult(chunks, mid, end, x, y, ysize); + } + else + { + LHS * temp = new LHS[ysize](); + BTransMult(chunks, mid, end, x, temp, ysize); + cilk_sync; + +#pragma simd + for(IT i=0; i +template +void BiCsb::BTransMult(vector< vector< tuple > * > & chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const +{ + assert(end-start > 0); // there should be at least one chunk + if (end-start == 1) // single chunk (note that single chunk does not mean single block) + { + if(chunks[start]->size() == 1) // chunk consists of a single (normally dense) block + { + // get the block row id higher order bits to index x (because this is A'x) + auto block = chunks[start]->front(); // get the tuple representing this compressed sparse block + IT chi = ( get<2>(block) << rowlowbits); + + // m-chi > lowrowmask for all blocks except the last skinny tall one. + // if the last one is regular too, then it has m-chi = lowcolmask+1 + if(ysize == (lowrowmask+1) && (m-chi) > lowrowmask ) // parallelize if it is a regular/complete block + { + const RHS * __restrict subx = &x[chi]; + BlockParT( get<0>(block) , get<1>(block), subx, y, 0, blcrange, BREAKEVEN * ysize); + } + else // otherwise block parallelization will fail + { + SubSpMVTrans(*(chunks[start]), x, y); + } + } + else // a number of sparse blocks with a total of at most O(\beta) nonzeros + { + SubSpMVTrans(*(chunks[start]), x, y); + } + } + else // multiple chunks + { + IT mid = (start+end)/2; + cilk_spawn BTransMult(chunks, start, mid, x, y, ysize); + if(SYNCHED) + { + BTransMult(chunks, mid, end, x, y, ysize); + } + else + { + LHS * temp = new LHS[ysize](); + BTransMult(chunks, mid, end, x, temp, ysize); + cilk_sync; + +#pragma simd + for(IT i=0; i No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubSpMV(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[chi]; + +#ifdef SIMDUNROLL + IT start = btop[j]; + IT range = (btop[j+1]-btop[j]) >> 2; + + if(range > ROLLING) + { + for (IT k = 0 ; k < range ; ++k) // for all nonzeros within ith block (expected =~ nnz/n = c) + { + // ABAB: how to ensure alignment on the stack? + // float a[4] __attribute__((aligned(0x1000))); +#define ALIGN16 __attribute__((aligned(16))) + + IT ALIGN16 rli4[4]; IT ALIGN16 cli4[4]; + NT ALIGN16 x4[4]; NT ALIGN16 y4[4]; + + // _mm_srli_epi32: Shifts the 4 signed or unsigned 32-bit integers to right by shifting in zeros. + IT pin = start + (k << 2); + + __m128i bots = _mm_loadu_si128((__m128i*) &r_bot[pin]); // load 4 consecutive r_bot elements + __m128i clis = _mm_and_si128( bots, lcms); + __m128i rlis = _mm_and_si128( _mm_srli_epi32(bots, collowbits), lrms); + _mm_store_si128 ((__m128i*) cli4, clis); + _mm_store_si128 ((__m128i*) rli4, rlis); + + x4[0] = subx[cli4[0]]; + x4[1] = subx[cli4[1]]; + x4[2] = subx[cli4[2]]; + x4[3] = subx[cli4[3]]; + + __m128d Y01QW = _mm_mul_pd((__m128d)_mm_loadu_pd(&r_num[pin]), (__m128d)_mm_load_pd(&x4[0])); + __m128d Y23QW = _mm_mul_pd((__m128d)_mm_loadu_pd(&r_num[pin+2]), (__m128d)_mm_load_pd(&x4[2])); + + _mm_store_pd(&y4[0],Y01QW); + _mm_store_pd(&y4[2],Y23QW); + + suby[rli4[0]] += y4[0]; + suby[rli4[1]] += y4[1]; + suby[rli4[2]] += y4[2]; + suby[rli4[3]] += y4[3]; + } + for(IT k=start+4*range; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); + } + } + else + { +#endif + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); + } +#ifdef SIMDUNROLL + } +#endif + } +} + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubSpMV_tar(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[chi]; + +#ifdef SIMDUNROLL + IT start = btop[j]; + IT range = (btop[j+1]-btop[j]) >> 2; + + if(range > ROLLING) + { + for (IT k = 0 ; k < range ; ++k) // for all nonzeros within ith block (expected =~ nnz/n = c) + { + // ABAB: how to ensure alignment on the stack? + // float a[4] __attribute__((aligned(0x1000))); +#define ALIGN16 __attribute__((aligned(16))) + + IT ALIGN16 rli4[4]; IT ALIGN16 cli4[4]; + NT ALIGN16 x4[4]; NT ALIGN16 y4[4]; + + // _mm_srli_epi32: Shifts the 4 signed or unsigned 32-bit integers to right by shifting in zeros. + IT pin = start + (k << 2); + + __m128i bots = _mm_loadu_si128((__m128i*) &r_bot[pin]); // load 4 consecutive r_bot elements + __m128i clis = _mm_and_si128( bots, lcms); + __m128i rlis = _mm_and_si128( _mm_srli_epi32(bots, collowbits), lrms); + _mm_store_si128 ((__m128i*) cli4, clis); + _mm_store_si128 ((__m128i*) rli4, rlis); + + x4[0] = subx[cli4[0]]; + x4[1] = subx[cli4[1]]; + x4[2] = subx[cli4[2]]; + x4[3] = subx[cli4[3]]; + + __m128d Y01QW = _mm_mul_pd((__m128d)_mm_loadu_pd(&r_num[pin]), (__m128d)_mm_load_pd(&x4[0])); + __m128d Y23QW = _mm_mul_pd((__m128d)_mm_loadu_pd(&r_num[pin+2]), (__m128d)_mm_load_pd(&x4[2])); + + _mm_store_pd(&y4[0],Y01QW); + _mm_store_pd(&y4[2],Y23QW); + + suby[rli4[0]] += y4[0]; + suby[rli4[1]] += y4[1]; + suby[rli4[2]] += y4[2]; + suby[rli4[3]] += y4[3]; + } + for(IT k=start+4*range; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); + } + } + else + { +#endif + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); + } +#ifdef SIMDUNROLL + } +#endif + } +} + +// Partial boolean specialization on NT=bool +template +template +void BiCsb::SubSpMV(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row or chunk + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[chi]; + for (IT k = btop[j] ; k < btop[j+1] ; ++k) // for all nonzeros within ith block (expected =~ nnz/n = c) + { + IT rli = ((r_bot[k] >> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(subx[cli], suby[rli]); // suby [rli] += subx [cli] where subx and suby are vectors. + } + } +} + +//! SubSpMVTrans's chunked version +template +template +void BiCsb::SubSpMVTrans(const vector< tuple > & chunk, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(auto itr = chunk.begin(); itr != chunk.end(); ++itr) // over all blocks within this chunk + { + // get the starting point for accessing x + IT chi = ( get<2>(*itr) << rowlowbits); + const RHS * __restrict subx = &x[chi]; + + IT nzbeg = get<0>(*itr); + IT nzend = get<1>(*itr); + + for (IT k = nzbeg ; k < nzend ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); // suby [rli] += r_num[k] * subx [cli] where subx and suby are vectors. + } + } +} + +//! SubSpMVTrans's chunked version with boolean specialization +template +template +void BiCsb::SubSpMVTrans(const vector< tuple > & chunk, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + for(auto itr = chunk.begin(); itr != chunk.end(); ++itr) + { + // get the starting point for accessing x + IT chi = ( get<2>(*itr) << rowlowbits); + const RHS * __restrict subx = &x[chi]; + + IT nzbeg = get<0>(*itr); + IT nzend = get<1>(*itr); + + for (IT k = nzbeg ; k < nzend ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(subx[cli], suby[rli]); // suby [rli] += subx [cli] where subx and suby are vectors. + } + } +} + +template +template +void BiCsb::SubSpMVTrans(IT col, IT rowstart, IT rowend, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT i= rowstart; i < rowend; ++i) + { + // get the starting point for accessing x + IT chi = (i << rowlowbits); + const RHS * __restrict subx = &x[chi]; + + for (IT k = top[i][col] ; k < top[i][col+1] ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); // suby [rli] += r_num[k] * subx [cli] where subx and suby are vectors. + } + } +} + + +template +template +void BiCsb::SubSpMVTrans(IT col, IT rowstart, IT rowend, const RHS * __restrict x, LHS * __restrict suby) const +{ + IT * __restrict r_bot = bot; + for(IT i= rowstart; i < rowend; ++i) + { + // get the starting point for accessing x + IT chi = (i << rowlowbits); + const RHS * __restrict subx = &x[chi]; + for (IT k = top[i][col] ; k < top[i][col+1] ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(subx[cli], suby[rli]); // suby [rli] += subx [cli] where subx and suby are vectors. + } + } +} + +// Parallelize the block itself (A*x version) +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the same block +// PRECONDITION: rangeend-rangebeg is a power of two +// TODO: we rely on the particular implementation of lower_bound for correctness, which is dangerous ! +// what if lhs (instead of rhs) parameter to the comparison object is the splitter? +template +template +void BiCsb::BlockPar(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for (IT k = start ; k < end ; ++k) + { + IT rli = ((r_bot[k] >> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); // suby [rli] += r_num[k] * subx [cli] where subx and suby are vectors. + } + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + // We can choose to perform [0,3] in parallel and then [1,2] in parallel + // or perform [0,1] in parallel and then [2,3] in parallel + // Decision is based on the balance, i.e. we pick the more balanced parallelism + if( ( absdiff(size0,size3) + absdiff(size1,size2) ) < ( absdiff(size0,size1) + absdiff(size2,size3) ) ) + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff); // multiply subblock_1 + BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } + else + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff); // multiply subblock_1 + cilk_sync; + + cilk_spawn BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + } + } +} + + +template +template +void BiCsb::BlockPar(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + for (IT k = start ; k < end ; ++k) + { + IT rli = ((r_bot[k] >> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + SR::axpy(subx[cli], suby[rli]); // suby [rli] += subx [cli] where subx and suby are vectors. + } + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + // We can choose to perform [0,3] in parallel and then [1,2] in parallel + // or perform [0,1] in parallel and then [2,3] in parallel + // Decision is based on the balance, i.e. we pick the more balanced parallelism + if( ( absdiff(size0,size3) + absdiff(size1,size2) ) < ( absdiff(size0,size1) + absdiff(size2,size3) ) ) + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff); // multiply subblock_1 + BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } + else + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff); // multiply subblock_1 + cilk_sync; + + cilk_spawn BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + } + } +} + +// Parallelize the block itself (A'*x version) +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the same block +template +template +void BiCsb::BlockParT(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for (IT k = start ; k < end ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(r_num[k], subx[cli], suby[rli]); // suby [rli] += r_num[k] * subx [cli] where subx and suby are vectors. + } + } + else + { + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 1 | + | 2 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + // We can choose to perform [0,3] in parallel and then [1,2] in parallel + // or perform [0,2] in parallel and then [1,3] in parallel + // Decision is based on the balance, i.e. we pick the more balanced parallelism + if( ( absdiff(size0,size3) + absdiff(size1,size2) ) < ( absdiff(size0,size2) + absdiff(size1,size3) ) ) + { + cilk_spawn BlockParT(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockParT(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockParT(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff);// multiply subblock_1 + BlockParT(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } + else + { + cilk_spawn BlockParT(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockParT(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + + cilk_spawn BlockParT(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff);// multiply subblock_1 + BlockParT(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + } + } +} + + +template +template +void BiCsb::BlockParT(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + for (IT k = start ; k < end ; ++k) + { + // Note the swap in cli/rli + IT cli = ((r_bot[k] >> collowbits) & lowrowmask); + IT rli = (r_bot[k] & lowcolmask); + SR::axpy(subx[cli], suby[rli]); // suby [rli] += subx [cli] where subx and suby are vectors. + } + } + else + { + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 1 | + | 2 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + // We can choose to perform [0,3] in parallel and then [1,2] in parallel + // or perform [0,2] in parallel and then [1,3] in parallel + // Decision is based on the balance, i.e. we pick the more balanced parallelism + if( ( absdiff(size0,size3) + absdiff(size1,size2) ) < ( absdiff(size0,size2) + absdiff(size1,size3) ) ) + { + cilk_spawn BlockParT(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockParT(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockParT(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff);// multiply subblock_1 + BlockParT(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } + else + { + cilk_spawn BlockParT(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockParT(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + + cilk_spawn BlockParT(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff);// multiply subblock_1 + BlockParT(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + } + } +} + +// Print stats to an ofstream object +template +ofstream & BiCsb::PrintStats(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" < blocksizes(ntop); + for(IT i=0; i (top[i][j+1]-top[i][j]); + } + } + sort(blocksizes.begin(), blocksizes.end()); + outfile<< "## Total nonzeros: "<< accumulate(blocksizes.begin(), blocksizes.end(), 0) << endl; + + outfile << "## Nonzero distribution (sorted) of blocks follows: \n" ; + for(IT i=0; i< ntop; ++i) + { + outfile << blocksizes[i] << "\n"; + } + outfile << endl; + return outfile; +} + +// Print top level statistics to file +template +ofstream & BiCsb::PrintTopLevel(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" < blocksizes(ntop); + for(IT i=0; i (top[i][j+1]-top[i][j]); + } + } + + for(IT i=0; i +ofstream & BiCsb::PrintTopLevelSparse(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" < blocksizes(ntop); + for(IT i=0; i (top[i][j+1]-top[i][j]); + } + } + + // first row contains top-level size + outfile << nbr << "," << nbc << "," << endl; + + for(IT i=0; i 0) // if block contains nz elems + outfile << i << "," << j << "," << blocksizes[i*nbc+j] << endl; + + } + } + + return outfile; +} + +///////////////////////////////// +// t-SNE kernel Implementation // +// September 2017 // +// by Kostas Mylonakis // +///////////////////////////////// + + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubtSNEkernel(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 3; + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[3] = {0}; + RHS Yi[3] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[:] = subx[cli*DIM+ 0:DIM]; + Yj[:] = subxx[rli*DIM+ 0:DIM]; + + /* distance computation */ + RHS dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + + /* P_{ij} \times Q_{ij} */ + const RHS p_times_q = r_num[k] / (1+dist); + suby[rli*DIM + 0:DIM] += p_times_q * (Yj[:] - Yi[:]); + + + } + } +} + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubtSNEkernel2D(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 2; + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[2] = {0}; + RHS Yi[2] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[:] = subx[cli*DIM+ 0:DIM]; + Yj[:] = subxx[rli*DIM+ 0:DIM]; + + /* distance computation */ + RHS dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + + /* P_{ij} \times Q_{ij} */ + const RHS p_times_q = r_num[k] / (1+dist); + suby[rli*DIM + 0:DIM] += p_times_q * (Yj[:] - Yi[:]); + + + } + } +} + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubtSNEkernel4D(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 4; + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[4] = {0}; + RHS Yi[4] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[:] = subx[cli*DIM+ 0:DIM]; + Yj[:] = subxx[rli*DIM+ 0:DIM]; + + /* distance computation */ + RHS dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + + /* P_{ij} \times Q_{ij} */ + const RHS p_times_q = r_num[k] / (1+dist); + suby[rli*DIM + 0:DIM] += p_times_q * (Yj[:] - Yi[:]); + + + } + } +} + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubtSNEkernel1D(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 1; + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[1] = {0}; + RHS Yi[1] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[:] = subx[cli*DIM+ 0:DIM]; + Yj[:] = subxx[rli*DIM+ 0:DIM]; + + /* distance computation */ + RHS dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + + /* P_{ij} \times Q_{ij} */ + const RHS p_times_q = r_num[k] / (1+dist); + suby[rli*DIM + 0:DIM] += p_times_q * (Yj[:] - Yi[:]); + + + } + } +} + + + +template +template +void BiCsb::SubtSNEkernel_tar(IT * __restrict btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 3; + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[3] = {0}; + RHS Yi[3] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[:] = subx[cli*DIM+ 0:DIM]; + Yj[:] = subxx[rli*DIM+ 0:DIM]; + + /* distance computation */ + RHS dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + + /* P_{ij} \times Q_{ij} */ + const RHS p_times_q = r_num[k] / (1+dist); + suby[rli*DIM + 0:DIM] += p_times_q * (Yj[:] - Yi[:]); + + + } + } +} + +template +template +void BiCsb::SubtSNEkernel(IT * __restrict btop, IT bstart, IT bend, + const RHS * __restrict x_row, + const RHS * __restrict x_col, + LHS * __restrict suby, + IT rhi) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + IT DIM = 3; + + const RHS * __restrict subxx = &x_row[DIM*rhi]; + RHS Yj[3] = {0}; + RHS Yi[3] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x_col[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + +#ifdef GCC_BUG + for(int di=0; di No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +template +void BiCsb::SubtSNEcost(IT * __restrict btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi, + int DIM, + double alpha, + double zeta) const +{ + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + + __m128i lcms = _mm_set1_epi32 (lowcolmask); + __m128i lrms = _mm_set1_epi32 (lowrowmask); + + const RHS * __restrict subxx = &x[DIM*rhi]; + RHS Yj[10] = {0}; + RHS Yi[10] = {0}; + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = (j << collowbits); + const RHS * __restrict subx = &x[DIM * chi]; + + for(IT k=btop[j]; k> collowbits) & lowrowmask); + IT cli = (r_bot[k] & lowcolmask); + + Yi[0:DIM] = subx[cli*DIM+ 0:DIM]; + Yj[0:DIM] = subxx[rli*DIM+ 0:DIM]; + // for(int d =0; d +#include +#include +#include // for std:accumulate() +#include // C++ style numeric_limits +#include +#include "csc.h" +#include "mortoncompare.h" + +using namespace std; + +// CSB variant where nonzeros "within each block" are sorted w.r.t. the bit-interleaved order +// Implementer's (Aydin) notes: +// - to ensure correctness in BlockPar, we use square blocks (lowcolmask = highcolmask) +template +class BiCsb +{ +public: + BiCsb ():nz(0), m(0), n(0), nbc(0), nbr(0) {} // default constructor (dummy) + + BiCsb (IT size,IT rows, IT cols, int workers); + BiCsb (IT size,IT rows, IT cols, IT * ri, IT * ci, NT * val, int workers, IT forcelogbeta = 0); + + BiCsb (const BiCsb & rhs); // copy constructor + ~BiCsb(); + BiCsb & operator=(const BiCsb & rhs); // assignment operator + BiCsb (Csc & csc, int workers, IT forcelogbeta = 0); + + ofstream & PrintStats(ofstream & outfile) const; + IT colsize() const { return n;} + IT rowsize() const { return m;} + IT getNbc() const { return nbc;} + IT getNbr() const { return nbr;} + IT getBeta() const { return rowlowbits;} + IT numnonzeros() const { return nz; } + bool isPar() const { return ispar; } + + /* function to output top level statistics to CSV (in dense format) */ + ofstream & PrintTopLevel(ofstream & outfile) const; + + /* function to output top level statistics to CSV (in sparse format) */ + ofstream & PrintTopLevelSparse(ofstream & outfile) const; + +private: + void Init(int workers, IT forcelogbeta = 0); + + template + void SubSpMV(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void SubSpMV_tar(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void SubtSNEcost(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi, int dim, + double alpha, double zeta) const; + + + template + void SubtSNEkernel(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi) const; + + template + void SubtSNEkernel1D(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi) const; + + template + void SubtSNEkernel2D(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi) const; + + + template + void SubtSNEkernel4D(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi) const; + + template + void SubtSNEkernel_tar(IT * btop, IT bstart, IT bend, + const RHS * __restrict x, + LHS * __restrict suby, + IT rhi) const; + + template + void SubtSNEkernel(IT * __restrict btop, IT bstart, IT bend, + const RHS * __restrict x_row, + const RHS * __restrict x_col, + LHS * __restrict suby, + IT rhi) const; + + template + void SubSpMVTrans(IT col, IT rowstart, IT rowend, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void SubSpMVTrans(const vector< tuple > & chunk, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void BMult(IT** chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const; + + template + void BTransMult(vector< vector< tuple > * > & chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const; + + template + void BlockPar(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const; + + template + void BlockParT(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const; + + void SortBlocks(pair > * pairarray, NT * val); + + IT ** top ; // pointers array (indexed by higher-order bits of the coordinate index), size ~= ntop+1 + IT * bot; // contains lower-order bits of the coordinate index, size nnz + NT * num; // contains numerical values, size nnz + + bool ispar; + IT nz; // # nonzeros + IT m; // # rows + IT n; // # columns + IT blcrange; // range indexed by one block + + IT nbc; // #{column blocks} = #{blocks in any block row} + IT nbr; // #{block rows) + + IT rowlowbits; // # lower order bits for rows + IT rowhighbits; + IT highrowmask; // mask with the first log(m)/2 bits = 1 and the other bits = 0 + IT lowrowmask; + + IT collowbits; // # lower order bits for columns + IT colhighbits; + IT highcolmask; // mask with the first log(n)/2 bits = 1 and the other bits = 0 + IT lowcolmask; + + MortonCompare mortoncmp; // comparison operator w.r.t. the N-morton layout + + template + friend void bicsb_gespmv (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_gespmv_tar (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_gespmvt (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y); + + template + friend float RowImbalance(const CSB & A); // just befriend the BiCsb instantiation + + template + friend float ColImbalance(const BiCsb & A); + + template + friend void bicsb_tsne (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_tsne4D (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_tsne2D (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_tsne1D (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_tsne_tar (const BiCsb & A, const RHS * x, LHS * y); + + template + friend void bicsb_tsne (const BiCsb & A, + const RHS * x_row, + const RHS * x_col, LHS * y); + + template + friend void bicsb_tsne_cost (const BiCsb & A, + const RHS * x, LHS * y, int dim, + double alpha, double zeta); +}; + + +// Partial template specialization +template +class BiCsb +{ +public: + BiCsb ():nz(0), m(0), n(0), nbc(0), nbr(0) {} // default constructor (dummy) + + BiCsb (IT size,IT rows, IT cols, int workers); + BiCsb (IT size,IT rows, IT cols, IT * ri, IT * ci, int workers, IT forcelogbeta = 0); + + BiCsb (const BiCsb & rhs); // copy constructor + ~BiCsb(); + BiCsb & operator=(const BiCsb & rhs); // assignment operator + + template + BiCsb (Csc & csc, int workers); + + IT colsize() const { return n;} + IT rowsize() const { return m;} + IT numnonzeros() const { return nz; } + bool isPar() const { return ispar; } + +private: + void Init(int workers, IT forcelogbeta = 0); + + template + void SubSpMV(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby) const; + template + void SubtSNEkernel(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const; + + template + void SubtSNEkernel2D(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const; + + template + void SubtSNEkernel1D(IT * btop, IT bstart, IT bend, const RHS * __restrict x, LHS * __restrict suby, IT rhi) const; + + template + void SubSpMVTrans(IT col, IT rowstart, IT rowend, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void SubSpMVTrans(const vector< tuple > & chunk, const RHS * __restrict x, LHS * __restrict suby) const; + + template + void BMult(IT ** chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const; + + template + void BTransMult(vector< vector< tuple > * > & chunks, IT start, IT end, const RHS * __restrict x, LHS * __restrict y, IT ysize) const; + + template + void BlockPar(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const; + + template + void BlockParT(IT start, IT end, const RHS * __restrict subx, LHS * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const; + + void SortBlocks(pair > * pairarray); + + IT ** top ; // pointers array (indexed by higher-order bits of the coordinate index), size ~= ntop+1 + IT * bot; // contains lower-order bits of the coordinate index, size nnz + + bool ispar; + IT nz; // # nonzeros + IT m; // # rows + IT n; // # columns + IT blcrange; // range indexed by one block + + IT nbc; // #{column blocks} = #{blocks in any block row} + IT nbr; // #{block rows) + + IT rowlowbits; // # lower order bits for rows + IT rowhighbits; + IT highrowmask; // mask with the first log(m)/2 bits = 1 and the other bits = 0 + IT lowrowmask; + + IT collowbits; // # lower order bits for columns + IT colhighbits; + IT highcolmask; // mask with the first log(n)/2 bits = 1 and the other bits = 0 + IT lowcolmask; + + MortonCompare mortoncmp; // comparison operator w.r.t. the N-morton layout + + template + friend void bicsb_gespmv (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y); + + template + friend void bicsb_gespmvt (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y); + + template + friend float RowImbalance(const CSB & A); // befriend any CSB instantiation + + template + friend float ColImbalance(const BiCsb & A); +}; + +#include "friends.h" +#include "bicsb.cpp" +#endif + + +/*------------------------------------------------------------ + * + * AUTHORS + * + * Dimitris Floros fcdimitr@auth.gr + * + * VERSION + * + * 0.3 - December 16, 2017 + * + * CHANGELOG + * + * 0.3 (Dec 16, 2017) - Dimitris + * * incorporated TAR and TAR+ codes + * + * 0.2 (Dec 12, 2017) - Dimitris + * * added sparse output of top-level statistics + * + * 0.1 (Dec 08, 2017) - Dimitris + * * added custom function to get top-level statistics + * + * ----------------------------------------------------------*/ diff --git a/csb/bmcsb.cpp b/csb/bmcsb.cpp new file mode 100644 index 0000000..eca0d5f --- /dev/null +++ b/csb/bmcsb.cpp @@ -0,0 +1,527 @@ +#include "bmcsb.h" +#include "utility.h" + +// Choose block size as big as possible given the following constraints +// 1) The bot array is addressible by IT +// 2) The parts of x & y vectors that a block touches fits into L2 cache [assuming a saxpy() operation] +// 3) There's enough parallel slackness for block rows (at least SLACKNESS * CILK_NPROC) +template +void BmCsb::Init(int workers, IT forcelogbeta) +{ + ispar = (workers > 1); + IT roundrowup = nextpoweroftwo(m); + IT roundcolup = nextpoweroftwo(n); + + // if indices are negative, highestbitset returns -1, + // but that will be caught by the sizereq below + IT rowbits = highestbitset(roundrowup); + IT colbits = highestbitset(roundcolup); + bool sizereq; + if (ispar) + { + sizereq = ((IntPower<2>(rowbits) > SLACKNESS * workers) + && (IntPower<2>(colbits) > SLACKNESS * workers)); + } + else + { + sizereq = ((rowbits > 1) && (colbits > 1)); + } + if(!sizereq) + { + cerr << "Matrix too small for this library" << endl; + return; + } + + rowlowbits = rowbits-1; + collowbits = colbits-1; + IT inf = numeric_limits::max(); + IT maxbits = highestbitset(inf); + + rowhighbits = rowbits-rowlowbits; // # higher order bits for rows (has at least one bit) + colhighbits = colbits-collowbits; // # higher order bits for cols (has at least one bit) + if(ispar) + { + while(IntPower<2>(rowhighbits) < SLACKNESS * workers) + { + rowhighbits++; + rowlowbits--; + } + } + + // calculate the space that suby occupies in L2 cache + IT yL2 = IntPower<2>(rowlowbits) * sizeof(NT); + while(yL2 > L2SIZE) + { + yL2 /= 2; + rowhighbits++; + rowlowbits--; + } + + // calculate the space that subx occupies in L2 cache + IT xL2 = IntPower<2>(collowbits) * sizeof(NT); + while(xL2 > L2SIZE) + { + xL2 /= 2; + colhighbits++; + collowbits--; + } + + // blocks need to be square for correctness (maybe generalize this later?) + while(rowlowbits+collowbits > maxbits) + { + if(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + else + { + colhighbits++; + collowbits--; + } + } + while(rowlowbits > collowbits) + { + rowhighbits++; + rowlowbits--; + } + while(rowlowbits < collowbits) + { + colhighbits++; + collowbits--; + } + assert (collowbits == rowlowbits); + lowrowmask = IntPower<2>(rowlowbits) - 1; + lowcolmask = IntPower<2>(collowbits) - 1; + if(forcelogbeta != 0) + { + IT candlowmask = IntPower<2>(forcelogbeta) -1; + cout << "Forcing beta to "<< (candlowmask+1) << " instead of the chosen " << (lowrowmask+1) << endl; + cout << "Warning : No checks are performed on the beta you have forced, anything can happen !" << endl; + lowrowmask = lowcolmask = candlowmask; + rowlowbits = collowbits = forcelogbeta; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + else + { + double sqrtn = sqrt(sqrt(static_cast(m) * static_cast(n))); + IT logbeta = static_cast(ceil(log2(sqrtn))) + 2; + if(rowlowbits > logbeta) + { + rowlowbits = collowbits = logbeta; + lowrowmask = lowcolmask = IntPower<2>(logbeta) -1; + rowhighbits = rowbits-rowlowbits; + colhighbits = colbits-collowbits; + } + cout << "Beta chosen to be "<< (lowrowmask+1) << endl; + } + highrowmask = ((roundrowup - 1) ^ lowrowmask); + highcolmask = ((roundcolup - 1) ^ lowcolmask); + + // nbc = #{block columns} = #{blocks in any block row}, nbr = #{block rows) + IT blcdimrow = lowrowmask + 1; + IT blcdimcol = lowcolmask + 1; + nbr = static_cast(ceil(static_cast(m) / static_cast(blcdimrow))); + nbc = static_cast(ceil(static_cast(n) / static_cast(blcdimcol))); + + blcrange = (lowrowmask+1) * (lowcolmask+1); // range indexed by one block + mortoncmp = MortonCompare(rowlowbits, collowbits, lowrowmask, lowcolmask); +} + + +// copy constructor +template +BmCsb::BmCsb (const BmCsb & rhs) +: nz(rhs.nz), m(rhs.m), n(rhs.n), blcrange(rhs.blcrange), nbr(rhs.nbr), nbc(rhs.nbc), nrb(rhs.nrb), +rowhighbits(rhs.rowhighbits), rowlowbits(rhs.rowlowbits), highrowmask(rhs.highrowmask), lowrowmask(rhs.lowrowmask), +colhighbits(rhs.colhighbits), collowbits(rhs.collowbits), highcolmask(rhs.highcolmask), lowcolmask(rhs.lowcolmask), +mortoncmp(rhs.mortoncmp), ispar(rhs.ispar) +{ + if(nz > 0) // nz > 0 iff nrb > 0 + { + num = new NT[nz+2](); num++; + bot = new IT[nrb]; + masks = new MTYPE[nrb]; + + copy ( rhs.num, rhs.num+nz+1, num); + copy ( rhs.bot, rhs.bot+nrb, bot ); + copy ( rhs.masks, rhs.masks+nrb, masks ); + } + if ( nbr > 0) + { + top = new IT* [nbr]; + for(IT i=0; i +BmCsb & BmCsb::operator= (const BmCsb & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty + { + // make it empty + delete [] masks; + delete [] bot; + delete [] (--num); + } + if(nbr > 0) + { + for(IT i=0; i 0) // if the copied object is not empty + { + num = new NT[nz+2](); num++; + bot = new IT[nrb]; + masks = new MTYPE[nrb]; + + copy ( rhs.num, rhs.num+nz+1, num); + copy ( rhs.bot, rhs.bot+nrb, bot ); + copy ( rhs.masks, rhs.masks+nrb, masks ); + } + if(nbr > 0) + { + top = new IT* [nbr]; + for(IT i=0; i +BmCsb::~BmCsb() +{ + if( nz > 0) + { + delete [] masks; + delete [] bot; + delete [] (--num); + } + if ( nbr > 0) + { + for(IT i=0; i +BmCsb::BmCsb (Csc & csc, int workers):nz(csc.nz), m(csc.m),n(csc.n) +{ + typedef std::pair ipair; + typedef std::pair mypair; + + assert(nz != 0 && n != 0 && m != 0); + Init(workers); + + num = new NT[nz+2](); num++; // Padding for SSEspmv (the blendv operation) + // bot is later to be resized to nrb (number of register blocks) + // nrb < nz as the worst case happens when each register block contains only one nonzero + + top = allocate2D(nbr, nbc+1); + mypair * pairarray = new mypair[nz]; + IT k = 0; + for(IT j = 0; j < n; ++j) + { + for (IT i = csc.jc [j] ; i < csc.jc[j+1] ; ++i) // scan the jth column + { + // concatenate the higher/lower order half of both row (first) index and col (second) index bits + IT hindex = (((highrowmask & csc.ir[i] ) >> rowlowbits) << colhighbits) + | ((highcolmask & j) >> collowbits); + IT lindex = ((lowrowmask & csc.ir[i]) << collowbits) | (lowcolmask & j) ; + + // i => location of that nonzero in csc.ir and csc.num arrays^M + pairarray[k++] = mypair(hindex, ipair(lindex,i)); + } + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray, csc.num); + delete [] pairarray; +} + +template +void BmCsb::SortBlocks(pair > * pairarray, NT * val) +{ + typedef pair > mypair; + IT cnz = 0; + IT crb = 0; // current register block + IT ldim = IntPower<2>(colhighbits); // leading dimension (not always equal to nbc) + vector tempbot; + vector M; + for(IT i = 0; i < nbr; ++i) + { + for(IT j = 0; j < nbc; ++j) + { + top[i][j] = tempbot.size(); // top array now points to register blocks (instead of nonzeros) + IT prevcnz = cnz; + std::vector blocknz; + while(cnz < nz && pairarray[cnz].first == ((i*ldim)+j) ) // as long as we're in this block + { + IT lowbits = pairarray[cnz].second.first; + IT rlowbits = ((lowbits >> collowbits) & lowrowmask); + IT clowbits = (lowbits & lowcolmask); + IT bikey = BitInterleaveLow(rlowbits, clowbits); + + blocknz.push_back(mypair(bikey, pairarray[cnz++].second)); + } + // sort the block into bitinterleaved order + sort(blocknz.begin(), blocknz.end()); + + int lastregblk = -1; + IT bnz = blocknz.size(); + + for(IT bcur=0; bcur < bnz; ++bcur) + { + int curregblk = getDivident(blocknz[bcur].first, RBSIZE); + if(curregblk > lastregblk) // new register block + { + lastregblk = curregblk; + M.push_back((MTYPE) 0); + + // The following lines implement a get_head function that returns + // the top-left index of the register block that this nonzero belongs + IT Ci = blocknz[bcur].second.first & lowcolmask; + IT Ri = (blocknz[bcur].second.first >> collowbits) & lowrowmask; + Ci -= getModulo(Ci,RBDIM); + Ri -= getModulo(Ri,RBDIM); + IT lefttop = ((lowrowmask & Ri) << collowbits) | (lowcolmask & Ci); + + tempbot.push_back(lefttop); + } + M.back() |= GetMaskTable(getModulo(blocknz[bcur].first, RBSIZE)); + } + for(IT k=prevcnz; k +void BmCsb::BMult(IT** chunks, IT start, IT end, const NT * x, NT * y, IT ysize, IT * __restrict sumscan) const +{ + assert(end-start > 0); // there should be at least one chunk + if (end-start == 1) // single chunk + { + if((chunks[end] - chunks[start]) == 1) // chunk consists of a single (normally dense) block + { + IT chi = ( (chunks[start] - chunks[0]) << collowbits); + + // m-chi > lowcolmask for all blocks except the last skinny tall one. + // if the last one is regular too, then it has m-chi = lowcolmask+1 + if(ysize == (lowrowmask+1) && (m-chi) > lowcolmask ) // parallelize if it is a regular/complete block + { + const NT * __restrict subx = &x[chi]; + BlockPar( *(chunks[start]) , *(chunks[end]), subx, y, 0, blcrange, BREAKNRB * ysize, sumscan); + } + else // otherwise block parallelization will fail + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y, sumscan); + } + } + else // a number of sparse blocks with a total of at most O(\beta) nonzeros + { + SubSpMV(chunks[0], chunks[start]-chunks[0], chunks[end]-chunks[0], x, y, sumscan); + } + } + else + { + IT mid = (start+end)/2; // divide chunks into half + cilk_spawn BMult(chunks, start, mid, x, y, ysize, sumscan); + if(SYNCHED) + { + BMult(chunks, mid, end, x, y, ysize, sumscan); + } + else + { + NT * temp = new NT[ysize]; + std::fill_n(temp, ysize, 0.0); + BMult(chunks, mid, end, x, temp, ysize, sumscan); + cilk_sync; + for(IT i=0; i +void BmCsb::BlockPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff, IT * __restrict sumscan) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + SSEspmv(num + sumscan[start], masks + start, bot + start, end-start, subx, suby, lowcolmask, lowrowmask, collowbits); + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNRBTOPAR); + + // We can choose to perform [0,3] in parallel and then [1,2] in parallel + // or perform [0,1] in parallel and then [2,3] in parallel + // Decision is based on the balance, i.e. we pick the more balanced parallelism + if( ( absdiff(size0,size3) + absdiff(size1,size2) ) < ( absdiff(size0,size1) + absdiff(size2,size3) ) ) + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff,sumscan); // multiply subblock_0 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff,sumscan); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff,sumscan); // multiply subblock_1 + BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff,sumscan); // multiply subblock_2 + cilk_sync; + } + else + { + cilk_spawn BlockPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff,sumscan); // multiply subblock_0 + BlockPar(start+size0, start+size0+size1, subx, suby, qrt1range, halfrange, ncutoff,sumscan); // multiply subblock_1 + cilk_sync; + + cilk_spawn BlockPar(start+size0+size1, end-size3, subx, suby, halfrange, qrt3range, ncutoff,sumscan); // multiply subblock_2 + BlockPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff,sumscan); // multiply subblock_3 + cilk_sync; + } + } +} + + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +void BmCsb::SubSpMV(IT * __restrict btop, IT bstart, IT bend, const NT * __restrict x, NT * __restrict suby, IT * __restrict sumscan) const +{ + for (IT j = bstart ; j < bend ; ++j) // for all blocks inside that block row + { + IT chi = (j << collowbits); // &x[chi] addresses the higher order bits for column indices + + if(btop[j+1] - btop[j] > 0) + { + SSEspmv(num + sumscan[btop[j]], masks + btop[j], bot + btop[j], btop[j+1]-btop[j], x+chi, suby, lowcolmask, lowrowmask, collowbits); + } + } +} + + +// Print stats to an ofstream object +template +ofstream & BmCsb::PrintStats(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" < blocksizes(ntop); + for(IT i=0; i (top[i][j+1]-top[i][j]); + } + } + sort(blocksizes.begin(), blocksizes.end()); + outfile<< "## Total number of nonzeros: " << nz << endl; + outfile<< "## Total number of register blocks: "<< accumulate(blocksizes.begin(), blocksizes.end(), 0) << endl; + outfile<< "## Average fill ratio is: " << static_cast(nz) / static_cast((RBSIZE * nrb)) << endl; + outfile<< "## The histogram of fill ratios within register blocks:" << endl; + + unsigned * counts = new unsigned[nrb]; + popcountall(masks, counts, nrb); + printhistogram(counts, nrb, RBSIZE); + delete [] counts; + + outfile << "## Nonzero distribution (sorted) of blocks follows: \n" ; + for(IT i=0; i< ntop; ++i) + { + outfile << blocksizes[i] << "\n"; + } + outfile << endl; + return outfile; +} diff --git a/csb/bmcsb.h b/csb/bmcsb.h new file mode 100644 index 0000000..f2e880a --- /dev/null +++ b/csb/bmcsb.h @@ -0,0 +1,90 @@ +#ifndef _BMCSB_H +#define _BMCSB_H + +#include +#include +#include +#include // for std:accumulate() +#include // C++ style numeric_limits +#include "csc.h" +#include "mortoncompare.h" + +using namespace std; + +void SSEspmv(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits); + +void SSEspmv(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits); + +void SSEspmv(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, const double * __restrict X, double * Y, unsigned lcmask, unsigned lrmask, unsigned clbits); + +template +class BmCsb +{ +public: + BmCsb ():nz(0), m(0), n(0), nbc(0), nbr(0) {} // default constructor (dummy) + + BmCsb (const BmCsb & rhs); // copy constructor + ~BmCsb(); + BmCsb & operator=(const BmCsb & rhs); // assignment operator + BmCsb (Csc & csc, int workers); + + ofstream & PrintStats(ofstream & outfile) const; + IT colsize() const { return n;} + IT rowsize() const { return m;} + IT numregb() const { return nrb;} + IT numnonzeros() const { return nz; } + bool isPar() const { return ispar; } + +private: + typedef typename int_least_helper::least MTYPE; + + void Init(int workers, IT forcelogbeta = 0); + + void SubSpMV(IT * btop, IT bstart, IT bend, const NT * __restrict x, NT * __restrict suby, IT * __restrict sumscan) const; + + void BMult(IT** chunks, IT start, IT end, const NT * __restrict x, NT * __restrict y, IT ysize, IT * __restrict sumscan) const; + + + void BlockPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff, IT * __restrict sumscan) const; + + void SortBlocks(pair > * pairarray, NT * val); + + IT ** top ; // pointers array (indexed by higher-order bits of the coordinate index), size = nbr*(nbc+1) + IT * bot; // contains lower-order bits of the coordinate index, size nrb + MTYPE * masks; // array of masks, size nrb + NT * num; // contains numerical values, size nnz + + bool ispar; + IT nz; // # nonzeros + IT m; // # rows + IT n; // # columns + IT blcrange; // range indexed by one block + + IT nbc; // #{column blocks} = #{blocks in any block row} + IT nbr; // #{block rows) + IT nrb; // #{register blocks} + + IT rowlowbits; // # lower order bits for rows + IT rowhighbits; + IT highrowmask; // mask with the first log(m)/2 bits = 1 and the other bits = 0 + IT lowrowmask; + + IT collowbits; // # lower order bits for columns + IT colhighbits; + IT highcolmask; // mask with the first log(n)/2 bits = 1 and the other bits = 0 + IT lowcolmask; + + MortonCompare mortoncmp; // comparison operator w.r.t. the (inverted N)-morton layout + + template + friend void bmcsb_gespmv (const BmCsb & A, const NU * x, NU * y); + + template + friend float RowImbalance(const CSB & A); // befriend any CSB instantiation +}; + + +#include "friends.h" +#include "bmcsb.cpp" +#endif diff --git a/csb/bmsym.cpp b/csb/bmsym.cpp new file mode 100644 index 0000000..879297f --- /dev/null +++ b/csb/bmsym.cpp @@ -0,0 +1,681 @@ +#include "bmsym.h" +#include "utility.h" + +// Choose block size as big as possible given the following constraints +// 1) The bot array is addressible by IT +// 2) The parts of x & y vectors that a block touches fits into L2 cache [assuming a saxpy() operation] +// 3) There's enough parallel slackness for block rows (at least SLACKNESS * CILK_NPROC) +template +void BmSym::Init(int workers, IT forcelogbeta) +{ + ispar = (workers > 1); + IT roundup = nextpoweroftwo(n); + + // if indices are negative, highestbitset returns -1, + // but that will be caught by the sizereq below + IT nbits = highestbitset(roundup); + bool sizereq; + if (ispar) + { + sizereq = (IntPower<2>(nbits) > SLACKNESS * workers); + } + else + { + sizereq = (nbits > 1); + } + if(!sizereq) + { + cerr << "Matrix too small for this library" << endl; + return; + } + + nlowbits = nbits-1; + IT inf = numeric_limits::max(); + IT maxbits = highestbitset(inf); + + nhighbits = nbits-nlowbits; // # higher order bits for rows (has at least one bit) + if(ispar) + { + while(IntPower<2>(nhighbits) < SLACKNESS * workers) + { + nhighbits++; + nlowbits--; + } + } + + // calculate the space that suby and subx occupy in L2 cache + IT yL2 = IntPower<2>(nlowbits) * sizeof(NT); + while(yL2 > L2SIZE) + { + yL2 /= 2; + nhighbits++; + nlowbits--; + } + + lowmask = IntPower<2>(nlowbits) - 1; + if(forcelogbeta != 0) + { + IT candlowmask = IntPower<2>(forcelogbeta) -1; + cout << "Forcing beta to "<< (candlowmask+1) << " instead of the chosen " << (lowmask+1) << endl; + cout << "Warning : No checks are performed on the beta you have forced, anything can happen !" << endl; + lowmask = candlowmask; + nlowbits = forcelogbeta; + nhighbits = nbits-nlowbits; + } + else + { + double sqrtn = sqrt(static_cast(n)); IT logbeta = static_cast(ceil(log2(sqrtn))) + 2; + if(nlowbits > logbeta) + { + nlowbits = logbeta; + lowmask = IntPower<2>(logbeta) -1; + nhighbits = nbits-nlowbits; + } + cout << "Beta chosen to be "<< (lowmask+1) << endl; + } + highmask = ((roundup - 1) ^ lowmask); + + IT blcdim = lowmask + 1; + ncsb = static_cast(ceil(static_cast(n) / static_cast(blcdim))); + + blcrange = (lowmask+1) * (lowmask+1); // range indexed by one block + mortoncmp = MortCompSym(nlowbits, lowmask); +} + + +// copy constructor +template +BmSym::BmSym (const BmSym & rhs) +: nz(rhs.nz), n(rhs.n), blcrange(rhs.blcrange), ncsb(rhs.ncsb), nrb(rhs.nrb), +nhighbits(rhs.nhighbits), nlowbits(rhs.nlowbits), diagonal(rhs.diagonal), +highmask(rhs.highmask), lowmask(rhs.lowmask), mortoncmp(rhs.mortoncmp), ispar(rhs.ispar) +{ + if(nz > 0) // nz > 0 iff nrb > 0 + { + num = new NT[nz+2](); // pad from both sides + num++; + + bot = new IT[nrb]; + masks = new MTYPE[nrb]; + scansum = new IT[nrb]; + + copy ( rhs.num, rhs.num+nz+1, num); + copy ( rhs.bot, rhs.bot+nrb, bot ); + copy ( rhs.masks, rhs.masks+nrb, masks ); + copy ( rhs.scansum, rhs.scansum+nrb, scansum ); + } + if ( ncsb > 0) + { + top = new IT* [ncsb]; + for(IT i=0; i +BmSym & BmSym::operator= (const BmSym & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty + { + // make it empty + delete [] scansum; + delete [] masks; + delete [] bot; + delete [] (--num); + } + if(ncsb > 0) + { + for(IT i=0; i 0) // if the copied object is not empty + { + num = new NT[nz+2](); num++; + bot = new IT[nrb]; + masks = new MTYPE[nrb]; + scansum = new IT[nrb]; + + copy ( rhs.num, rhs.num+nz+1, num); + copy ( rhs.bot, rhs.bot+nrb, bot ); + copy ( rhs.masks, rhs.masks+nrb, masks ); + copy ( rhs.scansum, rhs.scansum+nrb, scansum ); + } + if(ncsb > 0) + { + top = new IT* [ncsb]; + for(IT i=0; i +BmSym::~BmSym() +{ + if( nz > 0) + { + delete [] scansum; + delete [] masks; + delete [] bot; + delete [] (--num); + } + if ( ncsb > 0) + { + for(IT i=0; i +BmSym::BmSym (Csc & csc, int workers):nz(csc.nz), n(csc.n) +{ + typedef std::pair ipair; + typedef std::pair mypair; + + assert(nz != 0 && n != 0); + Init(workers); + + top = new IT* [ncsb]; + for(IT i=0; i> nlowbits) << nhighbits) | ((highmask & j) >> nlowbits); + IT lindex = ((lowmask & csc.ir[i]) << nlowbits) | (lowmask & j) ; + + // i => location of that nonzero in csc.ir and csc.num arrays^M + pairarray[k++] = mypair(hindex, ipair(lindex,i)); + } + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray, csc.num); + delete [] pairarray; +} + +template +void BmSym::SortBlocks(pair > * pairarray, NT * val) +{ + typedef pair > mypair; + IT cnz = 0; + IT crb = 0; // current register block + IT ldim = IntPower<2>(nhighbits); // leading dimension (not always equal to ncsb) + vector tempnum; + vector tempbot; + vector M; + for(IT i = 0; i < ncsb; ++i) + { + for(IT j = 0; j < (ncsb-i); ++j) + { + top[i][j] = tempbot.size(); // top points to register blocks + IT prevcnz = cnz; + std::vector blocknz; + while(cnz < nz && pairarray[cnz].first == ((i*ldim)+(j+i)) ) // as long as we're in this block + { + IT interlowbits = pairarray[cnz].second.first; + IT rlowbits = ((interlowbits >> nlowbits) & lowmask); + IT clowbits = (interlowbits & lowmask); + IT bikey = BitInterleaveLow(rlowbits, clowbits); + + if(j == 0 && rlowbits == clowbits) + { + diagonal.push_back(make_pair((i << nlowbits)+rlowbits, val[pairarray[cnz++].second.second])); + } + else + { + blocknz.push_back(mypair(bikey, pairarray[cnz++].second)); + } + } + // sort the block into bitinterleaved order + sort(blocknz.begin(), blocknz.end()); + + int lastregblk = -1; + for(typename vector::iterator itr = blocknz.begin(); itr != blocknz.end(); ++itr) + { + tempnum.push_back( val[itr->second.second] ); + + int curregblk = getDivident(itr->first, RBSIZE); + if(curregblk > lastregblk) // new register block + { + lastregblk = curregblk; + M.push_back((MTYPE) 0); + + // The following lines implement a get_head function that returns + // the top-left index of the register block that this nonzero belongs + IT Ci = itr->second.first & lowmask; + IT Ri = (itr->second.first >> nlowbits) & lowmask; + Ci -= getModulo(Ci,RBDIM); + Ri -= getModulo(Ri,RBDIM); + IT lefttop = ((lowmask & Ri) << nlowbits) | (lowmask & Ci); + + tempbot.push_back(lefttop); + } + M.back() |= GetMaskTable(getModulo(itr->first, RBSIZE)); + } + } + top[i][ncsb-i] = tempbot.size(); + } + + assert (cnz == nz); + nz = tempnum.size(); // update the number of off-diagonal nonzeros + nrb = tempbot.size(); // update the number of off-diagonal register blocks + masks = new MTYPE[nrb]; + scansum = new IT[nrb]; + bot = new IT[nrb]; + num = new NT[nz+2](); num++; // padded for blendv in both sides + + copy(M.begin(), M.end(), masks); + prescan(scansum, masks, nrb); + copy(tempbot.begin(), tempbot.end(), bot); + copy(tempnum.begin(), tempnum.end(), num); +} + +template +void BmSym::DivideIterationSpace(IT * & lspace, IT * & rspace, IT & lsize, IT & rsize, IT size, IT d) const +{ + if(d == 1) + { + lsize = size-size/2; + rsize = size/2; + lspace = new IT[lsize]; + rspace = new IT[rsize]; + for(IT i=0; i rsize) + { + lspace[lsize-1] = size-1; + } + } + else if(d == 2) + { + IT chunksfour = size/4; // we alternate chunks of two + IT rest = size - 4*chunksfour; // rest is modulus 4 + lsize = 2*chunksfour; + rsize = 2*chunksfour; + if(rest > 2) + { + rsize += (rest-2); + lsize += 2; + } + else + { + lsize += rest; + } + + lspace = new IT[lsize]; + rspace = new IT[rsize]; + + for(IT i=0; i +void BmSym::MultAddAtomics(NT * __restrict y, const NT * __restrict x, const IT d) const +{ + cilk_for(IT i=0; i< ncsb-d; ++i) // all blocks at the dth diagonal and beyond + { + IT rhi = (i << nlowbits); + + cilk_for(IT j=d; j < (ncsb-i); ++j) + { + IT chi = ((j+i) << nlowbits); + symcsr(num+scansum[top[i][j]], masks+top[i][j], bot+top[i][j], top[i][j+1]-top[i][j], x+chi, x+rhi, y+rhi, y+chi, lowmask, nlowbits); + } + } +} + + +template +void BmSym::MultMainDiag(NT * __restrict y, const NT * __restrict x) const +{ + if(Imbalance(0) > 2 * BALANCETH) // factor of 2: main diagonal has twice as much parallelism as other diagonals + { + cilk_for(IT i=0; i< ncsb; ++i) // in main diagonal, j = i + { + IT hi = (i << nlowbits); + + if(i == (ncsb-1) && (n-hi) <= lowmask) // last iteration and it's irregular (can't parallelize) + { + SSEsym(num + scansum[top[i][0]], masks + top[i][0], bot + top[i][0], top[i][1]-top[i][0], x+hi, y+hi, lowmask, nlowbits); + } + else + { + BlockTriPar(top[i][0], top[i][1], x+hi, y+hi, 0, blcrange, BREAKNRB * (nlowbits+1)); + } + } + } + else // No need for block parallelization + { + cilk_for(IT i=0; i< ncsb; ++i) // in main diagonal, j = i + { + IT hi = (i << nlowbits); + SSEsym(num + scansum[top[i][0]], masks + top[i][0], bot + top[i][0], top[i][1]-top[i][0], x+hi, y+hi, lowmask, nlowbits); + } + } + + const IT diagsize = diagonal.size(); + cilk_for(IT i=0; i < diagsize; ++i) + { + y[diagonal[i].first] += diagonal[i].second * x[diagonal[i].first]; // process the diagonal + } +} + + +// Multiply the nth block diagonal +// which is composed of blocks A[i][i+n] +template +void BmSym::MultDiag(NT * __restrict y, const NT * __restrict x, const IT d) const +{ + IT * lspace; + IT * rspace; + IT lsize, rsize; + DivideIterationSpace(lspace, rspace, lsize, rsize, ncsb-d, d); + + IT lsum = 0; + IT rsum = 0; + for(IT k=0; k BALANCETH * lave) // relative denser block + && (!(lspace[i] == (ncsb-d-1) && (n-chi) <= lowmask))) // and parallelizable + { + BlockPar(start, end, x+chi, x+rhi, y+rhi, y+chi, 0, blcrange, BREAKNRB * (nlowbits+1)); + } + else + { + SSEsym(num + scansum[start], masks + start, bot + start, end-start, x+chi, x+rhi, y+rhi, y+chi, lowmask, nlowbits); + } + } + cilk_for(IT j=0; j< rsize; ++j) + { + IT rhi = (rspace[j] << nlowbits) ; + IT chi = ((rspace[j]+d) << nlowbits); + IT start = top[rspace[j]][d]; + IT end = top[rspace[j]][d+1]; + + if((top[rspace[j]][d+1] - top[rspace[j]][d] > BALANCETH * rave) // relative denser block + && (!(rspace[j] == (ncsb-d-1) && (n-chi) <= lowmask))) // and parallelizable + { + BlockPar(start, end, x+chi, x+rhi, y+rhi, y+chi, 0, blcrange, BREAKNRB * (nlowbits+1)); + } + else + { + SSEsym(num + scansum[start], masks + start, bot + start, end-start, x+chi, x+rhi, y+rhi, y+chi, lowmask, nlowbits); + } + } + delete [] lspace; + delete [] rspace; +} + +// Block parallelization for upper triangular compressed sparse blocks +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the `same block +// PRECONDITION: rangeend-rangebeg is a power of two +template +void BmSym::BlockTriPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + SSEsym(num + scansum[start], masks + start, bot + start, end-start, subx, suby, lowmask, nlowbits); + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); // divides in mid column + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // In the symmetric case, quadrant "1" doesn't exist (size1 = 0) + IT size0 = static_cast (mid - &bot[start]); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNRBTOPAR); + + cilk_spawn BlockTriPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockTriPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + BlockPar(start+size0, end-size3, subx, subx, suby, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + } +} + +// Parallelize the block itself +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the same block +// PRECONDITION: rangeend-rangebeg is a power of two +// TODO: we rely on the particular implementation of lower_bound for correctness, which is dangerous ! +// what if lhs (instead of rhs) parameter to the comparison object is the splitter? +template +void BmSym::BlockPar(IT start, IT end, const NT * __restrict subx, const NT * __restrict subx_mirror, + NT * __restrict suby, NT * __restrict suby_mirror, IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + // Aliasing is not an issue here. BlockPar is only called on off-diagonal register blocks + SSEsym(num + scansum[start], masks + start, bot + start, end-start, subx, subx_mirror, suby, suby_mirror, lowmask, nlowbits); + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNRBTOPAR); + + // We only perform [0,3] in parallel and then [1,2] in parallel because the symmetric update causes races when + // performing [0,1] in parallel (as it would perform [0,2] in the fictitious lower triangular part) + cilk_spawn BlockPar(start, start+size0, subx, subx_mirror, suby, suby_mirror, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(end-size3, end, subx, subx_mirror, suby, suby_mirror, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockPar(start+size0, start+size0+size1, subx, subx_mirror, suby, suby_mirror, qrt1range, halfrange, ncutoff); // multiply subblock_1 + BlockPar(start+size0+size1, end-size3, subx, subx_mirror, suby, suby_mirror, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } +} + + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +void BmSym::SeqSpMV(const NT * __restrict x, NT * __restrict y) const +{ + const IT diagsize = diagonal.size(); + for(IT i=0; i < diagsize; ++i) + { + y[diagonal[i].first] += diagonal[i].second * x[diagonal[i].first]; // process the diagonal + } + for (IT i = 0 ; i < ncsb ; ++i) // for all block rows of A + { + IT rhi = (i << nlowbits); + for (IT j = 1 ; j < (ncsb-i) ; ++j) // for all blocks inside that block row + { + IT chi = ((j+i) << nlowbits); + SSEsym(num + scansum[top[i][j]], masks+top[i][j], bot+top[i][j], top[i][j+1]-top[i][j], x+chi, x+rhi, y+rhi, y+chi, lowmask, nlowbits); + } + + SSEsym(num + scansum[top[i][0]], masks+top[i][0], bot+top[i][0], top[i][1]-top[i][0], x+rhi, y+rhi, lowmask, nlowbits); + } +} + +// Imbalance in the dth block diagonal (the main diagonal is the 0th) +template +float BmSym::Imbalance(IT d) const +{ + if(ncsb <= d+1) + { + return 0.0; // no such diagonal exist + } + // get the average without the last left-over blockrow + IT size = ncsb-d-1; + IT * sums = new IT[size]; + for(size_t i=0; i< size; ++i) + { + sums[i] = top[i][d+1] - top[i][d]; + } + IT max = *max_element(sums, sums+size); + IT mean = accumulate(sums, sums+size, 0.0) / size; + delete [] sums; + + return static_cast(max) / mean; +} + + +// Total number of register blocks in the dth block diagonal (the main diagonal is the 0th) +template +IT BmSym::nrbsum(IT d) const +{ + IT sum = 0; + for(size_t i=0; i< ncsb-d; ++i) + { + sum += (top[i][d+1] - top[i][d]); + } + return sum; +} + +// Print stats to an ofstream object +template +ofstream & BmSym::PrintStats(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" <(nz) / static_cast((RBSIZE * nrb)) << endl; + outfile << "## Number of real blocks is "<< ntop << endl; + outfile << "## Main (0th) block diagonal imbalance: " << Imbalance(0) << endl; + outfile << "## 1st block diagonal imbalance: " << Imbalance(1) << endl; + outfile << "## 2nd block diagonal imbalance: " << Imbalance(2) << endl; + + outfile << "## nrb ratios (block diagonal 0,1,2): " << static_cast(nrbsum(0)) / nrb << ", " + << static_cast(nrbsum(1)) / nrb << ", " << static_cast(nrbsum(2)) / nrb << endl; + outfile << "## atomics ratio: " << static_cast(nrb-nrbsum(0)-nrbsum(1)-nrbsum(2))/nrb << endl; + + outfile<< "## Total number of nonzeros: " << nz << endl; + outfile<< "## Total number of register blocks: "<< nrb << endl; + return outfile; +} + + +template +ofstream & BmSym::Dump(ofstream & outfile) const +{ + for(IT i =0; i> nlowbits) & lowmask); + IT cli = bot[k] & lowmask; + outfile << "A(" << rli << "," << cli << ")=" << num[k] << endl; + } + } + } + return outfile; +} diff --git a/csb/bmsym.h b/csb/bmsym.h new file mode 100644 index 0000000..040a67d --- /dev/null +++ b/csb/bmsym.h @@ -0,0 +1,119 @@ +#ifndef _BMSYM_H +#define _BMSYM_H + +#include +#include +#include +#include // for std:accumulate() +#include // C++ style numeric_limits +#include +#include +#include "csc.h" +#include "mortoncompare.h" + +using namespace std; + +void symcsr(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lcmask, unsigned nlbits); + +void symcsr(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lcmask, unsigned nlbits); + +void symcsr(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lcmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const unsigned char * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * Y, unsigned lowmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const unsigned short * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * Y, unsigned lowmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, const double * __restrict XT, double * Y, double * YT, unsigned lowmask, unsigned nlbits); + +void SSEsym(const double * __restrict V, const uint64_t * __restrict M, const unsigned * __restrict bot, const unsigned nrb, + const double * __restrict X, double * Y, unsigned lowmask, unsigned nlbits); + +/* Symmetric CSB implementation +** Only upper triangle is stored +** top[i][0] gives the ith diagonal block for every i +** Since this class works only for symmetric (hence square) matrices, +** each compressed sparse block is (lowbits+1)x(lowbits+1) and ncsb = nbr = nbc +*/ +template +class BmSym +{ +public: + BmSym ():nz(0), n(0), ncsb(0) {} // default constructor (dummy) + + BmSym (const BmSym & rhs); // copy constructor + ~BmSym(); + BmSym & operator=(const BmSym & rhs); // assignment operator + BmSym (Csc & csc, int workers); + + ofstream & PrintStats(ofstream & outfile) const; + ofstream & Dump(ofstream & outfile) const; + IT colsize() const { return n;} + IT rowsize() const { return n;} + IT numregb() const { return nrb;} + bool isPar() const { return ispar; } + +private: + typedef typename int_least_helper::least MTYPE; + + void Init(int workers, IT forcelogbeta = 0); + void SeqSpMV(const NT * __restrict x, NT * __restrict y) const; + void BMult(IT** chunks, IT start, IT end, const NT * __restrict x, NT * __restrict y, IT ysize) const; + + void BlockPar(IT start, IT end, const NT * __restrict subx, const NT * __restrict subx_mirror, + NT * __restrict suby, NT * __restrict suby_mirror, IT rangebeg, IT rangeend, IT cutoff) const; + void BlockTriPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, IT rangebeg, IT rangeend, IT cutoff) const; + + void SortBlocks(pair > * pairarray, NT * val); + void DivideIterationSpace(IT * & lspace, IT * & rspace, IT & lsize, IT & rsize, IT size, IT d) const; + + void MultAddAtomics(NT * __restrict y, const NT * __restrict x, const IT d) const; + void MultDiag(NT * __restrict y, const NT * __restrict x, const IT d) const; + void MultMainDiag(NT * __restrict y, const NT * __restrict x) const; + + float Imbalance(IT d) const; + IT nrbsum(IT d) const; + + IT ** top ; // pointers array (indexed by higher-order bits of the coordinate index), size = nbr*(nbc+1) + IT * bot; // contains lower-order bits of the coordinate index, size nrb + IT * scansum; // prefix-sums on popcounts of masks, size nrb + MTYPE * masks; // array of masks, size nrb + NT * num; // contains numerical values, size nnz + + vector< pair > diagonal; + + bool ispar; + IT nz; // # nonzeros + IT nrb; + IT n; // #{rows} = #{columns} + IT blcrange; // range indexed by one block + + IT ncsb; // #{block rows) = #{block cols} + + IT nlowbits; // # lower order bits (for both rows and columns) + IT nhighbits; + IT highmask; // mask with the first log(n)/2 bits = 1 and the other bits = 0 + IT lowmask; + + MortCompSym mortoncmp; // comparison operator w.r.t. the (inverted N)-morton layout + + template + friend void bmsym_gespmv (const BmSym & A, const NU * x, NU * y); +}; + + +#include "friends.h" +#include "bmsym.cpp" +#endif + diff --git a/csb/bssb.cpp b/csb/bssb.cpp new file mode 100644 index 0000000..19b9181 --- /dev/null +++ b/csb/bssb.cpp @@ -0,0 +1,1554 @@ +/** + * @file bssb.cpp + * @author Nikos Sismanis + * @date Wed Sep 19 13:51:08 2017 + * + * @brief BSSB core functions + * + * Core function for BSSB implementation. Not all of them are visible + * through the corresponding header + * + * + */ + +#include "benchmark_csb.hpp" + +#include "orders.hpp" + +#include "bssb.hpp" + + +bool operator<(const boxHelper &lhs, const boxHelper &rhs) { + return lhs.size < rhs.size; +} + + +/* PQ Kernel */ +void computeLeaf(sparseBlock* Ps, double* F, double* Y, int dim){ + + double *vv = Ps->vv; + int32_t *ju = Ps->ju; + int32_t *ii = Ps->ii; + int32_t *li = Ps->li; + int32_t m = Ps->Nrow; + int32_t n = Ps->Ncol; + + int32_t nn = Ps->nuj; + + int32_t ss = 0; + + int32_t R = Ps->row; + int32_t C = Ps->col; + + + double* Y0i = &Y[R * dim]; + double* Y0j = &Y[C * dim]; + + double* F0i = &F[R * dim]; + + + + for (uint32_t j = 0; j < nn; j++) { + + double accum[DIM] = {0}; + double Ftemp[DIM] = {0}; + double Yj[DIM] = {0}; + double Yi[DIM] = {0}; + + const int32_t k = li[j]; /* number of nonzero elements of each column */ + + + Yj[:] = Y0j[ ju[j]*dim + 0:dim ]; + + accum[:] = 0; + + /* for each non zero element */ + for (uint32_t idx = 0; idx < k; idx++) { + + const uint32_t i = (ii[ss + idx]); + + Yi[:] = Y0i[ i * dim + 0:dim ]; + + /* distance computation */ + double dist = __sec_reduce_add( (Yj[:] - Yi[:])*(Yj[:] - Yi[:]) ); + + // FILE *f_i = fopen( "csc_i.bin", "ab" ); + // FILE *f_j = fopen( "csc_j.bin", "ab" ); + // FILE *f_v = fopen( "csc_v.bin", "ab" ); + + // int i_bin = i+R; + // int j_bin = ju[j]+C; + // double v_bin = vv[ss+idx]; + + // fwrite( &i_bin, sizeof(i_bin), 1, f_i ); + // fwrite( &j_bin, sizeof(j_bin), 1, f_j ); + // fwrite( &v_bin, sizeof(v_bin), 1, f_v ); + + // fclose( f_i ); fclose( f_j ); fclose( f_v ); + + /* P_{ij} \times Q_{ij} */ + double p_times_q = vv[ss+idx] / (1+dist); + + Ftemp[:] = p_times_q * ( Yj[:] - Yi[:] ); + + /* F_{attr}(i,j) */ + F0i[i*dim + 0:dim] -= Ftemp[:]; + } + + ss += k; + + } + + +} + + +/* SpMV Multiplication Kernel */ +void blockMatMult(sparseBlock* Ps, double* F, double* Y){ + + double *vv = Ps->vv; + int32_t *ju = Ps->ju; + int32_t *ii = Ps->ii; + int32_t *li = Ps->li; + int32_t m = Ps->Nrow; + int32_t n = Ps->Ncol; + + int32_t nn = Ps->nuj; + + int32_t ss = 0; + + int32_t R = Ps->row; + int32_t C = Ps->col; + + + double* Y0i = &Y[R]; + double* Y0j = &Y[C]; + + double* F0i = &F[R]; + + + for (uint32_t j = 0; j < nn; j++) { + + double accum = 0; + double Ftemp = 0; + double Yj = 0; + double Yi = 0; + + Yj = Y0j[ ju[j] ]; + + + const int32_t k = li[j]; + for (uint32_t idx = 0; idx < k; idx++) { + + const uint32_t i = (ii[ss + idx]); + + double pval = vv[ ss + idx]; + + Ftemp = pval * Yj; + + F0i[i] += Ftemp; + } + ss += k; + } + + +} + +/** + * Compute B = A for CSR matrix A, CSC matrix B + * + * Also, with the appropriate arguments can also be used to: + * - compute B = A^t for CSR matrix A, CSR matrix B + * - compute B = A^t for CSC matrix A, CSC matrix B + * - convert CSC->CSR + * + * Input Arguments: + * I n_row - number of rows in A + * I n_col - number of columns in A + * I Ap[n_row+1] - row pointer + * I Aj[nnz(A)] - column indices + * T Ax[nnz(A)] - nonzeros + * + * Output Arguments: + * I Bp[n_col+1] - column pointer + * I Bj[nnz(A)] - row indices + * T Bx[nnz(A)] - nonzeros + * + * Note: + * Output arrays Bp, Bj, Bx must be preallocated + * + * Note: + * Input: column indices *are not* assumed to be in sorted order + * Output: row indices *will be* in sorted order + * + * Complexity: Linear. Specifically O(nnz(A) + max(n_row,n_col)) + * + */ +template +void csr_tocsc(const I n_row, + const I n_col, + const I Ap[], + const I Aj[], + const T Ax[], + I Bp[], + I Bi[], + T Bx[]) +{ + const I nnz = Ap[n_row]; + + //compute number of non-zero entries per column of A + std::fill(Bp, Bp + n_col, 0); + + for (I n = 0; n < nnz; n++){ + Bp[Aj[n]]++; + } + + //cumsum the nnz per column to get Bp[] + for(I col = 0, cumsum = 0; col < n_col; col++){ + I temp = Bp[col]; + Bp[col] = cumsum; + cumsum += temp; + } + Bp[n_col] = nnz; + + for(I row = 0; row < n_row; row++){ + for(I jj = Ap[row]; jj < Ap[row+1]; jj++){ + I col = Aj[jj]; + I dest = Bp[col]; + + Bi[dest] = row; + Bx[dest] = Ax[jj]; + + Bp[col]++; + } + } + + for(I col = 0, last = 0; col <= n_col; col++){ + I temp = Bp[col]; + Bp[col] = last; + last = temp; + } +} + +void csc2csr_top(top_lvl_csc *BSSB_CSC, int nCol, int nRow){ + + int *csc_jc = BSSB_CSC->jc; + int *csc_ir = BSSB_CSC->ir; + sparseBlock *csc_sb = BSSB_CSC->Pb; + + int nnz = csc_jc[nCol]; + + int *csr_jr = (int *) malloc((nCol+1)*sizeof(int)); + int *csr_ic = (int *) malloc(nnz*sizeof(int)); + sparseBlock *csr_sb = (sparseBlock *) malloc(nnz*sizeof(sparseBlock)); + + csr_tocsc(nRow, + nCol, + csc_jc, + csc_ir, + csc_sb, + csr_jr, + csr_ic, + csr_sb); + + BSSB_CSC->jc = csr_jr; + BSSB_CSC->ir = csr_ic; + BSSB_CSC->Pb = csr_sb; + + free( csc_jc ); + free( csc_ir ); + free( csc_sb ); + +} + +void updateLeafOrder(node_t *box, + int32_t *leafMap, + int32_t *leafStart, + int32_t *leafSize){ + + if(box->numChild == 0){ + box->leafId = leafMap[box->strParticle]; + //leafStart[box->leafId] = box->strParticle; + //leafSize[box->leafId] = box->pop; + } else{ + + node_t *cs = box->first_child; + for(int i=0; inumChild; i++){ + updateLeafOrder(cs, leafMap, leafStart, leafSize); + cs = cs->next_sibling; + } + + } +} + +void updateLeafOrder(int32_t *leafMap, + int32_t *leafStart, + int32_t *leafSize, + int N){ + + int offset = 0; + int leafCount = 0; + for(int i=0; inumChild == 0){ + + leafMap[box->strParticle] = box->pop; + + } else { + + + node_t *cs = box->first_child; + for(int i=0; inumChild; i++){ + createLeafMap(leafMap, cs); + cs = cs->next_sibling; + } + + } + + +} + + +void fillMap(int32_t *col2leaves, int N, int f){ + + for(int i=0; iir = (int32_t *)malloc(nnz * sizeof(int32_t)); + Ps->Pb = (sparseBlock *)malloc(nnz * sizeof(sparseBlock)); + + /* initialize with zeros */ + for(int i=0; iPb[i].nnz = 0;} + + /* Thraverse every column block */ + for(int bi=0; bijc[bi]; // Start of the compressed block column in csc packing + + for(int j=cStr; j 0){ // if the block is non-empty + Ps->Pb[blkstr+cursor].nnz = rowCount[i]; + Ps->ir[blkstr+cursor] = i; + cursor++; + } + + } + + free(rowCount); + } + +} + + +/* Finish the construction of every block and + turn it to CSC2 packing*/ +void finishBlock(sparseBlock *Ps){ + + /* Find unique columns */ + int32_t nCols = Ps->Ncol; + int32_t nRows = Ps->Nrow; + + int32_t *lj = (int32_t *)calloc(nCols, sizeof(int32_t)); + int32_t *li = (int32_t *)calloc(nRows, sizeof(int32_t)); + + /* Count volumns that have nnz elements */ + for(int i=0; innz; i++){ + lj[Ps->jj[i]]++; + li[Ps->ii[i]]++; + } + + + /* Count the unique */ + int nuj = 0; + for(int i=0; i 0){ + nuj++; + } + } + + + int nui = 0; + for(int i=0; i 0){ + nui++; + } + } + + /* Allocate space for the unique */ + Ps->nuj = nuj; Ps->nui = nui; + Ps->ju = (int32_t *)malloc(nuj * sizeof(int32_t)); // unique columns + Ps->iu = (int32_t *)malloc(nui * sizeof(int32_t)); // unique rows + Ps->li = (int32_t *)malloc(nuj * sizeof(int32_t)); // Histogram of columns + + /* Move unique elements */ + nuj = 0; nui = 0; + /* gather unique columns */ + for(int j=0; j 0){ + Ps->ju[nuj] = j; + Ps->li[nuj] = lj[j]; + nuj++; + } + } + + /* gather unique rows */ + for(int i=0; i 0){ + Ps->iu[nui] = i; + nui++; + } + } + + free(lj); + free(li); + +} + +/* Sparse top level */ +void formSparseBlockSparseTop(top_lvl_csc *Ps, + int32_t *rowMap, + int32_t *colMap, + int32_t *rowStart, + int32_t *colStart, + int32_t *rowSize, + int32_t *colSize, + int *ir, + int *jc, + double *vv, + int Ncols, + int nRowBlocks, + int nColBlocks, + int nnz){ + + + /* Block memory allocation */ + for(int j=0; jjc[j+1] - Ps->jc[j]; + + for (unsigned int idx = 0; idx < k; idx++) { + + int blk = Ps->jc[j] + idx; + const unsigned int i = Ps->ir[blk]; + + + int bnnz = Ps->Pb[blk].nnz; + Ps->Pb[blk].ii = (int32_t *)malloc(bnnz * sizeof(int32_t)); + Ps->Pb[blk].vv = (double *)malloc(bnnz * sizeof(double)); + Ps->Pb[blk].jj = (int32_t *)malloc(bnnz * sizeof(int32_t)); + + Ps->Pb[blk].row = rowStart[i]; + Ps->Pb[blk].col = colStart[j]; + + Ps->Pb[blk].Nrow = rowSize[i]; + Ps->Pb[blk].Ncol = colSize[j]; + } + + } + + + /* scan and assign the points */ + for(int bi=0; bijc[bi]; + + /* fisrt scan to count */ + for(int j=cStr; j 0){ + /* + rowBlocks[cursor].ii = (int32_t *)malloc(rowCount[i]*sizeof(int32_t)); + rowBlocks[cursor].jj = (int32_t *)malloc(rowCount[i]*sizeof(int32_t)); + rowBlocks[cursor].vv = (double *)malloc(rowCount[i]*sizeof(double)); + rowBlocks[cursor].nnz = rowCount[i]; + */ + rowBlocks[i].ii = (int32_t *)malloc(rowCount[i]*sizeof(int32_t)); + rowBlocks[i].jj = (int32_t *)malloc(rowCount[i]*sizeof(int32_t)); + rowBlocks[i].vv = (double *)malloc(rowCount[i]*sizeof(double)); + rowBlocks[i].nnz = rowCount[i]; + + } else{ + rowBlocks[i].nnz = 0; + } + } + + /* Pass again to store */ + for(int j=cStr; j 0){ + //Ps->Pb[blkstr+cursor].ii = rowBlocks[i].ii; + //Ps->Pb[blkstr+cursor].jj = rowBlocks[i].jj; + //Ps->Pb[blkstr+cursor].vv = rowBlocks[i].vv; + + memcpy(Ps->Pb[blkstr+cursor].ii, rowBlocks[i].ii, rowCount[i] * sizeof(int32_t)); + memcpy(Ps->Pb[blkstr+cursor].jj, rowBlocks[i].jj, rowCount[i] * sizeof(int32_t)); + memcpy(Ps->Pb[blkstr+cursor].vv, rowBlocks[i].vv, rowCount[i] * sizeof(double)); + cursor++; + } + + } + + + for(int i=0; i0){ + free( rowBlocks[i].ii ); + free( rowBlocks[i].jj ); + free( rowBlocks[i].vv ); + } + + } + + + free(rowBlocks); + free(rowCount); + free(ptr); + + } + + + /* finish the block construction */ + for(int j=0; jjc[j+1] - Ps->jc[j]; + for (unsigned int idx = 0; idx < k; idx++) { + + int blk = Ps->jc[j] + idx; + + if( Ps->Pb[blk].nnz > 0 ){ + finishBlock(&Ps->Pb[blk]); + } + + } + } + +} + + +/* Function that scans the intput CSC matrix and computes + the number of nnz blocks of the top level structure */ +int32_t count_nnz_from_top(top_lvl_csc * const Ps, + int32_t const * const rowMap, + int32_t const * const colMap, + int32_t const * const colStart, + int32_t const * const colSizes, + int const * const ir, + int const * const jc, + int const nRowBlocks, + int const nColBlocks){ + + Ps->jc = (int32_t*)calloc((nColBlocks+1), sizeof(int32_t)); + + int blkcount = 0; + for(int bi=0; bijc[bi]; + + for(int j=cStr; j 0){ + Ps->jc[bi+1]++; + blkcount++; + } + } + + } + + /* scan prefix */ + int offset = 0; + for(int i=0; ijc[i]; + Ps->jc[i] += offset; + offset += size; + } + + + return blkcount; +} + + +/* Function used for debugging - computed the nnz of the top level matrix + from the dense top level */ +int32_t count_nnz_toplvl(top_lvl_csc *Ps, int32_t *block_nnz, int nLeaves){ + + Ps->jc = (int32_t*)calloc((nLeaves+1), sizeof(int32_t)); + int totalcount = 0; + + for(int j=0; j 0){ + totalcount++; + Ps->jc[j+1]++; + } + } + } + + /* scan prefix */ + int offset = 0; + for(int i=0; ijc[i]; + Ps->jc[i] += offset; + offset += size; + } + + + for(int i=0; i<8; i++){ + printf("scan: %d\n", Ps->jc[i]); + } + + + return totalcount; + +} + +/* Function used for debugging - Matches the dense and sparse top levels */ +int32_t toplvl_checksum(top_lvl_csc *Ps, int32_t *block_nnz, int nLeaves, int N){ + + int count = 0; + int pass = 1; + for(int j=0; jjc[j+1] - Ps->jc[j]; + + for (unsigned int idx = 0; idx < k; idx++) { + + const unsigned int i = (Ps->ir[Ps->jc[j] + idx]); + count += Ps->Pb[Ps->jc[j] + idx].nnz; + + pass &= (Ps->Pb[Ps->jc[j] + idx].nnz == block_nnz[j * nLeaves + i]); + + } + } + + printf("MATRCH: %s\n", (pass) ? "PASS\n" : "FAIL" ); + return count; +} + + +void mergeLeafBoxes(int32_t *leafMap, int N, int pThresLow){ + + int offset = 0; + + int prev_box = 0; + int current_box = N; + int next_box = 0; + + + int leafCount_old = 0; + int leafCount_new = 0; + for(int i=0; ip ); + free( B->col2leaves ); + free( B->boxStart ); + free( B->boxSizes ); + free( B ); +} + +leafMergeBuffs *mergeBoxRec(node_t *box, + int32_t *p, + int32_t *col2leaves, + int32_t *boxStart, + int32_t *boxSizes, + int numBoxes, + int N, + int pThresLow, + int pThresHigh){ + + leafMergeBuffs *leafBuffs = (leafMergeBuffs *)malloc(sizeof(leafMergeBuffs)); + + + if( box->numChild==0){ + + /* peack up the part of the matrix + that belongs to the leaf */ + leafBuffs->nLeaves = 1; + leafBuffs->numBoxes = 1; + leafBuffs->nPoints = box->pop; + leafBuffs->leavesStart = box->strParticle; + /* Get the part of the permutation vector */ + leafBuffs->p = (int32_t *)malloc(box->pop * sizeof(int32_t)); + memcpy(leafBuffs->p, &p[box->strParticle], box->pop * sizeof(int32_t)); + /* Get the mapp of the columns */ + leafBuffs->col2leaves = (int32_t *)malloc(box->pop * sizeof(int32_t)); + memcpy(leafBuffs->col2leaves, &col2leaves[box->strParticle], box->pop * sizeof(int32_t)); + /* Get the start of the box */ + leafBuffs->boxStart = (int32_t *)malloc(sizeof(int32_t)); + /* Get the size of the box */ + leafBuffs->boxSizes = (int32_t *)malloc(sizeof(int32_t)); + leafBuffs->boxSizes[0] = boxSizes[box->leafId]; + + + } else { + + + leafMergeBuffs **childBuffs = (leafMergeBuffs **)malloc(box->numChild * sizeof(leafMergeBuffs*)); + + node_t *current = box->first_child; + for(int i=0; inumChild; i++){ + childBuffs[i] = mergeBoxRec(current, + p, + col2leaves, + boxStart, + boxSizes, + numBoxes, + N, + pThresLow, + pThresHigh); + current = current->next_sibling; + + } + + /* Set up the merge buffers */ + leafBuffs->nLeaves = 0; + leafBuffs->nPoints = 0; + leafBuffs->numBoxes = 0; + for(int i=0; inumChild; i++){ + leafBuffs->nLeaves += childBuffs[i]->nLeaves; + leafBuffs->nPoints += childBuffs[i]->nPoints; + leafBuffs->numBoxes += childBuffs[i]->numBoxes; + } + + /* Collect the parts of the permutation vector */ + leafBuffs->p = (int32_t *)malloc(leafBuffs->nPoints*sizeof(int32_t)); + + /* Collect the parts of the column map */ + leafBuffs->col2leaves = (int32_t *)malloc(leafBuffs->nPoints * sizeof(int32_t)); + + + /* Colect the box pointers */ + leafBuffs->boxStart = (int32_t *)malloc(leafBuffs->nLeaves * sizeof(int32_t)); + + + /* Collect the box sizes */ + leafBuffs->boxSizes = (int32_t *)malloc(leafBuffs->nLeaves *sizeof(int32_t)); + + int offsetPoints = 0; + int offsetLeaves = 0; + for(int i=0; inumChild; i++){ + memcpy(&leafBuffs->p[offsetPoints], childBuffs[i]->p, childBuffs[i]->nPoints*sizeof(int32_t)); + + + memcpy(&leafBuffs->col2leaves[offsetPoints], childBuffs[i]->col2leaves, + childBuffs[i]->nPoints*sizeof(int32_t)); + /* + memcpy(&leafBuffs->boxStart[offsetLeaves], childBuffs[i]->boxStart, + childBuffs[i]->nLeaves*sizeof(int32_t)); + */ + memcpy(&leafBuffs->boxSizes[offsetLeaves], childBuffs[i]->boxSizes, + childBuffs[i]->nLeaves*sizeof(int32_t)); + + offsetPoints += childBuffs[i]->nPoints; + offsetLeaves += childBuffs[i]->nLeaves; + } + + + /* Free the buffers of the children */ + for(int i=0; inumChild; i++){ + freeMergeBuffs( childBuffs[i] ); + } + free(childBuffs); + + /* Scan prefix to find the start */ + int offset = 0; + for(int i=0; inLeaves; i++){ + leafBuffs->boxStart[i] = offset; + offset += leafBuffs->boxSizes[i]; + } + + if( leafBuffs->numBoxes < 16 ){ + int *pmerge = (int *)malloc( leafBuffs->nPoints * sizeof(int32_t) ); + leafBuffs->nLeaves = mergeBoxBase(pmerge, + leafBuffs->col2leaves, + leafBuffs->boxStart, + leafBuffs->boxSizes, + leafBuffs->nLeaves, + leafBuffs->nPoints, + pThresLow, + pThresHigh); + + // ------- Fix the permutation vector + int *pbuff = (int*)malloc(leafBuffs->nPoints * sizeof(int)); + for(int i=0; inPoints; i++){ + pbuff[i] = leafBuffs->p[pmerge[i]]; + } + memcpy(leafBuffs->p, pbuff, leafBuffs->nPoints * sizeof(int)); + + free(pbuff); + free(pmerge); + } + + + + } + + return leafBuffs; + +} + +int mergeBoxBase(int32_t *p, + int32_t *col2leaves, + int32_t *boxStart, + int32_t *boxSizes, + int numBoxes, + int N, int pThresLow, int pThresHigh){ + + boxHelper *helper = (boxHelper *)malloc(numBoxes * sizeof(boxHelper)); + boxHelper *mergedBoxes = (boxHelper *)malloc(numBoxes * sizeof(boxHelper)); + + + /* Check which boxes must be mearged */ + for(int i=0; i 0){ + printf("PROBLEM: %d, %d\n", i, mergedLeafMap[i]); + } + } + printf("check: %d, N: %d\n", chsum, N); + printf("Merging: %s\n", (chsum==N) ? "PASS" : "FAIL"); +#endif + + free(leafMap); + leafMap = mergedLeafMap; + +#endif + + // ----- get a vector that says its element which block belongs to + int32_t nLeaves = mapClo2Leaves(col2leaves, leafMap, N); + + // ----- allocate 2 vectors of size nLeaves + int32_t *leafStart = (int32_t *)calloc( nLeaves, sizeof(int32_t) ); + int32_t *leafSize = (int32_t *)calloc( nLeaves, sizeof(int32_t) ); + + + // ------ Index of tree leaves + updateLeafOrder(tree, + col2leaves, + leafStart, + leafSize); + + + /* Linearize the leaves of the tree */ + updateLeafOrder(leafMap, + leafStart, + leafSize, + N); + + +#ifdef MERGE_LEAVES + int *p = (int *)malloc(N * sizeof(int)); + for(int i=0; inLeaves, nLeaves); + + + /* Copy the new order and free helping buffers */ + nLeaves = mergeBuffs->nLeaves; + memcpy(pmerge, mergeBuffs->p, N*sizeof(int32_t)); + memcpy(leafStart, mergeBuffs->boxStart, nLeaves*sizeof(int32_t)); + memcpy(leafSize, mergeBuffs->boxSizes, nLeaves*sizeof(int32_t)); + /* Update the column map */ + updateColMap(col2leaves, + leafStart, + leafSize, + N, nLeaves); + + /* Clean the memory buffers used for the leaf merging */ + freeMergeBuffs( mergeBuffs ); + free( p ); + + int *pm_inv = cs_pinv( pmerge, N ); + cs *Cp2 = cs_permute( Cp, pm_inv, pmerge, 1 ); + Cp = Cp2; + +#endif + + + + /* Verify sum */ +#ifdef VERIFY + int lsum = 0; + for(int i=0; ii, + Cp->p, + nLeaves, + nLeaves); + + + /* Form the data structure */ + + printf("#### Number of source leaves: %d\n", nLeaves); + printf("#### Number of target leaves: %d\n", nLeaves); + + + + //printf("NNZ at the top: %d\n", nnz_top); + + /* Count the nnz per non empty block of the top level */ + count_block_nnz_sptop(Ptop, + col2leaves, + col2leaves, + leafStart, + leafSize, + Cp->i, + Cp->p, + N, + nLeaves, + nnz_top); + + printf("Form the new matrix"); + /* Form the non-empty blocks of the top level (store and pack points) */ + formSparseBlockSparseTop(Ptop, + col2leaves, + col2leaves, + leafStart, + leafStart, + leafSize, + leafSize, + Cp->i, + Cp->p, + Cp->x, + N, + nLeaves, + nLeaves, + nnz); + + + csc2csr_top(Ptop, nLeaves, nLeaves); + + pm[0] = nLeaves; + pn[0] = nLeaves; + + free( leafMap ); + free( leafStart ); + free( leafSize ); + free( col2leaves ); + // freeBlockSparseTopSparse(Ptop, nLeaves); + +#ifdef MERGE_LEAVES + cs_spfree(Cp); +#endif + +} + + +/* Free the 2nd level blocks */ +void freeSparseBlock(sparseBlock *Ps){ + + free( Ps->ju ); + free( Ps->li ); + free( Ps->ii ); + free( Ps->jj ); + free( Ps->vv ); + free( Ps->iu ); + +} + + + +/* Free top level sparse structure */ +void freeBlockSparseTopSparse(top_lvl_csc *Ps, int nColBlocks, bool rec){ + + for(int j=0; jjc[j+1] - Ps->jc[j]; + + for (unsigned int idx = 0; idx < k; idx++) { + + int blk = Ps->jc[j] + idx; + + sparseBlock *Bl = &Ps->Pb[blk]; + + if (rec) + freeSparseBlock(Bl); + + } + } + + free( Ps->ir ); + free( Ps->jc ); + free( Ps->Pb ); + +} + +/* travese the bssb structure in parallel + csc packing of the top level*/ +void traverse_csc_top(double *F, + double *Y, + top_lvl_csc *Ps, + int nRowBlocks, + int nColBlocks, + int dim){ + + cilk_for(int j=0; jjc[j]; + + const int k = Ps->jc[j+1] - offCol; + + + for (unsigned int idx = 0; idx < k; idx++) { + + int blk = offCol + idx; + + sparseBlock *Bl = &Ps->Pb[blk]; + + computeLeaf(Bl, F, Y, dim); + + } + + } + + +} + + +/* Top level caller function */ +int CSC2BSSB(top_lvl_csc **Pt, int *p, cs *C, + double *X, int N, int d, int nnz, + int maxLevel, int pThres, int pThresMin, + int nworkers){ + + // ----- GENERATE BSSB STRUCTURE + + // prepare permutation vector and permuted data points + double *Xt = (double *)malloc(N*d*sizeof(double)); + uint32_t *pc = (uint32_t *)malloc(N*sizeof(uint32_t)); + + + + node_t *tree = NULL; + treeOrder(&tree, Xt, pc, X, N, d, maxLevel, pThres, nworkers); + + + // permute rows and columns of P to tree order + for(int i=0; inLeaves, nLeaves); + + /* Update the permutation vector */ + int *pbuff = (int *)malloc(N * sizeof(int)); + for(int i=0; ip[i]]; + } + memcpy(pointId, pbuff, N * sizeof(int)); + + free( pbuff ); + free( p ); + + /* Clean memory */ + freeMergeBuffs( mergeBuffs ); + free( leafStart ); + free( leafSize ); + free( leafMap ); + free( col2leaves ); +} diff --git a/csb/cs.hpp b/csb/cs.hpp new file mode 100644 index 0000000..128c561 --- /dev/null +++ b/csb/cs.hpp @@ -0,0 +1,737 @@ +#ifndef _CXS_H +#define _CXS_H +#include +#include +#include +#include +#ifdef MATLAB_MEX_FILE +#include "mex.h" +#endif + +#ifdef __cplusplus +#ifndef NCOMPLEX +#include +typedef std::complex cs_complex_t ; +#endif +extern "C" { +#else +#ifndef NCOMPLEX +#include +#define cs_complex_t double _Complex +#endif +#endif + +#define CS_VER 3 /* CXSparse Version */ +#define CS_SUBVER 1 +#define CS_SUBSUB 4 +#define CS_DATE "Oct 10, 2014" /* CXSparse release date */ +#define CS_COPYRIGHT "Copyright (c) Timothy A. Davis, 2006-2014" +#define CXSPARSE + +#include "SuiteSparse_config.h" +#define cs_long_t SuiteSparse_long +#define cs_long_t_id SuiteSparse_long_id +#define cs_long_t_max SuiteSparse_long_max + +/* -------------------------------------------------------------------------- */ +/* double/int version of CXSparse */ +/* -------------------------------------------------------------------------- */ + +/* --- primary CSparse routines and data structures ------------------------- */ + +typedef struct cs_di_sparse /* matrix in compressed-column or triplet form */ +{ + int nzmax ; /* maximum number of entries */ + int m ; /* number of rows */ + int n ; /* number of columns */ + int *p ; /* column pointers (size n+1) or col indices (size nzmax) */ + int *i ; /* row indices, size nzmax */ + double *x ; /* numerical values, size nzmax */ + int nz ; /* # of entries in triplet matrix, -1 for compressed-col */ +} cs_di ; + +cs_di *cs_di_add (const cs_di *A, const cs_di *B, double alpha, double beta) ; +int cs_di_cholsol (int order, const cs_di *A, double *b) ; +int cs_di_dupl (cs_di *A) ; +int cs_di_entry (cs_di *T, int i, int j, double x) ; +int cs_di_lusol (int order, const cs_di *A, double *b, double tol) ; +int cs_di_gaxpy (const cs_di *A, const double *x, double *y) ; +cs_di *cs_di_multiply (const cs_di *A, const cs_di *B) ; +int cs_di_qrsol (int order, const cs_di *A, double *b) ; +cs_di *cs_di_transpose (const cs_di *A, int values) ; +cs_di *cs_di_compress (const cs_di *T) ; +double cs_di_norm (const cs_di *A) ; +int cs_di_print (const cs_di *A, int brief) ; +cs_di *cs_di_load (FILE *f) ; + +/* utilities */ +void *cs_di_calloc (int n, size_t size) ; +void *cs_di_free (void *p) ; +void *cs_di_realloc (void *p, int n, size_t size, int *ok) ; +cs_di *cs_di_spalloc (int m, int n, int nzmax, int values, int t) ; +cs_di *cs_di_spfree (cs_di *A) ; +int cs_di_sprealloc (cs_di *A, int nzmax) ; +void *cs_di_malloc (int n, size_t size) ; + +/* --- secondary CSparse routines and data structures ----------------------- */ + +typedef struct cs_di_symbolic /* symbolic Cholesky, LU, or QR analysis */ +{ + int *pinv ; /* inverse row perm. for QR, fill red. perm for Chol */ + int *q ; /* fill-reducing column permutation for LU and QR */ + int *parent ; /* elimination tree for Cholesky and QR */ + int *cp ; /* column pointers for Cholesky, row counts for QR */ + int *leftmost ; /* leftmost[i] = min(find(A(i,:))), for QR */ + int m2 ; /* # of rows for QR, after adding fictitious rows */ + double lnz ; /* # entries in L for LU or Cholesky; in V for QR */ + double unz ; /* # entries in U for LU; in R for QR */ +} cs_dis ; + +typedef struct cs_di_numeric /* numeric Cholesky, LU, or QR factorization */ +{ + cs_di *L ; /* L for LU and Cholesky, V for QR */ + cs_di *U ; /* U for LU, r for QR, not used for Cholesky */ + int *pinv ; /* partial pivoting for LU */ + double *B ; /* beta [0..n-1] for QR */ +} cs_din ; + +typedef struct cs_di_dmperm_results /* cs_di_dmperm or cs_di_scc output */ +{ + int *p ; /* size m, row permutation */ + int *q ; /* size n, column permutation */ + int *r ; /* size nb+1, block k is rows r[k] to r[k+1]-1 in A(p,q) */ + int *s ; /* size nb+1, block k is cols s[k] to s[k+1]-1 in A(p,q) */ + int nb ; /* # of blocks in fine dmperm decomposition */ + int rr [5] ; /* coarse row decomposition */ + int cc [5] ; /* coarse column decomposition */ +} cs_did ; + +int *cs_di_amd (int order, const cs_di *A) ; +cs_din *cs_di_chol (const cs_di *A, const cs_dis *S) ; +cs_did *cs_di_dmperm (const cs_di *A, int seed) ; +int cs_di_droptol (cs_di *A, double tol) ; +int cs_di_dropzeros (cs_di *A) ; +int cs_di_happly (const cs_di *V, int i, double beta, double *x) ; +int cs_di_ipvec (const int *p, const double *b, double *x, int n) ; +int cs_di_lsolve (const cs_di *L, double *x) ; +int cs_di_ltsolve (const cs_di *L, double *x) ; +cs_din *cs_di_lu (const cs_di *A, const cs_dis *S, double tol) ; +cs_di *cs_di_permute (const cs_di *A, const int *pinv, const int *q, + int values) ; +int *cs_di_pinv (const int *p, int n) ; +int cs_di_pvec (const int *p, const double *b, double *x, int n) ; +cs_din *cs_di_qr (const cs_di *A, const cs_dis *S) ; +cs_dis *cs_di_schol (int order, const cs_di *A) ; +cs_dis *cs_di_sqr (int order, const cs_di *A, int qr) ; +cs_di *cs_di_symperm (const cs_di *A, const int *pinv, int values) ; +int cs_di_usolve (const cs_di *U, double *x) ; +int cs_di_utsolve (const cs_di *U, double *x) ; +int cs_di_updown (cs_di *L, int sigma, const cs_di *C, const int *parent) ; + +/* utilities */ +cs_dis *cs_di_sfree (cs_dis *S) ; +cs_din *cs_di_nfree (cs_din *N) ; +cs_did *cs_di_dfree (cs_did *D) ; + +/* --- tertiary CSparse routines -------------------------------------------- */ + +int *cs_di_counts (const cs_di *A, const int *parent, const int *post, + int ata) ; +double cs_di_cumsum (int *p, int *c, int n) ; +int cs_di_dfs (int j, cs_di *G, int top, int *xi, int *pstack, + const int *pinv) ; +int *cs_di_etree (const cs_di *A, int ata) ; +int cs_di_fkeep (cs_di *A, int (*fkeep) (int, int, double, void *), + void *other) ; +double cs_di_house (double *x, double *beta, int n) ; +int *cs_di_maxtrans (const cs_di *A, int seed) ; +int *cs_di_post (const int *parent, int n) ; +cs_did *cs_di_scc (cs_di *A) ; +int cs_di_scatter (const cs_di *A, int j, double beta, int *w, double *x, + int mark, cs_di *C, int nz) ; +int cs_di_tdfs (int j, int k, int *head, const int *next, int *post, + int *stack) ; +int cs_di_leaf (int i, int j, const int *first, int *maxfirst, int *prevleaf, + int *ancestor, int *jleaf) ; +int cs_di_reach (cs_di *G, const cs_di *B, int k, int *xi, const int *pinv) ; +int cs_di_spsolve (cs_di *L, const cs_di *B, int k, int *xi, double *x, + const int *pinv, int lo) ; +int cs_di_ereach (const cs_di *A, int k, const int *parent, int *s, int *w) ; +int *cs_di_randperm (int n, int seed) ; + +/* utilities */ +cs_did *cs_di_dalloc (int m, int n) ; +cs_di *cs_di_done (cs_di *C, void *w, void *x, int ok) ; +int *cs_di_idone (int *p, cs_di *C, void *w, int ok) ; +cs_din *cs_di_ndone (cs_din *N, cs_di *C, void *w, void *x, int ok) ; +cs_did *cs_di_ddone (cs_did *D, cs_di *C, void *w, int ok) ; + + +/* -------------------------------------------------------------------------- */ +/* double/cs_long_t version of CXSparse */ +/* -------------------------------------------------------------------------- */ + +/* --- primary CSparse routines and data structures ------------------------- */ + +typedef struct cs_dl_sparse /* matrix in compressed-column or triplet form */ +{ + cs_long_t nzmax ; /* maximum number of entries */ + cs_long_t m ; /* number of rows */ + cs_long_t n ; /* number of columns */ + cs_long_t *p ; /* column pointers (size n+1) or col indlces (size nzmax) */ + cs_long_t *i ; /* row indices, size nzmax */ + double *x ; /* numerical values, size nzmax */ + cs_long_t nz ; /* # of entries in triplet matrix, -1 for compressed-col */ +} cs_dl ; + +cs_dl *cs_dl_add (const cs_dl *A, const cs_dl *B, double alpha, double beta) ; +cs_long_t cs_dl_cholsol (cs_long_t order, const cs_dl *A, double *b) ; +cs_long_t cs_dl_dupl (cs_dl *A) ; +cs_long_t cs_dl_entry (cs_dl *T, cs_long_t i, cs_long_t j, double x) ; +cs_long_t cs_dl_lusol (cs_long_t order, const cs_dl *A, double *b, double tol) ; +cs_long_t cs_dl_gaxpy (const cs_dl *A, const double *x, double *y) ; +cs_dl *cs_dl_multiply (const cs_dl *A, const cs_dl *B) ; +cs_long_t cs_dl_qrsol (cs_long_t order, const cs_dl *A, double *b) ; +cs_dl *cs_dl_transpose (const cs_dl *A, cs_long_t values) ; +cs_dl *cs_dl_compress (const cs_dl *T) ; +double cs_dl_norm (const cs_dl *A) ; +cs_long_t cs_dl_print (const cs_dl *A, cs_long_t brief) ; +cs_dl *cs_dl_load (FILE *f) ; + +/* utilities */ +void *cs_dl_calloc (cs_long_t n, size_t size) ; +void *cs_dl_free (void *p) ; +void *cs_dl_realloc (void *p, cs_long_t n, size_t size, cs_long_t *ok) ; +cs_dl *cs_dl_spalloc (cs_long_t m, cs_long_t n, cs_long_t nzmax, cs_long_t values, + cs_long_t t) ; +cs_dl *cs_dl_spfree (cs_dl *A) ; +cs_long_t cs_dl_sprealloc (cs_dl *A, cs_long_t nzmax) ; +void *cs_dl_malloc (cs_long_t n, size_t size) ; + +/* --- secondary CSparse routines and data structures ----------------------- */ + +typedef struct cs_dl_symbolic /* symbolic Cholesky, LU, or QR analysis */ +{ + cs_long_t *pinv ; /* inverse row perm. for QR, fill red. perm for Chol */ + cs_long_t *q ; /* fill-reducing column permutation for LU and QR */ + cs_long_t *parent ; /* elimination tree for Cholesky and QR */ + cs_long_t *cp ; /* column pointers for Cholesky, row counts for QR */ + cs_long_t *leftmost ; /* leftmost[i] = min(find(A(i,:))), for QR */ + cs_long_t m2 ; /* # of rows for QR, after adding fictitious rows */ + double lnz ; /* # entries in L for LU or Cholesky; in V for QR */ + double unz ; /* # entries in U for LU; in R for QR */ +} cs_dls ; + +typedef struct cs_dl_numeric /* numeric Cholesky, LU, or QR factorization */ +{ + cs_dl *L ; /* L for LU and Cholesky, V for QR */ + cs_dl *U ; /* U for LU, r for QR, not used for Cholesky */ + cs_long_t *pinv ; /* partial pivoting for LU */ + double *B ; /* beta [0..n-1] for QR */ +} cs_dln ; + +typedef struct cs_dl_dmperm_results /* cs_dl_dmperm or cs_dl_scc output */ +{ + cs_long_t *p ; /* size m, row permutation */ + cs_long_t *q ; /* size n, column permutation */ + cs_long_t *r ; /* size nb+1, block k is rows r[k] to r[k+1]-1 in A(p,q) */ + cs_long_t *s ; /* size nb+1, block k is cols s[k] to s[k+1]-1 in A(p,q) */ + cs_long_t nb ; /* # of blocks in fine dmperm decomposition */ + cs_long_t rr [5] ; /* coarse row decomposition */ + cs_long_t cc [5] ; /* coarse column decomposition */ +} cs_dld ; + +cs_long_t *cs_dl_amd (cs_long_t order, const cs_dl *A) ; +cs_dln *cs_dl_chol (const cs_dl *A, const cs_dls *S) ; +cs_dld *cs_dl_dmperm (const cs_dl *A, cs_long_t seed) ; +cs_long_t cs_dl_droptol (cs_dl *A, double tol) ; +cs_long_t cs_dl_dropzeros (cs_dl *A) ; +cs_long_t cs_dl_happly (const cs_dl *V, cs_long_t i, double beta, double *x) ; +cs_long_t cs_dl_ipvec (const cs_long_t *p, const double *b, double *x, cs_long_t n) ; +cs_long_t cs_dl_lsolve (const cs_dl *L, double *x) ; +cs_long_t cs_dl_ltsolve (const cs_dl *L, double *x) ; +cs_dln *cs_dl_lu (const cs_dl *A, const cs_dls *S, double tol) ; +cs_dl *cs_dl_permute (const cs_dl *A, const cs_long_t *pinv, const cs_long_t *q, + cs_long_t values) ; +cs_long_t *cs_dl_pinv (const cs_long_t *p, cs_long_t n) ; +cs_long_t cs_dl_pvec (const cs_long_t *p, const double *b, double *x, cs_long_t n) ; +cs_dln *cs_dl_qr (const cs_dl *A, const cs_dls *S) ; +cs_dls *cs_dl_schol (cs_long_t order, const cs_dl *A) ; +cs_dls *cs_dl_sqr (cs_long_t order, const cs_dl *A, cs_long_t qr) ; +cs_dl *cs_dl_symperm (const cs_dl *A, const cs_long_t *pinv, cs_long_t values) ; +cs_long_t cs_dl_usolve (const cs_dl *U, double *x) ; +cs_long_t cs_dl_utsolve (const cs_dl *U, double *x) ; +cs_long_t cs_dl_updown (cs_dl *L, cs_long_t sigma, const cs_dl *C, + const cs_long_t *parent) ; + +/* utilities */ +cs_dls *cs_dl_sfree (cs_dls *S) ; +cs_dln *cs_dl_nfree (cs_dln *N) ; +cs_dld *cs_dl_dfree (cs_dld *D) ; + +/* --- tertiary CSparse routines -------------------------------------------- */ + +cs_long_t *cs_dl_counts (const cs_dl *A, const cs_long_t *parent, + const cs_long_t *post, cs_long_t ata) ; +double cs_dl_cumsum (cs_long_t *p, cs_long_t *c, cs_long_t n) ; +cs_long_t cs_dl_dfs (cs_long_t j, cs_dl *G, cs_long_t top, cs_long_t *xi, + cs_long_t *pstack, const cs_long_t *pinv) ; +cs_long_t *cs_dl_etree (const cs_dl *A, cs_long_t ata) ; +cs_long_t cs_dl_fkeep (cs_dl *A, + cs_long_t (*fkeep) (cs_long_t, cs_long_t, double, void *), void *other) ; +double cs_dl_house (double *x, double *beta, cs_long_t n) ; +cs_long_t *cs_dl_maxtrans (const cs_dl *A, cs_long_t seed) ; +cs_long_t *cs_dl_post (const cs_long_t *parent, cs_long_t n) ; +cs_dld *cs_dl_scc (cs_dl *A) ; +cs_long_t cs_dl_scatter (const cs_dl *A, cs_long_t j, double beta, cs_long_t *w, + double *x, cs_long_t mark,cs_dl *C, cs_long_t nz) ; +cs_long_t cs_dl_tdfs (cs_long_t j, cs_long_t k, cs_long_t *head, const cs_long_t *next, + cs_long_t *post, cs_long_t *stack) ; +cs_long_t cs_dl_leaf (cs_long_t i, cs_long_t j, const cs_long_t *first, + cs_long_t *maxfirst, cs_long_t *prevleaf, cs_long_t *ancestor, cs_long_t *jleaf) ; +cs_long_t cs_dl_reach (cs_dl *G, const cs_dl *B, cs_long_t k, cs_long_t *xi, + const cs_long_t *pinv) ; +cs_long_t cs_dl_spsolve (cs_dl *L, const cs_dl *B, cs_long_t k, cs_long_t *xi, + double *x, const cs_long_t *pinv, cs_long_t lo) ; +cs_long_t cs_dl_ereach (const cs_dl *A, cs_long_t k, const cs_long_t *parent, + cs_long_t *s, cs_long_t *w) ; +cs_long_t *cs_dl_randperm (cs_long_t n, cs_long_t seed) ; + +/* utilities */ +cs_dld *cs_dl_dalloc (cs_long_t m, cs_long_t n) ; +cs_dl *cs_dl_done (cs_dl *C, void *w, void *x, cs_long_t ok) ; +cs_long_t *cs_dl_idone (cs_long_t *p, cs_dl *C, void *w, cs_long_t ok) ; +cs_dln *cs_dl_ndone (cs_dln *N, cs_dl *C, void *w, void *x, cs_long_t ok) ; +cs_dld *cs_dl_ddone (cs_dld *D, cs_dl *C, void *w, cs_long_t ok) ; + + +/* -------------------------------------------------------------------------- */ +/* complex/int version of CXSparse */ +/* -------------------------------------------------------------------------- */ + +#ifndef NCOMPLEX + +/* --- primary CSparse routines and data structures ------------------------- */ + +typedef struct cs_ci_sparse /* matrix in compressed-column or triplet form */ +{ + int nzmax ; /* maximum number of entries */ + int m ; /* number of rows */ + int n ; /* number of columns */ + int *p ; /* column pointers (size n+1) or col indices (size nzmax) */ + int *i ; /* row indices, size nzmax */ + cs_complex_t *x ; /* numerical values, size nzmax */ + int nz ; /* # of entries in triplet matrix, -1 for compressed-col */ +} cs_ci ; + +cs_ci *cs_ci_add (const cs_ci *A, const cs_ci *B, cs_complex_t alpha, + cs_complex_t beta) ; +int cs_ci_cholsol (int order, const cs_ci *A, cs_complex_t *b) ; +int cs_ci_dupl (cs_ci *A) ; +int cs_ci_entry (cs_ci *T, int i, int j, cs_complex_t x) ; +int cs_ci_lusol (int order, const cs_ci *A, cs_complex_t *b, double tol) ; +int cs_ci_gaxpy (const cs_ci *A, const cs_complex_t *x, cs_complex_t *y) ; +cs_ci *cs_ci_multiply (const cs_ci *A, const cs_ci *B) ; +int cs_ci_qrsol (int order, const cs_ci *A, cs_complex_t *b) ; +cs_ci *cs_ci_transpose (const cs_ci *A, int values) ; +cs_ci *cs_ci_compress (const cs_ci *T) ; +double cs_ci_norm (const cs_ci *A) ; +int cs_ci_print (const cs_ci *A, int brief) ; +cs_ci *cs_ci_load (FILE *f) ; + +/* utilities */ +void *cs_ci_calloc (int n, size_t size) ; +void *cs_ci_free (void *p) ; +void *cs_ci_realloc (void *p, int n, size_t size, int *ok) ; +cs_ci *cs_ci_spalloc (int m, int n, int nzmax, int values, int t) ; +cs_ci *cs_ci_spfree (cs_ci *A) ; +int cs_ci_sprealloc (cs_ci *A, int nzmax) ; +void *cs_ci_malloc (int n, size_t size) ; + +/* --- secondary CSparse routines and data structures ----------------------- */ + +typedef struct cs_ci_symbolic /* symbolic Cholesky, LU, or QR analysis */ +{ + int *pinv ; /* inverse row perm. for QR, fill red. perm for Chol */ + int *q ; /* fill-reducing column permutation for LU and QR */ + int *parent ; /* elimination tree for Cholesky and QR */ + int *cp ; /* column pointers for Cholesky, row counts for QR */ + int *leftmost ; /* leftmost[i] = min(find(A(i,:))), for QR */ + int m2 ; /* # of rows for QR, after adding fictitious rows */ + double lnz ; /* # entries in L for LU or Cholesky; in V for QR */ + double unz ; /* # entries in U for LU; in R for QR */ +} cs_cis ; + +typedef struct cs_ci_numeric /* numeric Cholesky, LU, or QR factorization */ +{ + cs_ci *L ; /* L for LU and Cholesky, V for QR */ + cs_ci *U ; /* U for LU, r for QR, not used for Cholesky */ + int *pinv ; /* partial pivoting for LU */ + double *B ; /* beta [0..n-1] for QR */ +} cs_cin ; + +typedef struct cs_ci_dmperm_results /* cs_ci_dmperm or cs_ci_scc output */ +{ + int *p ; /* size m, row permutation */ + int *q ; /* size n, column permutation */ + int *r ; /* size nb+1, block k is rows r[k] to r[k+1]-1 in A(p,q) */ + int *s ; /* size nb+1, block k is cols s[k] to s[k+1]-1 in A(p,q) */ + int nb ; /* # of blocks in fine dmperm decomposition */ + int rr [5] ; /* coarse row decomposition */ + int cc [5] ; /* coarse column decomposition */ +} cs_cid ; + +int *cs_ci_amd (int order, const cs_ci *A) ; +cs_cin *cs_ci_chol (const cs_ci *A, const cs_cis *S) ; +cs_cid *cs_ci_dmperm (const cs_ci *A, int seed) ; +int cs_ci_droptol (cs_ci *A, double tol) ; +int cs_ci_dropzeros (cs_ci *A) ; +int cs_ci_happly (const cs_ci *V, int i, double beta, cs_complex_t *x) ; +int cs_ci_ipvec (const int *p, const cs_complex_t *b, cs_complex_t *x, int n) ; +int cs_ci_lsolve (const cs_ci *L, cs_complex_t *x) ; +int cs_ci_ltsolve (const cs_ci *L, cs_complex_t *x) ; +cs_cin *cs_ci_lu (const cs_ci *A, const cs_cis *S, double tol) ; +cs_ci *cs_ci_permute (const cs_ci *A, const int *pinv, const int *q, + int values) ; +int *cs_ci_pinv (const int *p, int n) ; +int cs_ci_pvec (const int *p, const cs_complex_t *b, cs_complex_t *x, int n) ; +cs_cin *cs_ci_qr (const cs_ci *A, const cs_cis *S) ; +cs_cis *cs_ci_schol (int order, const cs_ci *A) ; +cs_cis *cs_ci_sqr (int order, const cs_ci *A, int qr) ; +cs_ci *cs_ci_symperm (const cs_ci *A, const int *pinv, int values) ; +int cs_ci_usolve (const cs_ci *U, cs_complex_t *x) ; +int cs_ci_utsolve (const cs_ci *U, cs_complex_t *x) ; +int cs_ci_updown (cs_ci *L, int sigma, const cs_ci *C, const int *parent) ; + +/* utilities */ +cs_cis *cs_ci_sfree (cs_cis *S) ; +cs_cin *cs_ci_nfree (cs_cin *N) ; +cs_cid *cs_ci_dfree (cs_cid *D) ; + +/* --- tertiary CSparse routines -------------------------------------------- */ + +int *cs_ci_counts (const cs_ci *A, const int *parent, const int *post, + int ata) ; +double cs_ci_cumsum (int *p, int *c, int n) ; +int cs_ci_dfs (int j, cs_ci *G, int top, int *xi, int *pstack, + const int *pinv) ; +int *cs_ci_etree (const cs_ci *A, int ata) ; +int cs_ci_fkeep (cs_ci *A, int (*fkeep) (int, int, cs_complex_t, void *), + void *other) ; +cs_complex_t cs_ci_house (cs_complex_t *x, double *beta, int n) ; +int *cs_ci_maxtrans (const cs_ci *A, int seed) ; +int *cs_ci_post (const int *parent, int n) ; +cs_cid *cs_ci_scc (cs_ci *A) ; +int cs_ci_scatter (const cs_ci *A, int j, cs_complex_t beta, int *w, + cs_complex_t *x, int mark,cs_ci *C, int nz) ; +int cs_ci_tdfs (int j, int k, int *head, const int *next, int *post, + int *stack) ; +int cs_ci_leaf (int i, int j, const int *first, int *maxfirst, int *prevleaf, + int *ancestor, int *jleaf) ; +int cs_ci_reach (cs_ci *G, const cs_ci *B, int k, int *xi, const int *pinv) ; +int cs_ci_spsolve (cs_ci *L, const cs_ci *B, int k, int *xi, + cs_complex_t *x, const int *pinv, int lo) ; +int cs_ci_ereach (const cs_ci *A, int k, const int *parent, int *s, int *w) ; +int *cs_ci_randperm (int n, int seed) ; + +/* utilities */ +cs_cid *cs_ci_dalloc (int m, int n) ; +cs_ci *cs_ci_done (cs_ci *C, void *w, void *x, int ok) ; +int *cs_ci_idone (int *p, cs_ci *C, void *w, int ok) ; +cs_cin *cs_ci_ndone (cs_cin *N, cs_ci *C, void *w, void *x, int ok) ; +cs_cid *cs_ci_ddone (cs_cid *D, cs_ci *C, void *w, int ok) ; + + +/* -------------------------------------------------------------------------- */ +/* complex/cs_long_t version of CXSparse */ +/* -------------------------------------------------------------------------- */ + +/* --- primary CSparse routines and data structures ------------------------- */ + +typedef struct cs_cl_sparse /* matrix in compressed-column or triplet form */ +{ + cs_long_t nzmax ; /* maximum number of entries */ + cs_long_t m ; /* number of rows */ + cs_long_t n ; /* number of columns */ + cs_long_t *p ; /* column pointers (size n+1) or col indlces (size nzmax) */ + cs_long_t *i ; /* row indices, size nzmax */ + cs_complex_t *x ; /* numerical values, size nzmax */ + cs_long_t nz ; /* # of entries in triplet matrix, -1 for compressed-col */ +} cs_cl ; + +cs_cl *cs_cl_add (const cs_cl *A, const cs_cl *B, cs_complex_t alpha, + cs_complex_t beta) ; +cs_long_t cs_cl_cholsol (cs_long_t order, const cs_cl *A, cs_complex_t *b) ; +cs_long_t cs_cl_dupl (cs_cl *A) ; +cs_long_t cs_cl_entry (cs_cl *T, cs_long_t i, cs_long_t j, cs_complex_t x) ; +cs_long_t cs_cl_lusol (cs_long_t order, const cs_cl *A, cs_complex_t *b, + double tol) ; +cs_long_t cs_cl_gaxpy (const cs_cl *A, const cs_complex_t *x, cs_complex_t *y) ; +cs_cl *cs_cl_multiply (const cs_cl *A, const cs_cl *B) ; +cs_long_t cs_cl_qrsol (cs_long_t order, const cs_cl *A, cs_complex_t *b) ; +cs_cl *cs_cl_transpose (const cs_cl *A, cs_long_t values) ; +cs_cl *cs_cl_compress (const cs_cl *T) ; +double cs_cl_norm (const cs_cl *A) ; +cs_long_t cs_cl_print (const cs_cl *A, cs_long_t brief) ; +cs_cl *cs_cl_load (FILE *f) ; + +/* utilities */ +void *cs_cl_calloc (cs_long_t n, size_t size) ; +void *cs_cl_free (void *p) ; +void *cs_cl_realloc (void *p, cs_long_t n, size_t size, cs_long_t *ok) ; +cs_cl *cs_cl_spalloc (cs_long_t m, cs_long_t n, cs_long_t nzmax, cs_long_t values, + cs_long_t t) ; +cs_cl *cs_cl_spfree (cs_cl *A) ; +cs_long_t cs_cl_sprealloc (cs_cl *A, cs_long_t nzmax) ; +void *cs_cl_malloc (cs_long_t n, size_t size) ; + +/* --- secondary CSparse routines and data structures ----------------------- */ + +typedef struct cs_cl_symbolic /* symbolic Cholesky, LU, or QR analysis */ +{ + cs_long_t *pinv ; /* inverse row perm. for QR, fill red. perm for Chol */ + cs_long_t *q ; /* fill-reducing column permutation for LU and QR */ + cs_long_t *parent ; /* elimination tree for Cholesky and QR */ + cs_long_t *cp ; /* column pointers for Cholesky, row counts for QR */ + cs_long_t *leftmost ; /* leftmost[i] = min(find(A(i,:))), for QR */ + cs_long_t m2 ; /* # of rows for QR, after adding fictitious rows */ + double lnz ; /* # entries in L for LU or Cholesky; in V for QR */ + double unz ; /* # entries in U for LU; in R for QR */ +} cs_cls ; + +typedef struct cs_cl_numeric /* numeric Cholesky, LU, or QR factorization */ +{ + cs_cl *L ; /* L for LU and Cholesky, V for QR */ + cs_cl *U ; /* U for LU, r for QR, not used for Cholesky */ + cs_long_t *pinv ; /* partial pivoting for LU */ + double *B ; /* beta [0..n-1] for QR */ +} cs_cln ; + +typedef struct cs_cl_dmperm_results /* cs_cl_dmperm or cs_cl_scc output */ +{ + cs_long_t *p ; /* size m, row permutation */ + cs_long_t *q ; /* size n, column permutation */ + cs_long_t *r ; /* size nb+1, block k is rows r[k] to r[k+1]-1 in A(p,q) */ + cs_long_t *s ; /* size nb+1, block k is cols s[k] to s[k+1]-1 in A(p,q) */ + cs_long_t nb ; /* # of blocks in fine dmperm decomposition */ + cs_long_t rr [5] ; /* coarse row decomposition */ + cs_long_t cc [5] ; /* coarse column decomposition */ +} cs_cld ; + +cs_long_t *cs_cl_amd (cs_long_t order, const cs_cl *A) ; +cs_cln *cs_cl_chol (const cs_cl *A, const cs_cls *S) ; +cs_cld *cs_cl_dmperm (const cs_cl *A, cs_long_t seed) ; +cs_long_t cs_cl_droptol (cs_cl *A, double tol) ; +cs_long_t cs_cl_dropzeros (cs_cl *A) ; +cs_long_t cs_cl_happly (const cs_cl *V, cs_long_t i, double beta, cs_complex_t *x) ; +cs_long_t cs_cl_ipvec (const cs_long_t *p, const cs_complex_t *b, + cs_complex_t *x, cs_long_t n) ; +cs_long_t cs_cl_lsolve (const cs_cl *L, cs_complex_t *x) ; +cs_long_t cs_cl_ltsolve (const cs_cl *L, cs_complex_t *x) ; +cs_cln *cs_cl_lu (const cs_cl *A, const cs_cls *S, double tol) ; +cs_cl *cs_cl_permute (const cs_cl *A, const cs_long_t *pinv, const cs_long_t *q, + cs_long_t values) ; +cs_long_t *cs_cl_pinv (const cs_long_t *p, cs_long_t n) ; +cs_long_t cs_cl_pvec (const cs_long_t *p, const cs_complex_t *b, + cs_complex_t *x, cs_long_t n) ; +cs_cln *cs_cl_qr (const cs_cl *A, const cs_cls *S) ; +cs_cls *cs_cl_schol (cs_long_t order, const cs_cl *A) ; +cs_cls *cs_cl_sqr (cs_long_t order, const cs_cl *A, cs_long_t qr) ; +cs_cl *cs_cl_symperm (const cs_cl *A, const cs_long_t *pinv, cs_long_t values) ; +cs_long_t cs_cl_usolve (const cs_cl *U, cs_complex_t *x) ; +cs_long_t cs_cl_utsolve (const cs_cl *U, cs_complex_t *x) ; +cs_long_t cs_cl_updown (cs_cl *L, cs_long_t sigma, const cs_cl *C, + const cs_long_t *parent) ; + +/* utilities */ +cs_cls *cs_cl_sfree (cs_cls *S) ; +cs_cln *cs_cl_nfree (cs_cln *N) ; +cs_cld *cs_cl_dfree (cs_cld *D) ; + +/* --- tertiary CSparse routines -------------------------------------------- */ + +cs_long_t *cs_cl_counts (const cs_cl *A, const cs_long_t *parent, + const cs_long_t *post, cs_long_t ata) ; +double cs_cl_cumsum (cs_long_t *p, cs_long_t *c, cs_long_t n) ; +cs_long_t cs_cl_dfs (cs_long_t j, cs_cl *G, cs_long_t top, cs_long_t *xi, + cs_long_t *pstack, const cs_long_t *pinv) ; +cs_long_t *cs_cl_etree (const cs_cl *A, cs_long_t ata) ; +cs_long_t cs_cl_fkeep (cs_cl *A, + cs_long_t (*fkeep) (cs_long_t, cs_long_t, cs_complex_t, void *), void *other) ; +cs_complex_t cs_cl_house (cs_complex_t *x, double *beta, cs_long_t n) ; +cs_long_t *cs_cl_maxtrans (const cs_cl *A, cs_long_t seed) ; +cs_long_t *cs_cl_post (const cs_long_t *parent, cs_long_t n) ; +cs_cld *cs_cl_scc (cs_cl *A) ; +cs_long_t cs_cl_scatter (const cs_cl *A, cs_long_t j, cs_complex_t beta, + cs_long_t *w, cs_complex_t *x, cs_long_t mark,cs_cl *C, cs_long_t nz) ; +cs_long_t cs_cl_tdfs (cs_long_t j, cs_long_t k, cs_long_t *head, const cs_long_t *next, + cs_long_t *post, cs_long_t *stack) ; +cs_long_t cs_cl_leaf (cs_long_t i, cs_long_t j, const cs_long_t *first, + cs_long_t *maxfirst, cs_long_t *prevleaf, cs_long_t *ancestor, cs_long_t *jleaf) ; +cs_long_t cs_cl_reach (cs_cl *G, const cs_cl *B, cs_long_t k, cs_long_t *xi, + const cs_long_t *pinv) ; +cs_long_t cs_cl_spsolve (cs_cl *L, const cs_cl *B, cs_long_t k, cs_long_t *xi, + cs_complex_t *x, const cs_long_t *pinv, cs_long_t lo) ; +cs_long_t cs_cl_ereach (const cs_cl *A, cs_long_t k, const cs_long_t *parent, + cs_long_t *s, cs_long_t *w) ; +cs_long_t *cs_cl_randperm (cs_long_t n, cs_long_t seed) ; + +/* utilities */ +cs_cld *cs_cl_dalloc (cs_long_t m, cs_long_t n) ; +cs_cl *cs_cl_done (cs_cl *C, void *w, void *x, cs_long_t ok) ; +cs_long_t *cs_cl_idone (cs_long_t *p, cs_cl *C, void *w, cs_long_t ok) ; +cs_cln *cs_cl_ndone (cs_cln *N, cs_cl *C, void *w, void *x, cs_long_t ok) ; +cs_cld *cs_cl_ddone (cs_cld *D, cs_cl *C, void *w, cs_long_t ok) ; + +#endif + +/* -------------------------------------------------------------------------- */ +/* Macros for constructing each version of CSparse */ +/* -------------------------------------------------------------------------- */ + +#ifdef CS_LONG +#define CS_INT cs_long_t +#define CS_INT_MAX cs_long_t_max +#define CS_ID cs_long_t_id +#ifdef CS_COMPLEX +#define CS_ENTRY cs_complex_t +#define CS_NAME(nm) cs_cl ## nm +#define cs cs_cl +#else +#define CS_ENTRY double +#define CS_NAME(nm) cs_dl ## nm +#define cs cs_dl +#endif +#else +#define CS_INT int +#define CS_INT_MAX INT_MAX +#define CS_ID "%d" +#ifdef CS_COMPLEX +#define CS_ENTRY cs_complex_t +#define CS_NAME(nm) cs_ci ## nm +#define cs cs_ci +#else +#define CS_ENTRY double +#define CS_NAME(nm) cs_di ## nm +#define cs cs_di +#endif +#endif + +#ifdef CS_COMPLEX +#define CS_REAL(x) creal(x) +#define CS_IMAG(x) cimag(x) +#define CS_CONJ(x) conj(x) +#define CS_ABS(x) cabs(x) +#else +#define CS_REAL(x) (x) +#define CS_IMAG(x) (0.) +#define CS_CONJ(x) (x) +#define CS_ABS(x) fabs(x) +#endif + +#define CS_MAX(a,b) (((a) > (b)) ? (a) : (b)) +#define CS_MIN(a,b) (((a) < (b)) ? (a) : (b)) +#define CS_FLIP(i) (-(i)-2) +#define CS_UNFLIP(i) (((i) < 0) ? CS_FLIP(i) : (i)) +#define CS_MARKED(w,j) (w [j] < 0) +#define CS_MARK(w,j) { w [j] = CS_FLIP (w [j]) ; } +#define CS_CSC(A) (A && (A->nz == -1)) +#define CS_TRIPLET(A) (A && (A->nz >= 0)) + +/* --- primary CSparse routines and data structures ------------------------- */ + +#define cs_add CS_NAME (_add) +#define cs_cholsol CS_NAME (_cholsol) +#define cs_dupl CS_NAME (_dupl) +#define cs_entry CS_NAME (_entry) +#define cs_lusol CS_NAME (_lusol) +#define cs_gaxpy CS_NAME (_gaxpy) +#define cs_multiply CS_NAME (_multiply) +#define cs_qrsol CS_NAME (_qrsol) +#define cs_transpose CS_NAME (_transpose) +#define cs_compress CS_NAME (_compress) +#define cs_norm CS_NAME (_norm) +#define cs_print CS_NAME (_print) +#define cs_load CS_NAME (_load) + +/* utilities */ +#define cs_calloc CS_NAME (_calloc) +#define cs_free CS_NAME (_free) +#define cs_realloc CS_NAME (_realloc) +#define cs_spalloc CS_NAME (_spalloc) +#define cs_spfree CS_NAME (_spfree) +#define cs_sprealloc CS_NAME (_sprealloc) +#define cs_malloc CS_NAME (_malloc) + +/* --- secondary CSparse routines and data structures ----------------------- */ +#define css CS_NAME (s) +#define csn CS_NAME (n) +#define csd CS_NAME (d) + +#define cs_amd CS_NAME (_amd) +#define cs_chol CS_NAME (_chol) +#define cs_dmperm CS_NAME (_dmperm) +#define cs_droptol CS_NAME (_droptol) +#define cs_dropzeros CS_NAME (_dropzeros) +#define cs_happly CS_NAME (_happly) +#define cs_ipvec CS_NAME (_ipvec) +#define cs_lsolve CS_NAME (_lsolve) +#define cs_ltsolve CS_NAME (_ltsolve) +#define cs_lu CS_NAME (_lu) +#define cs_permute CS_NAME (_permute) +#define cs_pinv CS_NAME (_pinv) +#define cs_pvec CS_NAME (_pvec) +#define cs_qr CS_NAME (_qr) +#define cs_schol CS_NAME (_schol) +#define cs_sqr CS_NAME (_sqr) +#define cs_symperm CS_NAME (_symperm) +#define cs_usolve CS_NAME (_usolve) +#define cs_utsolve CS_NAME (_utsolve) +#define cs_updown CS_NAME (_updown) + +/* utilities */ +#define cs_sfree CS_NAME (_sfree) +#define cs_nfree CS_NAME (_nfree) +#define cs_dfree CS_NAME (_dfree) + +/* --- tertiary CSparse routines -------------------------------------------- */ +#define cs_counts CS_NAME (_counts) +#define cs_cumsum CS_NAME (_cumsum) +#define cs_dfs CS_NAME (_dfs) +#define cs_etree CS_NAME (_etree) +#define cs_fkeep CS_NAME (_fkeep) +#define cs_house CS_NAME (_house) +#define cs_invmatch CS_NAME (_invmatch) +#define cs_maxtrans CS_NAME (_maxtrans) +#define cs_post CS_NAME (_post) +#define cs_scc CS_NAME (_scc) +#define cs_scatter CS_NAME (_scatter) +#define cs_tdfs CS_NAME (_tdfs) +#define cs_reach CS_NAME (_reach) +#define cs_spsolve CS_NAME (_spsolve) +#define cs_ereach CS_NAME (_ereach) +#define cs_randperm CS_NAME (_randperm) +#define cs_leaf CS_NAME (_leaf) + +/* utilities */ +#define cs_dalloc CS_NAME (_dalloc) +#define cs_done CS_NAME (_done) +#define cs_idone CS_NAME (_idone) +#define cs_ndone CS_NAME (_ndone) +#define cs_ddone CS_NAME (_ddone) + +/* -------------------------------------------------------------------------- */ +/* Conversion routines */ +/* -------------------------------------------------------------------------- */ + +#ifndef NCOMPLEX +cs_di *cs_i_real (cs_ci *A, int real) ; +cs_ci *cs_i_complex (cs_di *A, int real) ; +cs_dl *cs_l_real (cs_cl *A, cs_long_t real) ; +cs_cl *cs_l_complex (cs_dl *A, cs_long_t real) ; +#endif + +#ifdef __cplusplus +} +#endif +#endif diff --git a/csb/csb_wrapper.cpp b/csb/csb_wrapper.cpp new file mode 100644 index 0000000..1bc482b --- /dev/null +++ b/csb/csb_wrapper.cpp @@ -0,0 +1,252 @@ +/*! + \file csb_wrapper.cpp + \brief Wrapper for CSB object and routines. + + \author Dimitris Floros + \date 2019-07-12 +*/ + +// -------------------------------------------------- +// Include headers + +#include +#include +#include + +#include + +#include "csb_wrapper.hpp" + +#include "triple.h" +#include "csc.h" +#include "bicsb.h" +#include "bmcsb.h" +#include "spvec.h" +#include "Semirings.h" + +#define RHS3DIM 3 + + +/* + * Although this is a template, the result is always a CSB object. + * This is a hack to get arount the CSB library. + */ +template +BiCsb * prepareCSB( NT *values, IT *rows, IT *cols, + IT nzmax, IT m, IT n, + int workers, int forcelogbeta ){ + + // generate CSC object (CSB definitions) + Csc * csc; + csc = new Csc(); + + csc->SetPointers( cols, rows, values, nzmax, m, n, 0 ); + + if (workers == 0) + workers = __cilkrts_get_nworkers(); + else{ + std::string sworkers = std::to_string(workers); + __cilkrts_set_param("nworkers", sworkers.c_str()); + } + + BiCsb *bicsb = new BiCsb(*csc, workers, forcelogbeta); + + // clean CSB-type CSC object + delete( csc ); + + return bicsb; +} + +template +void deallocate( BiCsb * bicsb ){ + + // generate CSC object (CSB definitions) + delete bicsb; +} + +static double getMillisecondsExp( struct timeval begin, struct timeval end ) { + + return + ((double) (end.tv_sec - begin.tv_sec) * 1000 ) + + ((double) (end.tv_usec - begin.tv_usec) / 1000 ); +} + +// -------------------------------------------------- +// Function definitions + +INDEXTYPE csb_pq +( double *t_day_csb,double *t_day_csb_tar, + BiCsb * bicsb, + double * const x_in, + double * const y_out, + int n, int dim, int iter, + int workers, INDEXTYPE forcelogbeta ) { + + struct timeval begin, end; + + // find CSB block size + INDEXTYPE actual_beta = bicsb->getBeta(); + + // prepare template type for CSB routine + typedef PTSR PTDD; + + ///////////////////// + // Run experiments // + ///////////////////// + switch (dim){ + + case 1: + bicsb_tsne1D(*bicsb, x_in, y_out); + break; + + case 2: + bicsb_tsne2D(*bicsb, x_in, y_out); + break; + + case 3: + bicsb_tsne(*bicsb, x_in, y_out); + break; + + case 4: + bicsb_tsne4D(*bicsb, x_in, y_out); + break; + + } + + // return actual beta + return actual_beta; + +} + +INDEXTYPE csb_pq +( double *t_day_csb,double *t_day_csb_tar, + BiCsb * bicsb, + float * const x_in, + float * const y_out, + int n, int dim, int iter, + int workers, INDEXTYPE forcelogbeta ) { + + struct timeval begin, end; + + // find CSB block size + INDEXTYPE actual_beta = bicsb->getBeta(); + + // prepare template type for CSB routine + typedef PTSR PTDD; + + ///////////////////// + // Run experiments // + ///////////////////// + switch (dim){ + + case 1: + bicsb_tsne1D(*bicsb, x_in, y_out); + break; + + case 2: + bicsb_tsne2D(*bicsb, x_in, y_out); + break; + + case 3: + bicsb_tsne(*bicsb, x_in, y_out); + break; + + case 4: + bicsb_tsne4D(*bicsb, x_in, y_out); + break; + + } + + // return actual beta + return actual_beta; + +} + +float tsne_cost +( BiCsb * bicsb, + float * const x_in, int N, + int dim, float alpha, float zeta) { + + float *y_out = (float*) calloc(N, sizeof(float)); + + // prepare template type for CSB routine + typedef PTSR PTDD; + + ///////////////////// + // Run SPMV // + ///////////////////// + + bicsb_tsne_cost(*bicsb, x_in, y_out, dim, alpha, zeta); + float total_cost = 0; + + for (int i =0; i * bicsb, + double * const x_in, int N, + int dim, double alpha, double zeta) { + + double *y_out = (double*) calloc(N, sizeof(double)); + + // prepare template type for CSB routine + typedef PTSR PTDD; + + ///////////////////// + // Run SPMV // + ///////////////////// + + bicsb_tsne_cost(*bicsb, x_in, y_out, dim, alpha, zeta); + double total_cost = 0; + + for (int i =0; i * prepareCSB +(float *vals, uint32_t *rows, uint32_t *cols, + uint32_t nzmax, uint32_t m, uint32_t n, + int workers, int forcelogbeta ); + +template +BiCsb * prepareCSB +(double *vals, uint32_t *rows, uint32_t *cols, + uint32_t nzmax, uint32_t m, uint32_t n, + int workers, int forcelogbeta ); + +template +void deallocate( BiCsb * bicsb ); + +template +void deallocate( BiCsb * bicsb ); + + +/**------------------------------------------------------------ +* +* AUTHORS +* +* Dimitris Floros fcdimitr@auth.gr +* +* VERSION +* +* 1.0 - July 13, 2018 +* +* CHANGELOG +* +* 1.0 (Jul 13, 2018) - Dimitris +* * all interaction types in one file +* +* ----------------------------------------------------------*/ diff --git a/csb/csb_wrapper.hpp b/csb/csb_wrapper.hpp new file mode 100644 index 0000000..f66dbe0 --- /dev/null +++ b/csb/csb_wrapper.hpp @@ -0,0 +1,74 @@ +/*! + \file csb_wrapper.hpp + \brief Wrapper for CSB object and routines. + + \author Dimitris Floros + \date 2019-07-12 +*/ + + +#ifndef _H_EXP_STAT +#define _H_EXP_STAT + +#define INDEXTYPE uint32_t + + +template +class BiCsb; + +INDEXTYPE csb_pq +( double *t_day_csb,double *t_day_csb_tar, + BiCsb * bicsb, + double * const x_in, + double * const y_out, + int n, int dim, int iter, + int workers, INDEXTYPE forcelogbeta ); + +INDEXTYPE csb_pq +( double *t_day_csb,double *t_day_csb_tar, + BiCsb * bicsb, + float * const x_in, + float * const y_out, + int n, int dim, int iter, + int workers, INDEXTYPE forcelogbeta ); + + +template +BiCsb * prepareCSB(NT *vals, IT *rows, IT *cols, + IT nzmax, IT m, IT n, + int workers, int forcelogbeta); + + +float tsne_cost +( BiCsb * bicsb, + float * const x_in, int N, + int dim, float alpha, float zeta); + + +double tsne_cost +( BiCsb * bicsb, + double * const x_in, int N, + int dim, double alpha, double zeta); + +template +void deallocate( BiCsb * bicsb ); + +#endif + + +/**------------------------------------------------------------ +* +* AUTHORS +* +* Dimitris Floros fcdimitr@auth.gr +* +* VERSION +* +* 1.0 - July 13, 2018 +* +* CHANGELOG +* +* 1.0 (Jul 13, 2018) - Dimitris +* * all interaction types in one file +* +* ----------------------------------------------------------*/ diff --git a/csb/csbsym.cpp b/csb/csbsym.cpp new file mode 100644 index 0000000..83e7ceb --- /dev/null +++ b/csb/csbsym.cpp @@ -0,0 +1,803 @@ +#include "csbsym.h" +#include +#include "utility.h" + +// Choose block size as big as possible given the following constraints +// 1) The bot array is addressible by IT +// 2) The parts of x & y vectors that a block touches fits into L2 cache [assuming a saxpy() operation] +// 3) There's enough parallel slackness for block rows (at least SLACKNESS * CILK_NPROC) +template +void CsbSym::Init(int workers, IT forcelogbeta) +{ + ispar = (workers > 1); + IT roundup = nextpoweroftwo(n); + + // if indices are negative, highestbitset returns -1, + // but that will be caught by the sizereq below + IT nbits = highestbitset(roundup); + bool sizereq; + if (ispar) + { + sizereq = (IntPower<2>(nbits) > SLACKNESS * workers); + } + else + { + sizereq = (nbits > 1); + } + if(!sizereq) + { + cerr << "Matrix too small for this library" << endl; + return; + } + + nlowbits = nbits-1; + IT inf = numeric_limits::max(); + IT maxbits = highestbitset(inf); + + nhighbits = nbits-nlowbits; // # higher order bits for rows (has at least one bit) + if(ispar) + { + while(IntPower<2>(nhighbits) < SLACKNESS * workers) + { + nhighbits++; + nlowbits--; + } + } + + // calculate the space that suby and subx occupy in L2 cache + IT yL2 = IntPower<2>(nlowbits) * sizeof(NT); + while(yL2 > L2SIZE) + { + yL2 /= 2; + nhighbits++; + nlowbits--; + } + + lowmask = IntPower<2>(nlowbits) - 1; + if(forcelogbeta != 0) + { + IT candlowmask = IntPower<2>(forcelogbeta) -1; + cout << "Forcing beta to "<< (candlowmask+1) << " instead of the chosen " << (lowmask+1) << endl; + cout << "Warning : No checks are performed on the beta you have forced, anything can happen !" << endl; + lowmask = candlowmask; + nlowbits = forcelogbeta; + nhighbits = nbits-nlowbits; + } + else + { + double sqrtn = sqrt(static_cast(n)); + IT logbeta = static_cast(ceil(log2(sqrtn))) + 2; + if(nlowbits > logbeta) + { + nlowbits = logbeta; + lowmask = IntPower<2>(logbeta) -1; + nhighbits = nbits-nlowbits; + } + cout << "Beta chosen to be "<< (lowmask+1) << endl; + } + highmask = ((roundup - 1) ^ lowmask); + + IT blcdim = lowmask + 1; + ncsb = static_cast(ceil(static_cast(n) / static_cast(blcdim))); + + blcrange = (lowmask+1) * (lowmask+1); // range indexed by one block + mortoncmp = MortCompSym(nlowbits, lowmask); +} + + +// copy constructor +template +CsbSym::CsbSym (const CsbSym & rhs) +: nz(rhs.nz), n(rhs.n), blcrange(rhs.blcrange), ncsb(rhs.ncsb), nhighbits(rhs.nhighbits), nlowbits(rhs.nlowbits), +highmask(rhs.highmask), lowmask(rhs.lowmask), mortoncmp(rhs.mortoncmp), ispar(rhs.ispar), diagonal(rhs.diagonal) +{ + if(nz > 0) // nz > 0 iff nrb > 0 + { + num = new NT[nz](); + bot = new IT[nz]; + + copy ( rhs.num, rhs.num+nz, num); + copy ( rhs.bot, rhs.bot+nz, bot ); + } + if ( ncsb > 0) + { + top = new IT* [ncsb]; + for(IT i=0; i +CsbSym & CsbSym::operator= (const CsbSym & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty + { + // make it empty + delete [] bot; + delete [] num; + } + if(ncsb > 0) + { + for(IT i=0; i 0) // if the copied object is not empty + { + num = new NT[nz](); + bot = new IT[nz]; + + copy ( rhs.num, rhs.num+nz, num); + copy ( rhs.bot, rhs.bot+nz, bot ); + } + if(ncsb > 0) + { + top = new IT* [ncsb]; + for(IT i=0; i +CsbSym::~CsbSym() +{ + if( nz > 0) + { + delete [] bot; + delete [] num; + } + if ( ncsb > 0) + { + for(IT i=0; i +CsbSym::CsbSym (Csc & csc, int workers):nz(csc.nz), n(csc.n) +{ + typedef std::pair ipair; + typedef std::pair mypair; + + assert(nz != 0 && n != 0); + Init(workers); + + top = new IT* [ncsb]; + for(IT i=0; i> nlowbits) << nhighbits) | ((highmask & j) >> nlowbits); + IT lindex = ((lowmask & csc.ir[i]) << nlowbits) | (lowmask & j) ; + + // i => location of that nonzero in csc.ir and csc.num arrays + pairarray[k++] = mypair(hindex, ipair(lindex,i)); + } + } + sort(pairarray, pairarray+nz); // sort according to hindex + SortBlocks(pairarray, csc.num); + delete [] pairarray; +} + +template +void CsbSym::SortBlocks(pair > * pairarray, NT * val) +{ + typedef pair > mypair; + IT cnz = 0; + vector tempbot; + vector tempnum; + IT ldim = IntPower<2>(nhighbits); // leading dimension (not always equal to ncsb) + for(IT i = 0; i < ncsb; ++i) + { + for(IT j = 0; j < (ncsb-i); ++j) + { + top[i][j] = tempbot.size(); + IT prevcnz = cnz; + std::vector blocknz; + while(cnz < nz && pairarray[cnz].first == ((i*ldim)+(j+i)) ) // as long as we're in this block + { + IT interlowbits = pairarray[cnz].second.first; + IT rlowbits = ((interlowbits >> nlowbits) & lowmask); + IT clowbits = (interlowbits & lowmask); + IT bikey = BitInterleaveLow(rlowbits, clowbits); + + if(j == 0 && rlowbits == clowbits) + { + diagonal.push_back(make_pair((i << nlowbits)+rlowbits, val[pairarray[cnz++].second.second])); + } + else + { + blocknz.push_back(mypair(bikey, pairarray[cnz++].second)); + } + } + // sort the block into bitinterleaved order + sort(blocknz.begin(), blocknz.end()); + typename vector::iterator itr; + + for( itr = blocknz.begin(); itr != blocknz.end(); ++itr) + { + tempbot.push_back( itr->second.first ); + tempnum.push_back( val[itr->second.second] ); + } + } + top[i][ncsb-i] = tempbot.size(); + } + + assert (cnz == (tempbot.size() + diagonal.size())); + nz = tempbot.size(); // update the number of off-diagonal nonzeros + bot = new IT[nz]; + num = new NT[nz]; + + copy(tempbot.begin(), tempbot.end(), bot); + copy(tempnum.begin(), tempnum.end(), num); + sort(diagonal.begin(), diagonal.end()); +} + +template +void CsbSym::DivideIterationSpace(IT * & lspace, IT * & rspace, IT & lsize, IT & rsize, IT size, IT d) const +{ + if(d == 1) + { + lsize = size-size/2; + rsize = size/2; + lspace = new IT[lsize]; + rspace = new IT[rsize]; + for(IT i=0; i rsize) + { + lspace[lsize-1] = size-1; + } + } + else + { + IT chunks = size / (2*d); + int rest = size - (2*d*chunks); // rest is modulus 2d + lsize = d*chunks; // initial estimates + rsize = d*chunks; + if(rest > d) // first d goes to lsize, rest goes to rsize + { + rsize += (rest-d); + lsize += d; + } + else // all goes to lsize + { + lsize += rest; + } + lspace = new IT[lsize]; + rspace = new IT[rsize]; + int remrest = (int) rest; // needs to be signed integer since we're looping it until negative + if(d == 2) + { + for(IT i=0; i 0) lspace[2*chunks+0] = 4*chunks+0; + if(remrest-- > 0) lspace[2*chunks+1] = 4*chunks+1; + if(remrest-- > 0) rspace[2*chunks+0] = 4*chunks+2; + } + else if(d == 3) + { + for(IT i=0; i 0) lspace[3*chunks+0] = 6*chunks+0; + if(remrest-- > 0) lspace[3*chunks+1] = 6*chunks+1; + if(remrest-- > 0) lspace[3*chunks+2] = 6*chunks+2; + if(remrest-- > 0) rspace[3*chunks+0] = 6*chunks+3; + if(remrest-- > 0) rspace[3*chunks+1] = 6*chunks+4; + } + else if(d == 4) + { + + for(IT i=0; i 0) lspace[4*chunks+0] = 8*chunks+0; + if(remrest-- > 0) lspace[4*chunks+1] = 8*chunks+1; + if(remrest-- > 0) lspace[4*chunks+2] = 8*chunks+2; + if(remrest-- > 0) lspace[4*chunks+3] = 8*chunks+3; + if(remrest-- > 0) rspace[4*chunks+0] = 8*chunks+4; + if(remrest-- > 0) rspace[4*chunks+1] = 8*chunks+5; + if(remrest-- > 0) rspace[4*chunks+2] = 8*chunks+6; + } + else + { + cout << "Diagonal d = " << d << " is not yet supported" << endl; + } + } +} + +template +void CsbSym::MultAddAtomics(NT * __restrict y, const NT * __restrict x, const IT d) const +{ + cilk_for(IT i=0; i< ncsb-d; ++i) // all blocks at the dth diagonal and beyond + { + IT rhi = (i << nlowbits); + NT * __restrict suby = &y[rhi]; + const NT * __restrict subx_mirror = &x[rhi]; + + cilk_for(IT j=d; j < (ncsb-i); ++j) + { + IT chi = ((j+i) << nlowbits); + const NT * __restrict subx = &x[chi]; + NT * __restrict suby_mirror = &y[chi]; + + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=top[i][j]; k> nlowbits) & lowmask); + IT cli = (r_bot[k] & lowmask); + + atomicallyIncrementDouble(&suby[rli], val * subx[cli]); + atomicallyIncrementDouble(&suby_mirror[cli], val * subx_mirror[rli]); +#ifdef STATS + atomicflops += 2; +#endif + } + } + } +} + + +template +void CsbSym::MultMainDiag(NT * __restrict y, const NT * __restrict x) const +{ + if(Imbalance(0) > 2 * BALANCETH) + { + cilk_for(IT i=0; i< ncsb; ++i) // in main diagonal, j = i + { + IT hi = (i << nlowbits); + NT * __restrict suby = &y[hi]; + const NT * __restrict subx = &x[hi]; + + if(i == (ncsb-1) && (n-hi) <= lowmask) // last iteration and it's irregular (can't parallelize) + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=top[i][0]; k> nlowbits) & lowmask); + IT cli = (ind & lowmask); + + suby[rli] += val * subx[cli]; + suby[cli] += val * subx[rli]; // symmetric update + } + } + else + { + BlockTriPar(top[i][0], top[i][1], subx, suby, 0, blcrange, BREAKEVEN * (nlowbits+1)); + } + } + } + else // No need for block parallelization + { + cilk_for(IT i=0; i< ncsb; ++i) // in main diagonal, j = i + { + IT hi = (i << nlowbits); + NT * __restrict suby = &y[hi]; + const NT * __restrict subx = &x[hi]; + + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=top[i][0]; k> nlowbits) & lowmask); + IT cli = (ind & lowmask); + + suby[rli] += val * subx[cli]; + suby[cli] += val * subx[rli]; // symmetric update + } + } + } + const IT diagsize = diagonal.size(); + cilk_for(IT i=0; i < diagsize; ++i) + { + y[diagonal[i].first] += diagonal[i].second * x[diagonal[i].first]; // process the diagonal + } +} + + +// Multiply the dth block diagonal +// which is composed of blocks A[i][i+n] +template +void CsbSym::MultDiag(NT * __restrict y, const NT * __restrict x, const IT d) const +{ + if(d == 0) + { + MultMainDiag(y, x); + return; + } + IT * lspace; + IT * rspace; + IT lsize, rsize; + DivideIterationSpace(lspace, rspace, lsize, rsize, ncsb-d, d); + IT lsum = 0; + IT rsum = 0; + for(IT k=0; k BALANCETH * lave) // relative denser block + && (!(lspace[i] == (ncsb-d-1) && (n-chi) <= lowmask))) // and parallelizable + { + BlockPar(top[lspace[i]][d], top[lspace[i]][d+1], subx, subx_mirror, suby, suby_mirror, 0, blcrange, BREAKEVEN * (nlowbits+1)); + } + else + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=top[lspace[i]][d]; k> nlowbits) & lowmask); + IT cli = (r_bot[k] & lowmask); + suby[rli] += r_num[k] * subx[cli]; + suby_mirror[cli] += r_num[k] * subx_mirror[rli]; // symmetric update + } + } + } + + cilk_for(IT j=0; j< rsize; ++j) + { + IT rhi = (rspace[j] << nlowbits) ; + IT chi = ((rspace[j]+d) << nlowbits); + NT * __restrict suby = &y[rhi]; + NT * __restrict suby_mirror = &y[chi]; + const NT * __restrict subx = &x[chi]; + const NT * __restrict subx_mirror = &x[rhi]; + + if((top[rspace[j]][d+1] - top[rspace[j]][d] > BALANCETH * rave) // relative denser block + && (!(rspace[j] == (ncsb-d-1) && (n-chi) <= lowmask))) // and parallelizable + { + BlockPar(top[rspace[j]][d], top[rspace[j]][d+1], subx, subx_mirror, suby, suby_mirror, 0, blcrange, BREAKEVEN * (nlowbits+1)); + } + else + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=top[rspace[j]][d]; k> nlowbits) & lowmask); + IT cli = (r_bot[k] & lowmask); + suby[rli] += r_num[k] * subx[cli]; + suby_mirror[cli] += r_num[k] * subx_mirror[rli]; // symmetric update + } + } + } + delete [] lspace; + delete [] rspace; +} + +// Block parallelization for upper triangular compressed sparse blocks +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the `same block +// PRECONDITION: rangeend-rangebeg is a power of two +template +void CsbSym::BlockTriPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, + IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=start; k> nlowbits) & lowmask); + IT cli = (ind & lowmask); + + suby[rli] += val * subx[cli]; + suby[cli] += val * subx[rli]; // symmetric update + } + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); // divides in mid column + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // In the symmetric case, quadrant "1" doesn't exist (size1 = 0) + IT size0 = static_cast (mid - &bot[start]); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + cilk_spawn BlockTriPar(start, start+size0, subx, suby, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockTriPar(end-size3, end, subx, suby, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + BlockPar(start+size0, end-size3, subx, subx, suby, suby, halfrange, qrt3range, ncutoff); // multiply subblock_2 + } +} + +// Parallelize the block itself +// start/end: element start/end positions (indices to the bot array) +// bot[start...end] always fall in the same block +// PRECONDITION: rangeend-rangebeg is a power of two +// TODO: we rely on the particular implementation of lower_bound for correctness, which is dangerous ! +// what if lhs (instead of rhs) parameter to the comparison object is the splitter? +template +void CsbSym::BlockPar(IT start, IT end, const NT * __restrict subx, const NT * __restrict subx_mirror, + NT * __restrict suby, NT * __restrict suby_mirror, IT rangebeg, IT rangeend, IT cutoff) const +{ + assert(IsPower2(rangeend-rangebeg)); + if(end - start < cutoff) + { + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for(IT k=start; k> nlowbits) & lowmask); + IT cli = (ind & lowmask); + + suby[rli] += val * subx[cli]; + suby_mirror[cli] += val * subx_mirror[rli]; // symmetric update + } + } + else + { + // Lower_bound is a version of binary search: it attempts to find the element value in an ordered range [first, last) + // Specifically, it returns the first position where value could be inserted without violating the ordering + IT halfrange = (rangebeg+rangeend)/2; + IT qrt1range = (rangebeg+halfrange)/2; + IT qrt3range = (halfrange+rangeend)/2; + + IT * mid = std::lower_bound(&bot[start], &bot[end], halfrange, mortoncmp); + IT * left = std::lower_bound(&bot[start], mid, qrt1range, mortoncmp); + IT * right = std::lower_bound(mid, &bot[end], qrt3range, mortoncmp); + + /* ------- + | 0 2 | + | 1 3 | + ------- */ + // subtracting two pointers pointing to the same array gives you the # of elements separating them + // we're *sure* that the differences are 1) non-negative, 2) small enough to be indexed by an IT + IT size0 = static_cast (left - &bot[start]); + IT size1 = static_cast (mid - left); + IT size2 = static_cast (right - mid); + IT size3 = static_cast (&bot[end] - right); + + IT ncutoff = std::max(cutoff/2, MINNNZTOPAR); + + // We only perform [0,3] in parallel and then [1,2] in parallel because the symmetric update causes races when + // performing [0,1] in parallel (as it would perform [0,2] in the fictitious lower triangular part) + cilk_spawn BlockPar(start, start+size0, subx, subx_mirror, suby, suby_mirror, rangebeg, qrt1range, ncutoff); // multiply subblock_0 + BlockPar(end-size3, end, subx, subx_mirror, suby, suby_mirror, qrt3range, rangeend, ncutoff); // multiply subblock_3 + cilk_sync; + + cilk_spawn BlockPar(start+size0, start+size0+size1, subx, subx_mirror, suby, suby_mirror, qrt1range, halfrange, ncutoff); // multiply subblock_1 + BlockPar(start+size0+size1, end-size3, subx, subx_mirror, suby, suby_mirror, halfrange, qrt3range, ncutoff); // multiply subblock_2 + cilk_sync; + } +} + + +// double* restrict a; --> No aliases for a[0], a[1], ... +// bstart/bend: block start/end index (to the top array) +template +void CsbSym::SeqSpMV(const NT * __restrict x, NT * __restrict y) const +{ + const IT diagsize = diagonal.size(); + for(IT i=0; i < diagsize; ++i) + { + y[diagonal[i].first] += diagonal[i].second * x[diagonal[i].first]; // process the diagonal + } + for (IT i = 0 ; i < ncsb ; ++i) // for all block rows of A + { + IT rhi = (i << nlowbits); + NT * suby = &y[rhi]; + const NT * subx_mirror = &x[rhi]; + + IT * __restrict r_bot = bot; + NT * __restrict r_num = num; + for (IT j = 0 ; j < (ncsb-i) ; ++j) // for all blocks inside that block row + { + // get higher order bits for column indices + IT chi = ((j+i) << nlowbits); + const NT * __restrict subx = &x[chi]; + NT * __restrict suby_mirror = &y[chi]; + + for(IT k=top[i][j]; k> nlowbits) & lowmask); + IT cli = (r_bot[k] & lowmask); + NT val = r_num[k]; + suby[rli] += val * subx[cli]; + suby_mirror[cli] += val * subx_mirror[rli]; // symmetric update + } + } + } +} + +// Imbalance in the dth block diagonal (the main diagonal is the 0th) +template +float CsbSym::Imbalance(IT d) const +{ + if (ncsb <= d+1) + { + return 0.0; //pointless + } + + IT size = ncsb-d-1; + IT * sums = new IT[size]; + for(size_t i=0; i< size; ++i) + { + sums[i] = top[i][d+1] - top[i][d]; + } + IT max = *max_element(sums, sums+size); + IT mean = accumulate(sums, sums+size, 0.0) / size; + delete [] sums; + + return static_cast(max) / mean; +} + + +// Total number of nonzeros in the dth block diagonal (the main diagonal is the 0th) +template +IT CsbSym::nzsum(IT d) const +{ + IT sum = 0; + for(size_t i=0; i< ncsb-d; ++i) + { + sum += (top[i][d+1] - top[i][d]); + } + return sum; +} + +// Print stats to an ofstream object +template +ofstream & CsbSym::PrintStats(ofstream & outfile) const +{ + if(nz == 0) + { + outfile << "## Matrix Doesn't have any nonzeros" <(nzsum(0)) / nz << ", " + << static_cast(nzsum(1)) / nz << ", " << static_cast(nzsum(2)) / nz << endl; + outfile << "## atomics ratio: " << static_cast(nz-nzsum(0)-nzsum(1)-nzsum(2))/nz << endl; + + std::vector blocksizes; + for(IT i=0; i (top[i][j+1]-top[i][j])); + } + } + sort(blocksizes.begin(), blocksizes.end()); + outfile<< "## Total number of nonzeros: " << 2*nz +diagonal.size()<< endl; + outfile<< "## Total number of stored nonzeros: "<< nz+diagonal.size() << endl; + outfile<< "## Size of diagonal: " << diagonal.size() << endl; + + outfile << "## Nonzero distribution (sorted) of blocks follows: \n" ; + std::copy(blocksizes.begin(), blocksizes.end(), ostream_iterator(outfile,"\n")); + outfile << endl; + return outfile; +} + + +template +ofstream & CsbSym::Dump(ofstream & outfile) const +{ + for(typename vector< pair >::const_iterator itr = diagonal.begin(); itr != diagonal.end(); ++itr) + { + outfile << itr->first << " " << itr->second << "\n"; + } + for(IT i =0; i> nlowbits) & lowmask); + IT cli = bot[k] & lowmask; + outfile << "A(" << rli << "," << cli << ")=" << num[k] << endl; + } + } + } + return outfile; +} diff --git a/csb/csbsym.h b/csb/csbsym.h new file mode 100644 index 0000000..5401387 --- /dev/null +++ b/csb/csbsym.h @@ -0,0 +1,106 @@ +#ifndef _CSBSYM_H +#define _CSBSYM_H + +#include +#include +#include +#include // for std:accumulate() +#include // C++ style numeric_limits +#include +#include +#include "csc.h" +#include "mortoncompare.h" + +using namespace std; + + +inline void atomicallyIncrementDouble(volatile double *target, const double by){ + asm volatile( + "movq %0, %%rax \n\t" // rax = *(%0) + "xorpd %%xmm0, %%xmm0 \n\t" // xmm0 = [0.0,0.0] + "movsd %1, %%xmm0\n\t" // xmm0[lo] = *(%1) + "1:\n\t" + // rax (containing *target) was last set at startup or by a failed cmpxchg + "movq %%rax, %%xmm1\n\t" // xmm1[lo] = rax + "addsd %%xmm0, %%xmm1\n\t" // xmm1 = xmm0 + xmm1 = by + xmm1 + "movq %%xmm1, %%r8 \n\t" // r8 = xmm1[lo] + "lock cmpxchgq %%r8, %0\n\t" // if(*(%0)==rax){ZF=1;*(%0)=r8}else{ZF=0;rax=*(%0);} + "jnz 1b\n\t" // jump back if failed (ZF=0) + : "=m"(*target) // outputs + : "m"(by) // inputs + : "cc", "memory", "%rax", "%r8", "%xmm0", "%xmm1" // clobbered + ); + return; +} + +/* Symmetric CSB implementation +** Only upper triangle is stored +** top[i][0] gives the ith diagonal block for every i +** Since this class works only for symmetric (hence square) matrices, +** each compressed sparse block is (lowbits+1)x(lowbits+1) and ncsb = nbr = nbc +*/ +template +class CsbSym +{ +public: + CsbSym ():nz(0), n(0), ncsb(0) {} // default constructor (dummy) + + CsbSym (const CsbSym & rhs); // copy constructor + ~CsbSym(); + CsbSym & operator=(const CsbSym & rhs); // assignment operator + CsbSym (Csc & csc, int workers); + + ofstream & PrintStats(ofstream & outfile) const; + ofstream & Dump(ofstream & outfile) const; + IT colsize() const { return n;} + IT rowsize() const { return n;} + bool isPar() const { return ispar; } + +private: + void Init(int workers, IT forcelogbeta = 0); + void SeqSpMV(const NT * __restrict x, NT * __restrict y) const; + void BMult(IT** chunks, IT start, IT end, const NT * __restrict x, NT * __restrict y, IT ysize) const; + + void BlockPar(IT start, IT end, const NT * __restrict subx, const NT * __restrict subx_mirror, + NT * __restrict suby, NT * __restrict suby_mirror, IT rangebeg, IT rangeend, IT cutoff) const; + void BlockTriPar(IT start, IT end, const NT * __restrict subx, NT * __restrict suby, IT rangebeg, IT rangeend, IT cutoff) const; + + void SortBlocks(pair > * pairarray, NT * val); + void DivideIterationSpace(IT * & lspace, IT * & rspace, IT & lsize, IT & rsize, IT size, IT d) const; + + void MultAddAtomics(NT * __restrict y, const NT * __restrict x, const IT d) const; + void MultDiag(NT * __restrict y, const NT * __restrict x, const IT d) const; + void MultMainDiag(NT * __restrict y, const NT * __restrict x) const; + + float Imbalance(IT d) const; + IT nzsum(IT d) const; + + IT ** top ; // pointers array (indexed by higher-order bits of the coordinate index), size = nbr*(nbc+1) + IT * bot; // contains lower-order bits of the coordinate index, size nnz + NT * num; // contains numerical values, size nnz + + vector< pair > diagonal; + + bool ispar; + IT nz; // # nonzeros + IT n; // #{rows} = #{columns} + IT blcrange; // range indexed by one block + + IT ncsb; // #{block rows) = #{block cols} + + IT nlowbits; // # lower order bits (for both rows and columns) + IT nhighbits; + IT highmask; // mask with the first log(n)/2 bits = 1 and the other bits = 0 + IT lowmask; + + MortCompSym mortoncmp; // comparison operator w.r.t. the (inverted N)-morton layout + + template + friend void csbsym_gespmv (const CsbSym & A, const NU * x, NU * y); +}; + + +#include "friends.h" +#include "csbsym.cpp" +#endif + diff --git a/csb/csc.cpp b/csb/csc.cpp new file mode 100644 index 0000000..b08a189 --- /dev/null +++ b/csb/csc.cpp @@ -0,0 +1,290 @@ +#include "csc.h" +#include "utility.h" +#include + + +template +Csc::Csc (ITYPE size, ITYPE rows, ITYPE cols, bool isSym): nz(size),m(rows),n(cols),issym(isSym),logicalnz(size) +{ + // Constructing empty Csc objects (size = 0) are not allowed. + assert(size != 0 && n != 0); + num = new T[nz]; + ir = new ITYPE[nz]; + jc = new ITYPE[n+1]; +} + + +// copy constructor +template +Csc::Csc (const Csc & rhs): nz(rhs.nz), m(rhs.m), n(rhs.n), issym(rhs.issym), logicalnz(rhs.logicalnz) +{ + if(nz > 0) + { + num = new T[nz]; + ir = new ITYPE[nz]; + + for(ITYPE i=0; i< nz; ++i) + num[i]= rhs.num[i]; + for(ITYPE i=0; i< nz; ++i) + ir[i]= rhs.ir[i]; + } + if ( n > 0) + { + jc = new ITYPE[n+1]; + for(ITYPE i=0; i< n+1; i++) + jc[i] = rhs.jc[i]; + } +} + +template +Csc & Csc::operator= (const Csc & rhs) +{ + if(this != &rhs) + { + if(nz > 0) // if the existing object is not empty + { + // make it empty + delete [] ir; + delete [] num; + } + if(n > 0) + { + delete [] jc; + } + + nz = rhs.nz; + m = rhs.n; + n = rhs.n; + issym = rhs.issym; + logicalnz = rhs.logicalnz; + if(rhs.nz > 0) // if the copied object is not empty + { + num = new T[nz]; + ir = new ITYPE[nz]; + + for(ITYPE i=0; i< nz; ++i) + num[i]= rhs.num[i]; + for(ITYPE i=0; i< nz; ++i) + ir[i]= rhs.ir[i]; + } + if(rhs.n > 0) + { + jc = new ITYPE[n+1]; + for(ITYPE i=0; i< n+1; ++i) + jc[i] = rhs.jc[i]; + } + } + return *this; +} + + +template +Csc::~Csc() +{ + if( nz > 0) + { + delete [] ir; + delete [] num; + } + if ( n > 0) + { + delete [] jc; + } +} + + +// Construct a Csc object from an array of "triple"s +// The symmetric case is resilient in the sense that it covers both cases +// (a) triples only contain the upper triangular part, or (b) the whole matrix +template +Csc::Csc(Triple * triples, ITYPE size, ITYPE rows, ITYPE cols, bool isSym) +:nz(size),m(rows),n(cols),issym(isSym) +{ + // Constructing empty Csc objects (size = 0) are not allowed. + assert(size != 0 && n != 0); + + num = new T[nz]; + ir = new ITYPE[nz]; + jc = new ITYPE[n+1]; + + ITYPE * w = new ITYPE[n]; // workspace of size n (# of columns) + + for(ITYPE k = 0; k < n; ++k) + w[k] = 0; + + if(issym) + { + logicalnz = 0; + for (ITYPE k = 0 ; k < nz ; ++k) + { + if(triples[k].col >= triples[k].row) // only the upper triangular part + { + if(triples[k].col > triples[k].row) + ++logicalnz; // count each nonzero twice except for the diagonal + ITYPE tmp = triples[k].col; + w [ tmp ]++ ; + ++logicalnz; + } + } + } + else + { + logicalnz = nz; + for (ITYPE k = 0 ; k < nz ; ++k) + { + ITYPE tmp = triples[k].col; + w [ tmp ]++ ; // column counts (i.e, w holds the "col difference array") + } + } + + if(nz > 0) + { + jc[n] = CumulativeSum (w, n) ; // cumulative sum of w + for(ITYPE k = 0; k < n; ++k) + jc[k] = w[k]; + + ITYPE last; + if(issym) + { + for(ITYPE k = 0; k < nz; ++k) + { + if(triples[k].col >= triples[k].row) // only the upper triangular part + { + ir[ last = w[ triples[k].col ]++ ] = triples[k].row ; + num[last] = triples[k].val ; + } + } + nz = last + 1; // actual number of nonzeros that are physically stored + Resize(nz); + } + else + { + for (ITYPE k = 0 ; k < nz ; ++k) + { + ir[ last = w[ triples[k].col ]++ ] = triples[k].row ; + num[last] = triples[k].val ; + } + assert(((last+1) == nz)); + } + } + delete [] w; +} + + +// Construct a Csc object from parallel arrays +template +Csc::Csc(ITYPE * ri, ITYPE * ci, T * val, ITYPE size, ITYPE rows, ITYPE cols, bool isSym) +:nz(size),m(rows),n(cols),issym(isSym) +{ + // Constructing empty Csc objects (size = 0) are not allowed. + assert(size != 0 && n != 0); + + num = new T[nz]; + ir = new ITYPE[nz]; + jc = new ITYPE[n+1]; + + ITYPE * w = new ITYPE[n]; // workspace of size n (# of columns) + + for(ITYPE k = 0; k < n; ++k) + w[k] = 0; + + if(issym) + { + logicalnz = 0; + for (ITYPE k = 0 ; k < nz ; ++k) + { + if(ci[k] >= ri[k]) // only the upper part + { + if(ci[k] > ri[k]) + ++logicalnz; // count each nonzero twice except for the diagonal + ITYPE tmp = ci[k]; + w[ tmp ]++; + ++logicalnz; + } + } + } + else + { + logicalnz = nz; + for (ITYPE k = 0; k < nz; ++k) + { + ITYPE tmp = ci[k]; + w[ tmp ]++; // column counts (i.e, w holds the "col difference array") + } + } + if(nz > 0) + { + jc[n] = CumulativeSum (w, n) ; // cumulative sum of w + for(ITYPE k = 0; k < n; ++k) + jc[k] = w[k]; + + ITYPE last; + if(issym) + { + for (ITYPE k = 0 ; k < nz ; ++k) + { + if(ci[k] >= ri[k]) // only the upper part + { + ir[ last = w[ ci[k] ]++ ] = ri[k] ; + num[last] = val[k] ; + } + } + nz = last+1; // actual nnz that are physically stored + Resize(nz); + } + else + { + cout << "nz=" << nz << " last=" << last << endl; + for (ITYPE k = 0 ; k < nz ; ++k) + { + ir[ last = w[ ci[k] ]++ ] = ri[k] ; + num[last] = val[k] ; + } + cout << "nz=" << nz << " last=" << last << endl; + assert(((last+1) == nz)); + } + } + delete [] w; +} + + +// Resizes the maximum # nonzeros, doesn't change the matrix dimensions +template +void Csc::Resize(ITYPE nsize) +{ + if(nsize == nz) + { + // No need to do anything! + return; + } + else if(nsize == 0) + { + delete [] num; + delete [] ir; + nz = 0; + return; + } + + T * tmpnum = num; + ITYPE * tmpir = ir; + num = new T[nsize]; + ir = new ITYPE[nsize]; + + if(nsize > nz) // Grow it + { + for(ITYPE i=0; i< nz; ++i) // copy all of the old elements + num[i] = tmpnum[i]; + for(ITYPE i=0; i< nz; ++i) // copy all of the old elements + ir[i] = tmpir[i]; + } + else // Shrink it + { + for(ITYPE i=0; i< nsize; ++i) // copy only a portion of the old elements + num[i] = tmpnum[i]; + for(ITYPE i=0; i< nsize; ++i) // copy only a portion of the old elements + ir[i] = tmpir[i]; + } + delete [] tmpir; // delete the memory pointed by previous pointers + delete [] tmpnum; + nz = nsize; +} diff --git a/csb/csc.h b/csb/csc.h new file mode 100644 index 0000000..c65eacc --- /dev/null +++ b/csb/csc.h @@ -0,0 +1,223 @@ +#ifndef _CSC_H_ +#define _CSC_H_ + +#include "triple.h" +#include +#include + +using namespace std; + + +template +struct Triple; + +template +class Csc +{ +public: + Csc ():nz(0), m(0), n(0), logicalnz(0), issym(false) {} // default constructor + Csc (ITYPE size,ITYPE rows, ITYPE cols, bool isSym=false); + Csc (const Csc & rhs); // copy constructor + ~Csc(); + Csc & operator=(const Csc & rhs); // assignment operator + Csc (Triple * triples, ITYPE size, ITYPE rows, ITYPE cols, bool isSym=false); + Csc (ITYPE * ri, ITYPE * ci, T * val, ITYPE size, ITYPE rows, ITYPE cols, bool isSym=false); + + // we have to use another function because the compiler will reject another constructor with the same signature + void SetPointers (ITYPE * colpointers, ITYPE * rowindices, T * vals, ITYPE size, ITYPE rows, ITYPE cols, bool fortran) + { + jc = colpointers; + ir = rowindices; + num = vals; + nz = size; + m = rows; + n = cols; + issym = false; + logicalnz = size; + + if(fortran) + { + transform(jc, jc+n+1, jc, bind2nd(minus(),1)); + transform(ir, ir+nz, ir, bind2nd(minus(),1)); + } + } + + // symmetric pointer initialization + void SetPointersSym (ITYPE * colpointers, ITYPE * rowindices, T * vals, ITYPE size, ITYPE sizeNz, ITYPE rows, ITYPE cols, bool fortran) + { + jc = colpointers; + ir = rowindices; + num = vals; + nz = size; + m = rows; + n = cols; + issym = true; + logicalnz = sizeNz; + + if(fortran) + { + transform(jc, jc+n+1, jc, bind2nd(minus(),1)); + transform(ir, ir+nz, ir, bind2nd(minus(),1)); + } + } + + ITYPE colsize() const { return n;} + ITYPE rowsize() const { return m;} + ITYPE * getjc() const { return jc;} + ITYPE * getir() const { return ir;} + T * getnum() const { return num;} + ITYPE getlogicalnnz() const + { + return logicalnz; + } + + // function to print CSC stats for debugging + void printStats() const { + printf(" nz = %d\n" , nz ); + printf(" m = %d\n" , m ); + printf(" n = %d\n" , n ); + printf(" issym = %d\n" , issym ); + printf(" logicalnz = %d\n" , logicalnz ); + + for (int j = 0; j < n; j++) + for (int i = jc[j]; i < jc[j+1]; i++) + printf(" A[%d, %d] = %g\n", ir[i], j, num[i] ); + + } + +private: + void Resize(ITYPE nsize); + bool issym; + + ITYPE * jc ; // col pointers, size n+1 + ITYPE * ir; // row indices, size nnz + T * num; // numerical values, size nnz + + ITYPE logicalnz; + ITYPE nz; + ITYPE m; // number of rows + ITYPE n; // number of columns + + template + friend class CsbSym; + template + friend class BiCsb; + template + friend class BmCsb; + template + friend class BmSym; + + template + friend void csc_gaxpy (const Csc & A, U * x, U * y); + + template + friend void csc_gaxpy_trans (const Csc & A, U * x, U * y); + + template + friend void csc_gaxpy_mm(const Csc & A, array * x, array * y); + + template + friend void csc_gaxpy_mm_trans(const Csc & A, array * x, array * y); +}; + +/* y = A*x+y */ +template +void csc_gaxpy (const Csc & A, T * x, T * y) +{ + if(A.issym) + { + for (ITYPE j = 0 ; j < A.n ; ++j) // for all columns of A + { + for (ITYPE k = A.jc [j] ; k < A.jc [j+1] ; ++k) + { + y [ A.ir[k] ] += A.num[k] * x [j] ; + if( j != A.ir[k] ) + y [ j ] += A.num[k] * x[ A.ir[k] ] ; // perform the symmetric update + } + } + } + else + { + for (ITYPE j = 0 ; j < A.n ; ++j) // for all columns of A + { + for (ITYPE k = A.jc [j] ; k < A.jc [j+1] ; ++k) // scale jth column with x[j] + { + y [ A.ir[k] ] += A.num[k] * x [j] ; + } + } + } +} + + +/* y = A' x + y */ +template +void csc_gaxpy_trans(const Csc & A, T * x, T * y) +{ + if(A.issym) + { + cout << "Trying to run A'x on a symmetric matrix doesn't make sense" << endl; + cout << "Are you sure you're using the right data structure?" << endl; + return; + } + + for (ITYPE j = 0; j< A.n; ++j) + { + for(ITYPE k= A.jc[j]; k < A.jc[j+1]; ++k) + { + y[j] += A.num[k] * x [ A.ir[k] ]; + } + } +} + + +/* Y = A X + Y */ +template +void csc_gaxpy_mm(const Csc & A, array * x, array * y) +{ + if(A.issym) + { + cout << "Symmetric csc_gaxpy_mm not implemented yet" << endl; + return; + } + + for (IT j = 0 ; j < A.n ; ++j) // for all columns of A + { + for (IT k = A.jc[j] ; k < A.jc[j+1] ; ++k) // scale jth column with x[j] + { + for(int i=0; i +void csc_gaxpy_mm_trans(const Csc & A, array * x, array * y) +{ + if(A.issym) + { + cout << "Trying to run A'x on a symmetric matrix doesn't make sense" << endl; + cout << "Are you sure you're using the right data structure?" << endl; + return; + } + + for (IT j = 0; j< A.n; ++j) + { + for(IT k= A.jc[j]; k < A.jc[j+1]; ++k) + { + for(int i=0; i + * @date Thu Jul 19, 2018 + * + * @brief Implementations for stationary and nonstationary + * computations using CSR storage format. + * + * @version 1.0 + * + * + */ + +#include "csr_routines.hpp" +#include "cilk/cilk.h" +#include "mkl.h" + +#ifdef PARFLG +#define FOR cilk_for +#else +#define FOR for +#endif + + +void computeSubDistSparse( double * Fattr, + double * const Y, + double const * const p_sp, + int * ir, + int * jc, + int const n, + int const d) { + + // loop over rows of matrix (cilk_for or for) + FOR (int i = 0; i < n; i++) { + + // loop over rows of matrix (cilk_for or for) + // FOR (int i = 0; i < n; i++) { + + double Fi[3] = {0}; + double Yi[3]; + + const int nnzi = jc[i+1] - jc[i]; + + Yi[:] = Y[ (i*d) + 0:d ]; + + // for each non zero element of row i + for (int k = 0; k < nnzi; k++) { + + double Ydij[3]; + const int idx = jc[i]+k; + const int j = (ir[idx]); + + // compute on-the-fly vector Yi - Yj + Ydij[:] = Yi[:] - Y[ (j*d) + 0:d ]; + + // compute euclidean distance between Yi and Yj + double dist = __sec_reduce_add( Ydij[:]*Ydij[:] ); + + // P(i,j) / ( 1 + dist(Y[i,:],Y[j,:]) ) + const double p_times_q = p_sp[idx] / (1+dist); + + // Fi += P[i,j] * Q[i,j] * (Y[i,:] - Y[j,:]) + Fi[:] += p_times_q * ( Ydij[:] ); + + } + + // update final output vector F[i,:] + Fattr[ (i*d) + 0:d ] = Fi[:]; + + } + +} + + + +// -------------------------------------------------- +// ---------- ?CSRMV: MATRIX VECTOR PRODUCT USING SPARSE CSR + +void sparseComputationCSR( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa = 'n'; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_dcsrgemv (&transa , &m , values, rows, columns , &(x[jj*n]), &(y[jj*n]) ); + +} + +void sparseComputationCSR( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa = 'n'; + + MKL_INT m = n; + + for ( int jj = 0; jj < nOfVec; jj++) + + mkl_cspblas_scsrgemv (&transa , &m , values , rows , columns , &(x[jj*n]), &(y[jj*n]) ); + +} + + +// ================================================== +// === MATRIX-MATRIX PRODUCT ROUTINES + +// -------------------------------------------------- +// --- CSR MATRIX-MATRIX PRODUCT + + void sparseMatrixMatrixComputationCSR( double * const y, + double const * const values, + int const * const columns, + int const * const rows, + double const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + double alpha = 1; + double beta = 0; + + mkl_dcsrmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, columns, + rows, &(rows[1]), x, &nVecs, &beta, y, &nVecs ); + + } + + void sparseMatrixMatrixComputationCSR( float * const y, + float const * const values, + int const * const columns, + int const * const rows, + float const * const x, + unsigned int const n, + unsigned int const nOfVec){ + + char transa[1] = {'N'}; + + char matdescra[6] = {'g', 'l', 'n', 'c', 'x', 'x'}; + + MKL_INT m = (MKL_INT) n; + MKL_INT nVecs = (MKL_INT) nOfVec; + MKL_INT k = (MKL_INT) n; + + float alpha = 1; + float beta = 0; + + mkl_scsrmm (transa, &m, &nVecs, &k, &alpha, matdescra, values, columns, + rows, &(rows[1]), x, &nVecs, &beta, y, &nVecs ); + + } + + + // ================================================== + // CUSTOM IMPLEMENTATION OF CSR + +void computeSubDistSparse_spmv( double * Fattr, + double * const Y, + double const * const p_sp, + int * ir, + int * jc, + int const n, + int const d) { + + // loop over rows of matrix (cilk_for or for) + FOR (int i = 0; i < n; i++) { + + double Fi[1] = {0}; + + const int nnzi = jc[i+1] - jc[i]; + + // for each nnz of row i + for (int k = 0; k < nnzi; k++) { + + const int idx = jc[i] + k; + const int j = ir[idx]; + + // Fi += P[i,j] * Y[j,:] + Fi[:] += p_sp[idx] * Y[ (j*d) + 0:d ]; + + } + + // updated final ouptut vector F[i,:] + Fattr[ (i*d) + 0:d ] = Fi[:]; + + } + +} + + + + +/* ********************************************************************** + * + * AUTHORS + * + * Dimitris Floros fcdimitr@auth.gr + * + * VERSION + * + * 1.1 - October 25, 2017 + * + * CHANGELOG + * + * 1.1 (Oct 25, 2017) - Dimitris + * * added multiple different parallelism options + * - different grainsizes + * - openmp + * 1.0 (Oct 18, 2017) - Dimitris + * * fixed rows --> columns notation ( i <--> j ) + * * cleaned up and simplified code + * 0.1 (???) - ??? + * * initial implementation + * + * ********************************************************************** */ diff --git a/csb/csr_routines.hpp b/csb/csr_routines.hpp new file mode 100644 index 0000000..e75d4da --- /dev/null +++ b/csb/csr_routines.hpp @@ -0,0 +1,129 @@ +/* ********************************************************************** + * + * CSR_ROUTINES + * ---------------------------------------------------------------------- + * + * Header file containing definition of CSR routines for experiments. + * + * ********************************************************************** */ + +#ifndef _CSR_ROUTINES +#define _CSR_ROUTINES + + +/** + Single precision sparse matrix vector product using MKL SCSCMV, + with matrix stored in Compressed Sparse Column format (CSC). + + @param y Output vector y (result of matrix-vector product) + @param values Vector containing the nnz elements of sparse matrix + @param rows Rows vector + @param columns Columns vector + @param x Input vector x + @param b Bandwidth size (nnz for each row) + */ +void sparseComputationCSR( double * const y, + double const * const values, + int const * const rows, + int const * const columns, + double const * const x, + unsigned int const n, + unsigned int const nOfVec); + +void sparseComputationCSR( float * const y, + float const * const values, + int const * const rows, + int const * const columns, + float const * const x, + unsigned int const n, + unsigned int const nOfVec); + + +// -------------------------------------------------- +// GENERAL SPARSE MATRIX VECTOR PRODUCT -- CSR FORMAT (MKL_?CSRMV) + +/** + Double precision sparse matrix vector product using MKL DCSRMV, + with matrix stored in Compressed Sparse Row format (CSR). + + @param y Output vector y (result of matrix-vector product) + @param values Vector containing the nnz elements of sparse matrix + @param columns Columns vector + @param rows Rows vector + @param x Input vector x + @param b Bandwidth size (nnz for each row) + @param nOfVec Number of vectors + */ +void sparseMatrixMatrixComputationCSR( double * const y, + double const * const values, + int const * const columns, + int const * const rows, + double const * const x, + unsigned int const n, + unsigned int const nOfVec); + + +/** + Single precision sparse matrix vector product using MKL SCSRMV, + with matrix stored in Compressed Sparse Row format (CSR). + + @param y Output vector y (result of matrix-vector product) + @param values Vector containing the nnz elements of sparse matrix + @param columns Columns vector + @param rows Rows vector + @param x Input vector x + @param b Bandwidth size (nnz for each row) + @param nOfVec Number of vectors + */ +void sparseMatrixMatrixComputationCSR( float * const y, + float const * const values, + int const * const columns, + int const * const rows, + float const * const x, + unsigned int const n, + unsigned int const nOfVec); + + +/** + * COMPUTESUBDISTSPARSE: Custom implementation of nostationary 3RHS + * code using CSR. + */ +void computeSubDistSparse( double * Fattr, + double * const Y, + double const * const p_sp, + int * ir, + int * jc, + int const n, + int const d); + +/** + * COMPUTESUBDISTSPARSE: Custom implementation of stationary 1RHS + * code using CSR. + */ +void computeSubDistSparse_spmv( double * Fattr, + double * const Y, + double const * const p_sp, + int * ir, + int * jc, + int const n, + int const d); + + +#endif + +/* ********************************************************************** + * + * AUTHORS + * + * Dimitris Floros fcdimitr@auth.gr + * + * VERSION + * + * 0.1 - July 19, 2018 + * + * CHANGELOG + * + * 0.1 (Jul 19, 2018) - Dimitris + * * initial implementation + * + * ********************************************************************** */ diff --git a/csb/friends.h b/csb/friends.h new file mode 100644 index 0000000..83fcf0b --- /dev/null +++ b/csb/friends.h @@ -0,0 +1,1188 @@ +#ifndef _FRIENDS_H_ +#define _FRIENDS_H_ + +#include +#include +#include "bicsb.h" +#include "bmcsb.h" +#include "bmsym.h" +#include "csbsym.h" +#include "utility.h" +#include "timer.gettimeofday.c" + +using namespace std; + +template +class BiCsb; + +template +class BmCsb; + +double prescantime; + + +#if (__GNUC__ == 4 && (__GNUC_MINOR__ < 7) ) +#define emplace_back push_back +#endif + + + +// SpMV with Bit-Masked CSB +// No semiring or type promotion support yet +template +void bmcsb_gespmv (const BmCsb & A, const NT * __restrict x, NT * __restrict y) +{ + double t0 = timer_seconds_since_init(); + + unsigned * scansum = new unsigned[A.nrb]; + unsigned sum = prescan(scansum, A.masks, A.nrb); + + double t1 = timer_seconds_since_init(); + prescantime += (t1-t0); + + IT ysize = A.lowrowmask + 1; // size of the output subarray (per block row - except the last) + + if( A.isPar() ) + { + float rowave = static_cast(A.numnonzeros()) / (A.nbr-1); + cilk_for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + NT * suby = &y[rhi]; + if( A.top[i][A.nbc] - A.top[i][0] > BALANCETH * rowave) + { + IT thsh = ysize * BREAKNRB; + vector chunks; + chunks.push_back(btop); + for(IT j =0; j < A.nbc; ) + { + IT count = btop[j+1] - btop[j]; + if(count < thsh && j < A.nbc) + { + while(count < thsh && j < A.nbc) + { + count += btop[(++j)+1] - btop[j]; + } + chunks.push_back(btop+j); // push, but exclude the block that caused the overflow + } + else + { + chunks.push_back(btop+(++j)); // don't exclude the overflow block if it is the only block in that chunk + } + } + // In std:vector, the elements are stored contiguously so that we can + // treat &chunks[0] as an array of pointers to IT w/out literally copying it to IT** + if(i==(A.nbr-1)) // last iteration + { + A.BMult(&chunks[0], 0, chunks.size()-1, x, suby, A.rowsize() - ysize*i, scansum); + } + else + { + A.BMult(&chunks[0], 0, chunks.size()-1, x, suby, ysize, scansum); + } + } + else + { + A.SubSpMV(btop, 0, A.nbc, x, suby, scansum); + } + } + } + + else + { + cilk_for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + NT * suby = &y[rhi]; + + A.SubSpMV(btop, 0, A.nbc, x, suby, scansum); + } + } + delete [] scansum; +} + +/** + * Operation y = A*x+y on a semiring SR + * A: a general CSB matrix (no specialization on booleans is necessary as this loop is independent of numerical values) + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template +void bicsb_gespmv (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output subarray (per block row - except the last) + + + +#ifdef PRINT_PAR /* log: parallelization strategies */ + printf("ENTERED logfile creation\n"); + /* prepare filename */ + char fileName[ 180 ]; + sprintf( fileName, "csb-stat_n-%d_m-%d_blk-%d_nbr-%d_nbc-%d_nnz-%d_alpha-%d_gamma-%0.2f.log", + A.n, A.m, ysize, A.nbr, A.nbc, A.numnonzeros(), BREAKEVEN, BALANCETH ); + + printf("FILENAME %s\n", fileName); + /* open output stream */ + ofstream statsFile( fileName ); + + /* print matrix size and block size */ + statsFile << A.m << "," << A.n << "," << ysize << endl; +#endif + + if(A.isPar() ) + { + /* row average */ + float rowave = static_cast(A.numnonzeros()) / (A.nbr-1); /* why -1? */ + +#ifdef BREAK_NBR /* break cilk_for rows to enforce consecutive writing */ + + int np = __cilkrts_get_nworkers(); + int blockSize = BREAK_NBR * np; + for(IT ib = 0; ib < A.nbr; ib += blockSize ) { +#endif + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif + +#ifdef PRINT_PAR + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A +#else + + #ifdef SHOW_TIMESTAMP /* show worker timestamp */ + struct timeval startwtime; + gettimeofday (&startwtime, NULL); + #endif + + #ifdef BREAK_NBR /* break cilk_for rows to enforce consecutive writing */ + /* cilk for */ + #pragma cilk grainsize = BREAK_NBR + cilk_for (IT i = ib ; i < min(ib + blockSize, A.nbr) ; ++i) + #else + cilk_for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + #endif + +#endif + { + +#ifdef SHOW_TIMESTAMP /* show worker timestamp */ + + int whoamI = __cilkrts_get_worker_number(); + struct timeval endwtime; + gettimeofday (&endwtime, NULL); + + double time = (double)((endwtime.tv_usec - startwtime.tv_usec)/1.0e6 + + endwtime.tv_sec - startwtime.tv_sec); + + printf("Worker %d started iteration %2d at time %4.1f\n", + whoamI, i, time); + +#endif + + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); /* row block id (Row High Index) */ + LHS * suby = &y[rhi]; /* get y for row block id */ + +#ifdef PRINT_PAR + statsFile << i; +#endif + + /* check whether to parallelize the j or not */ + if( btop[A.nbc] - btop[0] > /* number of non-zeros in block row */ + std::max( static_cast(BALANCETH * rowave), /* 2 * rowave {default} */ + static_cast(BREAKEVEN * ysize) ) ) /* 4 * ysize {default} */ + { + IT thsh = BREAKEVEN * ysize; /* 4 * ysize {default} */ + vector chunks; /* generate vectors of 'chunks' */ + chunks.push_back(btop); +#ifdef PRINT_PAR + statsFile << ",0"; +#endif + for(IT j =0; j < A.nbc; ) /* loop through block columns */ + { + IT count = btop[j+1] - btop[j]; /* nnz in block j */ + if(count < thsh && j < A.nbc) + { + /* concatanate until count threshold */ + while(count < thsh && j < A.nbc) + { + count += btop[(++j)+1] - btop[j]; + } +#ifdef PRINT_PAR + statsFile << "," << j; +#endif + // push, but exclude the block that caused the + // overflow + chunks.push_back(btop+j); + } + else + { +#ifdef PRINT_PAR + statsFile << "," << j+1; +#endif + // don't exclude the overflow block if it is the + // only block in that chunk + chunks.push_back(btop+(++j)); + } + } + // In std:vector, the elements are stored contiguously + // so that we can treat &chunks[0] as an array of + // pointers to IT w/out literally copying it to IT** + if(i==(A.nbr-1)) // last iteration + { + A.template BMult(&chunks[0], 0, chunks.size()-1, + x, suby, A.rowsize() - ysize*i); + } + else + { + // chunksize-1 because we always insert a dummy chunk + A.template BMult(&chunks[0], 0, chunks.size()-1, + x, suby, ysize); + } +#ifdef PRINT_PAR + statsFile << ",0 " << A.nbc << endl; +#endif + } + else /* no parallelism among j blocks */ + { +#ifdef PRINT_PAR + statsFile << ",0," << A.nbc << endl; +#endif + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + A.template SubSpMV(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby); +#else + + A.template SubSpMV(btop, 0, A.nbc, x, suby); + +#endif + + } + } + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + } /* for (jb, A.nbc) */ + +#endif + +#ifdef BREAK_NBR /* break rows */ + + } /* for (ib, A.nbr) */ + +#endif + + } + else{ + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif /* --- ENDIF */ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + A.template SubSpMV(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby); +#else + + A.template SubSpMV(btop, 0, A.nbc, x, suby); + +#endif + } + +#ifdef BREAK_NBC /* --- IFDEF close loop over block of columns */ + } +#endif /* --- ENDIF */ + + } +} + + +/** + * Operation y = A*x+y on a semiring SR + * A: a general CSB matrix (no specialization on booleans is necessary as this loop is independent of numerical values) + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template +void bicsb_gespmv_tar (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output subarray (per block row - except the last) + + + +#ifdef PRINT_PAR /* log: parallelization strategies */ + printf("ENTERED logfile creation\n"); + /* prepare filename */ + char fileName[ 180 ]; + sprintf( fileName, "csb-stat_n-%d_m-%d_blk-%d_nbr-%d_nbc-%d_nnz-%d_alpha-%d_gamma-%0.2f.log", + A.n, A.m, ysize, A.nbr, A.nbc, A.numnonzeros(), BREAKEVEN, BALANCETH ); + + printf("FILENAME %s\n", fileName); + /* open output stream */ + ofstream statsFile( fileName ); + + /* print matrix size and block size */ + statsFile << A.m << "," << A.n << "," << ysize << endl; +#endif + + if(A.isPar() ) + { + /* row average */ + float rowave = static_cast(A.numnonzeros()) / (A.nbr-1); /* why -1? */ + +#ifdef BREAK_NBR /* break cilk_for rows to enforce consecutive writing */ + + int np = __cilkrts_get_nworkers(); + int blockSize = BREAK_NBR * np; + for(IT ib = 0; ib < A.nbr; ib += blockSize ) { +#endif + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif + +#ifdef PRINT_PAR + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A +#else + + #ifdef SHOW_TIMESTAMP /* show worker timestamp */ + struct timeval startwtime; + gettimeofday (&startwtime, NULL); + #endif + + #ifdef BREAK_NBR /* break cilk_for rows to enforce consecutive writing */ + /* cilk for */ + #pragma cilk grainsize = BREAK_NBR + cilk_for (IT i = ib ; i < min(ib + blockSize, A.nbr) ; ++i) + #else + cilk_for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + #endif + +#endif + { + +#ifdef SHOW_TIMESTAMP /* show worker timestamp */ + + int whoamI = __cilkrts_get_worker_number(); + struct timeval endwtime; + gettimeofday (&endwtime, NULL); + + double time = (double)((endwtime.tv_usec - startwtime.tv_usec)/1.0e6 + + endwtime.tv_sec - startwtime.tv_sec); + + printf("Worker %d started iteration %2d at time %4.1f\n", + whoamI, i, time); + +#endif + + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); /* row block id (Row High Index) */ + LHS * suby = &y[rhi]; /* get y for row block id */ + +#ifdef PRINT_PAR + statsFile << i; +#endif + + /* check whether to parallelize the j or not */ + if( btop[A.nbc] - btop[0] > /* number of non-zeros in block row */ + std::max( static_cast(BALANCETH * rowave), /* 2 * rowave {default} */ + static_cast(BREAKEVEN * ysize) ) ) /* 4 * ysize {default} */ + { + IT thsh = BREAKEVEN * ysize; /* 4 * ysize {default} */ + vector chunks; /* generate vectors of 'chunks' */ + chunks.push_back(btop); +#ifdef PRINT_PAR + statsFile << ",0"; +#endif + for(IT j =0; j < A.nbc; ) /* loop through block columns */ + { + IT count = btop[j+1] - btop[j]; /* nnz in block j */ + if(count < thsh && j < A.nbc) + { + /* concatanate until count threshold */ + while(count < thsh && j < A.nbc) + { + count += btop[(++j)+1] - btop[j]; + } +#ifdef PRINT_PAR + statsFile << "," << j; +#endif + // push, but exclude the block that caused the + // overflow + chunks.push_back(btop+j); + } + else + { +#ifdef PRINT_PAR + statsFile << "," << j+1; +#endif + // don't exclude the overflow block if it is the + // only block in that chunk + chunks.push_back(btop+(++j)); + } + } + // In std:vector, the elements are stored contiguously + // so that we can treat &chunks[0] as an array of + // pointers to IT w/out literally copying it to IT** + if(i==(A.nbr-1)) // last iteration + { + A.template BMult(&chunks[0], 0, chunks.size()-1, + x, suby, A.rowsize() - ysize*i); + } + else + { + // chunksize-1 because we always insert a dummy chunk + A.template BMult(&chunks[0], 0, chunks.size()-1, + x, suby, ysize); + } +#ifdef PRINT_PAR + statsFile << ",0 " << A.nbc << endl; +#endif + } + else /* no parallelism among j blocks */ + { +#ifdef PRINT_PAR + statsFile << ",0," << A.nbc << endl; +#endif + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + A.template SubSpMV_tar(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby); +#else + + A.template SubSpMV_tar(btop, 0, A.nbc, x, suby); + +#endif + + } + } + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + } /* for (jb, A.nbc) */ + +#endif + +#ifdef BREAK_NBR /* break rows */ + + } /* for (ib, A.nbr) */ + +#endif + + } + else{ + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif /* --- ENDIF */ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + +#ifdef BREAK_NBC /* break columns in 2 sequential blocks */ + + A.template SubSpMV_tar(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby); +#else + + A.template SubSpMV_tar(btop, 0, A.nbc, x, suby); + +#endif + } + +#ifdef BREAK_NBC /* --- IFDEF close loop over block of columns */ + } +#endif /* --- ENDIF */ + + } +} + + + +/** + * Operation y = (A^t)*x+y a semiring SR + * A: a general CSB matrix (no specialization on booleans is necessary as this loop is independent of numerical values) + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + */ +template +void bicsb_gespmvt (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowcolmask + 1; // size of the output subarray (per block column - except the last) + + // A.top (nbr=3, nbc=4): + // 0 5 17 21 24 + // 24 28 33 39 53 + // 53 60 61 70 72 + + vector colsums(A.nbc,0); + cilk_for(IT j=0; j(A.numnonzeros()) / (A.nbc-1); + cilk_for (IT j = 0 ; j < A.nbc ; ++j) // for all block columns of A + { + IT rhi = ((j << A.rowlowbits) & A.highcolmask); + LHS * suby = &y[rhi]; + typedef typename std::tuple IntTriple; + typedef typename std::vector< IntTriple > ChunkType; + vector< ChunkType * > chunks; // we will have to manage + + // the second condition is == natural == because if colsums[j] < BREAKEVEN * ysize, + // then the whole row will be a single chunk of sparse blocks that runs as a single strand + if( colsums[j] > BALANCETH * colave && colsums[j] > BREAKEVEN * ysize) + { + IT thsh = BREAKEVEN * ysize; + // each chunk is represented by a vector of blocks + // each block is represented by its {begin, end} pointers to bot array AND its -row- block id (within the block column) + // get<0>(tuple): begin pointer to bot, get<1>(tuple): end pointer to bot, get<2>(tuple): row block id + + for(IT i =0; i < A.nbr; ++i ) + { + ChunkType * chunk = new ChunkType(); + chunk->emplace_back( IntTriple (A.top[i][j], A.top[i][j+1], i)); + IT count = A.top[i][j+1] - A.top[i][j]; + + if(count < thsh) + { + // while adding the next (i+1) element wouldn't exceed the chunk limit + while(i < A.nbr-1 && (count+A.top[i+1][j+1] - A.top[i+1][j]) < thsh ) + { + i++; // move to next one before push + if(A.top[i][j+1] - A.top[i][j] > 0) + { + chunk->emplace_back( IntTriple (A.top[i][j], A.top[i][j+1], i)); + count += A.top[i][j+1] - A.top[i][j]; + } + } + // push, but exclude the block that caused the overflow + chunks.push_back(chunk); // emplace_back wouldn't buy anything for simple structures like pointers + } + else // already above the limit by itself => single dense block + { + chunks.push_back(chunk); + } + } + if(j==(A.nbc-1)) // last iteration + { + A.template BTransMult(chunks, 0, chunks.size(), x, suby, A.colsize() - ysize*j); + } + else + { + A.template BTransMult(chunks, 0, chunks.size(), x, suby, ysize); // chunksize (no -1) as there is no dummy chunk + } + + // call the destructor of each chunk vector + for_each(chunks.begin(), chunks.end(), [](ChunkType * pPtr){ delete pPtr; }); + } + else + { + A.template SubSpMVTrans(j, 0, A.nbr, x, suby); + } + } + } + else + { + cilk_for (IT j =0; j< A.nbc; ++j) // for all block columns of A + { + IT rhi = ((j << A.collowbits) & A.highcolmask); + LHS * suby = &y[rhi]; + + A.template SubSpMVTrans(j, 0, A.nbr, x, suby); + } + } +} + +// SpMV with symmetric CSB +// No semiring or type promotion support yet +template +void csbsym_gespmv (const CsbSym & A, const NT * __restrict x, NT * __restrict y) +{ + #pragma isat marker SM2_begin + //if( A.isPar() ) + //{ + #pragma isat tuning name(tune_tempy) scope(SM1_begin, SM1_end) measure(SM2_begin, SM2_end) variable(SPAWNS, range(1,6)) variable(NDIAGS, range(1,11)) search(dependent) + #pragma isat marker SM1_begin + #define SPAWNS 1 // how many you do in parallel at a time + #define NDIAGS 3 // how many you do in total + NT ** t_y = new NT* [SPAWNS]; + t_y[0] = y; // alias t_y[0] to y + for(int i=1; i 1) + { + A.MultDiag(t_y[0], x, j*SPAWNS); // maps to A.MultMainDiag(y,x) if j = 0 + --remdiags; // decrease remaining diagonals + int i = 1; + for(; (i < SPAWNS) && (remdiags > 1) ; ++i) + { + cilk_spawn A.MultDiag(t_y[i], x, j*SPAWNS + i); + --remdiags; + } + if(i < SPAWNS && remdiags == 1) + { + cilk_spawn A.MultAddAtomics(t_y[i], x, j*SPAWNS + i); + --remdiags; + } + cilk_sync; + } + else if(remdiags == 1) + { + A.MultAddAtomics(t_y[0], x, j*SPAWNS); // will only happen is remdiags is 1 when the outerloop started + --remdiags; + } + } + + cilk_for(int j=0; j< A.n; ++j) + { + for(int i=1; i +void bmsym_gespmv (const BmSym & A, const NT * __restrict x, NT * __restrict y) +{ + if( A.isPar() ) + { + NT * y1 = new NT[A.n](); + NT * y2 = new NT[A.n](); + NT * y3; + + IT size0 = A.nrbsum(0); + IT size1 = A.nrbsum(1); + IT size2 = A.nrbsum(2); + + if(size0+size1+size2 != A.nrb) + { + y3 = new NT[A.n](); + cilk_spawn A.MultAddAtomics(y3,x,3); + } + + cilk_spawn A.MultDiag(y1,x,1); + cilk_spawn A.MultDiag(y2,x,2); + A.MultMainDiag(y, x); + + cilk_sync; + + if(size0+size1+size2 != A.nrb) + { + cilk_for(int i=0; i< A.n; ++i) + { + y[i] += y1[i] + y2[i] + y3[i]; + } + delete [] y3; + } + else + { + cilk_for(int i=0; i< A.n; ++i) + { + y[i] += y1[i] + y2[i]; + } + } + + delete [] y1; + delete [] y2; + } + else + { + A.SeqSpMV(x, y); + } +} + +// Works on any CSB-like data structure +template +float RowImbalance(const CSB & A) +{ + // get the average without the last left-over blockrow + float rowave = static_cast(*(A.top[A.nbr-1])) / (A.nbr-1); + unsigned rowmax = 0; + for(size_t i=1; i< A.nbr; ++i) + { + rowmax = std::max(rowmax, *(A.top[i]) - *(A.top[i-1])); + } + return static_cast(rowmax) / rowave; +} + + +template +float ColImbalance(const BiCsb & A) +{ + vector sum(A.nbc-1); + cilk_for(IT j=1; j< A.nbc; ++j) // ignore the last block column + { + IT * blocknnz = new IT[A.nbr]; // nnz per block responsible + for(IT i=0; i(A.nbc-1); + vector::iterator colmax = std::max_element(sum.begin(), sum.end()); + return (*colmax) / colave; +} + +///////////////////////////////// +// t-SNE kernel Implementation // +// September 2017 // +// by Kostas Mylonakis // +///////////////////////////////// + +/** + * Operation tsne kernel + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template + void bicsb_tsne (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + + A.template SubtSNEkernel(btop, 0, A.nbc, x, suby, rhi); + + } + + + }else{ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + + + A.template SubtSNEkernel(btop, 0, A.nbc, x, suby, rhi); + + } + + } +} + +/** + * Operation tsne kernel + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template + void bicsb_tsne1D (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + + A.template SubtSNEkernel1D(btop, 0, A.nbc, x, suby, rhi); + + } + + + }else{ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + + + A.template SubtSNEkernel1D(btop, 0, A.nbc, x, suby, rhi); + + } + + } +} + +/** + * Operation tsne kernel + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template + void bicsb_tsne2D (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[2*rhi]; + + A.template SubtSNEkernel2D(btop, 0, A.nbc, x, suby, rhi); + + } + + + }else{ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[2*rhi]; + + + A.template SubtSNEkernel2D(btop, 0, A.nbc, x, suby, rhi); + + } + + } +} + + + +/** + * Operation tsne kernel + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template + void bicsb_tsne4D (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[4*rhi]; + + A.template SubtSNEkernel4D(btop, 0, A.nbc, x, suby, rhi); + + } + + + }else{ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[4*rhi]; + + + A.template SubtSNEkernel4D(btop, 0, A.nbc, x, suby, rhi); + + } + + } +} + + +/** + * Operation tsne kernel + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + **/ +template + void bicsb_tsne_tar (const BiCsb & A, const RHS * __restrict x, LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif /* --- ENDIF */ + +#ifdef GRAIN_1 + + #pragma cilk grainsize = 1 + cilk_for (int thr = 0; thr < workers; thr++){ + for (IT i = thr ; i < A.nbr ; i+=workers) // for all block + // rows of A + +#else + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + +#endif + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + A.template SubtSNEkernel_tar(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby, rhi); + +#else /* --- ELSE original CSB */ + + A.template SubtSNEkernel_tar(btop, 0, A.nbc, x, suby, rhi); + +#endif /* --- ENDIF */ + } +#ifdef GRAIN_1 + } +#endif + +#ifdef BREAK_NBC /* --- IFDEF close loop over block of columns */ + } +#endif /* --- ENDIF */ + + + }else{ + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + for (IT jb = 0; jb < A.nbc; jb += ((int)ceil(A.nbc/BREAK_NBC)) ) { + +#endif /* --- ENDIF */ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + +#ifdef BREAK_NBC /* --- IFDEF break columns in 2 sequential blocks */ + + A.template SubtSNEkernel_tar(btop, + jb, min(jb + ((int)ceil(A.nbc/BREAK_NBC)), A.nbc), + x, suby, rhi); + +#else /* --- ELSE original CSB */ + + A.template SubtSNEkernel_tar(btop, 0, A.nbc, x, suby, rhi); + +#endif /* --- ENDIF */ + + } + +#ifdef BREAK_NBC /* --- IFDEF close loop over block of columns */ + } +#endif /* --- ENDIF */ + + } +} + + + +/*******************************************/ +/* Implementation for multiple CSB objects */ +/*******************************************/ + +template + void bicsb_tsne (const BiCsb & A, + const RHS * __restrict x_row, + const RHS * __restrict x_col, + LHS * __restrict y) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + +#ifdef GRAIN_1 + + #pragma cilk grainsize = 1 + cilk_for (int thr = 0; thr < workers; thr++){ + for (IT i = thr ; i < A.nbr ; i+=workers) // for all block + // rows of A +#else + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + +#endif + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + A.template SubtSNEkernel(btop, 0, A.nbc, x_row, x_col, suby, rhi); + } +#ifdef GRAIN_1 + } +#endif + + }else{ + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[3*rhi]; + A.template SubtSNEkernel(btop, 0, A.nbc, x_row, x_col, suby, rhi); + } + } +} + + +/** + * Operation tsne cost + * A: a general CSB matrix + * x: a column vector or a set of column vectors (i.e. array of structs, array of std:arrays, etc)) + * SR::multiply() handles the multiple rhs and type promotions, etc. + * alpha: Scaling parameter + * zeta: Normalization parameter + **/ +template + void bicsb_tsne_cost (const BiCsb & A, + const RHS * __restrict x, + LHS * __restrict y, + int dim, double alpha, double zeta) +{ + IT ysize = A.lowrowmask + 1; // size of the output + // subarray (per block + // row - except the + // last) + if(A.isPar() ){ + int workers = __cilkrts_get_nworkers(); + + + cilk_for (IT i = 0; i < A.nbr ; i++) // for all block rows + { + IT * btop = A.top [i]; // get the + // pointer to + // this block + // row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + + + A.template SubtSNEcost(btop, 0, A.nbc, x, suby, rhi, dim, alpha, zeta); + + } + + + + }else{ + + + for (IT i = 0 ; i < A.nbr ; ++i) // for all block rows of A + { + IT * btop = A.top [i]; // get the pointer to this block row + IT rhi = ((i << A.rowlowbits) & A.highrowmask); + LHS * suby = &y[rhi]; + + + A.template SubtSNEcost(btop, 0, A.nbc, x, suby, rhi, dim, alpha, zeta); + + } + + } +} + + + +#endif + diff --git a/csb/matfile.hpp b/csb/matfile.hpp new file mode 100644 index 0000000..2e66c29 --- /dev/null +++ b/csb/matfile.hpp @@ -0,0 +1,23 @@ +#ifndef _H_MATFILE +#define _H_MATFILE + +#define CODE_VERSION_S1 0 +#define CODE_VERSION_S3 1 +#define CODE_VERSION_NS3 2 + +/** + * Read kNN graph data from MAT-file. + */ +int readMATdata( cs **C, int **perm, + double **lhsgold, double **rhsgold, + double *perplexity, + const char* basepath, const char *dataset, + const long long datasize, const long long knn, + const char* permName, const int VERSION, + const int flagSym ); + + + + + +#endif diff --git a/csb/matfileio.cpp b/csb/matfileio.cpp new file mode 100644 index 0000000..0610129 --- /dev/null +++ b/csb/matfileio.cpp @@ -0,0 +1,299 @@ +#include +#include +#include +#include "mat.h" + +#include "cs.hpp" + +#include "matfile.hpp" + +void* safe_malloc(size_t n, char *name) { + void *p = malloc(n); + if (p == NULL) { + fprintf(stderr, "Fatal: failed to allocate %zd bytes for %s.\n", + n, name); + exit(1); + } + return p; +} + +cs *calcSimMat( cs *C, const double perplexity ) { + + // get transpose + cs *Ct = cs_transpose( C, -1 ); + + // get symmetrized kNN graph + cs *Csym = cs_add( C, Ct, 0.5, 0.5); + + cs_spfree( Ct ); + + return Csym; + +} + +cs *generateSparseMatrix( int *row, int *col, double *val, + int datasize, int knn, int flagSym ){ + + int nnz = datasize * knn; + + // prepare sparse matrix using suite-sparse + cs *C = cs_spalloc (datasize, datasize, nnz, 1, 1) ; + + printf("FILLING CS MATRIX\n"); + + for (int i=0; i matSize[1]){ + printf("k=%d too large!\n", knn); + return(1); + } + + mxPermList = matGetVariable( pmat, permStructName ); + printf("Name struct: %s\n", permStructName); + + if ( mxPermList == NULL ) { + + printf("Permutation struct %s was not found!\n", permStructName); + mxPerm = NULL; + + } else { + + printf("Perm name: %s", permName); + mxPerm = mxGetField( mxPermList, 0, permName ); + printf(" read\n"); + + } + + if (mxPerm == NULL){ + + printf("Permutation %s was not found!\n", permName); + + *perm = NULL; + + } else { + + printf("Permutation %s ", permName); + + perm_ptr = (int *) mxGetData( mxPerm ); + *perm = (int *) safe_malloc( datasize * sizeof(int), "perm" ); + for (int i = 0; i < datasize; i++){ + perm[0][ i ] = perm_ptr[ i ] - 1; + } + mxDestroyArray( mxPermList ); + + printf("copied\n"); + } + + printf("Sparse matrix "); + // pass data to pointers + kidx_ptr = (int *) mxGetData( kidx ); + kdist_ptr = (double *) mxGetData( kdist ); + + // printf( "KIDX size: [%dx%d]\n", matSize[0], matSize[1] ); + + printf( "Size to allocate %zu (%d - %d - %d)\n", + datasize * knn * sizeof(int), + datasize, knn, sizeof(int) ); + + row = (int *) safe_malloc( datasize * knn * sizeof(int) , "row" ); + col = (int *) safe_malloc( datasize * knn * sizeof(int) , "col" ); + val = (double *) safe_malloc( datasize * knn * sizeof(double), "val" ); + + for (int j = 0; j < knn; j++) + for (int i = 0; i < datasize; i++){ + row[ i + j*datasize ] = i; + col[ i + j*datasize ] = kidx_ptr[ i + j*datasize ] - 1; + } + + memcpy( val, kdist_ptr, datasize*knn*sizeof(double) ); + + printf("copied\n"); + +#ifdef VERIFY + // -------------------------------------------------- + // Check for existence of ground truth + + int dim; + + // prepare buffer for variable LHS + char lhsName[100]; + + switch (VERSION) { + + case CODE_VERSION_S1: + sprintf( lhsName, "lhs1_test_s_k%d", knn ); + dim = 1; + break; + + case CODE_VERSION_S3: + sprintf( lhsName, "lhs3_test_s_k%d", knn ); + dim = 3; + break; + + case CODE_VERSION_NS3: + sprintf( lhsName, "lhs3_test_ns_k%d", knn ); + dim = 3; + break; + + default: + printf("Unknown version %d\n", VERSION); + break; + + } + + // -------------------- RHS + + mxArray *mx_rhsgold = matGetVariable( pmat, "rhs3_test" ); + + if (mx_rhsgold == NULL) { + printf( "Ground truth data unavailable\n" ); + rhsgold[0] = NULL; + } else { + double *mx_rhsgold_data = (double *) mxGetData( mx_rhsgold ); + rhsgold[0] = (double *) safe_malloc( datasize * dim * sizeof(double), + "rhs" ); + + for (int i = 0; i < datasize; i++) + for (int j = 0; j < dim; j++) + rhsgold[0][ j + i*dim ] = mx_rhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_rhsgold ); + } + + // -------------------- LHS + + mxArray *mx_lhsgold = matGetVariable( pmat, lhsName ); + + if (mx_lhsgold == NULL) { + printf( "Ground truth data unavailable\n" ); + lhsgold[0] = NULL; + } else { + double *mx_lhsgold_data = (double *) mxGetData( mx_lhsgold ); + lhsgold[0] = (double *) safe_malloc( datasize * dim * sizeof(double), + "lhs" ); + + for (int i = 0; i < datasize; i++) + for (int j = 0; j < dim; j++) + lhsgold[0][ j + i*dim ] = mx_lhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_lhsgold ); + } + + // -------------------- check if perplexity exists + mxArray *mx_perpl = matGetVariable( pmat, "perplexity" ); + + if (mx_perpl == NULL){ + printf( "Unknown perplexity\n" ); + perplexity[0] = 0; + } else { + perplexity[0] = mxGetScalar( mx_perpl ); + printf( "Perplexity = %4.2f\n", perplexity[0] ); + mxDestroyArray( mx_perpl ); + } + +#else + rhsgold[0] = NULL; + lhsgold[0] = NULL; + perplexity[0] = 0; + printf( "Not verifying results\n" ); +#endif + + // build sparse matrix + C[0] = generateSparseMatrix( row, col, val, datasize, knn, flagSym ); + + // free unecessary variables + free( row ); + free( val ); + free( col ); + printf("Buffers freed\n"); + + // destroy array read + mxDestroyArray( kidx ); + mxDestroyArray( kdist ); + + // close MAT-file -- otherwise error + if (matClose(pmat) != 0) { + printf("Error closing file %s\n", matName); + return(1); + } else { + printf("Closed file %s\n", matName); + } + + return 0; + +} diff --git a/csb/matfileio.hpp b/csb/matfileio.hpp new file mode 100644 index 0000000..6ac7948 --- /dev/null +++ b/csb/matfileio.hpp @@ -0,0 +1,27 @@ +#ifndef _H_MATFILEIO +#define _H_MATFILEIO + +#define CODE_VERSION_S1 0 +#define CODE_VERSION_S3 1 +#define CODE_VERSION_NS3 2 + +/** + * Read kNN graph data from MAT-file. + */ +CS_INT readMATdata( cs **C, CS_INT **perm, + double **lhss1gold, + double **lhss3gold, + double **lhsns3gold, + double **rhsgold1, + double **rhsgold3, + double *perplexity, + const char* basepath, const char *dataset, + const CS_INT datasize, const CS_INT knn, + const char* permName, + const CS_INT flagSym ); + + + + + +#endif diff --git a/csb/matfileio_all.cpp b/csb/matfileio_all.cpp new file mode 100644 index 0000000..392544d --- /dev/null +++ b/csb/matfileio_all.cpp @@ -0,0 +1,349 @@ +#include +#include +#include +#include "mat.h" +#include + +#include "cs.hpp" + +#include "matfileio.hpp" + +void* safe_malloc(size_t n, char *name) { + void *p = malloc(n); + if (p == NULL) { + fprintf(stderr, "Fatal: failed to allocate %zd bytes for %s.\n", + n, name); + exit(1); + } + return p; +} + +cs *calcSimMat( cs *C, const double perplexity ) { + + // get transpose + cs *Ct = cs_transpose( C, -1 ); + + // get symmetrized kNN graph + cs *Csym = cs_add( C, Ct, 0.5, 0.5); + + cs_spfree( Ct ); + + return Csym; + +} + +cs *generateSparseMatrix( CS_INT *row, CS_INT *col, double *val, + CS_INT datasize, CS_INT knn, CS_INT flagSym ){ + + CS_INT nnz = ( (CS_INT) datasize ) * ( (CS_INT) knn ); + + // prepare sparse matrix using suite-sparse + cs *C = cs_spalloc( (CS_INT) datasize, + (CS_INT) datasize, + nnz, + 1, 1) ; + + std::cout << "FILLING CS MATRIX\n" << std::endl; + + for (CS_INT i=0; i matSize[1]){ + printf("k=%d too large!\n", knn); + return(1); + } + + mxPermList = matGetVariable( pmat, permStructName ); + printf("Name struct: %s\n", permStructName); + + if ( mxPermList == NULL ) { + + printf("Permutation struct %s was not found!\n", permStructName); + mxPerm = NULL; + + } else { + + printf("Perm name: %s", permName); + mxPerm = mxGetField( mxPermList, 0, permName ); + printf(" read\n"); + + } + + if (mxPerm == NULL){ + + printf("Permutation %s was not found!\n", permName); + + *perm = NULL; + + } else { + + printf("Permutation %s ", permName); + + perm_ptr = (int *) mxGetData( mxPerm ); + *perm = (CS_INT *) safe_malloc( datasize * sizeof(CS_INT), "perm" ); + for (CS_INT i = 0; i < datasize; i++){ + perm[0][ i ] = (CS_INT) perm_ptr[ i ] - 1; + } + mxDestroyArray( mxPermList ); + + printf("copied\n"); + } + + printf("Sparse matrix "); + // pass data to pointers + kidx_ptr = (int *) mxGetData( kidx ); + kdist_ptr = (double *) mxGetData( kdist ); + + // printf( "KIDX size: [%dx%d]\n", matSize[0], matSize[1] ); + + printf( "Size to allocate %zu (%d - %d - %d)\n", + datasize * knn * sizeof(CS_INT), + datasize, knn, sizeof(CS_INT) ); + + row = (CS_INT *) safe_malloc( datasize * knn * sizeof(CS_INT) , "row" ); + col = (CS_INT *) safe_malloc( datasize * knn * sizeof(CS_INT) , "col" ); + val = (double *) safe_malloc( datasize * knn * sizeof(double), "val" ); + + for (CS_INT j = 0; j < knn; j++) + for (CS_INT i = 0; i < datasize; i++){ + row[ i + j*datasize ] = i; + col[ i + j*datasize ] = (CS_INT) kidx_ptr[ i + j*datasize ] - 1; + } + + memcpy( val, kdist_ptr, datasize*knn*sizeof(double) ); + + printf("copied\n"); + +#ifdef VERIFY + // -------------------------------------------------- + // Check for existence of ground truth + + char lhsNameS1[100]; + char lhsNameS3[100]; + char lhsNameNS3[100]; + + sprintf( lhsNameS1, "lhs1_test_s_k%d", knn ); + + sprintf( lhsNameS3, "lhs3_test_s_k%d", knn ); + + sprintf( lhsNameNS3, "lhs3_test_ns_k%d", knn ); + + // -------------------- RHS + + mxArray *mx_rhsgold = matGetVariable( pmat, "rhs3_test" ); + + if (mx_rhsgold == NULL) { + printf( "RHS -- Ground truth data unavailable\n" ); + rhsgold1[0] = NULL; + rhsgold3[0] = NULL; + } else { + printf( "RHS -- " ); + double *mx_rhsgold_data = (double *) mxGetData( mx_rhsgold ); + rhsgold1[0] = (double *) safe_malloc( datasize * sizeof(double), + "rhs" ); + + rhsgold3[0] = (double *) safe_malloc( datasize * 3 * sizeof(double), + "rhs" ); + + for (CS_INT i = 0; i < datasize; i++) + for (CS_INT j = 0; j < 1; j++) + rhsgold1[0][ j + i ] = mx_rhsgold_data[ i + j*datasize ]; + + for (CS_INT i = 0; i < datasize; i++) + for (CS_INT j = 0; j < 3; j++) + rhsgold3[0][ j + i*3 ] = mx_rhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_rhsgold ); + printf( "READ\n"); + } + + // -------------------- LHS + + mxArray *mx_lhsgold; + + mx_lhsgold = matGetVariable( pmat, lhsNameS1 ); + + if (mx_lhsgold == NULL) { + printf( "LSH -- %s -- Ground truth data unavailable\n", lhsNameS1 ); + lhss1gold[0] = NULL; + } else { + printf( "LSH -- %s -- ", lhsNameS1 ); + double *mx_lhsgold_data = (double *) mxGetData( mx_lhsgold ); + lhss1gold[0] = (double *) safe_malloc( datasize * 1 * sizeof(double), + "lhs" ); + + for (CS_INT i = 0; i < datasize; i++) + for (CS_INT j = 0; j < 1; j++) + lhss1gold[0][ j + i ] = mx_lhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_lhsgold ); + printf( "READ\n"); + } + + mx_lhsgold = matGetVariable( pmat, lhsNameS3 ); + + if (mx_lhsgold == NULL) { + printf( "LSH -- %s -- Ground truth data unavailable\n", lhsNameS3 ); + lhss3gold[0] = NULL; + } else { + printf( "LSH -- %s -- ", lhsNameS3 ); + double *mx_lhsgold_data = (double *) mxGetData( mx_lhsgold ); + lhss3gold[0] = (double *) safe_malloc( datasize * 3 * sizeof(double), + "lhs" ); + + for (CS_INT i = 0; i < datasize; i++) + for (CS_INT j = 0; j < 3; j++) + lhss3gold[0][ j + i*3 ] = mx_lhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_lhsgold ); + printf( "READ\n"); + } + + mx_lhsgold = matGetVariable( pmat, lhsNameNS3 ); + + if (mx_lhsgold == NULL) { + printf( "LSH -- %s -- Ground truth data unavailable\n", lhsNameNS3 ); + lhsns3gold[0] = NULL; + } else { + printf( "LSH -- %s -- ", lhsNameNS3 ); + double *mx_lhsgold_data = (double *) mxGetData( mx_lhsgold ); + lhsns3gold[0] = (double *) safe_malloc( datasize * 3 * sizeof(double), + "lhs" ); + + for (CS_INT i = 0; i < datasize; i++) + for (CS_INT j = 0; j < 3; j++) + lhsns3gold[0][ j + i*3 ] = mx_lhsgold_data[ i + j*datasize ]; + + mxDestroyArray( mx_lhsgold ); + printf( "READ\n"); + } + + + // -------------------- check if perplexity exists + mxArray *mx_perpl = matGetVariable( pmat, "perplexity" ); + + if (mx_perpl == NULL){ + printf( "Unknown perplexity\n" ); + perplexity[0] = 0; + } else { + perplexity[0] = mxGetScalar( mx_perpl ); + printf( "Perplexity = %4.2f\n", perplexity[0] ); + mxDestroyArray( mx_perpl ); + } + +#else + lhss1gold[0] = NULL; + lhss3gold[0] = NULL; + lhsns3gold[0] = NULL; + rhsgold1[0] = NULL; + rhsgold3[0] = NULL; + perplexity[0] = 0; + printf( "Not verifying results\n" ); +#endif + + // build sparse matrix + C[0] = generateSparseMatrix( row, col, val, datasize, knn, flagSym ); + + // free unecessary variables + free( row ); + free( val ); + free( col ); + printf("Buffers freed\n"); + + // destroy array read + mxDestroyArray( kidx ); + mxDestroyArray( kdist ); + + // close MAT-file -- otherwise error + if (matClose(pmat) != 0) { + printf("Error closing file %s\n", matName); + return(1); + } else { + printf("Closed file %s\n", matName); + } + + return 0; + +} diff --git a/csb/matmul.h b/csb/matmul.h new file mode 100644 index 0000000..1910b3c --- /dev/null +++ b/csb/matmul.h @@ -0,0 +1,34 @@ +/* author: Aydin Buluc (aydin@cs.ucsb.edu) ---------------------- */ +/* description: Helper class in order to get rid of copying */ +/* and temporaries during the y += A*x or A+=B*C calls */ +/* acknowlegment: This technique is described in Stroustrup, */ +/* The C++ Programming Language, 3rd Edition, */ +/* Section 22.4.7 [Temporaries, Copying and Loops] */ + + + +#ifndef _MAT_MUL_H +#define _MAT_MUL_H + + +template +struct Matmul +{ + const OPT1 & op1; // just keeps references to objects + const OPT2 & op2; + + // Constructor + Matmul(const OPT1 & operand1, const OPT2 & operand2): op1(operand1), op2(operand2) { } + + // No need for operator BT() because we have the corresponding copy constructor + // and assignment operators to evaluate and return result ! +}; + +template +inline Matmul< OPT1,OPT2 > operator* (const OPT1 & operand1, const OPT2 & operand2) +{ + return Matmul< OPT1,OPT2 >(operand1,operand2); //! Just defer the multiplication +} + +#endif + diff --git a/csb/mortoncompare.h b/csb/mortoncompare.h new file mode 100644 index 0000000..48fe749 --- /dev/null +++ b/csb/mortoncompare.h @@ -0,0 +1,56 @@ +#ifndef _MORTONCOMPARE_H_ +#define _MORTONCOMPARE_H_ + + +template +class MortonCompare: public binary_function< ITYPE , ITYPE , bool > // (par1, par2, return_type) +{ +public: + MortonCompare() {} + MortonCompare (ITYPE nrbits, ITYPE ncbits, ITYPE rmask, ITYPE cmask ) + : nrowbits(nrbits), ncolbits(ncbits), rowmask(rmask), colmask(cmask) {} + + // rhs is the splitter that is already in bit-interleaved order + // lhs is the actual value that is in row-major order + bool operator()(const ITYPE & lhs, const ITYPE & rhs) const + { + ITYPE rlowbits = ((lhs >> ncolbits) & rowmask); + ITYPE clowbits = (lhs & colmask); + ITYPE bikey = BitInterleaveLow(rlowbits, clowbits); + + return bikey < rhs; + } + +private: + ITYPE nrowbits; + ITYPE ncolbits; + ITYPE rowmask; + ITYPE colmask; +}; + +template +class MortCompSym: public binary_function< ITYPE , ITYPE , bool > // (par1, par2, return_type) +{ +public: + MortCompSym() {} + MortCompSym(ITYPE bits, ITYPE lowmask): nbits(bits), lmask(lowmask) {} + + // rhs is the splitter that is already in bit-interleaved order + // lhs is the actual value that is in row-major order + bool operator()(const ITYPE & lhs, const ITYPE & rhs) const + { + ITYPE rlowbits = ((lhs >> nbits) & lmask); + ITYPE clowbits = (lhs & lmask); + ITYPE bikey = BitInterleaveLow(rlowbits, clowbits); + + return bikey < rhs; + } + +private: + ITYPE nbits; + ITYPE lmask; + +}; + +#endif + diff --git a/csb/promote.h b/csb/promote.h new file mode 100644 index 0000000..0fc280a --- /dev/null +++ b/csb/promote.h @@ -0,0 +1,37 @@ +#ifndef _PROMOTE_H_ +#define _PROMOTE_H_ + +template +struct promote_trait { }; + +#define DECLARE_PROMOTE(A,B,C) \ + template <> struct promote_trait \ + { \ + typedef C T_promote; \ + }; + +DECLARE_PROMOTE(int, bool,int); +DECLARE_PROMOTE(unsigned, bool, unsigned); +DECLARE_PROMOTE(float, bool, float); +DECLARE_PROMOTE(double, bool, double); +DECLARE_PROMOTE(long long, bool, long long); +DECLARE_PROMOTE(unsigned long long, bool, unsigned long long); +DECLARE_PROMOTE(bool, int, int); +DECLARE_PROMOTE(bool, unsigned, unsigned); +DECLARE_PROMOTE(bool, float, float); +DECLARE_PROMOTE(bool, double, double); +DECLARE_PROMOTE(bool, long long, long long); +DECLARE_PROMOTE(bool, unsigned long long, unsigned long long); +DECLARE_PROMOTE(bool, bool, bool); +DECLARE_PROMOTE(float, int, float); +DECLARE_PROMOTE(double, int, double); +DECLARE_PROMOTE(int, float, float); +DECLARE_PROMOTE(int, double, double); +DECLARE_PROMOTE(float, float, float); +DECLARE_PROMOTE(double, double, double); +DECLARE_PROMOTE(int, int, int); +DECLARE_PROMOTE(unsigned, unsigned, unsigned); +DECLARE_PROMOTE(long long, long long, long long); +DECLARE_PROMOTE(unsigned long long, unsigned long long, unsigned long long); + +#endif diff --git a/csb/spvec.cpp b/csb/spvec.cpp new file mode 100644 index 0000000..17f3bc4 --- /dev/null +++ b/csb/spvec.cpp @@ -0,0 +1,173 @@ +#include "spvec.h" +#include "utility.h" +#if (__GNUC__ == 4 && (__GNUC_MINOR__ < 7) ) + #include "randgen.h" +#else + #include +#endif +#include + +// constructor that generates a junk dense vector +template +Spvec::Spvec (ITYPE dim) +{ + assert(dim != 0); + n = static_cast(ceil(static_cast(dim)/RBDIM)) * RBDIM; + padding = n-dim; + if(padding) + cout << "Padded vector to size " << n << " for register blocking" << endl; + arr = new T[n]; +} + +template +Spvec::Spvec (T * darr, ITYPE dim) +{ + assert(dim != 0); + + n = static_cast(ceil(static_cast(dim)/RBDIM)) * RBDIM; + padding = n-dim; + if(padding) + cout << "Padded vector to size " << n << " for register blocking" << endl; + + arr = new T[n](); // zero initialized PID + + for(ITYPE i=0; i< n; ++i) + { + arr[i] = darr[i]; + } +} + +// copy constructor +template +Spvec::Spvec (const Spvec & rhs): n(rhs.n),padding(rhs.padding) +{ + if(n > 0) + { + arr = new T[n]; + + for(ITYPE i=0; i< n; ++i) + arr[i]= rhs.arr[i]; + } +} + +template +Spvec & Spvec::operator= (const Spvec & rhs) +{ + if(this != &rhs) + { + if(n > 0) + { + delete [] arr; + } + + n = rhs.n; + padding = rhs.padding; + if(n > 0) + { + arr = new T[n]; + for(ITYPE i=0; i< n; ++i) + arr[i]= rhs.arr[i]; + } + } + return *this; +} + + +template +Spvec::~Spvec() +{ + if ( n > 0) + { + delete [] arr; + } +} + +template +Spvec & Spvec::operator+=(const Matmul< Csc, Spvec > & matmul) +{ + if((n-padding == matmul.op1.rowsize()) && (matmul.op1.colsize() == matmul.op2.size())) // check compliance + { + csc_gaxpy(matmul.op1, const_cast< T * >(matmul.op2.arr), arr); + } + else + { + cout<< "Detected noncompliant matvec..." << endl; + } + return *this; +} + +template +Spvec & Spvec::operator+=(const Matmul< BiCsb, Spvec > & matmul) +{ + typedef PTSR< T, T> PTDD; + if((n-padding == matmul.op1.rowsize()) && (matmul.op1.colsize() == matmul.op2.size())) // check compliance + { + bicsb_gespmv(matmul.op1, matmul.op2.arr, arr); + } + else + { + cout<< "Detected noncompliant matvec..." << endl; + } + return *this; +} + +// populate the vector with random entries +// currently, only works for T "double" and "float" +template +void Spvec::fillrandom() +{ +#if (__GNUC__ == 4 && (__GNUC_MINOR__ < 7) ) + RandGen G; + for(ITYPE i=0; i< n; ++i) + { + arr[i] = G.RandReal(); + } +#else + std::uniform_real_distribution distribution(0.0f, 1.0f); //Values between 0 and 1 + std::mt19937 engine; // Mersenne twister MT19937 + auto generator = std::bind(distribution, engine); + std::generate_n(arr, n, generator); +#endif +} + +// populate the vector with zeros +template +void Spvec::fillzero() +{ + for(ITYPE i=0; i< n; ++i) + { + arr[i] = 0; + } +} + +template +void Verify(Spvec & control, Spvec & test, string name, IT m) +{ + vectorerror(m); + std::transform(&control[0], (&control[0])+m, &test[0], error.begin(), absdiff()); + auto maxelement = std::max_element(error.begin(), error.end()); + cout << "Max error is: " << *maxelement << " on y[" << maxelement-error.begin()<<"]=" << test[maxelement-error.begin()] << endl; + NT machEps = machineEpsilon(); + cout << "Absolute machine epsilon is: " << machEps <<" and y[" << maxelement-error.begin() << "]*EPSILON becomes " + << machEps * test[maxelement-error.begin()] << endl; + + NT sqrtm = sqrt(static_cast(m)); + cout << "sqrt(n) * relative error is: " << abs(machEps * test[maxelement-error.begin()]) * sqrtm << endl; + if ( (abs(machEps * test[maxelement-error.begin()]) * sqrtm) < abs(*maxelement)) + { + cout << "*** ATTENTION ***: error is more than sqrt(n) times the relative machine epsilon" << endl; + } + +#ifdef DEBUG + cout << ": \n"; + for(IT i=0; i abs(sqrtm * machEps * test[i])) + { + cout << i << "\t" << control[i] << " " << test[i] << "\n"; + } + } +#endif +} + + diff --git a/csb/spvec.h b/csb/spvec.h new file mode 100644 index 0000000..3dac526 --- /dev/null +++ b/csb/spvec.h @@ -0,0 +1,52 @@ +#ifndef _SPVEC_H_ +#define _SPVEC_H_ + +#include "csc.h" +#include "bicsb.h" +#include "matmul.h" +#include "Semirings.h" + +template +class Spvec +{ +public: + Spvec (): n(0) {}; + Spvec (ITYPE dim); + Spvec (T * darr, ITYPE dim); + Spvec (const Spvec & rhs); + ~Spvec(); + Spvec & operator=(const Spvec & rhs); + + T& operator[] (const ITYPE nIndex) + { + return arr[nIndex]; + } + + //! Delayed evaluations using compositors for SpMV operation... y <- y + Ax + Spvec & operator+=(const Matmul< Csc, Spvec > & matmul); + Spvec & operator+=(const Matmul< BiCsb, Spvec > & matmul); + + void fillzero(); + void fillrandom(); + void fillone() + { + std::fill(arr,arr+n, static_cast(1.0)); + } + void fillfota() + { + for(ITYPE i =0; i(1.0); + } + + ITYPE size() const { return n-padding;} // return the real size + T * getarr(){ return arr;} + +private: + T * arr; + ITYPE n; + ITYPE padding; +}; + +#include "spvec.cpp" +#endif + diff --git a/csb/timer.gettimeofday.c b/csb/timer.gettimeofday.c new file mode 100644 index 0000000..b59526c --- /dev/null +++ b/csb/timer.gettimeofday.c @@ -0,0 +1,26 @@ +#ifndef _MYCLOCK_ +#define _MYCLOCK_ + + +#include + +struct timeval timer_ApplicationStartTime; +int timer_initialized = 0; + +void timer_init(){ + if(timer_initialized){fprintf(stderr,"timer_init() must be called once and only once\n");exit(0);} + timer_initialized = 1; + struct timezone timer_TimeZone; + gettimeofday(&timer_ApplicationStartTime,&timer_TimeZone); +} + +double timer_seconds_since_init(){ + if(!timer_initialized){fprintf(stderr,"timer_init() must be called first\n");exit(0);} + struct timezone timer_TimeZone; + struct timeval timer_CurrentTime; + gettimeofday(&timer_CurrentTime,&timer_TimeZone); + double rv = 1.0*(timer_CurrentTime.tv_sec-timer_ApplicationStartTime.tv_sec)+1e-6*(timer_CurrentTime.tv_usec-timer_ApplicationStartTime.tv_usec); + return(rv); +} + +#endif diff --git a/csb/transform_to_csb.hpp b/csb/transform_to_csb.hpp new file mode 100644 index 0000000..bf1baf5 --- /dev/null +++ b/csb/transform_to_csb.hpp @@ -0,0 +1,658 @@ +#include "benchmark_csb.hpp" + +#include "triple.h" +#include "csc.h" +#include "bicsb.h" +#include "bmcsb.h" +#include "spvec.h" +#include "Semirings.h" + +#include + +/** + * Structure holding BSSB object, using CSB for each block + * + * NOTE: Currently not working optimally. The overhead for accessing + * and processing each block appears to be more than it should. + * + */ +typedef struct{ + int *ir; // Start of every block row + int *jc; // Index of the column blocks + BiCsb **Sb; // Array of the sparse blocks CSB + int *rowStart; // Row start of each block (global) + int *colStart; // Col start of each block (global) + int nRow; // Number of block rows +} bssb; + +typedef struct{ + int rowStart; // Offset of global row + int colStart; // Offset of global col + int Nrow; // Number of rows in the block + int Ncol; // Number of columns in the block + int *iu; // Array of row indices with nonzero entries + int niu; // Number of rows with nonzero entries + int *li; // Number of nonzero entries in every row with nonzero entries + int *jj; // Column index of the nonzero entries + double *vv; // Value of the nonzero entries + int nnz; // Number of non zeros in the block +} sparseBlock_CSR; + +/** + * Custom BSSB object, using our bottom level code + * + */ +typedef struct{ + int *ir; // Start of every block row + int *jc; // Index of the column blocks + sparseBlock_CSR *Sb; // Array of the custom sparse blocks + int nRow; // Number of block rows +} bssb_custom; + +/* travese the bssb structure in parallel */ +void traverse_bssb(double *F, + double *Y, + bssb BSSB, + int dim){ + + cilk_for(int i=0; i *Bl = BSSB.Sb[blk]; + + int rowStart = BSSB.rowStart[blk]; + int colStart = BSSB.colStart[blk]; + + // printf("Blk (%d,%d): offset (%d,%d)\n", i, j, + // rowStart, colStart); + typedef PTSR PTDD; + bicsb_gespmv(Bl[0], &Y[colStart], &F[rowStart]); + // computeLeaf(Bl, F, Y, dim); + + } + + } + + +} + +/* travese the bssb structure in parallel */ +void traverse_bssb_tsne(double *F, + double *Y, + bssb BSSB, + int dim){ + + cilk_for(int i=0; i *Bl = BSSB.Sb[blk]; + + int rowStart = BSSB.rowStart[blk]; + int colStart = BSSB.colStart[blk]; + + // printf("Blk (%d,%d): offset (%d,%d)\n", i, j, + // rowStart, colStart); + typedef PTSR PTDD; + bicsb_tsne(Bl[0], &Y[dim*rowStart], &Y[dim*colStart], &F[dim*rowStart]); + + // computeLeaf(Bl, F, Y, dim); + + } + + } + + +} + + +void csc2csb_bot(bssb *BSSB, top_lvl_csc *BSSB_CSR, int nCol, int nRow, + int workers, int csbBeta){ + + int *csr_jr = BSSB_CSR->jc; + int *csr_ic = BSSB_CSR->ir; + sparseBlock *csr_sb = BSSB_CSR->Pb; + + int nnz = csr_jr[nCol]; + + // PREPARE NEW OBJECT BSSB + + BSSB->ir = csr_jr; + BSSB->jc = csr_ic; + BSSB->Sb = (BiCsb **) + malloc(nnz*sizeof(BiCsb*)); + BSSB->rowStart = (int *)malloc(nnz*sizeof(int)); + BSSB->colStart = (int *)malloc(nnz*sizeof(int)); + BSSB->nRow = nRow; + + // PASS TRHOUGH OLD STRUCT TO CONSTRUCT NEW OBJECT + + for(int i=0; innz; + + BSSB->rowStart[blk] = Bl->row; + BSSB->colStart[blk] = Bl->col; + + INDEXTYPE * rowindices = new INDEXTYPE[nnzBlk]; + INDEXTYPE * colindices = new INDEXTYPE[nnzBlk]; + VALUETYPE * vals = new VALUETYPE[nnzBlk]; + + int32_t nnzCol = Bl->nuj; + double *vv = Bl->vv; + int32_t *ju = Bl->ju; + int32_t *ii = Bl->ii; + int32_t *li = Bl->li; + int32_t m = Bl->Nrow; + int32_t n = Bl->Ncol; + int32_t ss = 0; + + printf("B[%d,%d] = Row: %d Col:%d | %dx%d nnz: %d\n", i, j, Bl->row, Bl->col, m,n, nnzBlk); + + int itemIter = 0; + + for (uint32_t j_blk = 0; j_blk < nnzCol; j_blk++) { + + const int32_t k_blk = li[j_blk]; + // printf(" k_blk = %d\n", k_blk); + for (uint32_t idx_blk = 0; idx_blk < k_blk; idx_blk++) { + + const uint32_t i_blk = (ii[ss + idx_blk]); + // printf(" Bp[%d,%d] = %.2g\n", i_blk, ju[j_blk], vv[ss+idx_blk]); + rowindices[itemIter] = i_blk; + colindices[itemIter] = ju[j_blk]; + vals[itemIter] = vv[ss+idx_blk]; + itemIter++; + } + + ss += k_blk; + + } + + Csc * csc; + csc = new Csc(rowindices, colindices, vals, nnzBlk, m, n); + + + float csbBetaNew = 0.0; + if (csbBeta == 0){ + csbBetaNew = floor( (double) log2(max(m,n))); + } + + BSSB->Sb[blk] = new BiCsb(*csc, workers, + (int)csbBetaNew); + + printf("Blk stats: nRow = %d | nCol = %d\n", + BSSB->Sb[blk]->getNbr(), + BSSB->Sb[blk]->getNbc()); + + freeSparseBlock(Bl); + + delete [] rowindices; + delete [] colindices; + delete [] vals; + + // computeLeaf(Bl, F, Y, dim); + + } + + } + +} + + + + +void csc2csr_bot(bssb_custom *BSSB, top_lvl_csc *BSSB_CSR, int nCol, int nRow, + int workers, int csbBeta, bool freePrev){ + + int *csr_jr = BSSB_CSR->jc; + int *csr_ic = BSSB_CSR->ir; + sparseBlock *csr_sb = BSSB_CSR->Pb; + + int nnz = csr_jr[nCol]; + + // PREPARE NEW OBJECT BSSB + + BSSB->ir = csr_jr; + BSSB->jc = csr_ic; + BSSB->Sb = (sparseBlock_CSR *) + malloc(nnz*sizeof(sparseBlock_CSR)); + BSSB->nRow = nRow; + + int total_nnz = 0; + int total_niu = 0; + + // PASS TRHOUGH OLD STRUCT TO CONSTRUCT NEW OBJECT + + for(int i=0; iSb[blk]; + + int nnzBlk = Bl->nnz; + + // pass scalar values + Sb_temp->rowStart = Bl->row; + Sb_temp->colStart = Bl->col; + Sb_temp->Nrow = Bl->Nrow; + Sb_temp->Ncol = Bl->Ncol; + Sb_temp->nnz = Bl->nnz; + + // allocate sparse block elements + Sb_temp->vv = (double *)malloc(Sb_temp->nnz*sizeof(double)); + Sb_temp->jj = (int *)malloc(Sb_temp->nnz*sizeof(double)); + + // prepare array of vectors + std::vector< std::pair > **rows = + (std::vector< std::pair > **) + malloc(Sb_temp->Nrow*sizeof(std::vector< std::pair >*)); + + // allocate vectors + for (int iVec = 0; iVecNrow; iVec++){ + + rows[iVec] = new std::vector< std::pair >(); + + } + + // transform CSC to CSR + int32_t nnzCol = Bl->nuj; + int itemIter = 0; + int32_t ss = 0; + + for (uint32_t j_blk = 0; j_blk < nnzCol; j_blk++) { + + const int32_t k_blk = Bl->li[j_blk]; + // printf(" k_blk = %d\n", k_blk); + for (uint32_t idx_blk = 0; idx_blk < k_blk; idx_blk++) { + + uint32_t i_blk = (Bl->ii[ss + idx_blk]); + double v_blk = (Bl->vv[ss + idx_blk]); + std::pair newPair(Bl->ju[j_blk], v_blk); + rows[i_blk]->push_back( newPair ); + + } + + ss += k_blk; + + } + + int niu = 0; + + // find number of nnz rows + for (int iVec = 0; iVecNrow; iVec++){ + if (rows[iVec]->size() > 0) niu++; + } + + Sb_temp->niu = niu; + + // update total values + total_nnz += Sb_temp->nnz; + total_niu += Sb_temp->niu; + + // allocate remaining vectors + Sb_temp->iu = (int *)malloc(niu*sizeof(int)); + Sb_temp->li = (int *)malloc(niu*sizeof(int)); + + int iterRow = 0; + int iterNnz = 0; + + // fill vectors + for (int iVec = 0; iVecNrow; iVec++){ + + if (rows[iVec]->size() > 0){ // if not empty + + Sb_temp->iu[iterRow] = iVec; + Sb_temp->li[iterRow] = rows[iVec]->size(); + + for (int jVec = 0; jVecsize(); jVec++){ + + Sb_temp->jj[iterNnz] = rows[iVec]->at(jVec).first; + Sb_temp->vv[iterNnz] = rows[iVec]->at(jVec).second; + + // FILE *f_i = fopen( "csr_con_i.bin", "ab" ); + // FILE *f_j = fopen( "csr_con_j.bin", "ab" ); + // FILE *f_v = fopen( "csr_con_v.bin", "ab" ); + + // int i_bin = iVec+Sb_temp->rowStart; + // int j_bin = Sb_temp->jj[iterNnz]+Sb_temp->colStart; + // double v_bin = Sb_temp->vv[iterNnz]; + + // fwrite( &i_bin, sizeof(i_bin), 1, f_i ); + // fwrite( &j_bin, sizeof(j_bin), 1, f_j ); + // fwrite( &v_bin, sizeof(v_bin), 1, f_v ); + + // fclose( f_i ); fclose( f_j ); fclose( f_v ); + + iterNnz++; + + } + + iterRow++; + + } + + } + + // printf("Block (%d,%d) row: [%d,%d] col:[%d,%d] nnz=%d gold=%d\n", + // i, j, + // Sb_temp->rowStart, Sb_temp->rowStart + Sb_temp->Nrow, + // Sb_temp->colStart, Sb_temp->colStart + Sb_temp->Ncol, + // iterNnz, nnzBlk); + + // de-allocate vectors + for (int iVec = 0; iVecNrow; iVec++){ + + delete rows[iVec]; + + } + + free(rows); + + if (freePrev) + freeSparseBlock(Bl); + + } // finish block row + + } // finish block col + + printf("total nnz: %d, total niu:%d\n", total_nnz, total_niu); + + int *global_iu = (int *)malloc( total_niu*sizeof(int) ); + int *global_li = (int *)malloc( total_niu*sizeof(int) ); + int *global_jj = (int *)malloc( total_nnz*sizeof(int) ); + double *global_vv = (double *)malloc( total_nnz*sizeof(double) ); + + // traverse and update pointers + + total_niu = 0; + total_nnz = 0; + + for(int i=0; inRow; i++){ + + const int offRow = BSSB->ir[i]; + + const int k = BSSB->ir[i+1] - offRow; + + // printf("\n New for \n\n"); + + for (unsigned int idx = 0; idx < k; idx++) { + + int blk = offRow + idx; + + int j = BSSB->jc[blk]; + + sparseBlock_CSR *Bl = &BSSB->Sb[blk]; + + for ( int iter = 0; iter < Bl->nnz; iter++ ){ + + global_jj[total_nnz + iter] = Bl->jj[iter]; + global_vv[total_nnz + iter] = Bl->vv[iter]; + + } + + for ( int iter = 0; iter < Bl->niu; iter++ ){ + + global_iu[total_niu + iter] = Bl->iu[iter]; + global_li[total_niu + iter] = Bl->li[iter]; + + } + + free( Bl->jj ); + free( Bl->vv ); + free( Bl->iu ); + free( Bl->li ); + + Bl->jj = &global_jj[total_nnz]; + Bl->vv = &global_vv[total_nnz]; + + Bl->iu = &global_iu[total_niu]; + Bl->li = &global_li[total_niu]; + + total_niu += Bl->niu; + total_nnz += Bl->nnz; + + } + + } + +} + + +/* PQ Kernel */ +void computeLeafCSRtsne(sparseBlock_CSR* Ps, double* F, double* Y, int dim){ + + double *vv = Ps->vv; + int *iu = Ps->iu; + int *jj = Ps->jj; + int *li = Ps->li; + int m = Ps->Nrow; + int n = Ps->Ncol; + + int nn = Ps->niu; + + int ss = 0; + + int R = Ps->rowStart; + int C = Ps->colStart; + + double* Y0i = &Y[R * dim]; + double* Y0j = &Y[C * dim]; + + double* F0i = &F[R * dim]; + + for (int i = 0; i < nn; i++) { + + double accum[3] = {0}; + double Ftemp[3] = {0}; + double Yj[3] = {0}; + double Yi[3] = {0}; + + const int k = li[i]; /* number of nonzero elements of each row */ + + + Yi[:] = Y0i[ iu[i]*dim + 0:dim ]; + accum[:] = 0; + + /* for each non zero element */ + for (int idx = 0; idx < k; idx++) { + + const int j = (jj[ss + idx]); + + Yj[:] = Y0j[ j * dim + 0:dim ]; + + /* distance computation */ + double dist = __sec_reduce_add( (Yi[:] - Yj[:])*(Yi[:] - Yj[:]) ); + + // FILE *f_i = fopen( "csr_i.bin", "ab" ); + // FILE *f_j = fopen( "csr_j.bin", "ab" ); + // FILE *f_v = fopen( "csr_v.bin", "ab" ); + + // int i_bin = iu[i] + R; + // int j_bin = j + C; + // double v_bin = vv[ss+idx]; + + // fwrite( &i_bin, sizeof(i_bin), 1, f_i ); + // fwrite( &j_bin, sizeof(j_bin), 1, f_j ); + // fwrite( &v_bin, sizeof(v_bin), 1, f_v ); + + // fclose( f_i ); fclose( f_j ); fclose( f_v ); + + /* P_{ij} \times Q_{ij} */ + double p_times_q = vv[ss+idx] / (1+dist); + + Ftemp[:] = p_times_q * ( Yi[:] - Yj[:] ); + + /* F_{attr}(i,j) */ + accum[:] += Ftemp[:]; + } + + F0i[iu[i]*dim + 0:dim] += accum[:]; + ss += k; + + } + + +} + + +/* travese the bssb structure in parallel */ +void traverse_bssb_csr_tsne(double *F, + double *Y, + bssb_custom BSSB, + int dim, + int nworkers){ + + +#pragma cilk grainsize = 1 + cilk_for (int thr = 0; thr < nworkers; thr++){ + + for(int i=thr; ivv; + int *iu = Ps->iu; + int *jj = Ps->jj; + int *li = Ps->li; + int m = Ps->Nrow; + int n = Ps->Ncol; + + int nn = Ps->niu; + int ss = 0; + + int R = Ps->rowStart; + int C = Ps->colStart; + + double* Y0j = &Y[C]; + + double* F0i = &F[R]; + + for (int i = 0; i < nn; i++) { + + double accum = 0; + + // nnz of this row + const int k = li[i]; + + /* for each non zero element in row */ + for (int idx = 0; idx < k; idx++) { + + // get column index + const int j = (jj[ss + idx]); + + // update results + F0i[iu[i]] += vv[ ss+idx ] * Y0j[ j ]; + + } + + // increase iterator by k + ss += k; + + } + + +} + + +/* travese the bssb structure in parallel */ +void traverse_bssb_csr_spmv(double *F, + double *Y, + bssb_custom BSSB){ + + cilk_for(int i=0; i +#include +#include "utility.h" +#include "spvec.h" +using namespace std; + +template +struct Triple +{ + ITYPE row; // row index + ITYPE col; // col index + T val; // value +}; + +template +struct ColSortCompare: // struct instead of class so that operator() is public + public binary_function< Triple, Triple, bool > // (par1, par2, return_type) + { + inline bool operator()(const Triple & lhs, const Triple & rhs) const + { + if(lhs.col == rhs.col) + { + return lhs.row < rhs.row; + } + else + { + return lhs.col < rhs.col; + } + } + }; + +template +struct RowSortCompare: // struct instead of class so that operator() is public + public binary_function< Triple, Triple, bool > // (par1, par2, return_type) + { + inline bool operator()(const Triple & lhs, const Triple & rhs) const + { + if(lhs.row == rhs.row) + { + return lhs.col < rhs.col; + } + else + { + return lhs.row < rhs.row; + } + } + }; + +template +struct BitSortCompare: // struct instead of class so that operator() is public + public binary_function< Triple, Triple, bool > // (par1, par2, return_type) + { + inline bool operator()(const Triple & lhs, const Triple & rhs) const + { + return BitInterleave(lhs.row, lhs.col) < BitInterleave(rhs.row, rhs.col); + } + }; + +template +void triples_gaxpy(Triple * triples, Spvec & x, Spvec & y, ITYPE nnz) +{ + for(ITYPE i=0; i< nnz; ++i) + { + y [triples[i].row] += triples[i].val * x [triples[i].col] ; + } +}; + + +#endif + diff --git a/csb/utility.h b/csb/utility.h new file mode 100644 index 0000000..46b17ea --- /dev/null +++ b/csb/utility.h @@ -0,0 +1,513 @@ +#ifndef _UTILITY_H +#define _UTILITY_H +#include +#define __int64 long long +#include +#include +#include +#include +#include +#include +#include +#include // MMX +#include // SSE +#include // SSE 2 +#include // SSE 3 + +#include + +using namespace std; + +#include +#include +#define SYNCHED __cilkrts_synched() +#define DETECT __cilkscreen_enable_checking() +#define ENDDETECT __cilkscreen_disable_checking() +#define WORKERS __cilkrts_get_nworkers() + +#ifdef BWTEST + #define UNROLL 100 +#else + #define UNROLL 1 +#endif + +#ifndef CILK_STUB +#ifdef __cplusplus +extern "C" { +#endif +/* + * __cilkrts_synched + * + * Allows an application to determine if there are any outstanding + * children at this instant. This function will examine the current + * full frame to determine this. + */ + +CILK_EXPORT __CILKRTS_NOTHROW +int __cilkrts_synched(void); + +#ifdef __cplusplus +} // extern "C" +#endif +#else /* CILK_STUB */ +/* Stubs for the api functions */ +#define __cilkrts_synched() (1) +#endif /* CILK_STUB */ + +#ifdef STATS + #include + cilk::reducer_opadd<__int64> blockparcalls; + cilk::reducer_opadd<__int64> subspmvcalls; + cilk::reducer_opadd<__int64> atomicflops; +#endif + +void * address; +void * base; + +using namespace std; + +// convert category to type + template< int Category > struct int_least_helper {}; // default is empty + template<> struct int_least_helper<8> { typedef uint64_t least; }; // 8x8 blocks require 64-bit bitmasks + template<> struct int_least_helper<4> { typedef unsigned short least; }; // 4x4 blocks require 16-bit bitmasks + template<> struct int_least_helper<2> { typedef unsigned char least; }; // 2x2 blocks require 4-bit bitmasks, so we waste half of the array here + +const uint64_t masktable64[64] = {0x8000000000000000, 0x4000000000000000, 0x2000000000000000, 0x1000000000000000, + 0x0800000000000000, 0x0400000000000000, 0x0200000000000000, 0x0100000000000000, + 0x0080000000000000, 0x0040000000000000, 0x0020000000000000, 0x0010000000000000, + 0x0008000000000000, 0x0004000000000000, 0x0002000000000000, 0x0001000000000000, + 0x0000800000000000, 0x0000400000000000, 0x0000200000000000, 0x0000100000000000, + 0x0000080000000000, 0x0000040000000000, 0x0000020000000000, 0x0000010000000000, + 0x0000008000000000, 0x0000004000000000, 0x0000002000000000, 0x0000001000000000, + 0x0000000800000000, 0x0000000400000000, 0x0000000200000000, 0x0000000100000000, + 0x0000000080000000, 0x0000000040000000, 0x0000000020000000, 0x0000000010000000, + 0x0000000008000000, 0x0000000004000000, 0x0000000002000000, 0x0000000001000000, + 0x0000000000800000, 0x0000000000400000, 0x0000000000200000, 0x0000000000100000, + 0x0000000000080000, 0x0000000000040000, 0x0000000000020000, 0x0000000000010000, + 0x0000000000008000, 0x0000000000004000, 0x0000000000002000, 0x0000000000001000, + 0x0000000000000800, 0x0000000000000400, 0x0000000000000200, 0x0000000000000100, + 0x0000000000000080, 0x0000000000000040, 0x0000000000000020, 0x0000000000000010, + 0x0000000000000008, 0x0000000000000004, 0x0000000000000002, 0x0000000000000001 }; + + +const unsigned short masktable16[16] = {0x8000, 0x4000, 0x2000, 0x1000, 0x0800, 0x0400, 0x0200, 0x0100, + 0x0080, 0x0040, 0x0020, 0x0010, 0x0008, 0x0004, 0x0002, 0x0001 }; + + +const unsigned char masktable4[4] = { 0x08, 0x04, 0x02, 0x01 }; // mask for 2x2 register blocks + + +template +MTYPE GetMaskTable(unsigned int index) +{ + return 0; +} + + +template <> +uint64_t GetMaskTable(unsigned int index) +{ + return masktable64[index]; +} + +template <> +unsigned short GetMaskTable(unsigned int index) +{ + return masktable16[index]; +} + + +template <> +unsigned char GetMaskTable(unsigned int index) +{ + return masktable4[index]; +} + +#ifndef RHSDIM +#define RHSDIM 1 +#endif +#define BALANCETH 2.0 +//#define BALANCETH 1.0 +#define RBDIM 8 +#define RBSIZE (RBDIM*RBDIM) // size of a register block (8x8 in this case) +#define SLACKNESS 8 +#define KBYTE 1024 +#define L2SIZE (256*KBYTE / RHSDIM) // less than half of the L2 Cache (L2 should hold x & y at the same time) - scaled back by RHSDIM +#define CLSIZE 64 // cache line size + +/* Tuning Parameters */ +#define BREAKEVEN 4 // A block (or subblock) with less than (BREAKEVEN * dimension) nonzeros won't be parallelized +#define MINNNZTOPAR 128 // A block (or subblock) with less than MINNNZTOPAR nonzeros won't be parallelized +#define BREAKNRB (8/RBDIM) // register blocked version of BREAKEVEN +#define MINNRBTOPAR (256/RBDIM) // register blocked version of MINNNZPAR +#define LOGSERIAL 15 +#define ROLLING 20 + +#define EPSILON 0.0001 +#define REPEAT 10 + +// "absolute" difference macro that has no possibility of unsigned wrap +#define absdiff(x,y) ( (x) > (y) ? (x-y) : (y-x)) + + +unsigned rmasks[32] = { 0x00000001, 0x00000002, 0x00000004, 0x00000008, + 0x00000010, 0x00000020, 0x00000040, 0x00000080, + 0x00000100, 0x00000200, 0x00000400, 0x00000800, + 0x00001000, 0x00002000, 0x00004000, 0x00008000, + 0x00010000, 0x00020000, 0x00040000, 0x00080000, + 0x00100000, 0x00200000, 0x00400000, 0x00800000, + 0x01000000, 0x02000000, 0x04000000, 0x08000000, + 0x10000000, 0x20000000, 0x40000000, 0x80000000 }; + + +void popcountall(const uint64_t * __restrict M, unsigned * __restrict count, size_t size); +void popcountall(const unsigned short * __restrict M, unsigned * __restrict count, size_t size); +void popcountall(const unsigned char * __restrict M, unsigned * __restrict count, size_t size); + + +template +void printhistogram(const T * scansum, size_t size, unsigned bins) +{ + ofstream outfile; + outfile.open("hist.csv"); + vector hist(bins); // an STD-vector is zero initialized + for(size_t i=0; i< size; ++i) + hist[scansum[i]]++; + + outfile << "Fill_ratio" << "," << "count" << endl; + for(size_t i=0; i< bins; ++i) + { + outfile << static_cast(i) / bins << "," << hist[i] << "\n"; + } +} + +struct thread_data +{ + unsigned sum; + unsigned * beg; + unsigned * end; +}; + +unsigned int highestbitset(unsigned __int64 v); + +template +unsigned prescan(unsigned * a, MTYPE * const M, int n) +{ + unsigned * end = a+n; + unsigned * _a = a; + MTYPE * __restrict _M = M; + unsigned int lgn; + unsigned sum = 0; + while ((lgn = highestbitset(n)) > LOGSERIAL) + { + unsigned _n = rmasks[lgn]; // _n: biggest power of two that is less than n + int numthreads = SLACKNESS*WORKERS; + thread_data * thdatas = new thread_data[numthreads]; + unsigned share = _n/numthreads; + cilk_for(int t=0; t < numthreads; ++t) + { + popcountall(_M+t*share, _a+t*share, ((t+1)==numthreads)?(_n-t*share):share); + thdatas[t].sum = 0; + thdatas[t].beg = _a + t*share; + thdatas[t].end = _a + (((t+1)==numthreads)?_n:((t+1)*share)); + thdatas[t].sum = accumulate(thdatas[t].beg, thdatas[t].end, thdatas[t].sum); + } + for(int t=0; t +ITYPE CumulativeSum (ITYPE * arr, ITYPE size) +{ + ITYPE prev; + ITYPE tempnz = 0 ; + for (ITYPE i = 0 ; i < size ; ++i) + { + prev = arr[i]; + arr[i] = tempnz; + tempnz += prev ; + } + return (tempnz) ; // return sum +} + + +template +T machineEpsilon() +{ + T machEps = 1.0; + do { + machEps /= static_cast(2.0); + // If next epsilon yields 1, then break, because current + // epsilon is the machine epsilon. + } + while ((T)(static_cast(1.0) + (machEps/static_cast(2.0))) != 1.0); + + return machEps; +} + + +template +void iota(_ForwardIter __first, _ForwardIter __last, T __value) +{ + while (__first != __last) + *__first++ = __value++; +} + +template +T ** allocate2D(I m, I n) +{ + T ** array = new T*[m]; + for(I i = 0; i +void deallocate2D(T ** array, I m) +{ + for(I i = 0; i +struct absdiff : binary_function +{ + T operator () ( T const &arg1, T const &arg2 ) const + { + using std::abs; + return abs( arg1 - arg2 ); + } +}; + + + +template +void MultAdd(double & a, const double & b, const double & c) +{ + for(int i=0; i +ITYPE BitInterleaveLow(ITYPE x, ITYPE y) +{ + ITYPE z = 0; // z gets the resulting Morton Number. + int ite = sizeof(z) * CHAR_BIT / 2; + + for (int i = 0; i < ite; ++i) + { + // bitwise shift operations have precedence over bitwise OR and AND + z |= (x & (1 << i)) << i | (y & (1 << i)) << (i + 1); + } + return z; +} + +// bit interleave x and y, and return result z (which is twice in size) +template +OTYPE BitInterleave(ITYPE x, ITYPE y) +{ + OTYPE z = 0; // z gets the resulting Morton Number. + int ite = sizeof(x) * CHAR_BIT; + + for (int i = 0; i < ite; ++i) + { + // bitwise shift operations have precedence over bitwise OR and AND + z |= (x & (1 << i)) << i | (y & (1 << i)) << (i + 1); + } + return z; +} + +template +inline unsigned IntPower(unsigned exponent) +{ + unsigned i = 1; + unsigned power = 1; + + while ( i <= exponent ) + { + power *= BASE; + i++; + } + return power; +} + +template <> +inline unsigned IntPower<2>(unsigned exponent) +{ + return rmasks[exponent]; +} + + + +// T should be uint32, uint64, int32 or int64; force concept requirement +template +bool IsPower2(T x) +{ + return ( (x>0) && ((x & (x-1)) == 0)); +} + +unsigned int nextpoweroftwo(unsigned int v) +{ + // compute the next highest power of 2 of 32(or 64)-bit n + // essentially does 1 << (lg(n - 1)+1). + + unsigned int n = v-1; + + // any "0" that is immediately right to a "1" becomes "1" (post: any zero has at least two "1"s to its left) + n |= n >> 1; + + // turn two more adjacent "0" to "1" (post: any zero has at least four "1"s to its left) + n |= n >> 2; + n |= n >> 4; // post: any zero has at least 8 "1"s to its left + n |= n >> 8; // post: any zero has at least 16 "1"s to its left + n |= n >> 16; // post: any zero has at least 32 "1"s to its left + + return ++n; +} + +// 64-bit version +// note: least significant bit is the "zeroth" bit +// pre: v > 0 +unsigned int highestbitset(unsigned __int64 v) +{ + // b in binary is {10,1100, 11110000, 1111111100000000 ...} + const unsigned __int64 b[] = {0x2ULL, 0xCULL, 0xF0ULL, 0xFF00ULL, 0xFFFF0000ULL, 0xFFFFFFFF00000000ULL}; + const unsigned int S[] = {1, 2, 4, 8, 16, 32}; + int i; + + unsigned int r = 0; // result of log2(v) will go here + for (i = 5; i >= 0; i--) + { + if (v & b[i]) // highestbitset is on the left half (i.e. v > S[i] for sure) + { + v >>= S[i]; + r |= S[i]; + } + } + return r; +} + +__int64 highestbitset(__int64 v) +{ + if(v < 0) + { + cerr << "Indices can not be negative, aborting..." << endl; + return -1; + } + else + { + unsigned __int64 uv = static_cast< unsigned __int64 >(v); + unsigned __int64 ur = highestbitset(uv); + return static_cast< __int64 > (ur); + } +} + +// 32-bit version +// note: least significant bit is the "zeroth" bit +// pre: v > 0 +unsigned int highestbitset(unsigned int v) +{ + // b in binary is {10,1100, 11110000, 1111111100000000 ...} + const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000}; + const unsigned int S[] = {1, 2, 4, 8, 16}; + int i; + + unsigned int r = 0; + for (i = 4; i >= 0; i--) + { + if (v & b[i]) // highestbitset is on the left half (i.e. v > S[i] for sure) + { + v >>= S[i]; + r |= S[i]; + } + } + return r; +} + +int highestbitset(int v) +{ + if(v < 0) + { + cerr << "Indices can not be negative, aborting..." << endl; + return -1; + } + else + { + unsigned int uv = static_cast< unsigned int> (v); + unsigned int ur = highestbitset(uv); + return static_cast< int > (ur); + } +} + +/* This function will return n % d. + d must be one of: 1, 2, 4, 8, 16, 32, … */ +inline unsigned int getModulo(unsigned int n, unsigned int d) +{ + return ( n & (d-1) ); +} + +// Same requirement (d=2^k) here as well +inline unsigned int getDivident(unsigned int n, unsigned int d) +{ + while((d = d >> 1)) + n = n >> 1; + return n; +} + +#endif + diff --git a/csb/utils.cpp b/csb/utils.cpp new file mode 100644 index 0000000..7c79d40 --- /dev/null +++ b/csb/utils.cpp @@ -0,0 +1,401 @@ +#include "utils.hpp" +#include "cs.hpp" +#include + +template +void verifyVectorEqual(T const * const f_new, T const * const f_gold, + CS_INT n, CS_INT dim, + double const ERR_THRES){ + + for (CS_INT i=0; im; + CS_INT n = A->n; + CS_INT *Ap = A->p; + CS_INT *Ai = A->i; + double *Ax = A->x; + CS_INT nzmax = A->nzmax; + CS_INT nz = A->nz; + + CS_INT *nElem = (CS_INT *) calloc( m, sizeof( CS_INT ) ); + CS_INT nznew = 0; + + // --- COUNTING + + // loop through every column + for (CS_INT j = 0 ; j < m ; j++) { + + // get range for NNZ for current col + for (CS_INT p = Ap [j] ; p < Ap [j+1] ; p++) { + + // if element is in upper triangular keep + if (j >= Ai[p]){ + nElem[j]++; + nznew++; + } + + } + + } + +#ifdef DEBUG + for (CS_INT j = 0; j < n; j++) + printf("%d ", nElem[j]); + printf("\n"); +#endif + + // allocate new matrix + cs *C_sym = cs_spalloc (m, n, nznew, 1, 0) ; + + CS_INT *Ap_sym = C_sym->p; + CS_INT *Ai_sym = C_sym->i; + double *Ax_sym = C_sym->x; + + // --- COPYING + + // first element is zero + CS_INT offset = 0; + + for (CS_INT j=0; j= Ai[p]){ + + // copy index and value + Ai_sym[ k_sym ] = Ai[p]; + Ax_sym[ k_sym ] = Ax[p]; + + // increment counter + k_sym++; + } + + } + + } + +#ifdef DEBUG + printf( "NNZ: %d, k_sym: %d", nznew, k_sym ); +#endif + + free( nElem ); + + return C_sym; + +} + + +void printDenseMatrix( double *matrix, CS_INT m, CS_INT n ) { + + for (CS_INT row=0; row=0 && r=0 && r= lim) + return C; + + CS_INT r,c; + + r = j + i; + c = i; + if ( r>=0 && r +T *permuteDataPoints( T* x, CS_INT *p, CS_INT n, CS_INT ldim ){ + + T *y = (T *)malloc(n*ldim*sizeof(T)); + + CS_INT i; + + cilk_for( i=0; i + * @date Wed Sep 20 13:51:08 2017 + * + * @brief Utility functions + * + * Various independed functions used throught the project + * + * + */ + +#ifndef _H_UTILS +#define _H_UTILS +#include "cs.hpp" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include // std::replace + +/** + * @brief Verifies equality between vectors + * + * Verifies that input vectors are equal (with respect to ERR_THRES) + * + * @param f_new The new vector that need to be checked + * @param f_gold The gold, correct vector + * @param n The number of elements + * @param dim Dimension of each element + * @param ERR_THRES Error threshold + */ +template +void verifyVectorEqual(T const * const f_new, T const * const f_gold, + CS_INT n, CS_INT dim, double const ERR_THRES); + +/** + * @brief Prints dense matrix + * + * @param matrix The matrix to print + * @param m Number of rows + * @param n Number of columns + */ +void printDenseMatrix( double *matrix, CS_INT m, CS_INT n ); + +/** + * @brief Transforms dense banded to CS sparse + * + * @param B Input dense + * @param n Size of matrix + * @param b Bandwidth (nnz per row) + * + * @return + */ +cs *band2sparse( double *B, CS_INT n, CS_INT b ); + +/** + * @brief Transforms dense banded to CS sparse with limited nnz + * elements + * + * @param B Input dense + * @param n Size of matrix + * @param b Bandwidth (nnz per row) + * @param lim Total nnz limitation + * + * @return + */ +cs *band2sparseLim( double *B, CS_INT n, CS_INT b, CS_INT lim ); + +/** + * @brief Generates a banded matrix + * + * @param n Size of matrix + * @param b Bandwidth + * + * @return Banded matrix + */ +double *generateBanded( CS_INT n, CS_INT b ); + +cs *genSymBandSparse( CS_INT n, CS_INT b, CS_INT lim ); + +void printMinTime( double *x, CS_INT n ); + +void exportTime2csv( double *x, FILE *fp, CS_INT n ); + +std::string getHostnameDateFilename(); + +cs *make_sym (cs *A); + +void exportBenchmarkResults( std::string prefix, double **times, char **names, + CS_INT nExp, CS_INT iter ); + +void exportBenchmarkResult( std::string prefix, double *times, CS_INT iter ); + + +template +T *permuteDataPoints( T* x, CS_INT *p, CS_INT n, CS_INT ldim ); + +static double tic (void) { + return (clock () / (double) CLOCKS_PER_SEC); +} + +static double toc (double t) { + double s = tic () ; return (CS_MAX (0, s-t)); +} + +static double getMilliseconds( struct timeval begin, struct timeval end ) { + + return + ((double) (end.tv_sec - begin.tv_sec) * 1000 ) + + ((double) (end.tv_usec - begin.tv_usec) / 1000 ); +} + +void setThreadsNum(int nworkers); + +void extractDimensions(double *y, double *x, CS_INT N, CS_INT ldim, CS_INT d); + +/** + * Extract upper triangular from a sparse matrix in SuiteSparse CSC. + * + * @param A Original sparse matrix + * + * @return New sparse matrix in CSC with only upper triangular + */ +cs * triu( cs const * const A ); + +/** + * Check whether string a starts with string b + */ +CS_INT StartsWith(const char *a, const char *b); + +#endif diff --git a/data/mobius-graph.tar.gz b/data/mobius-graph.tar.gz new file mode 100644 index 0000000..e4a6dd5 Binary files /dev/null and b/data/mobius-graph.tar.gz differ diff --git a/data/pbmc-graph.tar.gz b/data/pbmc-graph.tar.gz new file mode 100644 index 0000000..5ec4d98 Binary files /dev/null and b/data/pbmc-graph.tar.gz differ diff --git a/docs/MAINPAGE.md b/docs/MAINPAGE.md new file mode 100644 index 0000000..3c84c37 --- /dev/null +++ b/docs/MAINPAGE.md @@ -0,0 +1,87 @@ +# sgtsnepi: Swift Neighbor Embedding of Sparse Stochastic Graphs + +## Getting started + +### System environment + +SG-t-SNE-Π is developed for shared-memory computers with +multi-threading, running Linux or macOS operating system. The source +code is (to be) compiled by a `C++` compiler supporting Cilk. The +current release is tested with the `GNU g++` compiler 7.4.0 and the +`Intel` `icpc` compiler 19.0.4.233. + +### Prerequisites + +SG-t-SNE-Π uses the following open-source software: + +- `FFTW3` 3.3.8 + +- `METIS` 5.1.0 + +- `FLANN` 1.9.1 + +- `Intel TBB` 2019 + +- `Doxygen` 1.8.14 + +On `Ubuntu`: + + sudo apt-get install libtbb-dev libflann-dev libmetis-dev libfftw3-dev doxygen + +On `macOS`: + + sudo port install flann tbb metis fftw-3 + +### Installation + +#### Basic instructions + +To generate the SG-t-SNE-Π library, test and demo programs: + + ./configure + make all + +To specify the `C++` compiler: + + ./configure CXX= + +To test whether the installation is successful: + + bin/test_modules + +To generate the documentation: + + make documentation + +#### Support of the conventional t-SNE + +SG-t-SNE-Π supports the conventional t-SNE algorithm, through a set +of preprocessing functions. Issue + + make tsnepi + +to generate the `bin/tsnepi` binary, which is fully compatible with the +[existing wrappers](https://github.com/lvdmaaten/bhtsne/) provided by van der Maaten [[6](#VanDerMaaten2014)]. + +#### MATLAB interface + +To compile the SG-t-SNE-Π `MATLAB` wrappers, use the +`--enable-matlab` option in the `configure` command. The default +`MATLAB` installation path is `/opt/local/matlab`; otherwise, set +`MATLABROOT`: + + ./configure --enable-matlab MATLABROOT= + +### Usage demo + +We provide two data sets of modest size for demonstrating stochastic +graph embedding with SG-t-SNE-Π: + + tar -xvzf data/mobius-graph.tar.gz + bin/demo_stochastic_matrix mobius-graph.mtx + + tar -xvzf data/pbmc-graph.tar.gz + bin/demo_stochastic_matrix pbmc-graph.mtx + +The [MNIST data set](http://yann.lecun.com/exdb/mnist/) can be tested using [existing wrappers](https://github.com/lvdmaaten/bhtsne/) provided +by van der Maaten [[6](#VanDerMaaten2014)]. diff --git a/docs/doxygen.config b/docs/doxygen.config new file mode 100644 index 0000000..23e01cf --- /dev/null +++ b/docs/doxygen.config @@ -0,0 +1,2484 @@ +# Doxyfile 1.8.14 + +# This file describes the settings to be used by the documentation system +# doxygen (www.doxygen.org) for a project. +# +# All text after a double hash (##) is considered a comment and is placed in +# front of the TAG it is preceding. +# +# All text after a single hash (#) is considered a comment and will be ignored. +# The format is: +# TAG = value [value, ...] +# For lists, items can also be appended using: +# TAG += value [value, ...] +# Values that contain spaces should be placed between quotes (\" \"). + +#--------------------------------------------------------------------------- +# Project related configuration options +#--------------------------------------------------------------------------- + +# This tag specifies the encoding used for all characters in the config file +# that follow. The default is UTF-8 which is also the encoding used for all text +# before the first occurrence of this tag. Doxygen uses libiconv (or the iconv +# built into libc) for the transcoding. See +# https://www.gnu.org/software/libiconv/ for the list of possible encodings. +# The default value is: UTF-8. + +DOXYFILE_ENCODING = UTF-8 + +# The PROJECT_NAME tag is a single word (or a sequence of words surrounded by +# double-quotes, unless you are using Doxywizard) that should identify the +# project for which the documentation is generated. This name is used in the +# title of most generated pages and in a few other places. +# The default value is: My Project. + +PROJECT_NAME = "SG-t-SNE-Pi" + +# The PROJECT_NUMBER tag can be used to enter a project or revision number. This +# could be handy for archiving the generated documentation or if some version +# control system is used. + +PROJECT_NUMBER = + +# Using the PROJECT_BRIEF tag one can provide an optional one line description +# for a project that appears at the top of each page and should give viewer a +# quick idea about the purpose of the project. Keep the description short. + +PROJECT_BRIEF = + +# With the PROJECT_LOGO tag one can specify a logo or an icon that is included +# in the documentation. The maximum height of the logo should not exceed 55 +# pixels and the maximum width should not exceed 200 pixels. Doxygen will copy +# the logo to the output directory. + +PROJECT_LOGO = + +# The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path +# into which the generated documentation will be written. If a relative path is +# entered, it will be relative to the location where doxygen was started. If +# left blank the current directory will be used. + +OUTPUT_DIRECTORY = docs + +# If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub- +# directories (in 2 levels) under the output directory of each output format and +# will distribute the generated files over these directories. Enabling this +# option can be useful when feeding doxygen a huge amount of source files, where +# putting all generated files in the same directory would otherwise causes +# performance problems for the file system. +# The default value is: NO. + +CREATE_SUBDIRS = NO + +# If the ALLOW_UNICODE_NAMES tag is set to YES, doxygen will allow non-ASCII +# characters to appear in the names of generated files. If set to NO, non-ASCII +# characters will be escaped, for example _xE3_x81_x84 will be used for Unicode +# U+3044. +# The default value is: NO. + +ALLOW_UNICODE_NAMES = NO + +# The OUTPUT_LANGUAGE tag is used to specify the language in which all +# documentation generated by doxygen is written. Doxygen will use this +# information to generate all constant output in the proper language. +# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Catalan, Chinese, +# Chinese-Traditional, Croatian, Czech, Danish, Dutch, English (United States), +# Esperanto, Farsi (Persian), Finnish, French, German, Greek, Hungarian, +# Indonesian, Italian, Japanese, Japanese-en (Japanese with English messages), +# Korean, Korean-en (Korean with English messages), Latvian, Lithuanian, +# Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese, Romanian, Russian, +# Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish, Swedish, Turkish, +# Ukrainian and Vietnamese. +# The default value is: English. + +OUTPUT_LANGUAGE = English + +# If the BRIEF_MEMBER_DESC tag is set to YES, doxygen will include brief member +# descriptions after the members that are listed in the file and class +# documentation (similar to Javadoc). Set to NO to disable this. +# The default value is: YES. + +BRIEF_MEMBER_DESC = YES + +# If the REPEAT_BRIEF tag is set to YES, doxygen will prepend the brief +# description of a member or function before the detailed description +# +# Note: If both HIDE_UNDOC_MEMBERS and BRIEF_MEMBER_DESC are set to NO, the +# brief descriptions will be completely suppressed. +# The default value is: YES. + +REPEAT_BRIEF = YES + +# This tag implements a quasi-intelligent brief description abbreviator that is +# used to form the text in various listings. Each string in this list, if found +# as the leading text of the brief description, will be stripped from the text +# and the result, after processing the whole list, is used as the annotated +# text. Otherwise, the brief description is used as-is. If left blank, the +# following values are used ($name is automatically replaced with the name of +# the entity):The $name class, The $name widget, The $name file, is, provides, +# specifies, contains, represents, a, an and the. + +ABBREVIATE_BRIEF = "The $name class" \ + "The $name widget" \ + "The $name file" \ + is \ + provides \ + specifies \ + contains \ + represents \ + a \ + an \ + the + +# If the ALWAYS_DETAILED_SEC and REPEAT_BRIEF tags are both set to YES then +# doxygen will generate a detailed section even if there is only a brief +# description. +# The default value is: NO. + +ALWAYS_DETAILED_SEC = NO + +# If the INLINE_INHERITED_MEMB tag is set to YES, doxygen will show all +# inherited members of a class in the documentation of that class as if those +# members were ordinary class members. Constructors, destructors and assignment +# operators of the base classes will not be shown. +# The default value is: NO. + +INLINE_INHERITED_MEMB = NO + +# If the FULL_PATH_NAMES tag is set to YES, doxygen will prepend the full path +# before files name in the file list and in the header files. If set to NO the +# shortest path that makes the file name unique will be used +# The default value is: YES. + +FULL_PATH_NAMES = YES + +# The STRIP_FROM_PATH tag can be used to strip a user-defined part of the path. +# Stripping is only done if one of the specified strings matches the left-hand +# part of the path. The tag can be used to show relative paths in the file list. +# If left blank the directory from which doxygen is run is used as the path to +# strip. +# +# Note that you can specify absolute paths here, but also relative paths, which +# will be relative from the directory where doxygen is started. +# This tag requires that the tag FULL_PATH_NAMES is set to YES. + +STRIP_FROM_PATH = + +# The STRIP_FROM_INC_PATH tag can be used to strip a user-defined part of the +# path mentioned in the documentation of a class, which tells the reader which +# header file to include in order to use a class. If left blank only the name of +# the header file containing the class definition is used. Otherwise one should +# specify the list of include paths that are normally passed to the compiler +# using the -I flag. + +STRIP_FROM_INC_PATH = + +# If the SHORT_NAMES tag is set to YES, doxygen will generate much shorter (but +# less readable) file names. This can be useful is your file systems doesn't +# support long names like on DOS, Mac, or CD-ROM. +# The default value is: NO. + +SHORT_NAMES = NO + +# If the JAVADOC_AUTOBRIEF tag is set to YES then doxygen will interpret the +# first line (until the first dot) of a Javadoc-style comment as the brief +# description. If set to NO, the Javadoc-style will behave just like regular Qt- +# style comments (thus requiring an explicit @brief command for a brief +# description.) +# The default value is: NO. + +JAVADOC_AUTOBRIEF = NO + +# If the QT_AUTOBRIEF tag is set to YES then doxygen will interpret the first +# line (until the first dot) of a Qt-style comment as the brief description. If +# set to NO, the Qt-style will behave just like regular Qt-style comments (thus +# requiring an explicit \brief command for a brief description.) +# The default value is: NO. + +QT_AUTOBRIEF = NO + +# The MULTILINE_CPP_IS_BRIEF tag can be set to YES to make doxygen treat a +# multi-line C++ special comment block (i.e. a block of //! or /// comments) as +# a brief description. This used to be the default behavior. The new default is +# to treat a multi-line C++ comment block as a detailed description. Set this +# tag to YES if you prefer the old behavior instead. +# +# Note that setting this tag to YES also means that rational rose comments are +# not recognized any more. +# The default value is: NO. + +MULTILINE_CPP_IS_BRIEF = NO + +# If the INHERIT_DOCS tag is set to YES then an undocumented member inherits the +# documentation from any documented member that it re-implements. +# The default value is: YES. + +INHERIT_DOCS = YES + +# If the SEPARATE_MEMBER_PAGES tag is set to YES then doxygen will produce a new +# page for each member. If set to NO, the documentation of a member will be part +# of the file/class/namespace that contains it. +# The default value is: NO. + +SEPARATE_MEMBER_PAGES = NO + +# The TAB_SIZE tag can be used to set the number of spaces in a tab. Doxygen +# uses this value to replace tabs by spaces in code fragments. +# Minimum value: 1, maximum value: 16, default value: 4. + +TAB_SIZE = 4 + +# This tag can be used to specify a number of aliases that act as commands in +# the documentation. An alias has the form: +# name=value +# For example adding +# "sideeffect=@par Side Effects:\n" +# will allow you to put the command \sideeffect (or @sideeffect) in the +# documentation, which will result in a user-defined paragraph with heading +# "Side Effects:". You can put \n's in the value part of an alias to insert +# newlines (in the resulting output). You can put ^^ in the value part of an +# alias to insert a newline as if a physical newline was in the original file. + +ALIASES = + +# This tag can be used to specify a number of word-keyword mappings (TCL only). +# A mapping has the form "name=value". For example adding "class=itcl::class" +# will allow you to use the command class in the itcl::class meaning. + +TCL_SUBST = + +# Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources +# only. Doxygen will then generate output that is more tailored for C. For +# instance, some of the names that are used will be different. The list of all +# members will be omitted, etc. +# The default value is: NO. + +OPTIMIZE_OUTPUT_FOR_C = NO + +# Set the OPTIMIZE_OUTPUT_JAVA tag to YES if your project consists of Java or +# Python sources only. Doxygen will then generate output that is more tailored +# for that language. For instance, namespaces will be presented as packages, +# qualified scopes will look different, etc. +# The default value is: NO. + +OPTIMIZE_OUTPUT_JAVA = NO + +# Set the OPTIMIZE_FOR_FORTRAN tag to YES if your project consists of Fortran +# sources. Doxygen will then generate output that is tailored for Fortran. +# The default value is: NO. + +OPTIMIZE_FOR_FORTRAN = NO + +# Set the OPTIMIZE_OUTPUT_VHDL tag to YES if your project consists of VHDL +# sources. Doxygen will then generate output that is tailored for VHDL. +# The default value is: NO. + +OPTIMIZE_OUTPUT_VHDL = NO + +# Doxygen selects the parser to use depending on the extension of the files it +# parses. With this tag you can assign which parser to use for a given +# extension. Doxygen has a built-in mapping, but you can override or extend it +# using this tag. The format is ext=language, where ext is a file extension, and +# language is one of the parsers supported by doxygen: IDL, Java, Javascript, +# C#, C, C++, D, PHP, Objective-C, Python, Fortran (fixed format Fortran: +# FortranFixed, free formatted Fortran: FortranFree, unknown formatted Fortran: +# Fortran. In the later case the parser tries to guess whether the code is fixed +# or free formatted code, this is the default for Fortran type files), VHDL. For +# instance to make doxygen treat .inc files as Fortran files (default is PHP), +# and .f files as C (default is Fortran), use: inc=Fortran f=C. +# +# Note: For files without extension you can use no_extension as a placeholder. +# +# Note that for custom extensions you also need to set FILE_PATTERNS otherwise +# the files are not read by doxygen. + +EXTENSION_MAPPING = + +# If the MARKDOWN_SUPPORT tag is enabled then doxygen pre-processes all comments +# according to the Markdown format, which allows for more readable +# documentation. See http://daringfireball.net/projects/markdown/ for details. +# The output of markdown processing is further processed by doxygen, so you can +# mix doxygen, HTML, and XML commands with Markdown formatting. Disable only in +# case of backward compatibilities issues. +# The default value is: YES. + +MARKDOWN_SUPPORT = YES + +# When the TOC_INCLUDE_HEADINGS tag is set to a non-zero value, all headings up +# to that level are automatically included in the table of contents, even if +# they do not have an id attribute. +# Note: This feature currently applies only to Markdown headings. +# Minimum value: 0, maximum value: 99, default value: 0. +# This tag requires that the tag MARKDOWN_SUPPORT is set to YES. + +TOC_INCLUDE_HEADINGS = 0 + +# When enabled doxygen tries to link words that correspond to documented +# classes, or namespaces to their corresponding documentation. Such a link can +# be prevented in individual cases by putting a % sign in front of the word or +# globally by setting AUTOLINK_SUPPORT to NO. +# The default value is: YES. + +AUTOLINK_SUPPORT = YES + +# If you use STL classes (i.e. std::string, std::vector, etc.) but do not want +# to include (a tag file for) the STL sources as input, then you should set this +# tag to YES in order to let doxygen match functions declarations and +# definitions whose arguments contain STL classes (e.g. func(std::string); +# versus func(std::string) {}). This also make the inheritance and collaboration +# diagrams that involve STL classes more complete and accurate. +# The default value is: NO. + +BUILTIN_STL_SUPPORT = NO + +# If you use Microsoft's C++/CLI language, you should set this option to YES to +# enable parsing support. +# The default value is: NO. + +CPP_CLI_SUPPORT = NO + +# Set the SIP_SUPPORT tag to YES if your project consists of sip (see: +# https://www.riverbankcomputing.com/software/sip/intro) sources only. Doxygen +# will parse them like normal C++ but will assume all classes use public instead +# of private inheritance when no explicit protection keyword is present. +# The default value is: NO. + +SIP_SUPPORT = NO + +# For Microsoft's IDL there are propget and propput attributes to indicate +# getter and setter methods for a property. Setting this option to YES will make +# doxygen to replace the get and set methods by a property in the documentation. +# This will only work if the methods are indeed getting or setting a simple +# type. If this is not the case, or you want to show the methods anyway, you +# should set this option to NO. +# The default value is: YES. + +IDL_PROPERTY_SUPPORT = YES + +# If member grouping is used in the documentation and the DISTRIBUTE_GROUP_DOC +# tag is set to YES then doxygen will reuse the documentation of the first +# member in the group (if any) for the other members of the group. By default +# all members of a group must be documented explicitly. +# The default value is: NO. + +DISTRIBUTE_GROUP_DOC = NO + +# If one adds a struct or class to a group and this option is enabled, then also +# any nested class or struct is added to the same group. By default this option +# is disabled and one has to add nested compounds explicitly via \ingroup. +# The default value is: NO. + +GROUP_NESTED_COMPOUNDS = NO + +# Set the SUBGROUPING tag to YES to allow class member groups of the same type +# (for instance a group of public functions) to be put as a subgroup of that +# type (e.g. under the Public Functions section). Set it to NO to prevent +# subgrouping. Alternatively, this can be done per class using the +# \nosubgrouping command. +# The default value is: YES. + +SUBGROUPING = YES + +# When the INLINE_GROUPED_CLASSES tag is set to YES, classes, structs and unions +# are shown inside the group in which they are included (e.g. using \ingroup) +# instead of on a separate page (for HTML and Man pages) or section (for LaTeX +# and RTF). +# +# Note that this feature does not work in combination with +# SEPARATE_MEMBER_PAGES. +# The default value is: NO. + +INLINE_GROUPED_CLASSES = NO + +# When the INLINE_SIMPLE_STRUCTS tag is set to YES, structs, classes, and unions +# with only public data fields or simple typedef fields will be shown inline in +# the documentation of the scope in which they are defined (i.e. file, +# namespace, or group documentation), provided this scope is documented. If set +# to NO, structs, classes, and unions are shown on a separate page (for HTML and +# Man pages) or section (for LaTeX and RTF). +# The default value is: NO. + +INLINE_SIMPLE_STRUCTS = NO + +# When TYPEDEF_HIDES_STRUCT tag is enabled, a typedef of a struct, union, or +# enum is documented as struct, union, or enum with the name of the typedef. So +# typedef struct TypeS {} TypeT, will appear in the documentation as a struct +# with name TypeT. When disabled the typedef will appear as a member of a file, +# namespace, or class. And the struct will be named TypeS. This can typically be +# useful for C code in case the coding convention dictates that all compound +# types are typedef'ed and only the typedef is referenced, never the tag name. +# The default value is: NO. + +TYPEDEF_HIDES_STRUCT = NO + +# The size of the symbol lookup cache can be set using LOOKUP_CACHE_SIZE. This +# cache is used to resolve symbols given their name and scope. Since this can be +# an expensive process and often the same symbol appears multiple times in the +# code, doxygen keeps a cache of pre-resolved symbols. If the cache is too small +# doxygen will become slower. If the cache is too large, memory is wasted. The +# cache size is given by this formula: 2^(16+LOOKUP_CACHE_SIZE). The valid range +# is 0..9, the default is 0, corresponding to a cache size of 2^16=65536 +# symbols. At the end of a run doxygen will report the cache usage and suggest +# the optimal cache size from a speed point of view. +# Minimum value: 0, maximum value: 9, default value: 0. + +LOOKUP_CACHE_SIZE = 0 + +#--------------------------------------------------------------------------- +# Build related configuration options +#--------------------------------------------------------------------------- + +# If the EXTRACT_ALL tag is set to YES, doxygen will assume all entities in +# documentation are documented, even if no documentation was available. Private +# class members and static file members will be hidden unless the +# EXTRACT_PRIVATE respectively EXTRACT_STATIC tags are set to YES. +# Note: This will also disable the warnings about undocumented members that are +# normally produced when WARNINGS is set to YES. +# The default value is: NO. + +EXTRACT_ALL = NO + +# If the EXTRACT_PRIVATE tag is set to YES, all private members of a class will +# be included in the documentation. +# The default value is: NO. + +EXTRACT_PRIVATE = NO + +# If the EXTRACT_PACKAGE tag is set to YES, all members with package or internal +# scope will be included in the documentation. +# The default value is: NO. + +EXTRACT_PACKAGE = NO + +# If the EXTRACT_STATIC tag is set to YES, all static members of a file will be +# included in the documentation. +# The default value is: NO. + +EXTRACT_STATIC = NO + +# If the EXTRACT_LOCAL_CLASSES tag is set to YES, classes (and structs) defined +# locally in source files will be included in the documentation. If set to NO, +# only classes defined in header files are included. Does not have any effect +# for Java sources. +# The default value is: YES. + +EXTRACT_LOCAL_CLASSES = YES + +# This flag is only useful for Objective-C code. If set to YES, local methods, +# which are defined in the implementation section but not in the interface are +# included in the documentation. If set to NO, only methods in the interface are +# included. +# The default value is: NO. + +EXTRACT_LOCAL_METHODS = NO + +# If this flag is set to YES, the members of anonymous namespaces will be +# extracted and appear in the documentation as a namespace called +# 'anonymous_namespace{file}', where file will be replaced with the base name of +# the file that contains the anonymous namespace. By default anonymous namespace +# are hidden. +# The default value is: NO. + +EXTRACT_ANON_NSPACES = NO + +# If the HIDE_UNDOC_MEMBERS tag is set to YES, doxygen will hide all +# undocumented members inside documented classes or files. If set to NO these +# members will be included in the various overviews, but no documentation +# section is generated. This option has no effect if EXTRACT_ALL is enabled. +# The default value is: NO. + +HIDE_UNDOC_MEMBERS = NO + +# If the HIDE_UNDOC_CLASSES tag is set to YES, doxygen will hide all +# undocumented classes that are normally visible in the class hierarchy. If set +# to NO, these classes will be included in the various overviews. This option +# has no effect if EXTRACT_ALL is enabled. +# The default value is: NO. + +HIDE_UNDOC_CLASSES = NO + +# If the HIDE_FRIEND_COMPOUNDS tag is set to YES, doxygen will hide all friend +# (class|struct|union) declarations. If set to NO, these declarations will be +# included in the documentation. +# The default value is: NO. + +HIDE_FRIEND_COMPOUNDS = NO + +# If the HIDE_IN_BODY_DOCS tag is set to YES, doxygen will hide any +# documentation blocks found inside the body of a function. If set to NO, these +# blocks will be appended to the function's detailed documentation block. +# The default value is: NO. + +HIDE_IN_BODY_DOCS = NO + +# The INTERNAL_DOCS tag determines if documentation that is typed after a +# \internal command is included. If the tag is set to NO then the documentation +# will be excluded. Set it to YES to include the internal documentation. +# The default value is: NO. + +INTERNAL_DOCS = NO + +# If the CASE_SENSE_NAMES tag is set to NO then doxygen will only generate file +# names in lower-case letters. If set to YES, upper-case letters are also +# allowed. This is useful if you have classes or files whose names only differ +# in case and if your file system supports case sensitive file names. Windows +# and Mac users are advised to set this option to NO. +# The default value is: system dependent. + +CASE_SENSE_NAMES = NO + +# If the HIDE_SCOPE_NAMES tag is set to NO then doxygen will show members with +# their full class and namespace scopes in the documentation. If set to YES, the +# scope will be hidden. +# The default value is: NO. + +HIDE_SCOPE_NAMES = NO + +# If the HIDE_COMPOUND_REFERENCE tag is set to NO (default) then doxygen will +# append additional text to a page's title, such as Class Reference. If set to +# YES the compound reference will be hidden. +# The default value is: NO. + +HIDE_COMPOUND_REFERENCE= NO + +# If the SHOW_INCLUDE_FILES tag is set to YES then doxygen will put a list of +# the files that are included by a file in the documentation of that file. +# The default value is: YES. + +SHOW_INCLUDE_FILES = YES + +# If the SHOW_GROUPED_MEMB_INC tag is set to YES then Doxygen will add for each +# grouped member an include statement to the documentation, telling the reader +# which file to include in order to use the member. +# The default value is: NO. + +SHOW_GROUPED_MEMB_INC = NO + +# If the FORCE_LOCAL_INCLUDES tag is set to YES then doxygen will list include +# files with double quotes in the documentation rather than with sharp brackets. +# The default value is: NO. + +FORCE_LOCAL_INCLUDES = NO + +# If the INLINE_INFO tag is set to YES then a tag [inline] is inserted in the +# documentation for inline members. +# The default value is: YES. + +INLINE_INFO = YES + +# If the SORT_MEMBER_DOCS tag is set to YES then doxygen will sort the +# (detailed) documentation of file and class members alphabetically by member +# name. If set to NO, the members will appear in declaration order. +# The default value is: YES. + +SORT_MEMBER_DOCS = YES + +# If the SORT_BRIEF_DOCS tag is set to YES then doxygen will sort the brief +# descriptions of file, namespace and class members alphabetically by member +# name. If set to NO, the members will appear in declaration order. Note that +# this will also influence the order of the classes in the class list. +# The default value is: NO. + +SORT_BRIEF_DOCS = NO + +# If the SORT_MEMBERS_CTORS_1ST tag is set to YES then doxygen will sort the +# (brief and detailed) documentation of class members so that constructors and +# destructors are listed first. If set to NO the constructors will appear in the +# respective orders defined by SORT_BRIEF_DOCS and SORT_MEMBER_DOCS. +# Note: If SORT_BRIEF_DOCS is set to NO this option is ignored for sorting brief +# member documentation. +# Note: If SORT_MEMBER_DOCS is set to NO this option is ignored for sorting +# detailed member documentation. +# The default value is: NO. + +SORT_MEMBERS_CTORS_1ST = NO + +# If the SORT_GROUP_NAMES tag is set to YES then doxygen will sort the hierarchy +# of group names into alphabetical order. If set to NO the group names will +# appear in their defined order. +# The default value is: NO. + +SORT_GROUP_NAMES = NO + +# If the SORT_BY_SCOPE_NAME tag is set to YES, the class list will be sorted by +# fully-qualified names, including namespaces. If set to NO, the class list will +# be sorted only by class name, not including the namespace part. +# Note: This option is not very useful if HIDE_SCOPE_NAMES is set to YES. +# Note: This option applies only to the class list, not to the alphabetical +# list. +# The default value is: NO. + +SORT_BY_SCOPE_NAME = NO + +# If the STRICT_PROTO_MATCHING option is enabled and doxygen fails to do proper +# type resolution of all parameters of a function it will reject a match between +# the prototype and the implementation of a member function even if there is +# only one candidate or it is obvious which candidate to choose by doing a +# simple string match. By disabling STRICT_PROTO_MATCHING doxygen will still +# accept a match between prototype and implementation in such cases. +# The default value is: NO. + +STRICT_PROTO_MATCHING = NO + +# The GENERATE_TODOLIST tag can be used to enable (YES) or disable (NO) the todo +# list. This list is created by putting \todo commands in the documentation. +# The default value is: YES. + +GENERATE_TODOLIST = YES + +# The GENERATE_TESTLIST tag can be used to enable (YES) or disable (NO) the test +# list. This list is created by putting \test commands in the documentation. +# The default value is: YES. + +GENERATE_TESTLIST = YES + +# The GENERATE_BUGLIST tag can be used to enable (YES) or disable (NO) the bug +# list. This list is created by putting \bug commands in the documentation. +# The default value is: YES. + +GENERATE_BUGLIST = YES + +# The GENERATE_DEPRECATEDLIST tag can be used to enable (YES) or disable (NO) +# the deprecated list. This list is created by putting \deprecated commands in +# the documentation. +# The default value is: YES. + +GENERATE_DEPRECATEDLIST= YES + +# The ENABLED_SECTIONS tag can be used to enable conditional documentation +# sections, marked by \if ... \endif and \cond +# ... \endcond blocks. + +ENABLED_SECTIONS = + +# The MAX_INITIALIZER_LINES tag determines the maximum number of lines that the +# initial value of a variable or macro / define can have for it to appear in the +# documentation. If the initializer consists of more lines than specified here +# it will be hidden. Use a value of 0 to hide initializers completely. The +# appearance of the value of individual variables and macros / defines can be +# controlled using \showinitializer or \hideinitializer command in the +# documentation regardless of this setting. +# Minimum value: 0, maximum value: 10000, default value: 30. + +MAX_INITIALIZER_LINES = 30 + +# Set the SHOW_USED_FILES tag to NO to disable the list of files generated at +# the bottom of the documentation of classes and structs. If set to YES, the +# list will mention the files that were used to generate the documentation. +# The default value is: YES. + +SHOW_USED_FILES = YES + +# Set the SHOW_FILES tag to NO to disable the generation of the Files page. This +# will remove the Files entry from the Quick Index and from the Folder Tree View +# (if specified). +# The default value is: YES. + +SHOW_FILES = YES + +# Set the SHOW_NAMESPACES tag to NO to disable the generation of the Namespaces +# page. This will remove the Namespaces entry from the Quick Index and from the +# Folder Tree View (if specified). +# The default value is: YES. + +SHOW_NAMESPACES = YES + +# The FILE_VERSION_FILTER tag can be used to specify a program or script that +# doxygen should invoke to get the current version for each file (typically from +# the version control system). Doxygen will invoke the program by executing (via +# popen()) the command command input-file, where command is the value of the +# FILE_VERSION_FILTER tag, and input-file is the name of an input file provided +# by doxygen. Whatever the program writes to standard output is used as the file +# version. For an example see the documentation. + +FILE_VERSION_FILTER = + +# The LAYOUT_FILE tag can be used to specify a layout file which will be parsed +# by doxygen. The layout file controls the global structure of the generated +# output files in an output format independent way. To create the layout file +# that represents doxygen's defaults, run doxygen with the -l option. You can +# optionally specify a file name after the option, if omitted DoxygenLayout.xml +# will be used as the name of the layout file. +# +# Note that if you run doxygen from a directory containing a file called +# DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE +# tag is left empty. + +LAYOUT_FILE = + +# The CITE_BIB_FILES tag can be used to specify one or more bib files containing +# the reference definitions. This must be a list of .bib files. The .bib +# extension is automatically appended if omitted. This requires the bibtex tool +# to be installed. See also https://en.wikipedia.org/wiki/BibTeX for more info. +# For LaTeX the style of the bibliography can be controlled using +# LATEX_BIB_STYLE. To use this feature you need bibtex and perl available in the +# search path. See also \cite for info how to create references. + +CITE_BIB_FILES = + +#--------------------------------------------------------------------------- +# Configuration options related to warning and progress messages +#--------------------------------------------------------------------------- + +# The QUIET tag can be used to turn on/off the messages that are generated to +# standard output by doxygen. If QUIET is set to YES this implies that the +# messages are off. +# The default value is: NO. + +QUIET = NO + +# The WARNINGS tag can be used to turn on/off the warning messages that are +# generated to standard error (stderr) by doxygen. If WARNINGS is set to YES +# this implies that the warnings are on. +# +# Tip: Turn warnings on while writing the documentation. +# The default value is: YES. + +WARNINGS = YES + +# If the WARN_IF_UNDOCUMENTED tag is set to YES then doxygen will generate +# warnings for undocumented members. If EXTRACT_ALL is set to YES then this flag +# will automatically be disabled. +# The default value is: YES. + +WARN_IF_UNDOCUMENTED = YES + +# If the WARN_IF_DOC_ERROR tag is set to YES, doxygen will generate warnings for +# potential errors in the documentation, such as not documenting some parameters +# in a documented function, or documenting parameters that don't exist or using +# markup commands wrongly. +# The default value is: YES. + +WARN_IF_DOC_ERROR = YES + +# This WARN_NO_PARAMDOC option can be enabled to get warnings for functions that +# are documented, but have no documentation for their parameters or return +# value. If set to NO, doxygen will only warn about wrong or incomplete +# parameter documentation, but not about the absence of documentation. +# The default value is: NO. + +WARN_NO_PARAMDOC = NO + +# If the WARN_AS_ERROR tag is set to YES then doxygen will immediately stop when +# a warning is encountered. +# The default value is: NO. + +WARN_AS_ERROR = NO + +# The WARN_FORMAT tag determines the format of the warning messages that doxygen +# can produce. The string should contain the $file, $line, and $text tags, which +# will be replaced by the file and line number from which the warning originated +# and the warning text. Optionally the format may contain $version, which will +# be replaced by the version of the file (if it could be obtained via +# FILE_VERSION_FILTER) +# The default value is: $file:$line: $text. + +WARN_FORMAT = "$file:$line: $text" + +# The WARN_LOGFILE tag can be used to specify a file to which warning and error +# messages should be written. If left blank the output is written to standard +# error (stderr). + +WARN_LOGFILE = + +#--------------------------------------------------------------------------- +# Configuration options related to the input files +#--------------------------------------------------------------------------- + +# The INPUT tag is used to specify the files and/or directories that contain +# documented source files. You may enter file names like myfile.cpp or +# directories like /usr/src/myproject. Separate the files or directories with +# spaces. See also FILE_PATTERNS and EXTENSION_MAPPING +# Note: If this tag is empty the current directory is searched. + +INPUT = src/sgtsne.hpp src/knn.hpp src/types.hpp \ + src/sparsematrix.hpp src/utils.hpp \ + docs/MAINPAGE.md + +# This tag can be used to specify the character encoding of the source files +# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses +# libiconv (or the iconv built into libc) for the transcoding. See the libiconv +# documentation (see: https://www.gnu.org/software/libiconv/) for the list of +# possible encodings. +# The default value is: UTF-8. + +INPUT_ENCODING = UTF-8 + +# If the value of the INPUT tag contains directories, you can use the +# FILE_PATTERNS tag to specify one or more wildcard patterns (like *.cpp and +# *.h) to filter out the source-files in the directories. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# read by doxygen. +# +# If left blank the following patterns are tested:*.c, *.cc, *.cxx, *.cpp, +# *.c++, *.java, *.ii, *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h, +# *.hh, *.hxx, *.hpp, *.h++, *.cs, *.d, *.php, *.php4, *.php5, *.phtml, *.inc, +# *.m, *.markdown, *.md, *.mm, *.dox, *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, +# *.f, *.for, *.tcl, *.vhd, *.vhdl, *.ucf and *.qsf. + +FILE_PATTERNS = *.c \ + *.cc \ + *.cxx \ + *.cpp \ + *.c++ \ + *.java \ + *.ii \ + *.ixx \ + *.ipp \ + *.i++ \ + *.inl \ + *.idl \ + *.ddl \ + *.odl \ + *.h \ + *.hh \ + *.hxx \ + *.hpp \ + *.h++ \ + *.cs \ + *.d \ + *.php \ + *.php4 \ + *.php5 \ + *.phtml \ + *.inc \ + *.m \ + *.markdown \ + *.md \ + *.mm \ + *.dox \ + *.py \ + *.pyw \ + *.f90 \ + *.f95 \ + *.f03 \ + *.f08 \ + *.f \ + *.for \ + *.tcl \ + *.vhd \ + *.vhdl \ + *.ucf \ + *.qsf + +# The RECURSIVE tag can be used to specify whether or not subdirectories should +# be searched for input files as well. +# The default value is: NO. + +RECURSIVE = NO + +# The EXCLUDE tag can be used to specify files and/or directories that should be +# excluded from the INPUT source files. This way you can easily exclude a +# subdirectory from a directory tree whose root is specified with the INPUT tag. +# +# Note that relative paths are relative to the directory from which doxygen is +# run. + +EXCLUDE = + +# The EXCLUDE_SYMLINKS tag can be used to select whether or not files or +# directories that are symbolic links (a Unix file system feature) are excluded +# from the input. +# The default value is: NO. + +EXCLUDE_SYMLINKS = NO + +# If the value of the INPUT tag contains directories, you can use the +# EXCLUDE_PATTERNS tag to specify one or more wildcard patterns to exclude +# certain files from those directories. +# +# Note that the wildcards are matched against the file with absolute path, so to +# exclude all test directories for example use the pattern */test/* + +EXCLUDE_PATTERNS = + +# The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names +# (namespaces, classes, functions, etc.) that should be excluded from the +# output. The symbol name can be a fully qualified name, a word, or if the +# wildcard * is used, a substring. Examples: ANamespace, AClass, +# AClass::ANamespace, ANamespace::*Test +# +# Note that the wildcards are matched against the file with absolute path, so to +# exclude all test directories use the pattern */test/* + +EXCLUDE_SYMBOLS = + +# The EXAMPLE_PATH tag can be used to specify one or more files or directories +# that contain example code fragments that are included (see the \include +# command). + +EXAMPLE_PATH = + +# If the value of the EXAMPLE_PATH tag contains directories, you can use the +# EXAMPLE_PATTERNS tag to specify one or more wildcard pattern (like *.cpp and +# *.h) to filter out the source-files in the directories. If left blank all +# files are included. + +EXAMPLE_PATTERNS = * + +# If the EXAMPLE_RECURSIVE tag is set to YES then subdirectories will be +# searched for input files to be used with the \include or \dontinclude commands +# irrespective of the value of the RECURSIVE tag. +# The default value is: NO. + +EXAMPLE_RECURSIVE = NO + +# The IMAGE_PATH tag can be used to specify one or more files or directories +# that contain images that are to be included in the documentation (see the +# \image command). + +IMAGE_PATH = + +# The INPUT_FILTER tag can be used to specify a program that doxygen should +# invoke to filter for each input file. Doxygen will invoke the filter program +# by executing (via popen()) the command: +# +# +# +# where is the value of the INPUT_FILTER tag, and is the +# name of an input file. Doxygen will then use the output that the filter +# program writes to standard output. If FILTER_PATTERNS is specified, this tag +# will be ignored. +# +# Note that the filter must not add or remove lines; it is applied before the +# code is scanned, but not when the output code is generated. If lines are added +# or removed, the anchors will not be placed correctly. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# properly processed by doxygen. + +INPUT_FILTER = + +# The FILTER_PATTERNS tag can be used to specify filters on a per file pattern +# basis. Doxygen will compare the file name with each pattern and apply the +# filter if there is a match. The filters are a list of the form: pattern=filter +# (like *.cpp=my_cpp_filter). See INPUT_FILTER for further information on how +# filters are used. If the FILTER_PATTERNS tag is empty or if none of the +# patterns match the file name, INPUT_FILTER is applied. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# properly processed by doxygen. + +FILTER_PATTERNS = + +# If the FILTER_SOURCE_FILES tag is set to YES, the input filter (if set using +# INPUT_FILTER) will also be used to filter the input files that are used for +# producing the source files to browse (i.e. when SOURCE_BROWSER is set to YES). +# The default value is: NO. + +FILTER_SOURCE_FILES = NO + +# The FILTER_SOURCE_PATTERNS tag can be used to specify source filters per file +# pattern. A pattern will override the setting for FILTER_PATTERN (if any) and +# it is also possible to disable source filtering for a specific pattern using +# *.ext= (so without naming a filter). +# This tag requires that the tag FILTER_SOURCE_FILES is set to YES. + +FILTER_SOURCE_PATTERNS = + +# If the USE_MDFILE_AS_MAINPAGE tag refers to the name of a markdown file that +# is part of the input, its contents will be placed on the main page +# (index.html). This can be useful if you have a project on for instance GitHub +# and want to reuse the introduction page also for the doxygen output. + +USE_MDFILE_AS_MAINPAGE = MAINPAGE.md + +#--------------------------------------------------------------------------- +# Configuration options related to source browsing +#--------------------------------------------------------------------------- + +# If the SOURCE_BROWSER tag is set to YES then a list of source files will be +# generated. Documented entities will be cross-referenced with these sources. +# +# Note: To get rid of all source code in the generated output, make sure that +# also VERBATIM_HEADERS is set to NO. +# The default value is: NO. + +SOURCE_BROWSER = NO + +# Setting the INLINE_SOURCES tag to YES will include the body of functions, +# classes and enums directly into the documentation. +# The default value is: NO. + +INLINE_SOURCES = NO + +# Setting the STRIP_CODE_COMMENTS tag to YES will instruct doxygen to hide any +# special comment blocks from generated source code fragments. Normal C, C++ and +# Fortran comments will always remain visible. +# The default value is: YES. + +STRIP_CODE_COMMENTS = YES + +# If the REFERENCED_BY_RELATION tag is set to YES then for each documented +# function all documented functions referencing it will be listed. +# The default value is: NO. + +REFERENCED_BY_RELATION = NO + +# If the REFERENCES_RELATION tag is set to YES then for each documented function +# all documented entities called/used by that function will be listed. +# The default value is: NO. + +REFERENCES_RELATION = NO + +# If the REFERENCES_LINK_SOURCE tag is set to YES and SOURCE_BROWSER tag is set +# to YES then the hyperlinks from functions in REFERENCES_RELATION and +# REFERENCED_BY_RELATION lists will link to the source code. Otherwise they will +# link to the documentation. +# The default value is: YES. + +REFERENCES_LINK_SOURCE = YES + +# If SOURCE_TOOLTIPS is enabled (the default) then hovering a hyperlink in the +# source code will show a tooltip with additional information such as prototype, +# brief description and links to the definition and documentation. Since this +# will make the HTML file larger and loading of large files a bit slower, you +# can opt to disable this feature. +# The default value is: YES. +# This tag requires that the tag SOURCE_BROWSER is set to YES. + +SOURCE_TOOLTIPS = YES + +# If the USE_HTAGS tag is set to YES then the references to source code will +# point to the HTML generated by the htags(1) tool instead of doxygen built-in +# source browser. The htags tool is part of GNU's global source tagging system +# (see https://www.gnu.org/software/global/global.html). You will need version +# 4.8.6 or higher. +# +# To use it do the following: +# - Install the latest version of global +# - Enable SOURCE_BROWSER and USE_HTAGS in the config file +# - Make sure the INPUT points to the root of the source tree +# - Run doxygen as normal +# +# Doxygen will invoke htags (and that will in turn invoke gtags), so these +# tools must be available from the command line (i.e. in the search path). +# +# The result: instead of the source browser generated by doxygen, the links to +# source code will now point to the output of htags. +# The default value is: NO. +# This tag requires that the tag SOURCE_BROWSER is set to YES. + +USE_HTAGS = NO + +# If the VERBATIM_HEADERS tag is set the YES then doxygen will generate a +# verbatim copy of the header file for each class for which an include is +# specified. Set to NO to disable this. +# See also: Section \class. +# The default value is: YES. + +VERBATIM_HEADERS = YES + +#--------------------------------------------------------------------------- +# Configuration options related to the alphabetical class index +#--------------------------------------------------------------------------- + +# If the ALPHABETICAL_INDEX tag is set to YES, an alphabetical index of all +# compounds will be generated. Enable this if the project contains a lot of +# classes, structs, unions or interfaces. +# The default value is: YES. + +ALPHABETICAL_INDEX = YES + +# The COLS_IN_ALPHA_INDEX tag can be used to specify the number of columns in +# which the alphabetical index list will be split. +# Minimum value: 1, maximum value: 20, default value: 5. +# This tag requires that the tag ALPHABETICAL_INDEX is set to YES. + +COLS_IN_ALPHA_INDEX = 5 + +# In case all classes in a project start with a common prefix, all classes will +# be put under the same header in the alphabetical index. The IGNORE_PREFIX tag +# can be used to specify a prefix (or a list of prefixes) that should be ignored +# while generating the index headers. +# This tag requires that the tag ALPHABETICAL_INDEX is set to YES. + +IGNORE_PREFIX = + +#--------------------------------------------------------------------------- +# Configuration options related to the HTML output +#--------------------------------------------------------------------------- + +# If the GENERATE_HTML tag is set to YES, doxygen will generate HTML output +# The default value is: YES. + +GENERATE_HTML = YES + +# The HTML_OUTPUT tag is used to specify where the HTML docs will be put. If a +# relative path is entered the value of OUTPUT_DIRECTORY will be put in front of +# it. +# The default directory is: html. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_OUTPUT = html + +# The HTML_FILE_EXTENSION tag can be used to specify the file extension for each +# generated HTML page (for example: .htm, .php, .asp). +# The default value is: .html. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_FILE_EXTENSION = .html + +# The HTML_HEADER tag can be used to specify a user-defined HTML header file for +# each generated HTML page. If the tag is left blank doxygen will generate a +# standard header. +# +# To get valid HTML the header file that includes any scripts and style sheets +# that doxygen needs, which is dependent on the configuration options used (e.g. +# the setting GENERATE_TREEVIEW). It is highly recommended to start with a +# default header using +# doxygen -w html new_header.html new_footer.html new_stylesheet.css +# YourConfigFile +# and then modify the file new_header.html. See also section "Doxygen usage" +# for information on how to generate the default header that doxygen normally +# uses. +# Note: The header is subject to change so you typically have to regenerate the +# default header when upgrading to a newer version of doxygen. For a description +# of the possible markers and block names see the documentation. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_HEADER = + +# The HTML_FOOTER tag can be used to specify a user-defined HTML footer for each +# generated HTML page. If the tag is left blank doxygen will generate a standard +# footer. See HTML_HEADER for more information on how to generate a default +# footer and what special commands can be used inside the footer. See also +# section "Doxygen usage" for information on how to generate the default footer +# that doxygen normally uses. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_FOOTER = + +# The HTML_STYLESHEET tag can be used to specify a user-defined cascading style +# sheet that is used by each HTML page. It can be used to fine-tune the look of +# the HTML output. If left blank doxygen will generate a default style sheet. +# See also section "Doxygen usage" for information on how to generate the style +# sheet that doxygen normally uses. +# Note: It is recommended to use HTML_EXTRA_STYLESHEET instead of this tag, as +# it is more robust and this tag (HTML_STYLESHEET) will in the future become +# obsolete. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_STYLESHEET = + +# The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined +# cascading style sheets that are included after the standard style sheets +# created by doxygen. Using this option one can overrule certain style aspects. +# This is preferred over using HTML_STYLESHEET since it does not replace the +# standard style sheet and is therefore more robust against future updates. +# Doxygen will copy the style sheet files to the output directory. +# Note: The order of the extra style sheet files is of importance (e.g. the last +# style sheet in the list overrules the setting of the previous ones in the +# list). For an example see the documentation. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_EXTRA_STYLESHEET = + +# The HTML_EXTRA_FILES tag can be used to specify one or more extra images or +# other source files which should be copied to the HTML output directory. Note +# that these files will be copied to the base HTML output directory. Use the +# $relpath^ marker in the HTML_HEADER and/or HTML_FOOTER files to load these +# files. In the HTML_STYLESHEET file, use the file name only. Also note that the +# files will be copied as-is; there are no commands or markers available. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_EXTRA_FILES = + +# The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen +# will adjust the colors in the style sheet and background images according to +# this color. Hue is specified as an angle on a colorwheel, see +# https://en.wikipedia.org/wiki/Hue for more information. For instance the value +# 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300 +# purple, and 360 is red again. +# Minimum value: 0, maximum value: 359, default value: 220. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_HUE = 220 + +# The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors +# in the HTML output. For a value of 0 the output will use grayscales only. A +# value of 255 will produce the most vivid colors. +# Minimum value: 0, maximum value: 255, default value: 100. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_SAT = 100 + +# The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the +# luminance component of the colors in the HTML output. Values below 100 +# gradually make the output lighter, whereas values above 100 make the output +# darker. The value divided by 100 is the actual gamma applied, so 80 represents +# a gamma of 0.8, The value 220 represents a gamma of 2.2, and 100 does not +# change the gamma. +# Minimum value: 40, maximum value: 240, default value: 80. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_GAMMA = 80 + +# If the HTML_TIMESTAMP tag is set to YES then the footer of each generated HTML +# page will contain the date and time when the page was generated. Setting this +# to YES can help to show when doxygen was last run and thus if the +# documentation is up to date. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_TIMESTAMP = NO + +# If the HTML_DYNAMIC_MENUS tag is set to YES then the generated HTML +# documentation will contain a main index with vertical navigation menus that +# are dynamically created via Javascript. If disabled, the navigation index will +# consists of multiple levels of tabs that are statically embedded in every HTML +# page. Disable this option to support browsers that do not have Javascript, +# like the Qt help browser. +# The default value is: YES. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_DYNAMIC_MENUS = YES + +# If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML +# documentation will contain sections that can be hidden and shown after the +# page has loaded. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_DYNAMIC_SECTIONS = NO + +# With HTML_INDEX_NUM_ENTRIES one can control the preferred number of entries +# shown in the various tree structured indices initially; the user can expand +# and collapse entries dynamically later on. Doxygen will expand the tree to +# such a level that at most the specified number of entries are visible (unless +# a fully collapsed tree already exceeds this amount). So setting the number of +# entries 1 will produce a full collapsed tree by default. 0 is a special value +# representing an infinite number of entries and will result in a full expanded +# tree by default. +# Minimum value: 0, maximum value: 9999, default value: 100. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_INDEX_NUM_ENTRIES = 100 + +# If the GENERATE_DOCSET tag is set to YES, additional index files will be +# generated that can be used as input for Apple's Xcode 3 integrated development +# environment (see: https://developer.apple.com/tools/xcode/), introduced with +# OSX 10.5 (Leopard). To create a documentation set, doxygen will generate a +# Makefile in the HTML output directory. Running make will produce the docset in +# that directory and running make install will install the docset in +# ~/Library/Developer/Shared/Documentation/DocSets so that Xcode will find it at +# startup. See https://developer.apple.com/tools/creatingdocsetswithdoxygen.html +# for more information. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_DOCSET = NO + +# This tag determines the name of the docset feed. A documentation feed provides +# an umbrella under which multiple documentation sets from a single provider +# (such as a company or product suite) can be grouped. +# The default value is: Doxygen generated docs. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_FEEDNAME = "Doxygen generated docs" + +# This tag specifies a string that should uniquely identify the documentation +# set bundle. This should be a reverse domain-name style string, e.g. +# com.mycompany.MyDocSet. Doxygen will append .docset to the name. +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_BUNDLE_ID = org.doxygen.Project + +# The DOCSET_PUBLISHER_ID tag specifies a string that should uniquely identify +# the documentation publisher. This should be a reverse domain-name style +# string, e.g. com.mycompany.MyDocSet.documentation. +# The default value is: org.doxygen.Publisher. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_PUBLISHER_ID = org.doxygen.Publisher + +# The DOCSET_PUBLISHER_NAME tag identifies the documentation publisher. +# The default value is: Publisher. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_PUBLISHER_NAME = Publisher + +# If the GENERATE_HTMLHELP tag is set to YES then doxygen generates three +# additional HTML index files: index.hhp, index.hhc, and index.hhk. The +# index.hhp is a project file that can be read by Microsoft's HTML Help Workshop +# (see: http://www.microsoft.com/en-us/download/details.aspx?id=21138) on +# Windows. +# +# The HTML Help Workshop contains a compiler that can convert all HTML output +# generated by doxygen into a single compiled HTML file (.chm). Compiled HTML +# files are now used as the Windows 98 help format, and will replace the old +# Windows help format (.hlp) on all Windows platforms in the future. Compressed +# HTML files also contain an index, a table of contents, and you can search for +# words in the documentation. The HTML workshop also contains a viewer for +# compressed HTML files. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_HTMLHELP = NO + +# The CHM_FILE tag can be used to specify the file name of the resulting .chm +# file. You can add a path in front of the file if the result should not be +# written to the html output directory. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +CHM_FILE = + +# The HHC_LOCATION tag can be used to specify the location (absolute path +# including file name) of the HTML help compiler (hhc.exe). If non-empty, +# doxygen will try to run the HTML help compiler on the generated index.hhp. +# The file has to be specified with full path. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +HHC_LOCATION = + +# The GENERATE_CHI flag controls if a separate .chi index file is generated +# (YES) or that it should be included in the master .chm file (NO). +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +GENERATE_CHI = NO + +# The CHM_INDEX_ENCODING is used to encode HtmlHelp index (hhk), content (hhc) +# and project file content. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +CHM_INDEX_ENCODING = + +# The BINARY_TOC flag controls whether a binary table of contents is generated +# (YES) or a normal table of contents (NO) in the .chm file. Furthermore it +# enables the Previous and Next buttons. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +BINARY_TOC = NO + +# The TOC_EXPAND flag can be set to YES to add extra items for group members to +# the table of contents of the HTML help documentation and to the tree view. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +TOC_EXPAND = NO + +# If the GENERATE_QHP tag is set to YES and both QHP_NAMESPACE and +# QHP_VIRTUAL_FOLDER are set, an additional index file will be generated that +# can be used as input for Qt's qhelpgenerator to generate a Qt Compressed Help +# (.qch) of the generated HTML documentation. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_QHP = NO + +# If the QHG_LOCATION tag is specified, the QCH_FILE tag can be used to specify +# the file name of the resulting .qch file. The path specified is relative to +# the HTML output folder. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QCH_FILE = + +# The QHP_NAMESPACE tag specifies the namespace to use when generating Qt Help +# Project output. For more information please see Qt Help Project / Namespace +# (see: http://doc.qt.io/qt-4.8/qthelpproject.html#namespace). +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_NAMESPACE = org.doxygen.Project + +# The QHP_VIRTUAL_FOLDER tag specifies the namespace to use when generating Qt +# Help Project output. For more information please see Qt Help Project / Virtual +# Folders (see: http://doc.qt.io/qt-4.8/qthelpproject.html#virtual-folders). +# The default value is: doc. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_VIRTUAL_FOLDER = doc + +# If the QHP_CUST_FILTER_NAME tag is set, it specifies the name of a custom +# filter to add. For more information please see Qt Help Project / Custom +# Filters (see: http://doc.qt.io/qt-4.8/qthelpproject.html#custom-filters). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_CUST_FILTER_NAME = + +# The QHP_CUST_FILTER_ATTRS tag specifies the list of the attributes of the +# custom filter to add. For more information please see Qt Help Project / Custom +# Filters (see: http://doc.qt.io/qt-4.8/qthelpproject.html#custom-filters). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_CUST_FILTER_ATTRS = + +# The QHP_SECT_FILTER_ATTRS tag specifies the list of the attributes this +# project's filter section matches. Qt Help Project / Filter Attributes (see: +# http://doc.qt.io/qt-4.8/qthelpproject.html#filter-attributes). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_SECT_FILTER_ATTRS = + +# The QHG_LOCATION tag can be used to specify the location of Qt's +# qhelpgenerator. If non-empty doxygen will try to run qhelpgenerator on the +# generated .qhp file. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHG_LOCATION = + +# If the GENERATE_ECLIPSEHELP tag is set to YES, additional index files will be +# generated, together with the HTML files, they form an Eclipse help plugin. To +# install this plugin and make it available under the help contents menu in +# Eclipse, the contents of the directory containing the HTML and XML files needs +# to be copied into the plugins directory of eclipse. The name of the directory +# within the plugins directory should be the same as the ECLIPSE_DOC_ID value. +# After copying Eclipse needs to be restarted before the help appears. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_ECLIPSEHELP = NO + +# A unique identifier for the Eclipse help plugin. When installing the plugin +# the directory name containing the HTML and XML files should also have this +# name. Each documentation set should have its own identifier. +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_ECLIPSEHELP is set to YES. + +ECLIPSE_DOC_ID = org.doxygen.Project + +# If you want full control over the layout of the generated HTML pages it might +# be necessary to disable the index and replace it with your own. The +# DISABLE_INDEX tag can be used to turn on/off the condensed index (tabs) at top +# of each HTML page. A value of NO enables the index and the value YES disables +# it. Since the tabs in the index contain the same information as the navigation +# tree, you can set this option to YES if you also set GENERATE_TREEVIEW to YES. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +DISABLE_INDEX = NO + +# The GENERATE_TREEVIEW tag is used to specify whether a tree-like index +# structure should be generated to display hierarchical information. If the tag +# value is set to YES, a side panel will be generated containing a tree-like +# index structure (just like the one that is generated for HTML Help). For this +# to work a browser that supports JavaScript, DHTML, CSS and frames is required +# (i.e. any modern browser). Windows users are probably better off using the +# HTML help feature. Via custom style sheets (see HTML_EXTRA_STYLESHEET) one can +# further fine-tune the look of the index. As an example, the default style +# sheet generated by doxygen has an example that shows how to put an image at +# the root of the tree instead of the PROJECT_NAME. Since the tree basically has +# the same information as the tab index, you could consider setting +# DISABLE_INDEX to YES when enabling this option. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_TREEVIEW = NO + +# The ENUM_VALUES_PER_LINE tag can be used to set the number of enum values that +# doxygen will group on one line in the generated HTML documentation. +# +# Note that a value of 0 will completely suppress the enum values from appearing +# in the overview section. +# Minimum value: 0, maximum value: 20, default value: 4. +# This tag requires that the tag GENERATE_HTML is set to YES. + +ENUM_VALUES_PER_LINE = 4 + +# If the treeview is enabled (see GENERATE_TREEVIEW) then this tag can be used +# to set the initial width (in pixels) of the frame in which the tree is shown. +# Minimum value: 0, maximum value: 1500, default value: 250. +# This tag requires that the tag GENERATE_HTML is set to YES. + +TREEVIEW_WIDTH = 250 + +# If the EXT_LINKS_IN_WINDOW option is set to YES, doxygen will open links to +# external symbols imported via tag files in a separate window. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +EXT_LINKS_IN_WINDOW = NO + +# Use this tag to change the font size of LaTeX formulas included as images in +# the HTML documentation. When you change the font size after a successful +# doxygen run you need to manually remove any form_*.png images from the HTML +# output directory to force them to be regenerated. +# Minimum value: 8, maximum value: 50, default value: 10. +# This tag requires that the tag GENERATE_HTML is set to YES. + +FORMULA_FONTSIZE = 10 + +# Use the FORMULA_TRANSPARENT tag to determine whether or not the images +# generated for formulas are transparent PNGs. Transparent PNGs are not +# supported properly for IE 6.0, but are supported on all modern browsers. +# +# Note that when changing this option you need to delete any form_*.png files in +# the HTML output directory before the changes have effect. +# The default value is: YES. +# This tag requires that the tag GENERATE_HTML is set to YES. + +FORMULA_TRANSPARENT = YES + +# Enable the USE_MATHJAX option to render LaTeX formulas using MathJax (see +# https://www.mathjax.org) which uses client side Javascript for the rendering +# instead of using pre-rendered bitmaps. Use this if you do not have LaTeX +# installed or if you want to formulas look prettier in the HTML output. When +# enabled you may also need to install MathJax separately and configure the path +# to it using the MATHJAX_RELPATH option. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +USE_MATHJAX = NO + +# When MathJax is enabled you can set the default output format to be used for +# the MathJax output. See the MathJax site (see: +# http://docs.mathjax.org/en/latest/output.html) for more details. +# Possible values are: HTML-CSS (which is slower, but has the best +# compatibility), NativeMML (i.e. MathML) and SVG. +# The default value is: HTML-CSS. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_FORMAT = HTML-CSS + +# When MathJax is enabled you need to specify the location relative to the HTML +# output directory using the MATHJAX_RELPATH option. The destination directory +# should contain the MathJax.js script. For instance, if the mathjax directory +# is located at the same level as the HTML output directory, then +# MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax +# Content Delivery Network so you can quickly see the result without installing +# MathJax. However, it is strongly recommended to install a local copy of +# MathJax from https://www.mathjax.org before deployment. +# The default value is: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_RELPATH = https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/ + +# The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax +# extension names that should be enabled during MathJax rendering. For example +# MATHJAX_EXTENSIONS = TeX/AMSmath TeX/AMSsymbols +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_EXTENSIONS = + +# The MATHJAX_CODEFILE tag can be used to specify a file with javascript pieces +# of code that will be used on startup of the MathJax code. See the MathJax site +# (see: http://docs.mathjax.org/en/latest/output.html) for more details. For an +# example see the documentation. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_CODEFILE = + +# When the SEARCHENGINE tag is enabled doxygen will generate a search box for +# the HTML output. The underlying search engine uses javascript and DHTML and +# should work on any modern browser. Note that when using HTML help +# (GENERATE_HTMLHELP), Qt help (GENERATE_QHP), or docsets (GENERATE_DOCSET) +# there is already a search function so this one should typically be disabled. +# For large projects the javascript based search engine can be slow, then +# enabling SERVER_BASED_SEARCH may provide a better solution. It is possible to +# search using the keyboard; to jump to the search box use + S +# (what the is depends on the OS and browser, but it is typically +# , /