Announcement with regards to a major refactor of the plugin #158

arizvisa · 2022-10-31T21:16:40Z

arizvisa
Oct 31, 2022
Maintainer

Announcement about refactoring

Currently active development is being done in an off-branch labeled "persistence-refactor" (yea, I get this is a super shitty name). With the introduction of wrappers around the Hex-Rays API, I've come to the conclusion that there's a number of things that need to maintain state and so this branch reorganizes the entire repository in order to facilitate this.

What

The major change you'll notice is that the directory structure of the plugin has been completely reorganized. This puts all "internal" modules inside the "misc" directory, all application-specific modules inside the "application" directory, and general tools under the "tools" directory. The original directory structure was in all actuality just a piece of history since this plugin grew somewhat organically. Now it's organized in a way that actually makes sense.

Some of the additions include moving state that used to be kept within the base modules into their own "temporal" modules. Examples of this include the instruction decoding which used to be part of the "instruction" module. Things like this are now temporal in that their namespace will switch depending on the current state of the disassembler. So with regards to the "instruction" modue, the operand decoders and other stuff would change depending on the processor that's detected.

Specifically, the contents of the "instruction" module will remain the same, but most of its internals are now in a module that you may import called "architecture". This "architecture" module uses the contents of the "procs" directory in order to determine an architecture's register state and how an instruction's operands are to be decoded, and if you want to add your own minsc-like decoder for an architecture it's simply a matter of dropping code in a module and registering it with minsc.

What (references)

I'm also re-working the way that references work. A user reported that the way the function.down api worked was slow, and he was 100% right because it was actually doing more than what the user actually wanted. Both function.up and function.down actually are used for references (rather than function calls), but since their output is always an integer...it's hard to distinguish that. Really, the best way to determine the callables for a function is actually to look at the basic blocks from the flowchart, but that's not actually how people think about things. So, as a result of this conversation I had..I've decided to make references first-class integers so that you can still do arithmetic with them, but also so you don't need to unpack their address to use them with a function that takes an address. This way you can use them wherever, but you can also check if it has metadata which you can then use to infer more information about the reference.

References have always had metadata attached (in their "access" attribute, which used to be "reftype") so that you can determine if a reference is being read from (load), written to (store), or executed. This was done actually via an immutable interface. But now, these will be flags that you can independently check and modify. Most importantly you can now merge references together (which will be done implicitly in certain cases). This way you can merge (union) data-references with code-references and get the exact access type that you'd expect (that being a "load" and an "execute", or just a simple "execute"). From these flags you can distinguish whether a reference requires you to dereference an address, or use it directly. So, if you combine it with the pythonic type of an operand/address (or the regular type information) you can easily determine the operations you need to perform in order to get the value for a specific reversing artifact in whatever language you use.

Why

The primary reason for this refactor is due to the Hex-Rays microregister set being separated from the disassembler's set of registers, yet still being somewhat related. The Hex-Rays decompiler also exposes modifications to these registers as intervals, so if you're trying to interact with the Hex-Rays microcode (for some reason), you'll need to go through a process to figure out which part of the register is actually being modified. The part that resolves this to the actual register (so you can promote to the full register or demote to a partial register) is already complete, and if you're using the internal debugger (as opposed to the external one) you should be able to use it to evaluate expressions to get the location being referenced as determined by Hex-Rays. This way when I integrate the "tree-sitter" parser for Hex-Rays, you can interact with the operations in mostly Python's basic types.

The other reason for refactoring the codebase like this is to completely avoid the whole vdui_t garbage from Hex-Rays which requires the user to be navigated to the desired function with the pseudocode view being open in order to use it. The tools inside vdui_t are super-powerful, and is pretty big part of what you'd want to interact with. Unfortunately, it requires interactivity and is super-flakey if you're trying to interact with a microcode pass that is not currently being viewed. So to deal with this as well as being able to map an mblock_t back to the actually mba_t, cfunc_t, or cfuncptr_t, the loader was tweaked a bit to allow support for temporal modules which can maintain state while currently inside the database.

Other than architecture, the other major temporal module is the hooks module. This used to be exposed via the ui.hook namespace (and still is), but now you can just import hook and then interact with whatever hooks you need to. This allows you to set up any kind of hook at any point where IDAPython is actually loaded. Attached to this hooks module are different properties which allow you to attach a Python callable to any of IDA's available hooks and also includes proper support for Hex-Rays' notifications.

When

You can find this inside the "persistence-refactor" branch which I'm doing all development work in. I'm keeping it in a separate branch because the changes in this branch aren't really bugfixes or incremental changes. Rather, they're still experiments. As an example, I've been experimenting with changing the data structure used for pattern matching so that it's a little more accurate and doesn't result in accidentally determining the correct callable for a constraint by doing a complete type check. Another example is that I'm considering refactoring the tag cache so that it's significantly faster on large database, possibly with the option of using native code instead of pure-Python.

Compatibility

Pretty much everything should be the same and work exactly the same if you're using the public interface. The one thing that will likely change is that you won't be able to pickle your entire namespace anymore with things like dill if you're trying to save your state between open databases. This is due to the whole temporal modues thing and is related to some of the strange Python trickery that I'm performing to hide objects and other state from the user. This'll probably be fixed eventually (especially if people demand it), but it's not a priority in my opinion.

Conclusion

After this, I'm not sure if I'm going to keep maintaining the html documentation for each release as it's going to take some time for it to catch up to these changes. Its maintenance is a ton of work and really serves as more of an advertisement (rather than a reference) due to the idea of this plugin being to take advantage of Python's auto-completion and provide simple/useful documentation natively via Python's help() or the ? expression that IDAPython provides. If you feel otherwise, let me know here...or through the usual channels.

arizvisa · 2023-05-03T02:57:39Z

arizvisa
May 3, 2023
Maintainer Author

So far, here's a list of the major changes that have happened.

Directory structure has been reorganized. Internal modules now reside within the misc/ directory. Processor-specific functionality can now be added directly to the processor module in procs/. There's now 3 methods of installation depending on what scope you want to use the plugin (in a plugin, for an entire instance, or just for a specific database).
Some modules are now persistent which allows them to track database state. This was the original reason for the new branch, since interacting with Hex-Rays is pretty flakey if you're trying to monitor what the user is doing. Currently, the hook, architecture, and internal __catalog__ modules are persistent and can be used to interact with state across the scope of a database. I will probably add one or two more when developing a safe way to access Hex-Rays at arbitrary times and encapsulating logging (thanks rolfr for giving me the idea).
A number of algorithmic improvements. This should improve performance significantly. Some algorithms such as the one used for multicase functions were running in O(n), but this wasn't noticed because of being clever with the order of function definitions. More trees are being used in various places.. such as when mapping the u-instruction intervals from Hex-Rays to the disassembler, etc.
Additional native types which can be treated as primitives. Prior to this branch, the integer was the core type for everything. The branch adds native support for both references, registers, locations, and boundaries. Many of these can now be used in place of integers and they also include additional operators for doing checks. The structure type also included additional operators that can be used for quick alignment calculations like if you want to see how a structure can tile across a range of memory (specifically, use the multiplication or power operator for this).
Access types have been fixed. The access type was broken a long time ago before the plugin went public, but now it has been fixed. Memory accesses can be treated an '&rwx' set which can be used to identify the type of reference that is associated with an instruction operand or whatever. Operand references also include an additional operator that can be used to determine the type of reference without having to unpack it out of its tuple. This completely deprecates the instruction.op_state and instruction.ops_state set of functions.
Structure paths and explicit structure paths are better distinguished between one another. This originated from a discord monologue, but now the ins.op_structure and ins.op_structurepath functions have clearly different semantics and do a better job of calculating the right offset to apply to an operand when referencing your desired field.
The ctypes module is now only used for interacting with the IDA SDK library. Prior to this, ctypes was used as a sort of hack to deal with decoding structures and arrays since there aren't any other typesystem libraries that I like (other than my own). This of course isn't included since that's a completely different project in a completely different repository, Now all encoding and decoding of types are handled natively and it includes support for integer and IEEE float sizes that were not supported by ctypes. Decoding unions and nested structures are now natively supported.
All mathematica-ese functions and aliases have been removed. These were actually just remnants of history and needed to be removed. There are now significantly less aliases for both namespaces and functions which makes it easier for auto-complete to find your desired functionality. The naming scheme now opts for either straight english or ida-speak for functions (i.e. structure vs struc, function vs func, etc) and single-letter aliases for namespaces.
Better support for idaapi.tinfo_t and type libraries in general. Now there's a number of functions for interacting with this type as well as utilities for searching and constructing types. Another addition that is related to this the support for parsing c++ templates out of strings.
More things can be serialized reliably with pickle. Many things can now be serialized, although the previous feature of being able to use the dill module to save your entire workspace has been deprecated (as result of the persistent modules addition from point 2). Although I had considered this incredibly useful and had used it for a very long time, workspace preservation had essentially lost priority in favor of additional Hex-Rays support.
Better support for pythonic types. Originally I had considered this a gimmick, but a random one-time user convinced me that it's more useful that I'd originally thought. Things like structure members which can take a pythonic type, can also take an idaapi.tinfo_t to destructively apply a type. Also, pythonic types can be returned for operands and registers or explicitly applied to an address similar to a structure. This can be useful for updating a reference with the exact same type as an operand where an idaapi.tinfo_t is not relevant.
A number of the namespace matchers now include flags in their output that can be used to quickly identify information that may be pertinent to your searching.
The database.address namespace has been greatly expanded and is essentially a core namespace now. It can be used for doing all sorts of things such as converting bounds_t or ranges into a list of addresses, filtering using any of the idaapi.next_* and idaapi.prev_* functions, names to addresses, offset calculation, etc.
All namespaces that are treated as functions will always return lists. If you want an iterator, simply use your desired namespace as $namespace.iterate. Since all output is converted to hex anyways, returning everything as a list by default makes it much quicker to see what is in a particular dataset.
Better support for all of the available string types which also includes support for arbitrary encodings and decodings. Now all string types are represented as a tuple that describes the character width and the length prefix. This has also been added to pythonic types using the str type. Now if you pack these in a tuple, you can specify a wide-character string with a 2-byte length prefix as (str, 2, 2). The set.string and get.string has also been improved in response. The available encodings depends entirely upon python's codec module, so if your database includes a weird encoding and you wish to use it..you might have to specify it as an index when applying your string.
The database.config namespace has been renamed to database.information. This is a minor change as there's still an alias for the original name, but calling it database.config did not make sense to me since I ended up needing access to it practically everywhere. There's been a couple changes to this namespace. One example is that database.information.path now takes a parameter so that you can use it to generate a path relative to the database location.
Many performance and consistency improvements. Now all the hooks are scheduled which should reduce the database build time. Common tools that end up being variations due to spotty disassembler support are packed into internal modules so that the same logic is used throughout the plugin. We still aim for backwards compatibility with older versions of the disassembler, but now a lot of the version checking is pulled out of the function which removes some conditionals out of common hot-paths.

To be continued in the next post some time in the future...

0 replies

arizvisa · 2024-07-19T05:01:28Z

arizvisa
Jul 19, 2024
Maintainer Author

So, originally I was planning to merge the "persistence-refactor" branch into mainline this year and include the recently introduced decompiler utilities too. At least, that was the goal...

However, I learned pretty recently that the entire structure/enumeration/frame API was going to be culled in favor of the local type library api that is based on tinfo_t. This is supposed to happen in the next major version of IDA. This unfortunately means that it is likely that all of the structure/enumeration/frame functionality inside the ida-minsc plugin will end up being busted during that release.

Addresses, Identifiers, and Type information

In IDA, an identifier or address is generally treated as a first-class citizen. Anything with an identifier/address can have a name, can be referenced with an xref, and can contain other information stored for it within a netnode.

The structure and enumeration API actually uses this identifier internally for storing all of its information. Other than some minor additions and the member offset being shifted, the frame api is essentially the same as the structure api. Structure-related operations such as operand structure paths, structures being applied to an address, etc. accomplish this by associating the structure id with the address.

Type information, however, is somewhat "optional". Not all addresses/identifiers will have a type associated with it, and the type information can even be removed/disassociated from an address/identifier.

Due to this original design, this plugin has always treated type information as a secondary source of information for an address. That's not to say that the type information is ignored, but rather that it is deprioritized when determining the boundaries of the type applied to an address/identifier. Essentially, the ida-minsc plugin relies on the disassembler/decompiler to manage the type information for an address/identifier implicitly and always assumes that the "flags" for an address/identifier are the actual "truth" of the type.

Support @ Hex-Rays

So, I questioned support@ about how they plan to progressively deprecate this API. Specifically, since members don't have IDs anymore and they are now coupled to the udtmembervec_t (which is essentially an array of all the members belonging to a structure).

The following list of questions is from memory, so it probably doesn't cover all of the concerns I asked about. (I'll correct this eventually, but it's not like anybody reads any of the shit I post here anyways).

Will the FF_STRUCT flag associated with an address still be used, or is this planned to be removed as well? (This affects next_that and prev_that which is used to quickly locate the previous/next structure associated with an address).
How will structure/union paths associated with an operand or references to a structure member appear? These are generally stored in an opinfo_t as an array of identifiers containing the member decisions needed to display a specific path through a union/structure for an operand. As a udm_t doesn't have an identifier, I expect this to change significantly. I'm also guessing that this would mean that you can't apply a structure path for a function frame to an operand.
How does one go from a udm_t directly to the parent type that contains it? The previous API uncouples structures/unions from their members which allows you to deal with a smaller subset of members rather than all of them at the same time. This way if you have a structure that contains 20k members, you can avoid having to process all of them at the same time, instead focusing on members within a specific range.
Will changes to the frame API reflect how the decompiler deals with the stack frame or will this change as well? This question is not a big deal,
Presently, I'm unsure how to find the member that comes before or after a "hole" in a structure. This was handled originally via the get_prev_member_idx and get_next_member_idx. However, in 8.4 the find_udm api can only be used for finding members. So, similar to question 3, this might require enumerating all 20k members of a structure to identify a "hole" that some user might want to fill.

I also asked about how to accomplish some of these things with the current 8.3 API in order to ease transitioning to the new API. Unfortunately, I didn't get an answer other than "Here's what works for me in 8.4."

What this (might) affect

Now...There is a LOT of code in this plugin that is based on the structure api in its older form. All of the interval arithmetic, structure and frame members, structure encoding and decoding, utilise the soon-to-be-deprecated structure API. Type references and even serialization/deserialization are also based on this API. Essentially when this API is removed by Hex-Rays, a lot of the things that I (personally) find useful about this plugin will end up ceasing to work.

Here is a non-exhaustive list:

Structure/union/frame encoding and decoding
Structure/union/frame intervals and arithmetic
Reading or applying paths via ins.op_struc and ins.op_strucpath
Serialization/Deserialization of types
References to members and structures/unions/frames
Structure/union/frame slice selections, assignments, and removals
Structure/union/frame/enum/member indexing and tagging
Structure/union/frame layouting and fragments
Pythonic types involving structures/unions and variadic structures
Pythonic types associated with members
The enumeration module could end up being deprecated in its current form.
I'll need to refactor some of the APIs in order to properly support alignment.

Conclusion

Because of the uncertainty associated with the changes that are going to happen in the future, I expect the "persistence-refactor" branch will be delayed until a couple minor versions after the next major version of IDA.

It is also pretty likely that during the transition to the new local type library api that plugin compatibility with certain versions of IDA will not be able to be maintained until the local type library becomes as mature as the old structure/union/frame API.

The way I plan to approach this transition is by creating an entirely separate implementation of the misc/structure.py module. I am also considering maintaining a separate cache for mapping structure/union/frame members to their parent type until I can be sure that there is a way to keep members uncoupled from their type.

So in the end, if I don't end up putting a gun to my head (and pulling the trigger), this will result in there being two versions of the misc/structure.py module. There will be the current version of the module which uses the old structure api, and then there will be a different version of that same module for new versions of IDA. This distinctly separate version, will attempt to utilise the tinfo_t API in order to maintain as much of the present functionality as possible. I will also end up introducing the misc/enumeration.py module so that I can attempt to mirror the current functionality for enumerations using tinfo_t.

3 replies

arizvisa Aug 12, 2024
Maintainer Author

Examining the leaked SDK, all structure, enumeration, and frame events were removed in favor of lt_udm_created, lt_udm_deleted, lt_udm_renamed, lt_udm_changed, lt_udt_expanded, frame_created, frame_udm_created, frame_udm_deleted, frame_udm_renamed, frame_udm_changed, and frame_expanded.

arizvisa Aug 12, 2024
Maintainer Author

Also looks like find_binary has another parameter attached to it, for the string literate encoding.

arizvisa Sep 2, 2024
Maintainer Author

https://docs.hex-rays.com/pre-release/developer-guide/idapython/idapython-porting-guide-ida-9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Announcement with regards to a major refactor of the plugin #158

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Announcement with regards to a major refactor of the plugin #158

arizvisa Oct 31, 2022 Maintainer

Announcement about refactoring

What

What (references)

Why

When

Compatibility

Conclusion

Replies: 2 comments · 3 replies

arizvisa May 3, 2023 Maintainer Author

arizvisa Jul 19, 2024 Maintainer Author

Addresses, Identifiers, and Type information

Support @ Hex-Rays

What this (might) affect

Conclusion

arizvisa Aug 12, 2024 Maintainer Author

arizvisa Aug 12, 2024 Maintainer Author

arizvisa Sep 2, 2024 Maintainer Author

arizvisa
Oct 31, 2022
Maintainer

Replies: 2 comments 3 replies

arizvisa
May 3, 2023
Maintainer Author

arizvisa
Jul 19, 2024
Maintainer Author

arizvisa Aug 12, 2024
Maintainer Author

arizvisa Aug 12, 2024
Maintainer Author

arizvisa Sep 2, 2024
Maintainer Author