Hello! I see something similar was suggested a few years ago, but with improvements in various things over the years, it may be time to at least revisit the idea.
Please, hear me out - I know adding multiprocessor support is not inherently an easy thing to do, but there is a way that I've used myself that may not be too difficult for you to implement. (And before you suggest I take the source and do it, I can't. My language of choice is Python; you're far more skilled than I :-) ).
What worked for me is starting multiple separate copies of the program, one for each cpu core, with the full list of files to process split among them. For instance, for 4 cpu cores, 4 copies of the program, each one given 1/4th of the file list to work on. Once the copies are done, the list is done.
With each copy being it's own separate process entirely (not just threads of one process), a surprising amount of problems with function reentrancy go away entirely, since each process gets their own copies of those functions.
I know - multiprocessor support can be tricky. For a cpu-intensive program like this, however, it would speed up the results by orders of magnitude.
Please, just let the idea settle in the back of your mind for a bit, and see how you feel about it after a while. It would be a huge speed boost if you could can pull it off. Thank you for at least considering it. :-)