This might be a bit of a stretch, but as a possible means of helping the image generator learn what to do, or what we’re aiming to make, perhaps a base image can be set or be an option to include to help give the generator an idea on what we’re looking for on some of the prompts?
Human Pose Estimation is already an existing technology, check the web for HPE.