Has anyone done such a user mode sampling before?
What is the performance impact on the process?
For Locksandwaits analysis, there are many extra works to monitor Sync Objects, IO waits and Thread APIs in your 1000+ threads application, it will impact on the performance, my opinion is to use Pause/Resume API from ittnotify library. Thus, you just focus on specific interest of code area (time period) to reduce overheads. Read this article.
Is it possible to use Hardware Event Based Sampling to get similar result as locksandwaits?
Using hardware event-based sampling with stack enabling is anothor option, to know context switches, wait time, etc for each function, also you have timeline panel report to know threads' CPU usage info. But there is no CPU time for sync-obj info.