Peterson's lock was designed for precisely this scenario. However, you need to ensure that the operations become visible to the other processors in the right order. The algorithm assumes a sequentially-consistent memory model, and on some platforms this will require memory fence operations before and/or after some of the stores and loads.
You also need to bear in mind that Peterson's lock is not a replacement for a general-purpose mutex --- each thread that tries to access the critical section must have a specific ID. The basic algorithm only works with 2 specific threads, though it is possible to extend it to more.