Progress has been made since the last post. My modifications to boxfit now allow for basic Inverse-Compton radiation. Here is a reference spectrum generated using the shipped settings. The current method uses the definition of the inverse compton parameter (Y) laid out in Nakar et. al. Apj, 703, 675, and functions for the slow cooling regime mainly, with placeholders in the other regimes.
The orange is the SSC enabled spectrum, and it is behaving exactly as expected above the cooling break.
The Next step is to get the proper parameterization for Y based on Nakar et. al. as well as Beniamini et al. MNRAS, 454, 1073B. This includes the Klein-Nishina effect at higher frequencies. I do worry a bit about how computationally expensive this will be, but I can't really speak to optimizations until I have a better idea of what the algorithm is going to look like.
I am still working on the CUDA port, but I haven't had much time to think about how I want to change the data structures to fit within what CUDA can work with. I have started playing with gcc's built-in multithreading schemes. The -ftree-parallelize-loops= option has given some modest but tangible improvements over single threaded compilation. I am also working to write it with OpenMP, but just parallelizing the loops seems to slow things down a bit, probably related to cacheline refreshes. I am going to try with SIMD instructions in the next couple of days, as well as see if I can tweak how the code runs to avoid memory hiccups and hopefully provide an additional speedup, as OpenMP will fully utilize the cores of the machine, unlike g++. I am also thinking about replacing CUDA with OpenACC for the initial port as that is largely manufacturer agnostic and a bit simpler to code in than OpenCL.
The orange is the SSC enabled spectrum, and it is behaving exactly as expected above the cooling break.
The Next step is to get the proper parameterization for Y based on Nakar et. al. as well as Beniamini et al. MNRAS, 454, 1073B. This includes the Klein-Nishina effect at higher frequencies. I do worry a bit about how computationally expensive this will be, but I can't really speak to optimizations until I have a better idea of what the algorithm is going to look like.
I am still working on the CUDA port, but I haven't had much time to think about how I want to change the data structures to fit within what CUDA can work with. I have started playing with gcc's built-in multithreading schemes. The -ftree-parallelize-loops= option has given some modest but tangible improvements over single threaded compilation. I am also working to write it with OpenMP, but just parallelizing the loops seems to slow things down a bit, probably related to cacheline refreshes. I am going to try with SIMD instructions in the next couple of days, as well as see if I can tweak how the code runs to avoid memory hiccups and hopefully provide an additional speedup, as OpenMP will fully utilize the cores of the machine, unlike g++. I am also thinking about replacing CUDA with OpenACC for the initial port as that is largely manufacturer agnostic and a bit simpler to code in than OpenCL.
Comments
Post a Comment