Coding with Large Language Models: Infrastructure and Evaluation Across Models