What’s Your Agent’s GPA? A Framework for Evaluating AI Agent Reliability