Traditional conversation systems extract text from speech using automatic speech recognition, then predict intent from transcriptions. We present Skit-S2I, the first publicly available Indian-accented spoken language understanding dataset in the banking domain in a conversational tonality. The end-to-end SLU approach directly predicts speaker intent from the speech signal, avoiding cascading errors from ASR and reducing latency. We test various baseline models and pretrained speech encoders, finding that self-supervised learning representations perform slightly better than ASR-based representations for this classification task.