My IIT R Journey


Talk about

  • Why am I writing this?
    • Soumik Dhaityari
    • Prajwal Bhatt
  • Opinions
  • Highs and Lows
  • Groups
    • People are doing amazing work
      • IARC, OnRec, Nayyar etc.
    • Drams
    • My IMG story
    • First year
    • DSG
  • Machine Learning
  • Getting things done at IIT R
  • Admin, Faculty, Department, Non-Circuital
    • DAPC, Minors
    • Double Majors
  • CP
  • Intern & Placements
    • Talk about Schlumberger
      • Why this was my last interview and other interviews Adobe, GS etc.
      • Prajwal
      • Data Science profile
      • Interview
      • Rescind offer (probably the best thing)
  • Higher Studies
    • How i always wanted to pursue this and was fixated. I remember my mentor telling me this (Prajwal)
    • Later I started asking, why?
    • Talk about Aarush,Dakshit
  • Open Source Everyone from heaven come here.
    • Benefits
      • Learning from the best
      • Connections & Interactions
      • Impact
      • Opportunities
      • Real World Experience
    • Dive Into Deep Learning
    • Programs
      • GSoC
      • MLH
      • LF AI
      • Outreachy
      • GSoD
      • Quansight Labs, Open Teams
      • Talk about design and Marketing opportunities
  • Why it is important for you to surround yourself with like minded people
    • Groups
    • Seniors and their impact (How to learn and connect)
    • Avg of five people around you
      • Talk about Aman startup journey
      • PM PhD journey
      • Aaradhya Faraz Madhvendra
  • TT Boys
  • Helping Others
  • Connecting with people
    • How open source helps
  • My future
  • Buying courses
  • Not everything is rainbows and sunshine.
    • I’ve made some poor decisions in the past and definitely a lot of mistakes.
    • My first semester priorities. I don’t want to talk about this in detail. But YEAH, I made some mistakes and for a long time I used to blame someoone else for the same. This is again one of my biggest learnings experiences.
    • Reading more
  • You will make mistakes
    • Accept that no one ever will be perfect. Don’t be worried “ki yar ye kya sochega”
  • My two cents

  • Set high standards for yourself
  • Judging a startup
    • Internship stipend
    • Should you be working with them
    • People
  • Kya fark parta hai approach
    • The ability to ask dumb questions is a super power.


A shark in a fish tank grows 8 inches, but in a ocean it can grow upto 8 feet or more. Change your environment and watch your growth.

Find your own path. It will probably be a a “Road not taken”

I’ll talk about our approach on filtering out inscrutable audios from VASR.

There are situations in Call Center Automation (CCA) pipeline where user utterances are bad. Bad here is defined by things like noise, static, silences or background murmur etc. rendering the downstream SLU systems helpless. We started with a proposal and prepare a dataset for making an ML system learn to reject these audios.


  • No more misfires from SLU side which ultimately leads to a better user experience.
  • Save compute and time by skipping bad audios.
  • The whole system can be used for all our audio based tasks to predict and filter out the poor ones, hence avoiding sample noise for these tasks.


We prepared a dataset of intent tagged conversations with specially marked intents which tell us that these utterances are bad and them going further in SLU will result in errors. Also we have a sampling of non-bad utterances (tagged with regular intents) to make this a classification problem.

There are total 9928 samples of audios labelled as bad and 20000 samples labeled as good.

All the raw labels were not very useful, hence we clean and preprocess the data to finally create 2 broad categories with sub-classes.

  • audio-bad
    • audio-noisy: Noisy audio.
    • audio-silent: Silent audio.
    • audio-talking: Background talking.
    • hold-phone: Music from keeping on hold.
  • audio-good


If we are going to reject these bad audios then we need to do so with:

  • High Precision: We should not be rejecting good audios which are perfectly interpretable and understandable.
  • Low Latency: This system should have little to no latency, otherwise it will just slow down our whole VASR flow after being deployed and integrated.
  • Online: The model should be capable of running in an online setting where continuous chunks of audios are fed into the system.
Can't See? Something went wrong!
Binary Audio Classification based on Log-Mel Spectrograms

While these features (spectrograms) can be generated once after processing all the audios in the dataset, this feature generation needs to be done on the fly to make a model that can be used in deployment i.e given raw audios as input, it should be able to predict the class, that was easily incorporated through a few transforms done within the model using torch-audio.

Even though this architecture is simple, it got us an accuracy of about 87%. But it is not the accuracy we need to see, our choice of metric to measure the performance is precision as explained earlier. We are still in the process improving these initial baseline numbers of the model. One simple approach for increasing the precision is to increase the threshold, trading-off some coverage in the form of support.

Misclassification Analysis

We also do a post prediction analysis on the misclassified audios, which revealed an interesting pattern in the dataset and in the kind of audios that the model was finding hard to make predictions on.

Briefly these errors followed 3 major types which now helps to understand the places where we can make improvements.

  • Type 1 (Very Short Utterance) : say 0.2 seconds in audio of 6 seconds. Due to noise in most part of such short utterance audios, our model predicts it to be noisy and not good in some occasions. This can probably be fixed with VAD which can trim the non speech segments in such short utterance audios.
  • Type 2 (Long audios) : Audio duration is longer than 6.5 seconds with the speaker in latter half. Since we chose to threshold our features (log-mels) at 6.5 seconds, the latter part of the audio is basically truncated and hence such errors.
  • Type 3 (Ambiguous / Wrongly Labeled) : There are samples in the dataset which are not perfectly labelled. One may say these audios are debatablem, some may find them to be bad others may think that they are ok. This type of label noise is something which needs to be tackled.

Needless to say, there are places where we can improve these results, but having a solid baseline model initially is important for incremental improvements over time and after a few iterations we finally see these models in our production systems.

That’s all for now. This simple model was really powerful, unfortunately the code cannnot be open sourced due to legal reasons and privacy restrictions behind the dataset.