Speech recognition is poor and costs too much.
There are two ways to do it, manually, or using big tech companies.
It takes for an average person around 5 hours, or an expert, 3 hours, to manually transcribe audio or video content, and get 95% accuracy at a huge cost.
If you use a speech-to-text solution from big tech companies, you get the transcript in half the time of the length of the video or audio content, for a content of 1 hour, you get the transcript in 30 minutes. This is fast, but you lose from the accuracy, which is about 80%. And the updates from big tech companies are very few.