Hi Gary,
Thank you for taking the time to share your thoughts on our service. We truly appreciate your feedback and value your opinion.
We would like to take this opportunity to explain that our service is a "task-optimized" solution, designed to provide the highest quality and experience for specific tasks, with enterprise-level scalability as well as containerization-ability for data sensitive needs. This sets us apart from other services that offer generic solutions, and allows our customers to easily achieve their desired results without the need for prompt engineering or fine-tuning.
We also strive to constantly optimize latency. As a GenAI solution, our service generates responses token by token, which can result in longer latency compared to classification ML models. The length of the input document also affects latency, e.g. longer documents resulting in longer response times.
Our service has been successfully used by numerous enterprise customers for large-scale document/conversation processing, and some of them are real-time or near real-time. We would be happy to share some recommendations with you on how to use our service more efficiently. For example,
- Use multi-threading to maximize usage up to the rate limit provisioned for your account.
- Send multiple requests simultaneously without waiting for the completion of previous requests.
- Wait for a brief period, such as 1 or 0.5 second, before querying for the results.
Our team is committed to continuously improving our service, including reducing latency. We welcome your feedback and would be delighted to schedule a call with you to discuss your thoughts in more detail. Please feel free to reach out to us at mslangsvc@microsoft.com.
Thank you again for your feedback, and we look forward to providing you with the best service possible.