Fail-safety?

6 Comments

  • Joey
    Comment actions Permalink

    Hi Alex-
    Thanks for your detailed post. I presented it to members of our team in charge of the Chat Conversations product, and they want to give it the consideration it deserves. They will be in touch as soon as possible to continue this dialogue, and feel free to reach out if you have any additional thoughts to add.

    0
  • Alex
    Comment actions Permalink

    Great!

    One possible workaround I've thought of is to have a trigger that transfers channels to the human department when a user has requested chat and waited a certain amount of time without being served (similar to the default chat rescuer trigger). I've had problems getting the transfer department part of the trigger to work during testing however (the trigger itself works, as demonstrated by the successful delivery of a message to the user from the trigger).

    Another problem that I've forgotten to mention is the chatbot server failing in the middle of serving customers. The customer would be in the chatbot department and be currently served by the now-inactive chatbot agent. Customers could manually end the chat and re-enter, but it would probably be a challenge getting users to remember or understand using that solution (especially when it's not as simple as a refresh when using authenticated users that don't create new channels/sessions on refresh). Additionally, any routing trigger to the chatbot department would have to instead route incoming chats to the human department. Since the chatbot would still be marked as "active" within Zendesk, it has no way of knowing when to change routing.

    These problems make the second solution I mentioned in my first post (single department, chatbot leaving channel) more appealing. Since no triggers are used and the Chatbot only intercepts incoming chats when active, when the chatbot fails, incoming chats should be seen by agents normally. It turns out the in-memory store for transferred channels probably wouldn't be necessary in this case, as the Conversations API message subscription stops delivering messages from channels that the chatbot has exited (and if it were, tagging the user channel would probably suffice).

    However, this brings up another issue. Since authenticated users seem to be connected to their old sessions after refreshes (and I suppose never lose their sessions on SPAs), after transferring the user (in the single-department-solution), it is impossible for the chatbot to interact with them again (granted, I don't know when authenticated user sessions expire and I haven't tested this extensively). The chatbot has left the channel, so new messages can't be seen, even if the user opens up chat a long time after their last chat session. This would make all chats initiated after a chat has been transferred essentially circumvent the transferring workflow (start with chabot, transfer to humans if necessary).

    0
  • Yu-Hsuan Chao
    Comment actions Permalink

    Hi Alex,

    Thanks for explaining your solutions in detail. Due to bots not being able to follow the assigned Chat routing rules fully, we would recommend bots using Conversations API to have its own dedicated agent department (https://develop.zendesk.com/hc/en-us/articles/360001331787-Getting-started-with-the-Chat-Conversations-API). You can set all incoming chats to go to the department the bot is in, let it either resolve incoming chats or route them to other departments where the human agents are.

    We expect you to have bot status monitor set up, and restart/reconnect the bot to Zendesk should it fails. This way, the bot will still be able to continue serving existing visitor chats and new incoming chats in the bot department once it's back online, and the mechanism for department transfer to human agents will remain the same. If it takes too long for the bot to come back online for whatever reason, you can always have humans check the visitor list and proactively reach out to those who are waiting to be served by the bot (https://support.zendesk.com/hc/en-us/articles/360022367833-Browsing-your-site-s-visitors).

    I hope the explanation helps :) Please feel free to comment on this post if the suggestions don't work for your particular set up!

    On the other hand, I'm curious if there are any special reason for you to invest so much in fail-overs in situations where the bot goes offline unexpectedly. In many cases that we've seen, the clients seem to be happy with a bot auto-reconnection mechanism. Does it (bot servers being unresponsive) happen often in your domain area? Are there any special concerns you have that we can help?

    0
  • Pierre-Gilles Leymarie
    Comment actions Permalink

    Hi everyone,

    How should we handle app restarts?

    We all improve our backend on a daily basis, and when the backend restarts, even if it takes only 1-2 seconds, it means all messages received during this timeframe are lost ?

    I was thinking of starting a new instance while the old is still running (it will disconnect the first instance from the Zendesk websocket server), then shutdown the old.

    In this situation, will we be guaranteed that 100% of messages are handled? 

    On the other hand, I'm curious if there are any special reason for you to invest so much in fail-overs in situations where the bot goes offline unexpectedly

    A server with a SLA of 100% doesn't exist, we all have little downtime due to various errors (crash, hardware failure, ..) . We can't just pray that no messages were sent during a downtime ^^ 

     

     

    0
  • Joey
    Comment actions Permalink

    Hi Pierre-

    I spoke with our engineering team, here is what they responded with:

    Our recommended flow for bot/app restart:
    - send startAgentSession mutation
    - connect to the new WS endpoint
    - finally, query channels to check if there's any missed chat & re-establish the previous subscriptions

    Note that the above steps are also recommended to prevent loss of chat during disconnection before reconnection (EOS event).

    1
  • Pierre-Gilles Leymarie
    Comment actions Permalink

    Thanks for your answer! We'll do that. 

    0

Please sign in to leave a comment.

Powered by Zendesk