Skip to content

Administration

Scott Veirs edited this page Oct 27, 2022 · 14 revisions

Join the admin team!

We welcome your help with administering the real-time inference system. If you'd like to get involved, please check out the on-boarding info for the Orcasound open source community and let us know you'd like to help maintain the system as a devops volunteer.

Once on-boarded, your first step will be to read the information in this wiki and discuss any questions with existing administrators. Once you have a handle on how the system works, we can grant appropriate access to cloud-based services so you can begin helping.

The OrcaHello project is based in Azure and most of the communication about the project occurs in the Orcasound Slack year-round and via Teams during Microsoft hackathons. Other aspects of Orcasound open source projects, including the management of the "swarm" of hydrophone nodes and the live-listening web app designed for community scientists, are deployed beyond Azure and are administered via general Orcasound communication tools (Orcasound Slack, Github, and Trello).

Administrator handbook

Overview of the system (last update 23 Sep 2022)

Here is a schematic of the OrcaHello system:

OrcaHello schematic showing data flow through the machine learning pipeline to moderation & notification components.

In a nutshell, some administration may be required for each pictured component of OrcaHello:

  1. Pulling audio data streamed from Orcasound hydrophones as quickly as possible (involves coordination with the broader Orcasound administrative community);
  2. Predicting when orca calls occur in the data stream (involves keeping model running, strategizing with data scientists and working with moderators to improve models)
  3. Maintaining the databases and storage of model outputs (including training data from other Orcasound projects like the Pod.Cast annotation tool)

An overall responsibility of administrators is to monitor the costs of the OrcaHello system. This was first done (2019-2022) to track usage of credits granted to the project by Microsoft's AI for Earth program via an Azure subscription sponsorship, first to Orca Conservancy, and later to the University of Washington's Detect2Protect initiative. In fall 2022, the AI for Earth program ended and all sponsorships expired on Oct 24, so now costs are monitored to ensure the Orcasound community can continue to fund OrcaHello through a pay-as-you-go subscription.

Oct. 2022 admin focus: optimize system uptime, binary call classifier performance, and costs within AKS

Archived notes

Nov. 2021 admin focus: stabilizing deployment

Prakruti posed a key question: ** "What Azure service shall we use in the long-run?"**

Michelle's initial suggestions of options (with pros / cons):

  1. Azure Container Instances (ACI)
  • Pro: we're using it already
  • Con: prone to failing frequently, no horizontal scalability (if/when we need it)
  1. Azure Kubernetes Service (AKS)
  • Pro: horizontal scalability, declarative method allows for easy cluster migration/adding hydrophones, cheaper than ACI
  • Con: need to manage cluster (K8s version upgrades), higher barrier to entry knowledge wise
  1. Virtual machines (VM)
  • Pro: lowest bar to entry, cheapest option
  • Con: managing VMs, no horizontal scalability
  1. Azure Container Apps
  • Pro: no need to maintain infrastructure, horizontal scalability
  • Con: preview service (just announced), most expensive option

In 2022, we transitioned to AKS in preparations for the fall/Oct annual hackathon.

Common administrative tasks

Troubleshoot the live system

Triage checklist:

  1. Hydrophones are broken
  2. AWS buckets are not picking up audio correctly
  3. Inference system is broken (AKS + Azure storage)
  4. Cosmos DB is broken
  5. Moderator portal is broken
  6. Notification system is broken (Azure Functions)
  7. SendGrid broken

Troubleshooting sub-steps for each level of triage

  1. Ensure the containers are running. (In Azure, view container instances -> look for something like live-inference-system-aci-allhydrophones... -> State should be running )

  2. Make sure Orcasound data is reaching the Azure blob (look in e.g. livemlaudiospecstorage.blob.core.windows.net/audiowavs )

  3. Check to see that new rows have recently been added to the CosmoDB database

View Azure CosmoDB databases ->

Example queries:

  • SELECT * FROM c WHERE c.timestamp LIKE "%2021-10-02%"
  • SELECT * FROM c WHERE c.comments LIKE "%transients%" <- matches all records with comment that includes the string "transients"

Add new admin

Restart ACI container(s)

Add a hydrophone location

Add a moderator

Manage moderator email list

Manage subscriber email list(s)

Machine learning processes

Access to High-Performance Computing (HPC) resources within Azure

Guidance from AI for Earth:

You may now request GPU access for your Azure account using the following steps. As a courtesy, please
only request the SKU/region you need and not ask for all SKUs/regions. Also, to fairly distribute the
limited number of GPUs that AI for Earth has reserved for grantees, we are limiting each grantee access
to only 12 additional cores of each SKU/region.

1. Go to the Azure Portal: https://portal.azure.com
2. Select Help + support
3. Select New support request
4. Select “Issue type” as "Service and subscription limits (quotas)" from the drop-down list
5. Select the subscription ID (make sure it’s the Subscription that you sent to us for whitelisting)
6. Quota type = Compute/VM (cores/vCPUs) subscription limit increases
7. Change the Support Method as appropriate
8. Set the Request Details for “Resource Manager” deployment model (the GPU SKUs are not
deployable via the Classic deployment model)
a. Select your Severity level,
b. Select the Deployment model,
c. Select the Location,
d. Select SKU Family (multiple selections are possible),
e. You can see the current quota limit,
f. Fill the required new quota limit
g. Click on “Save and continue”
9. Verify contact info and click on " Next: Review + create " to generate a request.

Please contact AI4EHelp@microsoft.com if you have any difficulties with requesting access.

Re-training the OrcaHello model for SRKW calls

Clone this wiki locally