Skip to content

Administration

Scott Veirs edited this page Nov 11, 2021 · 14 revisions

Join the admin team!

We welcome you help with administering the real-time inference system. If you'd like to get involved, please check out the on-boarding info for the Orcasound open source community and let us know you'd like to help maintain the system.

Once on-boarded, your first step will be to read the information in the wiki (here) and discuss any questions with existing administrators. Once you have a handle on how the system works, we can grant appropriate access to cloud-services so you can begin helping.

The OrcaHello project is based in Azure and most of the communication about the project occurs via Teams (default tools during Microsoft hackathons). Other aspects of Orcasound open source projects, including the management of the "swarm" of hydrophone nodes and the live-listening web app designed for community scientists, are deployed beyond Azure and are administered via general Orcasound communication tools.

Administrator handbook

Overview of the system (last update 15 Oct 2021)

Here is a schematic of the OrcaHello system:

OrcaHello schematic showing data flow through the machine learning pipeline to moderation & notification components.

In a nutshell, some administration may be required for each pictured component of OrcaHello:

  1. Pulling audio data streamed from Orcasound hydrophones as quickly as possible (involves coordination with the broader Orcasound administrative community);
  2. Predicting when orca calls occur in the data stream (involves keeping model running, strategizing with data scientists and working with moderators to improve models)
  3. Maintaining the databases and storage of model outputs (including training data from other Orcasound projects like the Pod.Cast annotation tool)

An overall responsibility of administrators is to monitor the costs of the OrcaHello system relative to the credits granted to the project by Microsoft's AI for Earth program via an Azure subscription.

** Nov. 2021 admin focus: stabilizing deployment **

Prakruti posed a key question: ** "What Azure service shall we use in the long-run?"**

Michelle's initial suggestions of options (with pros / cons):

  1. Azure Container Instances (ACI)
  • Pro: we're using it already
  • Con: prone to failing frequently, no horizontal scalability (if/when we need it)
  1. Azure Kubernetes Service (AKS)
  • Pro: horizontal scalability, declarative method allows for easy cluster migration/adding hydrophones, cheaper than ACI
  • Con: need to manage cluster (K8s version upgrades), higher barrier to entry knowledge wise
  1. Virtual machines (VM)
  • Pro: lowest bar to entry, cheapest option
  • Con: managing VMs, no horizontal scalability
  1. Azure Container Apps
  • Pro: no need to maintain infrastructure, horizontal scalability
  • Con: preview service (just announced), most expensive option

Common administrative tasks

Troubleshoot the live system

Triage checklist:

  1. Hydrophones are broken
  2. AWS buckets are not picking up audio correctly
  3. Inference system is broken
  4. Cosmos DB is broken
  5. Moderator portal is broken
  6. Notification system broken (Azure Functions)
  7. SendGrid broken

Troubleshooting sub-steps for each level of triage

  1. Ensure the containers are running. (In Azure, view container instances -> look for something like live-inference-system-aci-allhydrophones... -> State should be running )

  2. Make sure Orcasound data is reaching the Azure blob (look in e.g. livemlaudiospecstorage.blob.core.windows.net/audiowavs )

  3. Check to see that new rows have recently been added to the CosmoDB database

View Azure CosmoDB databases ->

Example queries:

  • SELECT * FROM c WHERE c.timestamp LIKE "%2021-10-02%"
  • SELECT * FROM c WHERE c.comments LIKE "%transients%" <- matches all records with comment that includes the string "transients"

Add new admin

Restart ACI container(s)

Add a hydrophone location

Add a moderator

Manage moderator email list

Manage subscriber email list(s)

Machine learning processes

Access to High-Performance Computing (HPC) resources within Azure

Guidance from AI for Earth:

You may now request GPU access for your Azure account using the following steps. As a courtesy, please
only request the SKU/region you need and not ask for all SKUs/regions. Also, to fairly distribute the
limited number of GPUs that AI for Earth has reserved for grantees, we are limiting each grantee access
to only 12 additional cores of each SKU/region.

1. Go to the Azure Portal: https://portal.azure.com
2. Select Help + support
3. Select New support request
4. Select “Issue type” as "Service and subscription limits (quotas)" from the drop-down list
5. Select the subscription ID (make sure it’s the Subscription that you sent to us for whitelisting)
6. Quota type = Compute/VM (cores/vCPUs) subscription limit increases
7. Change the Support Method as appropriate
8. Set the Request Details for “Resource Manager” deployment model (the GPU SKUs are not
deployable via the Classic deployment model)
a. Select your Severity level,
b. Select the Deployment model,
c. Select the Location,
d. Select SKU Family (multiple selections are possible),
e. You can see the current quota limit,
f. Fill the required new quota limit
g. Click on “Save and continue”
9. Verify contact info and click on " Next: Review + create " to generate a request.

Please contact AI4EHelp@microsoft.com if you have any difficulties with requesting access.

Re-training the OrcaHello model for SRKW calls

Clone this wiki locally