Must be a US Citizen or Green Card holder.
This is a hybrid position.
Statement of Work: GBS requires the services of a Storage System Administrator III to provide labor services to support the DOE National Energy Research Scientific Computing Center (NERSC) Storage Systems Group (SSG) Team's hardware and software environment at the Lawrence Berkeley National Laboratory's NERSC facilities in Berkeley, CA.
The hardware and software is part of a High Performance Computing (HPC) system environment and includes storage systems, servers in support of storage systems, storage services, software, and network components.
The work will require active interaction/participation with clients and the Team to troubleshoot and resolve technical issues with production storage system.
Baseline Hardware and Software Environment Support Baseline Equipment, QTY Estimated: 14 racks of Elastic Storage System computer storage - Community File System - manufactured by IBM 43 disk arrays - NetApp 80 storage servers - Supermicro 12 elastic storage system enclosures - IBM 44 storage servers - test development environment - Supermicro 48 mid-range servers - HPE 164 enterprise tape drives - installed in IBM tape libraries 3 tape libraries - manufactured by IBM.
3 director level fiber channel switches - Brocade Baseline Software: IBM Spectrum Scale IBM Red Hat Linux, Centos High Performance Storage System Required skills/Level of Experience: (Numbered skills/experiences are a priority, listed in order of importance Bachelor's degree or equivalent experience and a minimum of three years of computing or storage experience; or equivalent experience Strong understanding of Linux fundamentals including file systems, networking, and automation tools like Ansible or Puppet Experience using one or more interpreted programming or scripting languages such as Python and Bash to automate system management tasks.
bility to work effectively and collaboratively on a team and on technical projects, as well as give and receive constructive feedback to foster communication and trust.
Experience with hardware installation and replacement, running cables, cable management, racking systems, and labeling Strong organizational skills and ability to effectively manage priorities across many projects ranging from immediate problem resolution to long-term strategic planning.
Strong written and verbal communication skills and the ability to document and describe complex tasks to audiences of varying familiarity with storage technologies.
Task Description: Team Interaction/Participation Participate in weekly team meetings to maintain awareness of open projects and goals as allowable to maintain internal info and activities with other vendors, NDA etc.
Monitor Slack for direct messages and other channels for issues related to storage systems limit to certain channels at the discretion of the University Respond to email in timely manner as determined by the University Technical Representative Participate as a proactive team member Potential participation in on-call 24/7 responsibilities Potential participation in production storage system problem determination and resolution engage with other team members for advice when in doubt and vendor support when needed one-week rotation between 3-4 other individuals verage 2 hour on-site response time in emergency situations Hardware activities Communicate discovered and suspected hardware issues to the storage team Slack or email for awareness Service Now ticket for tracking status and closure Monitor for and respond to hardware issues on all systems from multiple vendors as needed, open support cases with upstream vendors coordinate with SSG team for replacement of components live or with down-time when required monitoring requires pro-active parsing of logs, monitoring Graphical User Interface (GUIs) to determine, rather than reactively waiting until something fails see issues through to resolution e.g.
disk controller failure: confirm that replacement is requested, arrives, is installed and returned material authorization (RMA) is sent back mber light walk at least weekly Work with on-site technicians as needed from the University and vendors Install/de-install hardware as needed rack and cable both new and existing equipment contribute to larger-scale integration responsibilities shared with other groups; e.g.
making storage system available to new compute resources Software activities t the Client's discretion - Determine for all storage system components (OS/kernel/firmware/etc.)
when updates are needed Read release notes, determine any impact of upgrades, fixes provided communicate concerns/issues to the team Via Gitlab issues, document upgrade plan, date of change(s) and systems involved, any issues encountered, potential risks Identify areas for routine process optimization and implement solutions utomation of common tasks, contributing to monitoring infrastructure Develop scripts and tools and contribute them to internal Gitlab repository Contribute to integration and implementation planning for future system upgrades and deployments Nice to have skills: Has demonstrated contributions to the high-performance storage community (e.g., conference presentations, open-source software).
Ability to present and describe systems and issues to technical staff as well as higher level management.
Understanding of file system internals, prior work developing storage systems, or experience troubleshooting and optimizing parallel I/O.