MSU AI Club

Inspiration

The rising of AI data center building after the AI boom in 2020 created a stagnant concern on environmental impact upon unregulated data centers. This concern further with the current U.S. administration which stepped away of environmental international framework regulation. AI as a topic created division since then if whether data center (AI or Data Centers in general) should be allowed in the first place considering its energy and water consumption (Quantitative environmental variables) and sound pollution to the communities around them. As a team, we believe that current Data Center that are unregulated (often multinational companies that continue to do so due to a lack of executive power, state laws, or how activists and union protests in relation to Data Centers). We have created this tool that will be of interest to organizations and environmental regulation industry (under environmental public policy accountability as well) to utilize AI to regulate environmental impact and ensure environmental justice on marginalized communities and communities impacted by unregulated Data Centers around the U.S.

What it does

Data Center Energy Predictor is an AI powered geospatial platform that forecasts electricity consumption and environmental impact of data centers with facility level granularity.

Core Functionality:

Regional Attribution Model When facility specific data is unavailable, we count data centers per region and divide total regional energy consumption by facility count to estimate average consumption per data center. For example, if a region consumes 10,000 MWh and contains 5 data centers, we estimate 2,000 MWh per facility.
Multi State Coverage We track 500+ data centers across Virginia, Ohio, Illinois, and Michigan. Users can filter by state or region to see local impact.
Interactive Geospatial Visualization Our map interface built with ArcGIS and Tableau allows users to: Click any data center to see predicted annual energy consumption View facility capacity, operational dates, and efficiency metrics (PUE) See regional consumption heatmaps color coded by intensity Filter by state, county, or grid operator region
Predictive Machine Learning We use XGBoost models trained on 15 years of EIA (U.S. Energy Information Administration) commercial electricity data (2010 through 2025) to forecast energy consumption based on facility capacity, operational dates, and regional trends.

Who Benefits: Communities near data centers gain evidence to demand fair infrastructure investment Environmental organizations get facility specific data to hold tech companies accountable Municipal planners assess true grid impact before approving permits Journalists and researchers access independent, verifiable energy predictions Unlike existing dashboards that show only state level aggregates, we provide facility specific predictions, enabling targeted policy interventions and community advocacy

How We Built It

Data Collection & Processing

Data Center Locations Scraped addresses, capacities, and coordinates from DataCenterMap Cross referenced with construction permits from county databases (Shovels.ai, Loudoun County, etc.) Used AI assisted web scraping (Claude and ChatGPT APIs) to extract operational dates from news articles and company announcements Built comprehensive CSV databases for Virginia (193 facilities), Ohio, Illinois, and Michigan
Energy Consumption Data Queried EIA OpenData API for commercial electricity consumption by region Pulled data from PJM Interconnection (serves Virginia, Ohio, Michigan) and MISO (serves Illinois) Extracted monthly time series from 2010 through 2025 Normalized to MWh per year for consistency
Capacity Calculations For facilities without published capacity, we used Rishabh’s formula: Annual Energy (MWh/year) = Capacity (MW) × 0.9 × 8,760 hours Where 0.9 is the industry standard capacity factor accounting for downtime.
Data Cleaning & Validation Filtered inactive facilities (1980s/1990s data centers no longer operational) Removed duplicates and validated geocoding accuracy (>95% accuracy target) Added PUE (Power Usage Effectiveness) placeholders where specific data unavailable (1.5 default) Cross referenced predictions against IEA (International Energy Agency) baseline reports for validation

Technical Stack Machine Learning: Python + Pandas for data processing XGBoost for energy consumption prediction GeoPandas for spatial analysis and regional joins Visualization: ArcGIS Online for interactive geospatial maps Tableau for data quality dashboards and facility distribution charts Folium (alternative) for web based mapping

Data Sources: EIA OpenData API (primary energy data) DataCenterMap (facility locations and capacity) County construction permit databases IEA Energy and AI reports (validation) Deployment: Railway for cloud hosting GitHub for version control and code collaboration Google Drive for team data sharing

Challenges we ran into

The Water to Energy Pivot (February 2026) Challenge: We originally planned to predict water consumption using USGS public supply data matched to HUC 12 watersheds. After two weeks of data collection, we discovered a fatal flaw: negative 0.0007% correlation with data center locations. The USGS data stopped at 2020, missing the entire AI boom. Resolution: Consulted with Leah Morin, JMC librarian and data scientist, who recommended pivoting to EIA energy data. This was a major strategic change but the right scientific decision.
Operational Date Scarcity Challenge: Only 5.2% of facilities (10 out of 193 Virginia data centers) had confirmed operational dates. Without dates, we could not do clean before and after energy consumption analysis. Resolution: Automated web scraping using AI APIs to extract dates from unstructured news text Expanded sources: construction permits, company press releases, local news Target: 50+ facilities with confirmed dates by final checkpoint
Data Quality Inconsistencies Challenge: DataCenterMap includes facilities from 1980s onward with no operational status field. Many are inactive. Some show “under construction” but have been stalled for years. Capacity data is missing for 40%+ of facilities. Resolution: Filter by construction date (post 2010 more likely active) Use regional averages or similar facility comparisons for missing capacity Flag data quality issues transparently in methodology documentation Accept that facility level data is imperfect; focus on regional accuracy
Attribution Complexity Challenge: Energy consumption changes in a region could come from non data center sources (new recreational centers, population growth, industrial expansion). How do we isolate data center impact? Resolution: Build control dataset: regions with no data centers but similar demographics Focus on before and after analysis for facilities with confirmed operational dates Use correlation analysis to validate attribution model Document limitations openly
Team Coordination Across Technical Domains Challenge: Team members had different technical backgrounds (data science, computer science, economics/policy). Coordinating data handoffs and ensuring consistent methodology was complex. Resolution: Weekly Thursday standup meetings (3pm to 5pm) Clear task delegation by expertise (Rishabh and Justin on ML, Delger on web scraping, Meli on coordination) Standardized CSV naming conventions and GitHub branch structure Shared Google Drive folder for data uploads

Accomplishments that we're proud of

Successfully Pivoted Under Pressure When our water consumption approach failed after two weeks of work, we made a data driven decision to pivot to energy. We rebuilt the entire data pipeline in under three weeks and delivered a working prototype by Checkpoint. This demonstrated scientific rigor and adaptability.
Built a Novel Attribution Model We solved a problem that even major research institutions struggle with: estimating facility specific energy consumption when operators do not publish data. Our regional attribution model (count facilities → divide energy → estimate average) is simple, transparent, and scientifically defensible.
Processed 500+ Data Centers Across 4 States We manually reviewed 720 Virginia facilities, filtered to 193 with confirmed addresses, and expanded to Ohio, Illinois, and Michigan. This is the largest independent data center energy database built by students.
Integrated AI for Data Collection We used Claude and ChatGPT APIs to automate operational date extraction from unstructured news articles. This shows practical application of generative AI for public interest research, not just commercial use cases.
Achieved External Validation Our Virginia energy consumption predictions match IEA (International Energy Agency) U.S. data center estimates when adjusted for Virginia’s share of national facilities. This external validation gives our model credibility

What we learned

1. Built with Zero Funding We completed this project with $25 in MSU AI Club API credits and personal contributions for cloud hosting. No corporate sponsors, no grants. Just students solving a real problem with public data and open source tools.

2. Youth Led, Community Focused We are college students with no financial ties to Big Tech. Our incentive structure is accountability, not profit. This independence is our competitive advantage

3. Data Quality Matters More Than Model Complexity We initially focused on building sophisticated ML models (XGBoost, LSTM). But we learned that garbage in, garbage out. Spending time cleaning data, filtering inactive facilities, and validating geocoding accuracy was more impactful than tweaking hyperparameters.

4. Start with Simple Baselines Our regional attribution model (divide total energy by facility count) is mathematically simple. But it works. We learned not to overcomplicate when a straightforward approach is defensible and transparent. 3. Public APIs Are Powerful EIA OpenData API gave us 15 years of commercial electricity data for free. No web scraping legal issues, no paywalls. We learned to prioritize public data sources over proprietary datasets.

5. AI Can Augment, Not Replace, Human Judgment ChatGPT and Claude helped us extract operational dates from news articles 10x faster than manual reading. But we still had to verify outputs and flag low confidence extractions. AI is a tool, not a replacement for critical thinking.

6. Pivot Early When Data Shows Problems We could have spent another month trying to salvage the water approach. Instead, we pivoted after two weeks when correlation analysis showed it would not work. Cutting losses early saved the project. 2. Task Delegation by Expertise Assigning tasks based on team skills (Rishabh on ML, Delger on web scraping, Justin on databases, Meli on coordination) was more effective than equal division of labor.

7. Weekly Check Ins Prevent Bottlenecks Our Thursday standups caught issues early (e.g., Diggy discovering operational status problem, Rishabh finding PUE data gaps). Without these, we would have discovered problems too late.

8. Documentation Is Not Optional Writing methodology documents, README files, and data dictionaries felt like overhead. But when MIT Solve asked for technical proof, we had it ready. Documentation is a competitive advantage.

9. Data Centers Are Not Regulated Like We Thought We assumed facility level energy reporting was required by law. It is not. Only aggregate company reports exist, and those are voluntary. This regulatory gap is why our tool matters.

10. Community Advocacy Needs Data Environmental groups told us their biggest challenge is lack of evidence. Tech companies say “trust our sustainability reports.” Communities need independent, verifiable numbers to negotiate in permit hearings.

11. Energy Infrastructure Is Political Data centers lobby for expedited permits and grid priority. Utility rate increases get passed to residential customers. This is not just a technical problem. It is a power dynamic problem. Our tool shifts leverage.

What's next

Immediate Next Steps (By April 17 Final Showcase) 1. Complete Multi State Data Collection Finalize Ohio, Illinois, Michigan facility counts Target: 500+ active data centers with >95% geocoding accuracy Expand operational dates from 10 to 50+ facilities 2. Deploy Production Application Integrate Tableau visualizations with Rishabh’s energy consumption calculations Launch multi state interactive map on Railway Add regional filtering (select state → see facilities) Include tooltips with facility details (name, capacity, predicted MWh) 3. Create Demo Video Record 3 to 5 minute walkthrough showing live application Upload to YouTube (unlisted) for MSU AI Club website Demonstrate regional attribution model in action 4. Complete MIT Solve Application Submit technical documentation and methodology write up Provide proof points: GitHub repo, demo video, validation data Apply by March 22 deadline Medium Term Goals (Summer 2026) 1. Expand Geographic Coverage Add Texas (massive data center expansion in Dallas, Austin) Add North Carolina (Google, Apple facilities) Add Arizona (Meta data center cluster) Target: 1,000+ facilities across 10 states 2. Incorporate Real Operational Data Partner with utilities or grid operators for actual consumption validation Cross reference predictions with company sustainability reports where available Build case studies: Did our predictions match reality? 3. Develop Public API Allow journalists, researchers, and advocacy groups to query our database Provide JSON endpoints for facility search, regional summaries, time series predictions Open source codebase on GitHub for replication and auditing 4. Pilot with Environmental Justice Organizations Deploy with Piedmont Environmental Council (Virginia) Gather user feedback on interface usability and data needs Iterate based on real world community advocacy use cases Long Term Vision (2027 and Beyond) 1. Build Community Energy Intelligence Infrastructure Just as PurpleAir democratized air quality monitoring through citizen science, we want to democratize energy accountability. Every data center hub globally should have independent, community accessible predictions for facility level consumption. 2. Influence Policy and Regulation Our model could inform: County permit processes (require energy impact assessments before approval) Utility rate cases (ensure data centers pay fair share of grid upgrades) State renewable energy mandates (track whether data centers meet clean energy commitments) Federal transparency legislation (require facility specific reporting) 3. Expand to Other Infrastructure The attribution methodology we developed (count facilities → divide regional impact → estimate average) could apply to: Cryptocurrency mining operations (energy consumption) Industrial warehouses (water consumption) Manufacturing facilities (carbon emissions) 4. Scale Sustainably We aim for grant funding, not venture capital. This is a public good tool, not a startup. Estimated annual operating costs: $5,000 (cloud hosting, API credits, domain). One full time equivalent can maintain the platform and expand coverage quarterly.

Data Center Energy Impact Analysis