Methodology

A detailed overview of our data collection, processing, and analysis methods

Data Collection

Historical Trip Data

We collected historical Citi Bike trip data from January 2024 to October 2025, sourced from Citi Bike System Data. The dataset includes 96 monthly CSV files containing detailed trip information.

Total raw dataset size: ~15 GB
Total trips in raw data: ~84 million
Filtered dataset: 529,908 trips involving Columbia stations

Real-Time Station Status

Live station status data is fetched from the Citi Bike GBFS (General Bikeshare Feed Specification) API. This provides real-time information on bike availability, dock availability, and station status.

Station Selection

We focused our analysis on 7 Citi Bike stations in the Columbia University area, covering Morningside Heights and Manhattanville neighborhoods:

Broadway & W 122 St (Station ID: 7783.18)
Morningside Dr & Amsterdam Ave (Station ID: 7741.04)
W 120 St & Claremont Ave (Station ID: 7745.07)
Amsterdam Ave & W 119 St (Station ID: 7727.07)
W 116 St & Broadway (Station ID: 7713.11)
W 116 St & Amsterdam Ave (Station ID: 7692.11)
W 113 St & Broadway (Station ID: 7713.01)

These stations were selected based on their proximity to Columbia University campus and high relevance to student and staff commuting patterns.

Data Processing

Filtering Process

From the complete Citi Bike dataset, we filtered trips where either the origin (start_station_id) or destination (end_station_id) was one of our 7 Columbia-area stations. This resulted in 529,908 trips.

Data Schema

Each trip record contains 13 fields:

ride_id: Unique trip identifier
rideable_type: Bike type (classic_bike, electric_bike)
started_at, ended_at: Trip start and end timestamps
start_station_name, start_station_id: Origin station details
end_station_name, end_station_id: Destination station details
start_lat, start_lng, end_lat, end_lng: GPS coordinates
member_casual: User type (member or casual rider)

Data Pipeline

Load all 96 monthly CSV files using Python pandas
Parse datetime columns (started_at, ended_at)
Filter trips involving Columbia stations
Sort by trip start and end times
Export to consolidated CSV file (columbia_filtered_citibike.csv, ~52 MB)

Analysis Methods

Temporal Analysis

We analyze usage patterns across multiple time scales:

Seasonal trends: Monthly aggregation to identify seasonal variations
Weekly patterns: Day-of-week analysis to understand weekday vs. weekend usage
Hourly distribution: Hour-of-day analysis to identify peak usage times

Station-Level Metrics

Key metrics calculated for each station:

Total trips: Count of trips originating or ending at the station
Inflow/Outflow balance: Net difference between arriving and departing bikes
Bike type distribution: Breakdown of classic vs. electric bike usage
User type distribution: Member vs. casual rider proportions

Visualization Tools

We use interactive visualizations built with Plotly and deck.gl to present findings:

Time series charts: Track usage trends over time
Heatmaps: Display hour-by-day usage patterns
Bar charts: Compare metrics across stations
3D map visualization: Real-time station status with geographic context

Demand Forecasting

We developed a machine learning model to predict hourly bike demand at Columbia area stations, enabling proactive rebalancing strategies. Multiple regression models were evaluated to identify the best performing approach.

Model Selection: Compared Linear Regression, Random Forest, Gradient Boosting, and XGBoost
Best Model: XGBoost (Gradient Boosting) selected based on R² and MAE metrics
Features: 27 engineered features including temporal patterns, lag variables, rolling averages, and academic calendar indicators
Performance: R² = 0.722, MAE = 1.63 departures/hour

For detailed methodology including feature engineering, model comparison, and evaluation, see the Demand Forecasting page.

Technologies & Tools

Data Processing

Python 3.x
pandas (data manipulation)
NumPy (numerical operations)
Jupyter notebooks (exploratory analysis)

Visualization

Plotly (interactive charts)
deck.gl (3D map visualization)
MapLibre (map rendering)

Web Application

Next.js 15 (React framework)
TypeScript
TailwindCSS (styling)

Backend

FastAPI (Python web framework)
RESTful API design
GBFS API integration

Machine Learning

scikit-learn (Linear Regression, Random Forest)
XGBoost (gradient boosting)

Deployment

Vercel (frontend hosting)
Railway (backend hosting)
Cloudflare (DNS management)