What Do Travellers Really Think About Swiss Train Stations?

An analysis of Google Maps reviews for SBB stations

Introduction

A train station serves many functions at once. Transit node, waiting space, first point of contact with a city. This analysis examines how passengers experienced SBB stations in Switzerland across all of these dimensions, drawing on real reviews to look beyond the network’s headline performance metrics.

This analysis covers 22,000+ Google Maps reviews across 61 SBB stations, exploring ratings, sentiment, language patterns, and what passengers consistently praise or complain about.

Scope and Methodology

Stations were selected using SBB’s trafimage dataset as a filter. I assumed that if a station appears on the official schematic map, it has enough operational significance and foot traffic to generate useful review volume. Smaller stations were excluded as they produce too few reviews on Google Maps to say anything meaningful. Both Bern Europaplatz stations were merged into one entry for crawling.

Not every review for every station was captured. Reviews longer than 340 characters are cut off. The dataset is enough to show what this kind of analysis can reveal, not to make definitive claims about the network or the stations as a whole.

Where This Could Go

This project is a proof of concept. The same analysis can be done on the complete set of reviews and mentions from Google Maps, social media, and travel forums in near-real-time. For example, spikes in negative sentiment around a specific station might show up days before formal complaints do, which could be used to develop an early warning system for issues at stations. The same approach could extend beyond stations to other SBB facilities like ticket offices, parking areas, or bike rental points, anywhere public reviews accumulate and operational decisions depend on customer experience.

TL;DR

I scraped and analysed 22,000+ Google Maps reviews across 61 SBB stations and ran each one through both a keyword filter and an AI classifier (GPT-4o-mini) that picks out which aspect of the station is being discussed and whether the sentiment is positive or negative. A few things stood out:

The network of stations averages 4.16 stars, which is high. The negative reviews are not spread evenly though, they pile up at a small group of stations.
Reviewers are harshest on stations in their own linguistic region. Tourists writing in English are by far the most generous.
Complaints split into two useful groups. Damage control (Safety, Crowds, Toilets) is where investment mostly just stops the bleeding, because even the top rated stations don’t get praised for these. Visible wins (Cleanliness, Connections, Staff/Service, Food & Shops, Signage/Nav) is where the best stations actually get praise, so investment can lift sentiment as well as reduce complaints.
Safety, Staff/Service, and Crowds drive the biggest share of negative ratings by volume. Reviews that mention Staff/Service, Connections, or Safety also tend to come with the lowest ratings (1.6-2.0★ averages).

The final deliverable is a ranked list of action items per station, split by the categories Damage control and Visible wins.

Setup

Show code

import warnings
warnings.filterwarnings('ignore')

import pathlib
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import plotly.io as pio
from sklearn.feature_extraction.text import CountVectorizer
import pycountry
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
pio.templates.default = 'plotly_white'
pio.renderers.default = 'plotly_mimetype+notebook_connected'

# Shared color palettes used across all charts for consistency
RATING_SCALE = ['#d73027', '#f46d43', '#fdae61', '#a6d96a', '#1a9641']
LANG_COLORS  = {'de': '#2196F3', 'fr': '#E91E63', 'it': '#4CAF50', 'en': '#FF9800', 'rm': '#9C27B0', 'other': '#9E9E9E'}
ASPECT_COLORS = {
    'Safety': '#d32f2f', 'Crowds': '#e57373', 'Toilets': '#ef6c00',
    'Cleanliness': '#1976d2', 'Connections': '#0288d1', 'Food & Shops': '#388e3c',
    'Staff/Service': '#7b1fa2', 'Signage/Nav': '#ab47bc', 'Seating/Waiting': '#5d4037',
    'Accessibility': '#00796b', 'Lifts/Escalators': '#00838f', 'Parking/Bikes': '#616161',
}

# Shared thresholds: used across all reliability filters
MIN_REVIEWS_PER_STATION = 30   # excludes stations with too few reviews for stable averages
MIN_COMPLAINTS_PER_ASPECT = 10  # excludes aspects with too few mentions to be meaningful

Show code

DATA_DIR = '../data/raw'

stations = pd.read_csv(f'{DATA_DIR}/stations.csv')
reviews  = pd.read_csv(f'{DATA_DIR}/reviews.csv')

# Parse dates
reviews['date_estimated'] = pd.to_datetime(reviews['date_estimated'], errors='coerce')
reviews = reviews[reviews['date_estimated'].notna()].copy()
reviews['year_month'] = reviews['date_estimated'].dt.to_period('M')
reviews['year']       = reviews['date_estimated'].dt.year

# Normalise language codes
reviews['lang'] = reviews['language'].where(reviews['language'].isin(['de','fr','it','en','rm']), other='other')

# Only stations that were successfully scraped
done = stations[stations['scrape_status'] == 'done'].copy()

# Attach scraped review counts to done stations
scraped_counts = reviews.groupby('opuic').size().rename('scraped_reviews')
done = done.join(scraped_counts, on='opuic')
done['scraped_reviews'] = done['scraped_reviews'].fillna(0).astype(int)

print(f'Stations scraped : {len(done)}')
print(f'Reviews collected: {len(reviews):,}')
print(f'Date range       : {reviews["date_estimated"].min().date()} → {reviews["date_estimated"].max().date()}')

Stations scraped : 61
Reviews collected: 22,621
Date range       : 2011-05-12 → 2026-05-08

1: Where Are the Stations?

Before diving into the numbers, lets put the stations on a map. The interactive map below shows every scraped station. Colour encodes the overall Google rating (red = low, green = high), and size reflects the total number of reviews. Hover over any dot for details.

Show code

fig = px.scatter_mapbox(
    done.dropna(subset=['latitude','longitude','overall_rating']),
    lat='latitude', lon='longitude',
    color='overall_rating',
    size='review_count_google',
    size_max=30,
    hover_name='name',
    hover_data={
        'overall_rating': ':.1f',
        'review_count_google': True,
        'scraped_reviews': True,
        'latitude': False,
        'longitude': False,
    },
    color_continuous_scale=RATING_SCALE,
    range_color=[1.0, 5.0],
    zoom=5.5,
    center={'lat': 46.8, 'lon': 8.2},
    mapbox_style='carto-positron',
    title='SBB Station Ratings across Switzerland',
    height=600,
    labels={'overall_rating': 'Rating', 'review_count_google': 'Total reviews', 'scraped_reviews': 'Collected reviews'},
)
fig.update_layout(coloraxis_colorbar_title='Rating')
fig.show()

Figure 1: SBB stations across Switzerland. Colour = overall Google rating, size = total review count.

Map at a glance

Most of the 61 stations cluster in the 4.0-4.5 star range. The negative tail is small and only affects a few stations (Yverdon-les-Bains, Genève Aéroport, Lenzburg, Olten, etc).

2: The Dataset

Now that the geography is on the table, let’s have a look at what the dataset itself contains. How many reviews, in which languages, and how many actually carry text rather than just a star rating.

Show code

total_google = done['review_count_google'].sum()
total_scraped = done['scraped_reviews'].sum()
coverage = total_scraped / total_google * 100

print(f'Total reviews on Google : {total_google:,}')
print(f'Reviews collected       : {total_scraped:,}  ({coverage:.0f}% coverage)')
print(f'Reviews with text       : {reviews["text"].notna().sum():,}  ({reviews["text"].notna().mean()*100:.0f}%)')
print()
print('Review languages:')
print(reviews['lang'].value_counts().to_string())

Total reviews on Google : 37,154
Reviews collected       : 22,596  (61% coverage)
Reviews with text       : 14,783  (65%)

Review languages:
lang
other    11057
de        5549
en        3711
fr        1727
it         577

Show code

# Star-only vs written reviews
n_text   = reviews['text'].notna().sum()
n_notext = len(reviews) - n_text

fig = go.Figure(go.Pie(
    labels=['Rating + written review', 'Rating-only'],
    values=[n_text, n_notext],
    hole=0.4,
    marker_colors=['#2196F3', '#9E9E9E'],
    textinfo='percent+label',
    textposition='outside',
))
fig.update_layout(title=f'Review type: {len(reviews):,} total reviews')
fig.show()
print(f'Rating + written review: {n_text:,}  ({n_text/len(reviews)*100:.0f}%)')
print(f'Rating-only: {n_notext:,}  ({n_notext/len(reviews)*100:.0f}%)')

Figure 2: Share of scraped reviews that include written text along the rating vs a star-only rating. About 65% of the 22,621 scraped reviews carries text.

Rating + written review: 14,783  (65%)
Rating-only: 7,838  (35%)

Show code

written = reviews[reviews['text'].notna()].copy()

lang_counts = written['lang'].value_counts().reset_index()
lang_counts.columns = ['language', 'count']
lang_labels = {'de': 'German', 'fr': 'French', 'it': 'Italian', 'en': 'English', 'rm': 'Romansh', 'other': 'Other'}
lang_counts['label'] = lang_counts['language'].map(lang_labels)

fig = px.pie(
    lang_counts, values='count', names='label',
    color='language', color_discrete_map=LANG_COLORS,
    title=f'Largest Group of Review Languages: {len(written):,} written reviews',
    hole=0.4,
)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

Show code

# Full language breakdown
_ZH = {'zh-tw': 'Chinese (Trad.)', 'zh-cn': 'Chinese (Simp.)'}

def lang_name(code):
    if code in _ZH:
        return _ZH[code]
    lang = pycountry.languages.get(alpha_2=code)
    return lang.name if lang else code

lang_counts = written['language'].value_counts(dropna=False).reset_index()
lang_counts.columns = ['language', 'count']
lang_counts['label'] = lang_counts['language'].apply(
    lambda c: 'Undetected' if pd.isna(c) else lang_name(c)
)
lang_counts['pct'] = (lang_counts['count'] / len(written) * 100).round(1)

print(f'All languages in written reviews ({len(written):,} total):')
print(lang_counts[['label', 'count', 'pct']].to_string(index=False))

All languages in written reviews (14,783 total):
                  label  count  pct
                 German   5549 37.5
                English   3711 25.1
                 French   1727 11.7
             Undetected   1075  7.3
                Italian    577  3.9
                Spanish    407  2.8
                 Korean    224  1.5
             Portuguese    196  1.3
               Romanian    116  0.8
                Russian    115  0.8
               Japanese    109  0.7
                 Arabic     91  0.6
                Turkish     84  0.6
                  Dutch     77  0.5
                Catalan     58  0.4
              Afrikaans     46  0.3
        Chinese (Trad.)     44  0.3
                 Danish     42  0.3
                 Polish     41  0.3
                   Thai     40  0.3
              Ukrainian     36  0.2
        Chinese (Simp.)     36  0.2
             Indonesian     35  0.2
              Hungarian     33  0.2
              Norwegian     32  0.2
                Swedish     31  0.2
                  Czech     30  0.2
                Finnish     24  0.2
   Modern Greek (1453-)     23  0.2
                Tagalog     21  0.1
                 Hebrew     20  0.1
               Croatian     19  0.1
               Estonian     18  0.1
                 Slovak     16  0.1
                 Somali     14  0.1
             Vietnamese     13  0.1
                  Welsh     10  0.1
              Bulgarian      9  0.1
              Slovenian      9  0.1
                Persian      5  0.0
             Lithuanian      4  0.0
               Albanian      4  0.0
             Macedonian      3  0.0
                Latvian      3  0.0
                  Tamil      3  0.0
Swahili (macrolanguage)      2  0.0
                   Urdu      1  0.0

Romansh, Switzerland’s fourth national language, is absent from the dataset. No reviews in Romansh (rm) were detected. Romansh speakers number around ~40,000 and are concentrated in rural Graubünden valleys. None of which are served by stations in this dataset. Romansh speakers are also typically bilingual in German, making German the likely choice when writing a review.

References: https://www.rtr.ch/emissiuns/decodar-nossa-cultura/raetoromanisch/fakten-geschichte/fakten-und-zahlen-raetoromanische-sprache https://www.bfs.admin.ch/asset/de/23366958

About the dataset

Of 22,000+ reviews, roughly half carry text rather than just stars. German, French, Italian, and English are reasonably balanced, with Romansh entirely absent.

3: Ratings Across Stations

How are ratings distributed, which stations sit at the top and bottom, and which are the most polarising?

Show code

fig = px.histogram(
    done, x='overall_rating',
    nbins=20,
    title='Distribution of Station Ratings (Google aggregate)',
    labels={'overall_rating': 'Overall Rating', 'count': 'Number of Stations'},
    color_discrete_sequence=['#2196F3'],
)
fig.add_vline(
    x=done['overall_rating'].mean(), line_dash='dash', line_color='#E91E63',
    annotation_text=f" Mean: {done['overall_rating'].mean():.2f}",
    annotation_position='top right',
)
fig.update_layout(bargap=0.05)
fig.show()

print(f"Mean rating : {done['overall_rating'].mean():.2f}")
print(f"Median      : {done['overall_rating'].median():.2f}")
print(f"Std dev     : {done['overall_rating'].std():.2f}")

Figure 4: Distribution of overall station ratings.

Mean rating : 4.16
Median      : 4.20
Std dev     : 0.28

Overall, the analyzed stations have a very high mean rating of 4.16, reflecting high satisfaction with the analyzed station facilities.

Show code

top10 = done.nlargest(10,  'overall_rating')[['name','overall_rating','review_count_google']]
bot10 = done.nsmallest(10, 'overall_rating')[['name','overall_rating','review_count_google']]

fig = make_subplots(rows=1, cols=2,
                    subplot_titles=['Top 10 Highest Rated', 'Bottom 10 Lowest Rated'],
                    horizontal_spacing=0.20)

fig.add_trace(go.Bar(
    x=top10['overall_rating'], y=top10['name'],
    orientation='h', marker_color='#1a9641',
    text=top10['overall_rating'].round(1), textposition='outside',
    name='Top 10',
), row=1, col=1)

fig.add_trace(go.Bar(
    x=bot10['overall_rating'], y=bot10['name'],
    orientation='h', marker_color='#d73027',
    text=bot10['overall_rating'].round(1), textposition='outside',
    name='Bottom 10',
), row=1, col=2)

fig.update_xaxes(range=[0, 5.5])
fig.update_yaxes(side='right', row=1, col=2)
fig.update_layout(height=400, showlegend=False, title_text='Best and Worst Rated SBB Stations')
fig.show()

Figure 5: Top 10 Best- and worst-rated stations.

Before jumping into the controversial cases, here is a quick leaderboard. I sort every station by its overall Google rating and compare it to the scraped average to spot-check that the data lines up. The table also displays the amount of ratings scraped and the total amount of reviews available.

Show code

# Per-station metrics
station_stats = reviews.groupby('opuic').agg(
    avg_rating    = ('rating', 'mean'),
    pct_with_text = ('text', lambda x: x.notna().mean() * 100),
    review_count  = ('id', 'count'),
).reset_index()

leaderboard = (
    done[['opuic','name','overall_rating','review_count_google']]
    .merge(station_stats, on='opuic', how='left')
    .sort_values('overall_rating', ascending=False)
    .reset_index(drop=True)
)
leaderboard.index += 1  # 1-based rank

leaderboard_display = leaderboard[[
    'name', 'overall_rating', 'avg_rating',
    'review_count_google', 'review_count', 'pct_with_text',
]].copy()
leaderboard_display.columns = [
    'Station', 'Google Rating', 'Avg Scraped Rating',
    'Reviews on Google', 'Reviews Scraped', '% with Text',
]
leaderboard_display = leaderboard_display.round(2)

print(f"Total reviews scraped across all stations: {int(leaderboard['review_count'].sum()):,}")
print()

with pd.option_context('display.max_rows', None):
    display(leaderboard_display)

Total reviews scraped across all stations: 22,596

	Station	Google Rating	Avg Scraped Rating	Reviews on Google	Reviews Scraped	% with Text
1	Rorschach Hafen	4.7	4.68	103	103	42.72
2	Rorschach Stadt	4.6	4.59	17	17	35.29
3	Genève-Eaux-Vives	4.6	4.56	39	39	51.28
4	Rapperswil SG	4.6	4.57	212	212	44.34
5	Thun	4.5	4.51	378	359	51.53
6	Rorschach	4.5	4.49	76	76	34.21
7	Locarno	4.5	4.46	135	135	48.89
8	Montreux	4.5	4.52	346	338	51.48
9	Nyon	4.5	4.52	243	243	42.39
10	Bern	4.5	4.46	1348	816	68.50
11	Lugano	4.4	4.37	871	672	64.14
12	Arth-Goldau	4.4	4.37	362	340	41.47
13	Brig	4.4	4.37	198	198	42.93
14	Zürich HB	4.4	4.33	5250	2223	99.73
15	Luzern	4.4	4.40	7155	2243	97.82
16	Chur	4.3	4.32	495	465	55.91
17	Schaffhausen	4.3	4.33	270	267	53.93
18	St. Gallen	4.3	4.27	461	414	50.00
19	Zug	4.3	4.31	406	376	43.88
20	Opfikon	4.3	4.30	23	23	47.83
21	Bellinzona	4.3	4.28	376	341	40.76
22	Martigny	4.3	4.26	120	120	40.00
23	Kreuzlingen Hafen	4.3	4.32	47	47	51.06
24	Neuchâtel	4.3	4.28	224	223	42.15
25	Zürich Enge	4.2	4.19	134	134	42.54
26	Zürich Stadelhofen	4.2	4.16	691	541	57.86
27	Winterthur	4.2	4.19	650	496	54.03
28	Sargans	4.2	4.25	153	153	54.90
29	Delémont	4.2	4.25	73	73	45.21
30	Burgdorf	4.2	4.20	71	71	40.85
31	Zürich Oerlikon	4.2	4.11	2457	1150	78.52
32	Genève	4.2	4.18	1046	750	68.40
33	Baden	4.1	4.16	1439	643	62.83
34	Landquart	4.1	4.05	79	79	39.24
35	Kreuzlingen	4.1	4.05	80	80	51.25
36	Basel SBB	4.1	3.97	2171	1323	79.82
37	Rotkreuz	4.1	4.14	99	99	36.36
38	Frauenfeld	4.1	4.13	115	115	40.87
39	Bülach	4.1	4.10	78	78	42.31
40	Fribourg/Freiburg	4.1	4.10	496	406	43.84
41	La Chaux-de-Fonds	4.1	4.13	136	136	50.00
42	Vevey	4.1	4.14	183	183	46.45
43	Sion	4.1	4.05	221	221	42.08
44	Biel/Bienne	4.1	4.06	568	452	50.66
45	Visp	4.1	4.14	256	256	53.12
46	Lausanne	4.0	3.93	2069	1008	74.40
47	Solothurn	3.9	3.89	686	500	53.60
48	Uster	3.9	3.89	394	362	40.88
49	Zürich Hardbrücke	3.9	3.92	280	277	41.88
50	Brugg AG	3.9	3.87	178	178	46.63
51	Oensingen	3.9	3.90	41	41	36.59
52	Glattbrugg	3.8	3.79	48	48	33.33
53	Bern Europaplatz	3.8	3.80	25	25	36.00
54	Aarau	3.8	3.84	366	358	51.12
55	Wil SG	3.8	3.81	187	187	44.92
56	Bern Wankdorf	3.8	3.83	60	60	53.33
57	Olten	3.8	3.76	554	459	50.11
58	Langenthal	3.7	3.66	94	94	44.68
59	Lenzburg	3.7	3.66	301	288	48.61
60	Genève-Aéroport	3.6	3.64	369	363	57.30
61	Yverdon-les-Bains	3.3	3.13	1151	619	60.74

Most Controversial Stations

To find the most controversial cases, let’s look at stations with the highest standard deviation in ratings, where some reviewers give 5 stars and others give 1. Only stations with at least 30 reviews are included to filter out statistical noise.

Show code

MIN_REVIEWS = 30  # exclude stations with too few reviews for meaningful variance

controversy = (
    reviews.groupby('opuic')
    .agg(std=('rating', 'std'), mean=('rating', 'mean'), count=('rating', 'count'))
    .reset_index()
    .merge(done[['opuic', 'name', 'overall_rating']], on='opuic', how='inner')
)
controversy = controversy[controversy['count'] >= MIN_REVIEWS].copy()
controversy = controversy.sort_values('std', ascending=False).reset_index(drop=True)
controversy['std']  = controversy['std'].round(2)
controversy['mean'] = controversy['mean'].round(2)

top_n = 15

# Explicit order for both charts: ascending std so most controversial appears at top
order15 = controversy.head(top_n).sort_values('std')['name'].tolist()
order10 = controversy.head(10).sort_values('std')['name'].tolist()

# Bar chart: std dev coloured by mean rating
fig = px.bar(
    controversy.head(top_n),
    x='std', y='name',
    orientation='h',
    color='mean',
    color_continuous_scale=RATING_SCALE,
    range_color=[1, 5],
    text='std',
    category_orders={'name': order15},
    title=f'Top {top_n} Most Controversial Stations (highest rating std dev)',
    labels={'std': 'Std deviation', 'name': '', 'mean': 'Avg rating'},
)
fig.update_traces(textposition='outside')
fig.update_layout(
    coloraxis_colorbar_title='Avg rating',
    margin_r=60,
    height=480,
)
fig.show()

# Box plot: rating distribution for top 10 most controversial
top10_names = controversy.head(10)['name'].tolist()
plot_df = (
    reviews
    .merge(done[['opuic', 'name']], on='opuic', how='left')
    [lambda df: df['name'].isin(top10_names)]
)

print(f"Most controversial (≥{MIN_REVIEWS} reviews):")
print(controversy.head(10)[['name', 'overall_rating', 'mean', 'std', 'count']]
      .rename(columns={'overall_rating': 'google_rating', 'mean': 'avg_scraped', 'count': 'reviews'})
      .to_string(index=False))
print()
print("Most consistent:")
print(controversy.tail(5)[['name', 'overall_rating', 'mean', 'std', 'count']]
      .rename(columns={'overall_rating': 'google_rating', 'mean': 'avg_scraped', 'count': 'reviews'})
      .to_string(index=False))

Figure 6: Most polarising stations: highest standard deviation in scraped review ratings.

Most controversial (≥30 reviews):
             name  google_rating  avg_scraped  std  reviews
  Genève-Aéroport            3.6         3.64 1.51      363
        Oensingen            3.9         3.90 1.48       41
           Wil SG            3.8         3.81 1.47      187
       Langenthal            3.7         3.66 1.41       94
            Aarau            3.8         3.84 1.37      358
         Lenzburg            3.7         3.66 1.36      288
        Basel SBB            4.1         3.97 1.34     1323
            Olten            3.8         3.76 1.33      459
             Sion            4.1         4.05 1.33      221
Yverdon-les-Bains            3.3         3.13 1.32      619

Most consistent:
           name  google_rating  avg_scraped  std  reviews
           Thun            4.5         4.51 0.85      359
       Montreux            4.5         4.52 0.84      338
  Rapperswil SG            4.6         4.57 0.83      212
           Nyon            4.5         4.52 0.81      243
Rorschach Hafen            4.7         4.68 0.56      103

The most controversial stations cluster around 3.3–4.1 stars. This is solidly mid-range, but their high standard deviations reveal split opinions rather than universal mediocrity. These are stations where some travellers have a perfectly fine experience while others are frustrated, likely driven by specific pain points (safety, cleanliness, crowds) that don’t affect everyone equally.

The most consistent stations, by contrast, are all top-rated (4.5+). This suggests that genuinely good stations leave little room for disagreement. Consistency and quality go hand in hand. No station is consistently bad, they’re just consistently forgettable or consistently good.

Ratings at a glance

The network mean is 4.16 stars with most stations close to it. The interesting cases are the polarising stations (Oensingen, Wil SG, Genève-Aéroport) where reviewers actively disagree about the same place. For Oensingen and Wil SG a big improvement in rating can be observed. Due to the previous lower ratings the controversial factor is higher. For Genève-Aéroport, many reviewers are also rating the airport itself, potentially leading to a discrpancy in ratings between the train station and the airport.

4: Temporal Trends

Google Maps reviews are timestamped, so I can ask: when do travellers write reviews, and has opinion about the analysed stations shifted over time? Note that 2026 only covers data up to early May, so its sample is incomplete.

Show code

# Review volume and average rating per year (2022–2026)
# Older reviews use 'X years ago' which loses month-level precision, so we limit to recent years.
yearly = (
    reviews[reviews['year'] >= 2022]
    .groupby('year')
    .agg(count=('id', 'size'), avg_rating=('rating', 'mean'))
    .reset_index()
)

fig = make_subplots(
    rows=2, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.08,
    subplot_titles=['Review Volume', 'Average Rating'],
)

fig.add_trace(
    go.Bar(x=yearly['year'], y=yearly['count'],
           name='Review count', marker_color='#90CAF9', opacity=0.9, showlegend=False),
    row=1, col=1,
)
fig.add_trace(
    go.Scatter(x=yearly['year'], y=yearly['avg_rating'],
               name='Avg rating', mode='lines+markers',
               line=dict(color='#E91E63', width=2), marker=dict(size=8), showlegend=False),
    row=2, col=1,
)

fig.update_xaxes(tickmode='array', tickvals=yearly['year'].tolist())
fig.update_yaxes(title_text='Reviews', row=1, col=1)
fig.update_yaxes(title_text='Avg rating', range=[0, 5.0], row=2, col=1)
fig.update_layout(title='Review Volume and Average Rating by Year (2022–2026)', hovermode='x unified', height=450)
fig.show()

Figure 7: Review volume per year, 2022-2026.

As of now, there is a slight uptick for 2026 compared to the previous years. However, overall the rating can only be judged after the year is complete. In the following chart only the completed years are taken for comparison.

Show code

# Rating drift: From 2022–2023 to 2024–2025 (2026 excluded due to partial data)
recent  = reviews[reviews['year'].isin([2024,2025])].groupby('opuic')['rating'].mean().rename('recent')
earlier = reviews[reviews['year'].isin([2022,2023])].groupby('opuic')['rating'].mean().rename('earlier')

drift = pd.concat([recent, earlier], axis=1).dropna()
drift['change'] = drift['recent'] - drift['earlier']
drift = drift.join(done.set_index('opuic')['name']).reset_index()
drift = drift.sort_values('change')

fig = px.bar(
    drift, x='change', y='name',
    orientation='h',
    color='change',
    color_continuous_scale='RdYlGn',
    range_color=[-1, 1],
    title='Rating change from 2022–2023 to 2024–2025',
    labels={'change': 'Rating change', 'name': ''},
    height=max(500, len(drift) * 22 + 100),
)
fig.add_vline(x=0, line_dash='dash', line_color='gray')
fig.update_layout(coloraxis_showscale=False)
fig.show()

print("Most improved:")
print(drift.nlargest(5, 'change')[['name','earlier','recent','change']].round(2).to_string(index=False))
print("\nMost declined:")
print(drift.nsmallest(5, 'change')[['name','earlier','recent','change']].round(2).to_string(index=False))

Figure 8: Rating drift from 2022-2023 to 2024-2025, top movers in both directions.

Most improved:
         name  earlier  recent  change
       Wil SG     3.04    3.83    0.80
    Oensingen     3.25    4.00    0.75
       Bülach     3.42    4.16    0.74
         Sion     3.66    4.33    0.67
Bern Wankdorf     3.55    4.14    0.59

Most declined:
             name  earlier  recent  change
Yverdon-les-Bains     4.75    3.28   -1.47
       Langenthal     4.29    3.07   -1.22
        Rorschach     4.83    4.00   -0.83
Genève-Eaux-Vives     4.88    4.18   -0.69
       Frauenfeld     4.28    3.71   -0.57

The biggest improvers (Wil SG, Oensingen, Bülach) gained 0.6–0.8 stars, suggesting recent interventions or renovations are being noticed by reviewers. Sion and Bern Wankdorf show similar upward momentum, albeit from a higher baseline.

On the decline side, Yverdon-les-Bains stands out with a dramatic 1.5-star drop, consistent with its position as the lowest-rated station in the dataset. Langenthal follows with a 1.2-star decline. The smaller drops at Rorschach, Genève-Eaux-Vives, and Frauenfeld may partly reflect regression to the mean. All three had very high earlier ratings (4.3–4.9) based on relatively small review volumes.

Trends at a glance

Most stations ratings are stable across 2022-2025. A handful (Wil SG, Bülach, Sion) gained 0.6-0.8 stars after visible improvements; Yverdon-les-Bains dropped sharply, reinforcing its position as the worst-rated station.

5: Language & Regional Patterns

Switzerland has four official languages. I look at whether station ratings differ by linguistic region, and how the mix of review languages varies across the country.

Rating by Linguistic Region

I group stations by linguistic region (German, French, Italian) and compare average ratings to see if travellers in one part of Switzerland are systematically harsher than in another. Romansh is not included as no analyzed station is within the language area.

Show code

# Use the linguistic_region column added to stations.csv
done['region'] = done['linguistic_region']

region_stats = done.groupby('region').agg(
    stations=('opuic','count'),
    avg_rating=('overall_rating','mean'),
    median_rating=('overall_rating','median'),
).reset_index()

print(region_stats.to_string(index=False))

          region  stations  avg_rating  median_rating
       Bilingual         2    4.100000            4.1
French (Romandy)        13    4.138462            4.2
          German        43    4.151163            4.2
Italian (Ticino)         3    4.400000            4.4

Figure 9

Station ratings are roughly consistent across linguistic regions, averaging 4.1-4.4 stars. The small differences between regions are not statistically meaningful given the per-region sample sizes (n=2 to n=44 stations).

Language Mix by Station

Which stations attract the most multilingual review bases? I take the 20 most-reviewed stations and break down their written reviews by reviewer language. This is a proxy for which stations have the most international vs local traffic.

Show code

# Language mix for the 20 most-reviewed stations (written reviews only)
top20_opuic = done.nlargest(20, 'scraped_reviews')['opuic'].tolist()
top20_rev   = written[written['opuic'].isin(top20_opuic)].copy()
top20_names = done.set_index('opuic')['name'].to_dict()
top20_rev['station_name'] = top20_rev['opuic'].map(top20_names)

lang_mix = (
    top20_rev.groupby(['station_name','lang'])
    .size()
    .reset_index(name='count')
)

# Sort stations by total written review count
order = top20_rev.groupby('station_name').size().sort_values(ascending=True).index.tolist()

fig = px.bar(
    lang_mix, x='count', y='station_name',
    color='lang',
    orientation='h',
    category_orders={'station_name': order},
    color_discrete_map=LANG_COLORS,
    title='Review Language Mix: Top 20 Stations (written reviews only)',
    labels={'count': 'Number of reviews', 'station_name': '', 'lang': 'Language'},
    height=600,
)
fig.show()

Figure 10: Language mix at the 20 most-reviewed stations. Tourist hubs like Zürich HB and Genève-Aéroport skew toward English, while regional stations lean toward the local language.

Not surprisingly, the train stations which are more frequented by tourists, such as Zürich HB and Luzern show that a large amount of reviews is in english. However, all stations have a large chunk of the reviews still in the language of the local canton.

Does the Reviewer’s Language Affect Ratings?

Another question which I asked myself while looking at the data is if the reviewer’s language influences the ratings in the respective linguistic regions of Switzerland: So for example, how does someone from a swiss german canton rate train stations in romandy?

For simplicity, the assumption is made that reviews written in languages which are not native to Switzerland are considered to be written by tourists.

Show code

# ── Rating distribution by reviewer language ─────────────────────────────────
SWISS_LANGS = {'de', 'fr', 'it', 'rm'}
GROUP_LABELS = {'de': 'German', 'fr': 'French', 'it': 'Italian', 'rm': 'Romansh'}

rev_lang = reviews.copy()
rev_lang['reviewer_label'] = rev_lang['language'].apply(
    lambda l: GROUP_LABELS.get(l, 'Tourist')
)

order = ['German', 'French', 'Italian', 'Tourist']
plot_df = rev_lang[rev_lang['reviewer_label'].isin(order)]

fig = make_subplots(rows=1, cols=4, subplot_titles=order, shared_yaxes=True)

for col, label in enumerate(order, 1):
    subset = plot_df[plot_df['reviewer_label'] == label]
    counts = subset['rating'].value_counts().reindex([1,2,3,4,5], fill_value=0)
    pcts = (counts / counts.sum() * 100).round(1)
    fig.add_trace(
        go.Bar(
            x=pcts.index, y=pcts.values,
            marker_color=[RATING_SCALE[r-1] for r in pcts.index],
            showlegend=False,
            text=[f'{v:.0f}%' for v in pcts.values], textposition='outside',
        ),
        row=1, col=col,
    )
    fig.update_xaxes(tickvals=[1,2,3,4,5], title_text='Stars', row=1, col=col)

fig.update_yaxes(title_text='% of reviews', range=[0, 65], row=1, col=1)
fig.update_layout(
    title='Rating Distribution by Reviewer Language (% of group)',
    height=350,
)
fig.show()

print(plot_df.groupby('reviewer_label')['rating'].agg(['mean','median','count']).loc[order].round(2))

Figure 11: Rating distribution by reviewer language.

                mean  median  count
reviewer_label                     
German          3.88     4.0   5549
French          3.69     4.0   1727
Italian         4.25     5.0    577
Tourist         4.31     5.0  14768

Interestingly, tourists often have the best ratings. The harshest reviews seem to come from french speaking reviewers, followed by the german speaking reviewers. One questions which results out of this is the following: Do people rate stations the same way when being in a region, that does not speak their primary language? As an example: Does someone who speaks german rate the stations in their language region the same as the stations in another language region in Switzerland?

Show code

# Cross-regional bias heatmap: reviewer language × station region
rev_region = reviews.merge(done[['opuic','linguistic_region']], on='opuic', how='left')
rev_region['reviewer_label'] = rev_region['language'].apply(
    lambda l: {'de':'German','fr':'French','it':'Italian'}.get(l, 'Tourist')
)

pivot = (
    rev_region
    .groupby(['reviewer_label','linguistic_region'])['rating']
    .mean()
    .unstack()
)

row_order = ['German','French','Italian','Tourist']
col_order = ['German','French (Romandy)','Italian (Ticino)','Bilingual']
pivot = pivot.reindex(
    index=[r for r in row_order if r in pivot.index],
    columns=[c for c in col_order if c in pivot.columns],
)

fig = px.imshow(
    pivot,
    color_continuous_scale='RdYlGn',
    range_color=[3.5, 5.0],
    text_auto='.2f',
    title='Average Rating: Reviewer Language × Station Linguistic Region',
    labels={'x': 'Station region', 'y': 'Reviewer language', 'color': 'Avg rating'},
    aspect='auto',
)
fig.update_layout(coloraxis_colorbar_title='Avg rating', height=350)
fig.show()

Figure 12: Reviewer language by station region: which combinations are harshest.

Reviewers are harshest in their home region. German speakers rate the Swiss German-region stations lowest (3.86), French speakers rate Romandy stations lowest (3.53), and Italian speakers reserve their lowest scores for Ticino (4.00). Even the bilingual regions of Biel/Bienne and Fribourg/Freiburg follow this pattern. This likely reflects the familiarity effect. Daily commuters notice every flaw, while visitors passing through tend to rate the overall experience more generously.

Language and region at a glance

Linguistic regions rate similarly on average, but reviewers are harshest in their home region. Tourists writing a language which is not German, French or Italian give the most generous ratings overall.

6: What Travellers Say

Star ratings are useful, but a lot of insight can be gained in the reviews. I look at the most common word pairs (bigrams) used by happy vs unhappy reviewers (stop words are removed), identify the most common topics with a keyword-based pass, and then shift to an AI classifier which, in addition to detecting the topic being talked about also detects the sentiment and summarizes the gist of the issue.

Show code

# Build multilingual stopword set (NLTK base + domain-specific terms)
STOP = set()
for lang in ['german', 'french', 'english', 'italian']:
    STOP.update(stopwords.words(lang))

# Domain-specific terms not covered by NLTK
STOP.update([
    'bahnhof', 'gare', 'stazione', 'station', 'train', 'zug', 'bahn',
    'treno', 'sbb', 'good', 'great', 'nice', 'well', 'really',
    'place', 'très', 'molto', 'sehr',
])

negative = reviews[reviews['rating'] <= 2]['text']

Show code

# Top bigrams per rating tier
def top_ngrams(text_series, n=2, top_k=15):
    corpus = text_series.dropna().str.lower().tolist()
    vec = CountVectorizer(ngram_range=(n,n), stop_words=list(STOP), min_df=2)
    X   = vec.fit_transform(corpus)
    counts = X.sum(axis=0).A1
    terms  = vec.get_feature_names_out()
    return pd.Series(counts, index=terms).nlargest(top_k)

tiers = {
    '1–2 ★': reviews[reviews['rating'] <= 2]['text'],
    '3 ★':   reviews[reviews['rating'] == 3]['text'],
    '4–5 ★': reviews[reviews['rating'] >= 4]['text'],
}

fig = make_subplots(rows=1, cols=3, subplot_titles=list(tiers.keys()),
                    horizontal_spacing=0.12)
colors = ['#d73027', '#fdae61', '#1a9641']

tier_ngrams = {}
for col, (tier, series), color in zip(range(1,4), tiers.items(), colors):
    ng = top_ngrams(series, n=2, top_k=12)
    tier_ngrams[tier] = ng
    fig.add_trace(
        go.Bar(x=ng.values, y=ng.index, orientation='h',
               marker_color=color, showlegend=False),
        row=1, col=col,
    )

fig.update_layout(height=450, width=1100, title_text='Most Common Bigrams by Rating Tier')
fig.show()

Figure 13: Most common bigrams per rating tier.

1-2 star reviews focus on concrete problems: unsafe atmosphere (“mal fréquenté/fréquentée”), overcrowding (“immer mehr”, “beaucoup trop”), and specific stations which face problems, such as Zürich HB. Passport control and 1st class issues at also surface.

3 star reviews are ambivalent. Stations “fulfil their purpose” and are “ganz ok”, with shops and good connections mentioned, but recurring issues dampen the experience (“schöner, leider…”, “leider oft”, “seit Jahren”). “Nothing special” is the single most common phrase.

4-5 star reviews show a clear pattern: satisfied reviewers frequently rate the city rather than the station (“schöne Stadt”, “beautiful city”, “belle ville”, “old town”). Practical qualities like navigation (“easy navigate”), connections, and shopping also feature prominently. The much higher counts in this tier reflect both the larger volume of positive reviews and the more consistent vocabulary happy reviewers use.

What the positive bigrams are about

Many of the positive bigrams (“beautiful city”, “great place”, “must visit”) are not really about the station at all. Some stations sit inside or right next to a famous destination (Genève-Aéroport, Lausanne, Lugano, Locarno), and reviewers often praise the city or the airport rather than the platform, signage, or facilities. Their 5-star review is honest, but it inflates the station’s rating for reasons SBB cannot directly influence.

Show code

# Review length vs star rating
written['text_len'] = written['text'].str.len()

fig = px.box(
    written, x='rating', y='text_len',
    color='rating',
    color_discrete_sequence=RATING_SCALE,
    title='Review Length by Star Rating (written reviews only)',
    labels={'text_len': 'Characters', 'rating': 'Star rating'},
    category_orders={'rating': [1,2,3,4,5]},
)
fig.update_layout(showlegend=False)
fig.show()

print(written.groupby('rating')['text_len'].median().rename('median_chars').astype(int))

Figure 14: Character count of reviews by star rating. Unhappy travellers write substantially more: 1-star reviews have a median of 124 characters, vs 49 for 5-star reviews.

rating
1    124
2     99
3     63
4     59
5     49
Name: median_chars, dtype: int64

Observation: even though the review text is cut off after a certain length (~340 characters), the pattern still shows that lower-rated reviews tend to be more wordy. Unhappy travellers have more to say.

Complaint Analysis

Low-star reviews contain the most actionable signal. As a first pass, I use a naive keyword-anchored approach. I match reviews against keywords in the four Swiss languages to bucket them into aspects (Cleanliness, Safety, Connections, etc.). This shows roughly what people talk about and where pain points concentrate, but has clear limitations that I revisit further down with an AI-based approach.

# Aspect-based complaint analysis (keyword anchoring)
ASPECTS = {
    'Cleanliness':      ['clean','dirty','sauber','dreckig','schmutzig','propre','sale','müll','abfall','filth',
                         'geruch','stink','smell','odeur','hygiene','graffiti','ordentlich','sporco','pulito','immondizia','puzza','déchets'],
    'Toilets':          ['toilet','wc','restroom','bathroom','toilette','klo',
                         'geschlossen','closed','fermé','kostenpflichtig','pay','bagno','sanitär'],
    'Lifts/Escalators': ['lift','elevator','escalator','rolltreppe','aufzug','ascenseur','escalier roulant',
                         'defekt','broken','kaputt','out of order','hors service','treppe','stairs','escalier','ascensore'],
    'Food & Shops':     ['shop','restaurant','food','essen','kiosk','migros','coop','café','coffee','kaffee','snack',
                         'bar','bakery','bäckerei','supermarché','laden','bistro','takeaway'],
    'Safety':           ['safe','unsafe','sicher','unsicher','security','polizei','dunkel','dark','gefährlich',
                         'drug','drogen','drogue','droga','dealer','needle','nadel','seringue','siringa','junkie','süchtig','rauschgift',
                         'diebstahl','theft','vol','betrunken','drunk','ivre','ubriaco','aggressiv','aggression','belästigung','gewalt','pericoloso'],
    'Signage/Nav':      ['signage','confus','wegweiser','orient','übersicht','indication','panneau',
                         'schild','beschilderung','anzeigetafel','orientation','display','abfahrt','departures','unübersichtlich'],
    'Parking/Bikes':    ['parking','parkplatz','parkhaus','velo','fahrrad','bike','vélo',
                         'e-bike','velostall','fahrradständer','gestohlen','stolen','moto'],
    'Connections':      ['connection','anschluss','verspätung','delay','pünktlich','correspondance','retard',
                         'missed','verpasst','ausfall','cancel','gleis','platform','voie','binario','fahrplan','horaire','ritardo'],
    'Crowds':           ['crowd','overcrowd','voll','überfüllt','bondé','busy',
                         'gedränge','rush hour','stosszeit','queue','warteschlange','heures de pointe'],
    'Accessibility':    ['wheelchair','rollstuhl','handicap','barrier','barriere','accessible','behinderung',
                         'ramp','rampe','kinderwagen','stroller','poussette','senior','elderly','blind'],
    'Seating/Waiting':  ['bench','seat','sitz','sitzplatz','banc','panchina','waiting area','warteplatz',
                         'warteraum',"salle d'attente","sala d'aspetto"],
    'Staff/Service':    ['staff','personal','mitarbeiter','freundlich','rude','helpful','hilfe','personnel',
                         'service','unhelpful','unfreundlich','scortese','aimable','impoli'],
}

Show code

all_stars = reviews['text'].dropna().str.lower()

aspect_counts = {
    asp: all_stars.str.contains('|'.join(kws), regex=True).sum()
    for asp, kws in ASPECTS.items()
}
aspect_df = (
    pd.Series(aspect_counts)
    .sort_values()
    .reset_index()
    .rename(columns={'index':'aspect', 0:'mentions'})
)
aspect_df.columns = ['aspect','mentions']
aspect_df['pct'] = (aspect_df['mentions'] / len(all_stars) * 100).round(1)

fig = px.bar(
    aspect_df, x='mentions', y='aspect',
    orientation='h',
    text=aspect_df['pct'].map('{:.1f}%'.format),
    title=f'Most Mentioned Topics in all Reviews ({len(all_stars):,} reviews)',
    labels={'mentions': 'Reviews mentioning topic', 'aspect': ''},
    color='aspect',
    color_discrete_map=ASPECT_COLORS,
)
fig.update_traces(textposition='outside')
fig.update_layout(showlegend=False, margin_r=80, height=500)
fig.show()

Food & Shops is the most discussed aspect (12.4%), followed by Cleanliness (8.4%) and Connections (5.5%). Together, these three account for over a quarter of all reviews. The lower bars (Toilets, Safety, Crowds, etc.) are sparser but more diagnostic, since they almost always indicate problems rather than praise. To focus on what is being talked about in negative reviews, I restrict the next chart to low-rated reviews only.

Show code

# Aspect mentions in 1-3 star reviews only
low_star_12 = reviews[reviews['rating'] <= 3]['text'].dropna().str.lower()

aspect_counts_12 = {
    asp: low_star_12.str.contains('|'.join(kws), regex=True).sum()
    for asp, kws in ASPECTS.items()
}
aspect_df_12 = (
    pd.Series(aspect_counts_12)
    .sort_values()
    .reset_index()
)
aspect_df_12.columns = ['aspect', 'mentions']
aspect_df_12['pct'] = (aspect_df_12['mentions'] / len(low_star_12) * 100).round(1)

fig = px.bar(
    aspect_df_12, x='mentions', y='aspect',
    orientation='h',
    text=aspect_df_12['pct'].map('{:.1f}%'.format),
    title=f'Most Mentioned Complaint Topics in 1-3 Star Reviews ({len(low_star_12):,} reviews)',
    labels={'mentions': 'Reviews mentioning topic', 'aspect': ''},
    color='aspect',
    color_discrete_map=ASPECT_COLORS,
)
fig.update_traces(textposition='outside')
fig.update_layout(showlegend=False, margin_r=80, height=500)
fig.show()

Show code

# Per-station dominant complaint topic (stations rated <= 3.0, >= 10 complaints)
low_rev = reviews[reviews['rating'] <= 3].merge(done[['opuic','name','overall_rating']], on='opuic', how='left').copy()
low_rev = low_rev[low_rev['overall_rating'] <= 4.5]
low_rev['text_lower'] = low_rev['text'].str.lower()

rows = []
for opuic, grp in low_rev.groupby('opuic'):
    txt = grp['text_lower'].dropna()
    if len(txt) < 5:
        continue
    scores = {asp: txt.str.contains('|'.join(kws), regex=True).sum() for asp, kws in ASPECTS.items()}
    top_asp = max(scores, key=scores.get)
    rows.append({
        'Station': grp['name'].iloc[0],
        'Rating': grp['overall_rating'].iloc[0],
        'Top complaint': top_asp,
        'Mentions': scores[top_asp],
        'Reviews analysed': len(txt),
        '% mentioning of written station reviews': round(scores[top_asp] / len(txt) * 100, 1),
    })

pain_df = (
    pd.DataFrame(rows)
    .sort_values('Rating', ascending=True)
    .reset_index(drop=True)
)
pain_df.index += 1
pain_df

	Station	Rating	Top complaint	Mentions	Reviews analysed	% mentioning of written station reviews
1	Yverdon-les-Bains	3.3	Safety	31	231	13.4
2	Genève-Aéroport	3.6	Safety	18	104	17.3
3	Langenthal	3.7	Connections	5	21	23.8
4	Lenzburg	3.7	Crowds	12	58	20.7
5	Glattbrugg	3.8	Cleanliness	1	11	9.1
6	Bern Europaplatz	3.8	Food & Shops	1	5	20.0
7	Wil SG	3.8	Safety	11	43	25.6
8	Aarau	3.8	Safety	17	84	20.2
9	Bern Wankdorf	3.8	Lifts/Escalators	4	13	30.8
10	Olten	3.8	Connections	15	91	16.5
11	Solothurn	3.9	Safety	9	81	11.1
12	Oensingen	3.9	Toilets	2	5	40.0
13	Uster	3.9	Food & Shops	11	52	21.2
14	Zürich Hardbrücke	3.9	Cleanliness	7	53	13.2
15	Brugg AG	3.9	Cleanliness	7	31	22.6
16	Lausanne	4.0	Connections	37	219	16.9
17	La Chaux-de-Fonds	4.1	Staff/Service	2	17	11.8
18	Landquart	4.1	Signage/Nav	1	7	14.3
19	Vevey	4.1	Safety	8	29	27.6
20	Kreuzlingen	4.1	Safety	2	9	22.2
21	Frauenfeld	4.1	Toilets	3	14	21.4
22	Biel/Bienne	4.1	Staff/Service	9	67	13.4
23	Fribourg/Freiburg	4.1	Safety	8	51	15.7
24	Baden	4.1	Food & Shops	9	85	10.6
25	Bülach	4.1	Cleanliness	2	10	20.0
26	Basel SBB	4.1	Toilets	40	294	13.6
27	Sion	4.1	Staff/Service	3	28	10.7
28	Visp	4.1	Staff/Service	5	39	12.8
29	Rotkreuz	4.1	Cleanliness	2	9	22.2
30	Sargans	4.2	Cleanliness	4	16	25.0
31	Delémont	4.2	Toilets	1	5	20.0
32	Burgdorf	4.2	Safety	2	5	40.0
33	Winterthur	4.2	Cleanliness	8	61	13.1
34	Genève	4.2	Toilets	20	128	15.6
35	Zürich Stadelhofen	4.2	Connections	9	76	11.8
36	Zürich Enge	4.2	Cleanliness	2	12	16.7
37	Zürich Oerlikon	4.2	Food & Shops	19	198	9.6
38	Bellinzona	4.3	Toilets	4	32	12.5
39	Martigny	4.3	Toilets	1	9	11.1
40	Neuchâtel	4.3	Toilets	2	20	10.0
41	Zug	4.3	Toilets	8	39	20.5
42	St. Gallen	4.3	Toilets	12	54	22.2
43	Schaffhausen	4.3	Toilets	4	28	14.3
44	Chur	4.3	Safety	11	49	22.4
45	Brig	4.4	Food & Shops	4	14	28.6
46	Luzern	4.4	Food & Shops	42	274	15.3
47	Arth-Goldau	4.4	Cleanliness	3	22	13.6
48	Lugano	4.4	Toilets	6	67	9.0
49	Zürich HB	4.4	Cleanliness	47	335	14.0
50	Montreux	4.5	Staff/Service	3	21	14.3
51	Nyon	4.5	Cleanliness	2	11	18.2
52	Bern	4.5	Cleanliness	6	83	7.2
53	Thun	4.5	Connections	3	22	13.6
54	Locarno	4.5	Signage/Nav	2	11	18.2

Figure 17: Dominant complaint topic per station (keyword anchoring). Sorted by rating of the station.

Limitations of the keyword-anchored approach

The keyword-based approach is a useful first pass, but it has clear limitations:

It cannot distinguish sentiment. “clean” matches both “very clean station” and “not clean at all”. A 3-star review saying “the food was great but connections were terrible” gets counted under both Food & Shops and Connections as complaints, even though only one is negative.
It double-counts multi-aspect reviews. One review mentioning “toilet” and “dirty” gets counted under both Toilets and Cleanliness, inflating totals.
Some keywords are ambiguous. “bar” (Food & Shops) matches “bar” in other contexts. “closed” (Toilets) matches anything being closed. “dark” (Safety) could describe atmosphere or aesthetics.
It misses complaints phrased without keywords. “I had to wait 45 minutes for the next train” has no Connections keyword.

Today’s AI models are reasonably good at extracting aspect-level sentiment, so i the next section I will use an AI-model to classify sentiment and the topics (aspects) the reviews are talking about.

AI-Powered Aspect Classification

The keyword approach above cannot distinguish a complaint from a compliment. To fix this I run every review through GPT-4o-mini (see src/sbb_reviews/analysis/classify_aspects.py) with a structured prompt that asks the model to return, for each review:

which aspect is mentioned (constrained to a fixed list of 12: Cleanliness, Toilets, Lifts/Escalators, Food & Shops, Safety, Signage/Nav, Parking/Bikes, Connections, Crowds, Accessibility, Seating/Waiting, Staff/Service)
whether it is a complaint, praise, or neutral observation
a concrete reason in the model’s own words (e.g. “drug addicts loitering near entrance” rather than just Safety)
the exact phrases from the review text that support the classification

Reviews are processed in batches of 10. The fixed aspect list should prevent the model from inventing categories, and the requirement to quote supporting phrases makes spot-checking easy.

Run sbb-reviews classify once to generate data/derived/complaint_aspects_ai.csv, then execute the cells below to explore the results.

Throughout the rest of this section I limit each aspect-level chart to aspects with at least MIN_COMPLAINTS_PER_ASPECT = 10 AI-classified negative mentions across the network. Aspects with fewer mentions are too sparse to support reliable comparisons of severity or volume, so they are dropped from the upcoming visualizations.

Coverage and validation

Approximately 2-3% of reviews could not be classified because the model’s batch response exceeded the token limit and was truncated mid-output. Those batches were skipped rather than partially saved. I spot-checked roughly 50 random classifications and found the aspect labels and sentiments to be consistent with the source text. The results are representative but not exhaustive.

Show code

# Load AI classification results
AI_CSV = pathlib.Path('../data/derived/complaint_aspects_ai.csv')

if not AI_CSV.exists():
    print("complaint_aspects_ai.csv not found in data/derived/ -- run `sbb-reviews classify` first.")
    print("Run: python analysis/run_ai_classification.py")
else:
    ai = pd.read_csv(AI_CSV)
    print(f"Rows loaded: {len(ai):,}")
    print(f"Reviews: {ai.drop_duplicates(subset=['station_name','date_estimated','text']).shape[0]:,}")
    print()
    print("Sentiment breakdown:")
    print(ai['sentiment'].value_counts().to_string())
    print()
    print(f"Aspect breakdown (all sentiments with count >= {MIN_COMPLAINTS_PER_ASPECT}):")
    print(ai['aspect'].value_counts()[lambda x: x >= MIN_COMPLAINTS_PER_ASPECT].to_string())

Rows loaded: 11,857
Reviews: 8,813

Sentiment breakdown:
sentiment
positive    7354
negative    3430
neutral     1073

Aspect breakdown (all sentiments with count >= 10):
aspect
Food & Shops        2852
Connections         1833
Cleanliness         1710
Signage/Nav         1135
Staff/Service        998
Crowds               874
Safety               740
Accessibility        545
Toilets              436
Seating/Waiting      352
Parking/Bikes        196
Lifts/Escalators     114

Sanity check: Keyword vs AI

To get an overview how much the keyword-based and AI-based approach differ, I compare the per-aspect share between the keyword approach (low-star reviews only) and the AI approach (negative-sentiment mentions only).

Show code

# Keyword vs AI: per-aspect share of complaints
if 'ai' in dir() and len(ai):
    # Keyword counts in low-star (<=3) reviews
    low_kw = reviews[reviews['rating'] <= 3]['text'].dropna().str.lower()
    kw_counts = {asp: low_kw.str.contains('|'.join(kws), regex=True).sum() for asp, kws in ASPECTS.items()}
    kw_df = pd.DataFrame({'aspect': list(kw_counts.keys()), 'count': list(kw_counts.values())})
    kw_df['method'] = 'Keyword (low-star)'
    kw_df['pct'] = kw_df['count'] / kw_df['count'].sum() * 100

    # AI counts in negative-sentiment mentions
    ai_neg = ai[ai['sentiment'] == 'negative']
    ai_df = ai_neg['aspect'].value_counts().reset_index()
    ai_df.columns = ['aspect', 'count']
    # Apply the section-wide aspect floor
    ai_df = ai_df[ai_df['count'] >= MIN_COMPLAINTS_PER_ASPECT]
    kw_df = kw_df[kw_df['aspect'].isin(ai_df['aspect'])]
    ai_df['method'] = 'AI (negative sentiment)'
    ai_df['pct'] = ai_df['count'] / ai_df['count'].sum() * 100

    compare = pd.concat([kw_df, ai_df], ignore_index=True)
    # Order by total share for a clean visual
    order = compare.groupby('aspect')['pct'].sum().sort_values(ascending=False).index.tolist()

    fig = px.bar(
        compare, x='aspect', y='pct', color='method', barmode='group',
        category_orders={'aspect': order},
        color_discrete_map={'Keyword (low-star)': '#9e9e9e', 'AI (negative sentiment)': '#1976d2'},
        title='Per-aspect share of complaints: keyword vs AI',
        labels={'pct': '% of method\'s total complaints', 'aspect': ''},
    )
    fig.update_layout(xaxis_tickangle=-30, legend_title='', height=450)
    fig.show()

    # Print the diff so the magnitudes are explicit
    pivot = compare.pivot(index='aspect', columns='method', values='pct').fillna(0).round(1)
    pivot['diff (AI - KW)'] = (pivot['AI (negative sentiment)'] - pivot['Keyword (low-star)']).round(1)
    print('Per-aspect share (%):')
    print(pivot.sort_values('AI (negative sentiment)', ascending=False).to_string())

Figure 18: Per-aspect share of complaints: keyword approach vs AI classifier.

Per-aspect share (%):
method            AI (negative sentiment)  Keyword (low-star)  diff (AI - KW)
aspect                                                                       
Safety                               16.8                12.6             4.2
Crowds                               12.9                 6.3             6.6
Staff/Service                        11.1                10.0             1.1
Signage/Nav                          10.2                 6.0             4.2
Toilets                              10.0                12.5            -2.5
Cleanliness                           9.4                13.1            -3.7
Connections                           8.4                13.5            -5.1
Food & Shops                          7.0                13.3            -6.3
Seating/Waiting                       5.7                 4.3             1.4
Accessibility                         4.4                 2.2             2.2
Parking/Bikes                         2.6                 2.8            -0.2
Lifts/Escalators                      1.7                 3.3            -1.6

The two methods rank the top aspects differently: the AI puts Safety and Crowds first, while the keyword approach puts Connections, Food & Shops, and Cleanliness on top because their keywords match plenty of positive mentions inside low-star reviews (“great food”, “good connections”), inflating those counts. The AI, by filtering on sentiment, also catches more Crowds and Signage/Nav complaints whose phrasing avoids the obvious keywords. The lists still overlap on multiple topics (Safety, Toilets, Cleanliness, Staff/Service).

Show code

# Complaint counts per aspect (negative only)
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    aspect_counts = neg['aspect'].value_counts().reset_index()
    aspect_counts.columns = ['aspect', 'complaints']
    aspect_counts = aspect_counts[aspect_counts['complaints'] >= MIN_COMPLAINTS_PER_ASPECT]

    fig = px.bar(
        aspect_counts, x='aspect', y='complaints',
        color='aspect', color_discrete_map=ASPECT_COLORS,
        title=f'Complaint topics: AI classification ({len(neg):,} negative mentions)',
        labels={'complaints': 'Negative mentions', 'aspect': ''},
    )
    fig.update_layout(showlegend=False, xaxis_tickangle=-45, height=500)
    fig.show()

Figure 19: Total complaint counts per aspect, AI-classified negative mentions only.

Total Negative Mentions and Impact per Aspect

Show code

# Aspect severity ranking: which complaints are associated with the lowest ratings?
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    severity = (
        neg.groupby('aspect')
        .agg(
            avg_rating=('rating', 'mean'),
            median_rating=('rating', 'median'),
            complaint_count=('rating', 'count'),
        )
        .sort_values('avg_rating')
        .reset_index()
    )
    severity = severity[severity['complaint_count'] >= MIN_COMPLAINTS_PER_ASPECT].reset_index(drop=True)
    severity.index += 1
    severity.index.name = 'severity_rank'
    severity['avg_rating'] = severity['avg_rating'].round(2)
    severity['label'] = severity.apply(lambda r: f"{r['avg_rating']:.2f} (n={int(r['complaint_count'])})", axis=1)

    print("Aspect severity ranking (lowest avg rating = most damaging):")
    display(severity)

    fig = px.bar(
        severity.sort_values('avg_rating'),
        x='aspect', y='avg_rating',
        color='avg_rating',
        color_continuous_scale='RdYlGn',
        title='Aspect Severity: Average Review Rating per Negative Aspect',
        labels={'avg_rating': 'Avg rating of reviews', 'aspect': ''},
        text='label',
    )
    fig.update_layout(
        coloraxis_showscale=False,
        xaxis_tickangle=-45,
        yaxis_range=[0, 5],
        height=500,
    )
    fig.update_traces(textposition='outside')
    fig.show()

Aspect severity ranking (lowest avg rating = most damaging):

	aspect	avg_rating	median_rating	complaint_count	label
severity_rank
1	Staff/Service	1.64	1.0	379	1.64 (n=379)
2	Connections	1.85	1.0	285	1.85 (n=285)
3	Safety	2.02	2.0	573	2.02 (n=573)
4	Cleanliness	2.37	2.0	319	2.37 (n=319)
5	Toilets	2.50	2.0	339	2.50 (n=339)
6	Seating/Waiting	2.55	3.0	194	2.55 (n=194)
7	Accessibility	2.58	3.0	149	2.58 (n=149)
8	Food & Shops	2.67	3.0	239	2.67 (n=239)
9	Lifts/Escalators	2.68	3.0	57	2.68 (n=57)
10	Signage/Nav	2.75	3.0	346	2.75 (n=346)
11	Crowds	2.76	3.0	439	2.76 (n=439)
12	Parking/Bikes	2.86	3.0	87	2.86 (n=87)

(a) Average review rating per negative aspect: lowest = most damaging.

(b)

Figure 20

Negative Staff/Service and Connections complaints are the most severe, averaging just 1.6 and 1.9 stars respectively. These aspects appear in reviews where travellers are most frustrated, suggesting that poor staff interactions and missed or delayed connections trigger harsher ratings than other issues. Safety ranks third (2.0 stars avg) with the highest complaint volume (573 mentions), making it both severe and widespread.

In contrast, aspects like Parking/Bikes, Crowds, and Signage/Navigation average closer to 3 stars, meaning reviewers who complain about these still rate the station more moderately. This suggests that while these are common pain points, they are less likely to drive a 1-star review on their own.

The median rating reinforces the pattern: Staff/Service, Connections, and Safety all have a median of 1.0-2.0 stars, while lower-severity aspects cluster around a median of 3.0 stars.

Volume vs Severity in One View

The two previous charts show volume and severity separately, but the actionable view is both at once. Aspects in the bottom-right of the scatter are both widely complained about and dragging ratings down the most.

Show code

# Volume vs severity scatter
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()
    vs = (
        neg.groupby('aspect')
        .agg(volume=('rating', 'count'), severity=('rating', 'mean'))
        .reset_index()
    )
    vs = vs[vs['volume'] >= MIN_COMPLAINTS_PER_ASPECT]

    fig = px.scatter(
        vs, x='volume', y='severity', text='aspect', size='volume', color='aspect',
        color_discrete_map=ASPECT_COLORS,
        title='Aspect Volume vs Severity (negative mentions, lower y = harsher ratings)',
        labels={'volume': 'Number of negative mentions', 'severity': 'Avg review rating'},
        size_max=55,
    )
    fig.update_traces(textposition='top center')
    fig.update_layout(showlegend=False, height=500, yaxis_range=[1, 4])
    fig.show()

Figure 21: Volume vs severity per aspect. Bottom-right = widespread and severe.

The previous charts show which complaint aspects are most common and most severe across the entire network, but they do not tell me where the problems are concentrated. The next cell breaks complaints down by station to answer: Which stations have the most complaints, and what are they about?

The table shows each station’s dominant complaint topic (the aspect with the most negative mentions), sorted by overall rating. The heatmap then visualises the full complaint profile of the 20 worst-rated stations, normalised to percentages so that stations with different review volumes are directly comparable. This reveals whether a station’s complaints are concentrated in one area or spread across many.

What travellers complain about, at a glance

Safety, Crowds, Staff/Service, and Toilets dominate the negative reviews. Staff/Service complaints are the most damaging (avg 1.6 stars), Safety the most widespread (573 mentions).

Show code

# Per-station complaint summary (AI)
if 'ai' in dir() and len(ai):
    neg = ai[ai['sentiment'] == 'negative'].copy()

    # Aspect-level filter (consistent with the section-wide threshold)
    aspect_totals = neg['aspect'].value_counts()
    keep_aspects = aspect_totals[aspect_totals >= MIN_COMPLAINTS_PER_ASPECT].index
    neg = neg[neg['aspect'].isin(keep_aspects)]

    # Number of unique reviews analysed per station (denominator for %)
    reviews_analysed = (
        ai.drop_duplicates(['station_name', 'date_estimated', 'text'])
        .groupby('station_name')
        .size()
        .rename('reviews_analysed')
    )

    station_aspect = (
        neg.groupby(['station_name', 'aspect'])
        .size()
        .reset_index(name='complaints')
    )

    # Attach station rating + total reviews analysed
    stn_ratings = done.drop_duplicates('name')[['name', 'overall_rating']]
    station_aspect = (
        station_aspect
        .merge(stn_ratings, left_on='station_name', right_on='name', how='left')
        .drop(columns='name')
        .merge(reviews_analysed, on='station_name', how='left')
    )

    # Dominant complaint per station, sorted by rating ascending (worst first)
    dominant = (
        station_aspect
        .sort_values('complaints', ascending=False)
        .drop_duplicates(subset='station_name')
        .sort_values('overall_rating', ascending=True)
        .reset_index(drop=True)
    )
    dominant['pct_mentioning'] = (dominant['complaints'] / dominant['reviews_analysed'] * 100).round(1)
    dominant.index += 1

    table = dominant[['station_name', 'overall_rating', 'aspect', 'complaints', 'reviews_analysed', 'pct_mentioning']].copy()
    table.columns = ['Station', 'Rating', 'Top complaint', 'Mentions', 'Reviews analysed', '% mentioning of analysed station reviews']

    print('Dominant complaint topic per station (AI), sorted by rating:')
    with pd.option_context('display.max_rows', None):
        display(table)

    # Heatmap: station × aspect complaint profile (top 20 worst-rated only)
    top20 = dominant.head(20)['station_name'].tolist()
    pivot = (
        station_aspect[station_aspect['station_name'].isin(top20)]
        .pivot(index='station_name', columns='aspect', values='complaints')
        .fillna(0)
        .astype(int)
    )
    # Sort heatmap rows by rating ascending
    rating_order = dominant[dominant['station_name'].isin(top20)].set_index('station_name')['overall_rating']
    pivot = pivot.loc[rating_order.sort_values().index]

    # Normalize to percentages (row-wise)
    pivot = pivot.div(pivot.sum(axis=1), axis=0) * 100

    fig = px.imshow(
        pivot,
        color_continuous_scale='Reds',
        title='Complaint profile: Top 20 worst-rated stations (% of complaints per aspect)',
        labels={'color': '% of complaints'},
        aspect='auto',
        height=600,
        text_auto='.0f',
    )
    fig.show()

Dominant complaint topic per station (AI), sorted by rating:

	Station	Rating	Top complaint	Mentions	Reviews analysed	% mentioning of analysed station reviews
1	Yverdon-les-Bains	3.3	Safety	114	261	43.7
2	Genève-Aéroport	3.6	Crowds	29	167	17.4
3	Lenzburg	3.7	Crowds	20	71	28.2
4	Langenthal	3.7	Safety	6	27	22.2
5	Aarau	3.8	Safety	32	137	23.4
6	Bern Wankdorf	3.8	Accessibility	4	24	16.7
7	Bern Europaplatz	3.8	Signage/Nav	2	5	40.0
8	Glattbrugg	3.8	Cleanliness	3	12	25.0
9	Olten	3.8	Safety	17	127	13.4
10	Wil SG	3.8	Safety	14	53	26.4
11	Uster	3.9	Cleanliness	8	96	8.3
12	Oensingen	3.9	Toilets	1	4	25.0
13	Solothurn	3.9	Safety	26	131	19.8
14	Zürich Hardbrücke	3.9	Cleanliness	11	81	13.6
15	Brugg AG	3.9	Cleanliness	8	54	14.8
16	Lausanne	4.0	Connections	29	437	6.6
17	Rotkreuz	4.1	Toilets	2	23	8.7
18	Sion	4.1	Safety	9	45	20.0
19	Kreuzlingen	4.1	Cleanliness	3	19	15.8
20	Fribourg/Freiburg	4.1	Cleanliness	10	88	11.4
21	Vevey	4.1	Safety	11	39	28.2
22	Bülach	4.1	Parking/Bikes	2	14	14.3
23	Baden	4.1	Signage/Nav	12	257	4.7
24	Biel/Bienne	4.1	Safety	13	92	14.1
25	Visp	4.1	Safety	14	78	17.9
26	Basel SBB	4.1	Toilets	51	776	6.6
27	La Chaux-de-Fonds	4.1	Crowds	4	22	18.2
28	Frauenfeld	4.1	Safety	4	25	16.0
29	Landquart	4.1	Toilets	1	23	4.3
30	Burgdorf	4.2	Connections	3	12	25.0
31	Zürich Enge	4.2	Cleanliness	5	41	12.2
32	Zürich Oerlikon	4.2	Staff/Service	36	617	5.8
33	Zürich Stadelhofen	4.2	Crowds	15	181	8.3
34	Delémont	4.2	Staff/Service	2	14	14.3
35	Genève	4.2	Toilets	25	327	7.6
36	Winterthur	4.2	Crowds	13	147	8.8
37	Sargans	4.2	Cleanliness	4	47	8.5
38	Opfikon	4.3	Toilets	2	4	50.0
39	Martigny	4.3	Crowds	4	23	17.4
40	Bellinzona	4.3	Crowds	5	70	7.1
41	Kreuzlingen Hafen	4.3	Cleanliness	1	12	8.3
42	Neuchâtel	4.3	Parking/Bikes	4	35	11.4
43	Chur	4.3	Safety	18	113	15.9
44	Zug	4.3	Toilets	8	89	9.0
45	Schaffhausen	4.3	Safety	10	60	16.7
46	St. Gallen	4.3	Toilets	11	104	10.6
47	Luzern	4.4	Crowds	55	1374	4.0
48	Zürich HB	4.4	Signage/Nav	101	1618	6.2
49	Brig	4.4	Toilets	4	40	10.0
50	Lugano	4.4	Staff/Service	13	189	6.9
51	Arth-Goldau	4.4	Toilets	3	73	4.1
52	Thun	4.5	Safety	5	61	8.2
53	Nyon	4.5	Staff/Service	3	30	10.0
54	Bern	4.5	Crowds	20	198	10.1
55	Rorschach	4.5	Food & Shops	1	11	9.1
56	Locarno	4.5	Connections	5	24	20.8
57	Montreux	4.5	Staff/Service	4	49	8.2
58	Rapperswil SG	4.6	Staff/Service	2	33	6.1
59	Genève-Eaux-Vives	4.6	Seating/Waiting	1	14	7.1
60	Rorschach Stadt	4.6	Connections	1	3	33.3
61	Rorschach Hafen	4.7	Staff/Service	2	12	16.7

(a) Per-station complaint profile: the 20 worst-rated stations × aspect, normalised to percentages so stations with different review volumes are comparable.

(b)

Figure 22

It needs to be kept in mind that not all stations have sufficient negative reviews to derive actionable items. For example Oestringen, Bern Wankdorf, Bern Europaplatz all have below 5 negative sentiment reviews for the one most mentioned aspect, requiring further analysis to derive actual improvements for the station.

7: From Insights to Action

The complaint data above is rich but does not yet point to concrete actions. Two further scripts synthesise it into actionable outputs:

sbb-reviews summarize-complaints identifies the top 20 worst-rated stations and ranks their (station, aspect) complaint groups by a priority score. The score combines how much each aspect contributes to a station’s complaints, how severe those complaints are, how recent they are, and how poorly the station is rated overall, so the items at the top are concentrated, harsh, fresh, and at stations already known to be problematic. The full formula and what each term does is spelled out just before the chart at the end of the section.
sbb-reviews summarize-strengths profiles the top 20 best-rated stations, surfaces the aspects they consistently excel at, and runs a gap analysis against the worst stations.

Run sbb-reviews summarize-complaints and sbb-reviews summarize-strengths to generate the CSVs below.

Below I first look at what top stations get praised for, then contrast that with what worst stations get complained about, and finally turn that contrast into a concrete list of priority action items grouped by expected return on investment.

What the Best Stations Do Right

Among the top 20 best-rated stations, only six aspects qualify as genuine strengths
(≥ 5 % of a station’s positive reviews, ≥ 5 mentions). Sorted by how many stations share each strength. This is used as a proxy for transferability.

Show code

# Cross-station strength frequency (best-rated stations)
_highlights_csv = pathlib.Path("../data/derived/station_highlights.csv")
if not _highlights_csv.exists():
    print("station_highlights.csv not found -- run `sbb-reviews summarize-strengths` first.")
else:
    highlights = pd.read_csv(_highlights_csv)

    freq = (
        highlights.groupby("aspect")
        .agg(stations=("station_name", "nunique"), avg_pct=("pct_of_station", "mean"))
        .sort_values("stations", ascending=False)
        .reset_index()
    )
    freq["avg_pct"] = freq["avg_pct"].round(1)

    fig = px.bar(
        freq.sort_values("stations"),
        x="stations", y="aspect",
        orientation="h",
        color="aspect", color_discrete_map=ASPECT_COLORS,
        text="stations",
        title="Transferable Strengths: Aspects Shared by Top-20 SBB Stations",
        labels={"stations": "Number of top-20 stations praised for this aspect", "aspect": ""},
    )
    fig.update_traces(textposition="outside")
    fig.update_layout(showlegend=False, height=400)
    fig.show()

Figure 23: Number of top-20 best-rated stations that excel at each aspect.

Overall, Connections and Food and Shops are the aspects most mentioned in positive reviews, closely followed by Cleanliness and Staff/Service.

Visible Wins vs Damage Control

Comparing what top-20 stations get praised for against what bottom-20 stations get complained about reveals two fundamentally different types of investment:

Category	Meaning
Visible win	Top stations get praised for this aspect. Investment has a visible payoff in positive sentiment.
Damage control	Even top stations do not get praised here. Fixing it stops complaints but will not generate praise. Travellers expect it to “just work”.

Both categories require investment. The difference is in the expected return.

Show code

# Gap analysis -- visible wins vs damage control
_gap_csv = pathlib.Path("../data/derived/strength_gaps.csv")
if not _gap_csv.exists():
    print("strength_gaps.csv not found -- run `sbb-reviews summarize-strengths` first.")
else:
    gaps = pd.read_csv(_gap_csv)
    gaps["display_category"] = gaps["category"].map({
        "learnable": "visible_win",
        "systemic_gap": "damage_control",
        "minor": "minor",
    })

    category_colors = {"damage_control": "#d32f2f", "visible_win": "#388e3c", "minor": "#9e9e9e"}

    fig = px.bar(
        gaps[gaps["display_category"] != "minor"].sort_values("complaint_rows", ascending=False),
        x="aspect",
        y="complaint_rows",
        color="display_category",
        color_discrete_map=category_colors,
        text="strength_stations",
        labels={
            "complaint_rows": "Complaint rows (worst stations)",
            "aspect": "Aspect",
            "display_category": "Category",
            "strength_stations": "Top stations excelling",
        },
        title="Visible Wins vs Damage Control by Aspect",
    )
    fig.update_traces(texttemplate="%{text} top stations", textposition="outside")
    fig.update_layout(height=420, xaxis_tickangle=-20)
    fig.show()

Figure 24: Aspect-level gap analysis: visible wins vs damage control.

All aspects on this chart require investment. Food & Shops needs retail buildout, Connections needs platforms and scheduling work, Cleanliness needs staff and maintenance. The distinction is not cost itself. It is whether top stations get praised for the aspect, which signals whether spending shows up as positive sentiment or just as the absence of complaints.

In this dataset, Safety, Crowds, and Toilets fall on the damage-control side, while Cleanliness, Connections, Food & Shops, Staff/Service, and Signage/Nav are visible wins.

Connections deserves a footnote: it appears as a visible win because top transfer hubs are praised for being well-connected, but the core levers (frequency, scheduling) depend on network planning, not station-level effort. Local levers (clear transfer signage, real-time departure info, staff assistance for missed connections) are still worth pursuing.

Action priorities at a glance

Worst-rated stations share systemic Safety, Crowds, and Toilet issues that need infrastructure investment. Cleanliness, Connections, Staff/Service, and Food & Shops are visible wins where top stations show a clear playbook to copy.

Derivation for the Top Improvements to Ensure Customer Satisfaction

Each (station, aspect) complaint group is scored to surface the most urgent items. The priority score is:

priority = pct_of_station × severity_weight × recency_weight × (1 + station_weight)

where:

pct_of_station is the share of a station’s negative complaints that this aspect accounts for. Using a percentage instead of a raw count makes the score station-size-neutral: an aspect that drives half the complaints at a small station counts the same as one that drives half at a big station.
severity_weight = (6 - avg_rating) / 5 captures how harshly travellers rated reviews that mentioned this aspect. A 1-star average gives a weight of 1.0. A 5-star average gives 0.2.
recency_weight multiplies by how fresh the complaints are, based on the most recent complaint date in the group. Active issues (last 90 days) get a 2.0× boost. Stale ones (no complaints in 2+ years) get 0.2×.
station_weight = 5 - station_overall_rating lifts the score for stations whose overall rating is already dragging the network down, so two equally severe complaints get prioritised at the worse-rated station.

The chart and the two tables below group each action item by whether the aspect is a visible win or damage control, as defined earlier in the section. The results are top recommendations based on said priority metric. The summary text and the results are summarized by AI.

Show code

# Priority action items (worst-rated stations, split by upside profile)
from IPython.display import Markdown

_action_csv = pathlib.Path("../data/derived/action_items.csv")
_gaps_csv = pathlib.Path("../data/derived/strength_gaps.csv")

if not _action_csv.exists() or not _gaps_csv.exists():
    print("action_items.csv or strength_gaps.csv not found. Run `sbb-reviews summarize-complaints` and `summarize-strengths` first.")
else:
    actions = pd.read_csv(_action_csv)
    aspect_cat = pd.read_csv(_gaps_csv)[["aspect", "category"]]
    actions = actions.merge(aspect_cat, on="aspect", how="left")
    actions["bucket"] = actions["category"].map({
        "learnable": "Visible Win",
        "systemic_gap": "Damage Control",
    }).fillna("Other")

    def render_items(df, heading, subtitle):
        lines = [f"### {heading}", f"_{subtitle}_", ""]
        for i, row in enumerate(df.itertuples(index=False), 1):
            lines.append(
                f"**{i}. {row.station_name}** ({row.station_overall_rating:.1f}★) — "
                f"*{row.aspect}* — {row.complaint_count} complaints, "
                f"{row.pct_of_station:.0f}% of station's complaints, priority score {row.priority_score:.0f}"
            )
            lines.append("")
            lines.append(f"> {row.summary}")
            lines.append("")
        return Markdown("\n".join(lines))

    visible_wins = actions[actions["bucket"] == "Visible Win"].head(10)
    hygiene_fixes = actions[actions["bucket"] == "Damage Control"].head(10)

    print(f"Total action items: {len(actions)}  |  Stations: {actions['station_name'].nunique()}")
    display(render_items(
        hygiene_fixes,
        "Damage control: top 10",
        "No top station excels here; payoff is complaint reduction.",
    ))
    display(render_items(
        visible_wins,
        "Visible Wins: top 10",
        "Top stations validate this aspect; payoff is positive sentiment.",
    ))

    # Combined priority chart, colored by bucket
    plot_df = actions.head(20).copy()
    plot_df["label"] = plot_df["station_name"] + ": " + plot_df["aspect"]
    fig = px.bar(
        plot_df,
        x="priority_score",
        y="label",
        color="bucket",
        orientation="h",
        color_discrete_map={
            "Visible win": "#388e3c",
            "Damage control": "#d32f2f",
            "Other": "#9e9e9e",
        },
        labels={"priority_score": "Priority Score", "label": "", "bucket": "Category"},
        title="Top 20 Priority Action Items: Total Priority Scores Unified: Visible Wins vs Damage Control",
    )
    fig.update_layout(yaxis={"categoryorder": "total ascending"}, height=550, margin={"l": 280})
    fig.show()

Total action items: 56  |  Stations: 16

Damage control: top 10

No top station excels here; payoff is complaint reduction.

1. Yverdon-les-Bains (3.3★) — Safety — 114 complaints, 63% of station’s complaints, priority score 136

Complaints describe Yverdon-les-Bains as unsafe due to persistent drug dealing, addicts, beggars, drunk individuals, and people harassing others around the station. Reviewers also mention weed smoke, fights at night, a gloomy atmosphere, and police presence seen as frequent but ineffective.

2. Solothurn (3.9★) — Safety — 26 complaints, 38% of station’s complaints, priority score 131

Safety complaints at Solothurn center on an unsettling atmosphere created by drug users, dealers, alcoholics, beggars, and other suspicious individuals, especially in the evening and after dark. Several reviews also mention aggressive solicitation and describe the station as functioning like a public drug scene.

3. Lenzburg (3.7★) — Crowds — 20 complaints, 33% of station’s complaints, priority score 124

Crowding complaints at Lenzburg consistently say the station is too small for passenger volumes, especially at peak times. Narrow platforms, too few underpasses, and congested exits create overcrowding, navigation difficulties, and a sense of danger or panic.

4. Aarau (3.8★) — Safety — 32 complaints, 38% of station’s complaints, priority score 123

Aarau is described as feeling unsafe because of loitering intoxicated groups, aggressive or unsavory individuals, and chaotic conditions around the entrance and outside the station. Complaints highlight nighttime danger, weekend fights, smoking and noise, and a general sense that authorities are not keeping the area under control.

5. Wil SG (3.8★) — Safety — 14 complaints, 40% of station’s complaints, priority score 117

Complaints about Wil SG focus on a persistently unsafe atmosphere linked to drug users, alcoholics, beggars, and other undesirable individuals around the station and bus area. Reviewers particularly mention poor lighting, weak security or police presence, nighttime risk for women, and even reports of assaults and violent incidents.

6. Genève-Aéroport (3.6★) — Crowds — 29 complaints, 24% of station’s complaints, priority score 98

Complaints about Genève-Aéroport focus on severe overcrowding and queue management problems, especially long waits at passport control, security, and luggage collection. Reviewers describe the airport as chaotic, slow, claustrophobic, and poorly organized, with too few checkpoints open and added inconvenience when facilities are closed.

7. Olten (3.8★) — Safety — 17 complaints, 24% of station’s complaints, priority score 81

Complaints about Olten highlight strong personal safety concerns linked to drug use, drunk or aggressive individuals, beggars, and threatening encounters. Reviewers particularly describe the station and surrounding area as uncomfortable or unsafe at night, especially for women, with reports of assaults, intimidation, and broader crime concerns.

8. Zürich Hardbrücke (3.9★) — Safety — 7 complaints, 13% of station’s complaints, priority score 48

Complaints about Zürich Hardbrücke center on feeling unsafe, especially at night, with reports of being followed, attacked with objects, and discomfort around loitering youth. Reviewers also mention dark surroundings, smoking concerns, and dangerous bike-pedestrian traffic conflicts as contributing to the station’s unsafe atmosphere.

9. Basel SBB (4.1★) — Toilets — 51 complaints, 15% of station’s complaints, priority score 41

Basel SBB toilet complaints mainly concern having to pay high fees for the station’s toilets, often cash-only or coin-operated, with broken payment/access systems that can leave passengers unable to use them. Many also describe the facilities as dirty, outdated, poorly maintained, and insufficiently available, including long waits and missing toilets in some areas.

10. Zürich Hardbrücke (3.9★) — Crowds — 7 complaints, 13% of station’s complaints, priority score 39

Zürich Hardbrücke is mainly criticized for overcrowding, especially at peak times, where heavy commuter traffic reduces comfort. Reviews also mention a dark, chaotic atmosphere during construction, evening crowds involving smokers and intoxicated or drug-using youths, and generally stressful circulation through the station.

(a) Top 20 priority action items, coloured by category.

Visible Wins: top 10

Top stations validate this aspect; payoff is positive sentiment.

1. Brugg AG (3.9★) — Cleanliness — 8 complaints, 24% of station’s complaints, priority score 82

Complaints about Brugg AG center on poor cleanliness and neglect, with repeated mentions of a dirty station, dirty platforms, and a general rundown feel. Reviewers also highlight persistent unpleasant odors, including urine smells, and filthy areas where people drink alcohol.

2. Zürich Hardbrücke (3.9★) — Cleanliness — 11 complaints, 20% of station’s complaints, priority score 72

Complaints about Zürich Hardbrücke focus on poor cleanliness, with repeated descriptions of dirty platforms and generally dirty station conditions. Bad smells, especially urine and smoke, along with an ugly, uncomfortable environment and broken lifts, reinforce the sense of neglect.

3. Lausanne (4.0★) — Connections — 29 complaints, 14% of station’s complaints, priority score 46

Complaints about connections at Lausanne center on chronic delays, cancellations, and trains rarely running on time, often causing missed appointments. Passengers also report last-minute platform or route changes, connections not waiting, disruption from ongoing construction and limited platform capacity, and poor or absent information when problems occur.

4. Basel SBB (4.1★) — Staff/Service — 41 complaints, 12% of station’s complaints, priority score 44

Basel SBB staff/service complaints are dominated by rude, dismissive, and unhelpful staff at ticket, information, and service counters, including poor support in English and with ticketing or refund issues. Reviews also mention unfair fines, long queues and waiting times, and lack of customer service during holidays.

5. Zürich Hardbrücke (3.9★) — Connections — 7 complaints, 13% of station’s complaints, priority score 44

At Zürich Hardbrücke, connection complaints focus on frequent delays, especially on the S6 to Baden, linked to bottlenecks, high train density, and limited track capacity. Passengers also mention confusing platform layouts, difficult transfers between bus stops, overcrowded trains, and a lack of direct service.

6. Genève-Aéroport (3.6★) — Signage/Nav — 18 complaints, 15% of station’s complaints, priority score 42

Genève-Aéroport signage/navigation complaints point to unclear or missing signage throughout the station-airport interface, including check-in desks, terminals, buses, train departures, and security. Passengers describe the layout as disorganized and confusing, with poor communication of platform changes and too little accessible information for timetables and other services.

7. Genève-Aéroport (3.6★) — Staff/Service — 22 complaints, 18% of station’s complaints, priority score 39

Complaints at Genève-Aéroport focus on poor staff service: rude or dismissive behavior, little help with ticket changes and reimbursements, and weak support for non-French speakers. Travelers also report missing or understaffed service points, badly managed passport/check-in queues, slow security and manual processes, plus limited Wi‑Fi and poor late-evening service availability.

8. Olten (3.8★) — Food & Shops — 8 complaints, 11% of station’s complaints, priority score 38

At Olten, the main issue is a weak retail and food offer, with few shops, mostly takeaway options, and no proper restaurant. Complaints also repeatedly highlight missing ATMs for Swiss francs, inconvenient shop locations, and prices seen as too high for the limited choice.

9. Genève-Aéroport (3.6★) — Food & Shops — 15 complaints, 12% of station’s complaints, priority score 38

At Genève-Aéroport, food and retail complaints center on very high prices paired with too little choice, including expensive drinks, poor-quality dining, and few shops or stalls open. Travelers also criticize early closing times, long queues when outlets open late, and weak amenities such as confusing locker policies and lack of charging points.

10. Biel/Bienne (4.1★) — Staff/Service — 7 complaints, 14% of station’s complaints, priority score 37

Complaints center on consistently poor staff interactions at Biel/Bienne, with employees described as aggressive, arrogant, unfriendly, and not sufficiently helpful in English. Service frustrations are worsened by cancellations and the lack of staffed ticket counters after 8 PM, leaving passengers without support.

(b)

(c)

Figure 25

8: Takeaways & Limitations

What I found out

Overall sentiment is positive. The network averages 4.16 stars, yet the negative tail concentrates around a handful of stations where issues are systemic rather than scattered.
Reviewers are harshest in their home region. Local reviewers consistently rate their own region’s stations lower than out-of-region reviewers do. Tourists give the most generous ratings overall.
Ratings drift over time, but slowly. Most stations are stable across 2022-2025. A few (for example, Wil SG, Oensingen, Bülach) gained 0.6-0.8 stars, suggesting recent improvements were noticed.
Two strategic categories of complaint emerged. Safety, Crowds, and Toilets are damage control: high complaint volume but no praise even at top stations, so investment stops the bleeding without lifting sentiment. Cleanliness, Connections, Food & Shops, Staff/Service, and Signage/Nav are visible wins: top stations show that positive sentiment is achievable.
Staff/Service, Connections, and Safety complaints sit on the harshest reviews (average ratings 1.6, 1.9, 2.0 stars, well below the other aspects). Combined with how often each appears, Safety, Staff/Service, and Crowds dominate the negative-rating story: Crowds drops fewer star points per complaint but shows up in very high volume, while Connections is severe but less frequent.

What this analysis cannot tell me

No causal claims. Correlation between aspect and rating is not proof of cause. A station may be rated low for reasons not captured in its review text at all.
Coverage is partial. Roughly 2-3% of reviews were lost to AI batch truncation, and with the scraped review texts being cutoff at ~140-character, that means long, nuanced reviews are clipped before the model ever sees them. Romansh reviewers are absent entirely.
Self-selection bias. Google Maps reviewers are not a representative sample of travellers. The people who bother writing a review are usually the ones with strong feelings, either very satisfied or very upset. The everyday commuter who finds the station “fine” rarely shows up in the data, so the dataset over-represents extremes in both directions.
Volume in reviews is not a measure of importance. Some aspects show up less often because the affected population is smaller, not because the issue matters less. Accessibility is the clearest example: Wheelchair users, parents with strollers, and travellers with mobility constraints make up a small share of reviewers, so the aspect ranks low on volume and falls out of the priority charts. That does not make it any less critical to the people for whom a missing lift or an inaccessible platform is the difference between using the station and not using it. Review-derived priorities are a useful complement to, not a substitute for, dedicated accessibility audits and SBB’s existing standards.
No within-station granularity. A complaint at Zurich HB does not tell me whether the issue was on a platform, in the underground retail level, or at a ticket counter. The whole station is treated as one unit, which also makes cross-station comparison tricky: “crowds” at a 100,000-passenger-per-day hub means something very different from “crowds” at a small regional station, but in the dataset they look identical.
Snapshot, not stream. This is a one-shot analysis. It only captures the data as of the date the reviews were scraped. The real value of such an analysis would be to conduct it live from running it continuously.

What I would do next

Connect the scraper + AI classifier into a scheduled job + add more data sources such as social media and travel forums so sentiment can be tracked monthly per station.
Add a comparison baseline: rate other major European stations the same way to see whether SBB’s 4.16 is genuinely high or just typical for the segment.
Cross-reference complaint timelines against operational events (renovation projects, schedule changes) to test whether interventions show up in sentiment data.